您的位置: 首页 > 技术杂谈 > 正文

性能问题从发现到优化一般思路

2022-09-05 15:00 https://my.oschina.net/u/4850089/blog/5572623 JAVA前线次阅读条评论

欢迎大家关注公众号「JAVA 前线」查看更多精彩分享文章，主要包括源码分析、实际应用、架构思维、职场分享、产品思考等等，同时欢迎大家加我个人微信「java_front」一起交流学习

1 文章概述

技术系统有一个发展过程，在业务初期主要是实现业务功能和目标，由于数据和访问量都不大，所以性能问题并不作为首要考虑。

但是随着业务发展，随着数据和访问量增大甚至激增，造成例如首页五秒钟才能展示等问题，这种不佳体验会造成用户流失，此时性能就是必须面对之问题。我们把技术系统分为早期、中期、后期三个阶段：

早期：主要实现业务需求，性能非重点考虑
中期：性能问题注解显现，影响业务发展
后期：技术迭代性能与业务必须同时考虑

如何发现性能问题，并且最终如何解决性能问题就是本文讨论之要点。

2 什么是性能

我们可以从四个维度介绍什么是性能：

两个维度定义性能：
- 速度慢
- 压力大
两个维度描述性能：
- 定性：直观感受
- 定量：指标分析

3 发现性能问题

3.1 定性 + 速度

一个页面需要长时间打开，一个列表很慢才能加载完成，一个接口访问导致超时异常，这些显而易见之问题可以归为此类。

3.2 定量 + 速度

3.2.1 速度指标

一个公司有7200名员工，每天上班打卡时间是早上8点到8点30分，每次打卡时间系统执行时长5秒，那么RT、QPS、并发量分别是多少？

RT表示响应时间，问题已经包含答案：

RT = 5秒

QPS表示每秒访问量，假设行为平均分布：

QPS = 7200 / (30 * 60) = 4

并发量表示系统同时处理请求数：

并发量 = QPS x RT = 4 x 5 = 20

根据上述实例引出公式：

并发量 = QPS x RT

3.2.2 QPS VS TPS

QPS（Queries Per Second）：每秒查询量

TPS（Transactions Per Second）：每秒事务数

需要注意此事务并不是指数据库事务，而是包括以下三个阶段：

接收请求
处理业务
返回结果

QPS = N * TPS (N>=1)

N=1表示接口有一个事务：

public class OrderService {

    public Order queryOrderById(String orderId) {
        return orderMapper.selectById(orderId);
    }
}

N>1表示接口有多个事务：

public class OrderService {

    public void updateOrder(Order order) {
        // transaction1
        orderMapper.update(order);
        // transaction2
        sendOrderUpdateMessage(order);
    }
}

3.2.3 发现问题

(1) 打印日志

public class FastTestService {

    public void test01() {
        long start = System.currentTimeMillis();
        biz1();
        biz2();
        long costTime = System.currentTimeMillis() - start;
        System.out.println("costTime=" + costTime);
    }

    private void biz1() {
        try {
            System.out.println("biz1");
            Thread.sleep(500L);
        } catch (Exception ex) {
            log.error("error", ex);
        }
    }

    private void biz2() {
        try {
            System.out.println("biz2");
            Thread.sleep(1000L);
        } catch (Exception ex) {
            log.error("error", ex);
        }
    }
}

(2) StopWatch

import org.springframework.util.StopWatch;
import org.springframework.util.StopWatch.TaskInfo;

public class FastTestService {

    public void test02() {
        StopWatch sw = new StopWatch("testWatch");
        sw.start("biz1");
        biz1();
        sw.stop();

        sw.start("biz2");
        biz2();
        sw.stop();

        // 简单输出耗时
        System.out.println("costTime=" + sw.getTotalTimeMillis());
        System.out.println();

        // 输出任务信息
        TaskInfo[] taskInfos = sw.getTaskInfo();
        for (TaskInfo task : taskInfos) {
            System.out.println("taskInfo=" + JSON.toJSONString(task));
        }
        System.out.println();

        // 格式化任务信息
        System.out.println(sw.prettyPrint());
    }
}

输出结果：

costTime=1526

taskInfo={"taskName":"biz1","timeMillis":510,"timeNanos":510811200,"timeSeconds":0.5108112}
taskInfo={"taskName":"biz2","timeMillis":1015,"timeNanos":1015439700,"timeSeconds":1.0154397}

StopWatch 'testWatch': running time = 1526250900 ns
---------------------------------------------
ns         %     Task name
---------------------------------------------
510811200  033%  biz1
1015439700  067%  biz2

(3) trace

Arthas是阿里开源Java诊断工具：

Arthas是一款线上监控诊断产品，通过全局视角实时查看应用 load、内存、gc、线程的状态信息，并能在不修改应用代码的情况下，对业务问题进行诊断，包括查看方法调用的出入参、异常，监测方法执行耗时，类加载信息等，大大提升线上问题排查效率

Arthas trace命令监控链路每个节点耗时：

https://arthas.aliyun.com/doc/trace.html

我们通过实例说明，首先编写并运行代码：

package java.front.optimize;

public class FastTestService {

    public static void main(String[] args) {
        FastTestService service = new FastTestService();
        while (true) {
            service.test03();
        }
    }

    public void test03() {
        biz1();
        biz2();
    }

    private void biz1() {
        try {
            System.out.println("biz1");
            Thread.sleep(500L);
        } catch (Exception ex) {
            log.error("error", ex);
        }
    }

    private void biz2() {
        try {
            System.out.println("biz2");
            Thread.sleep(1000L);
        } catch (Exception ex) {
            log.error("error", ex);
        }
    }
}

第一步进入arthas控制台：

$ java -jar arthas-boot.jar
[INFO] arthas-boot version: 3.6.2
[INFO] Found existing java process, please choose one and input the serial number of the process, eg : 1. Then hit ENTER
* [1]: 14121
  [2]: 20196 java.front.optimize.FastTestService

第二步输入监控进程号并回车：

第三步trace命令监控相应方法：

trace java.front.optimize.FastTestService test03

第四步查看链路耗时：

`---[1518.7362ms] java.front.optimize.FastTestService:test03()
    +---[33.66% 511.2817ms ] java.front.optimize.FastTestService:biz1() #54
    `---[66.32% 1007.2962ms ] java.front.optimize.FastTestService:biz2() #55

3.3 定性 + 压力

系统压力大也会表现出速度慢之特征，但是这种慢不仅仅是需要几秒后才能打开网页，而是网页一直处于加载状态最终白屏。

3.4 定量 + 压力

服务器常见压力指标如下：

内存
CPU
磁盘
网络

服务端开发比较容易引发内存和CPU问题，所以我们重点关注。

3.4.1 发现CPU问题

首先编写一段造成CPU飙高之代码并运行：

public class FastTestService {

    public static void main(String[] args) {
        FastTestService service = new FastTestService();
        while (true) {
            service.test();
        }
    }

    public void test() {
        biz();
    }

    private void biz() {
        System.out.println("biz");
    }
}

(1) dashboard + thread

dashboard查看当前系统实时面板，发现线程ID=1 CPU占用非常高（这个ID不可以与jstack nativeID相对应）：

$ dashboard
ID    NAME    GROUP  PRIORI   STATE    %CPU   DELTA TIME   TIME    INTERRU   DAEMON
1     main    main     5      RUNNA    96.06    4.812     2:41.2    false    false

thread查看最忙前N个线程：

$ thread -n 1

"main" Id=1 deltaTime=203ms time=1714000ms RUNNABLE
    at app//java.front.optimize.FastTestService.biz(FastTestService.java:83)
    at app//java.front.optimize.FastTestService.test(FastTestService.java:61)
    at app//java.front.optimize.FastTestService.main(FastTestService.java:17)

3.4.2 发现内存问题

(1) free

$ free -h
              total        used        free      shared  buff/cache   available
Mem:            10G        5.5G        3.1G         28M        1.4G        4.4G
Swap:          2.0G        435M        1.6G

total
服务器总内存

used
已使用内存

free
未被任何应用使用空闲内存

shared
被共享物理内存

cache
IO设备读缓存(Page Cache)

buff
IO设备写缓存(Buffer Cache)

available
可以被程序应用之内存

(2) memory

Arthas memory命令查看JVM内存信息：

https://arthas.aliyun.com/doc/heapdump.html

查看JVM内存信息（官方实例）

$ memory
Memory                           used      total      max        usage
heap                             32M       256M       4096M      0.79%
g1_eden_space                    11M       68M        -1         16.18%
g1_old_gen                       17M       184M       4096M      0.43%
g1_survivor_space                4M        4M         -1         100.00%
nonheap                          35M       39M        -1         89.55%
codeheap_'non-nmethods'          1M        2M         5M         20.53%
metaspace                        26M       27M        -1         96.88%
codeheap_'profiled_nmethods'     4M        4M         117M       3.57%
compressed_class_space           2M        3M         1024M      0.29%
codeheap_'non-profiled_nmethods' 685K      2496K      120032K    0.57%
mapped                           0K        0K         -          0.00%
direct                           48M       48M        -          100.00%

(3) jmap

查看JAVA程序进程号

jps -l

查看实时内存占用

jhsdb jmap --heap --pid 20196

导出快照文件

 jmap -dump:format=b,file=/home/tmp/my-dump.hprof 20196

内存溢出自动导出堆快照

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath==/home/tmp/my-dump.hprof

(4) heapdump

Arthas heapdump命令支持导出堆快照：

https://arthas.aliyun.com/doc/heapdump.html

dump至指定文件

heapdump /home/tmp/my-dump.hprof

dump live对象至指定文件

heapdump --live /home/tmp/my-dump.hprof

dump至临时文件

heapdump

(5) 垃圾回收

jstat可以查看垃圾回收情况，观察程序是否频繁GC或者GC用时是否过长：

jstat -gcutil <pid> <interval(ms)>

每秒查看垃圾回收情况

$ jstat -gcutil 20196 1000
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
  0.00   0.00  57.69   0.00      -      -      0    0.000     0    0.000     0    0.000    0.000
  0.00   0.00  57.69   0.00      -      -      0    0.000     0    0.000     0    0.000    0.000
  0.00   0.00  57.69   0.00      -      -      0    0.000     0    0.000     0    0.000    0.000

各参数说明如下：

S0：新生代中Survivor 0区占已使用空间比例

S1：新生代中Survivor 1区占已使用空间比例

E：新生代占已使用空间比例

O：老年代占已使用空间比例

P：永久带占已使用空间比例

YGC：应用程序启动至今，发生Young GC次数

YGCT：应用程序启动至今，Young GC所用时间（秒）

FGC：应用程序启动至今，发生Full GC次数

FGCT：应用程序启动至今，Full GC所用时间（秒）

GCT：应用程序启动至今，所用垃圾回收总时间（秒）