JVM性能调优

Java 虚拟机内存模型

JVM 虚拟机将内存数据分为如下这几部分:

pc register

  • pc register (program counter): 一个包含当前时刻指令的地址的寄存器

程序寄存器区域是唯一一个在 Java 虚拟机规范中没有规定任何 OutOfMemoryError 情况的区域

stack

栈会抛出两种异常:StackOverflowErrorOutOfMemoryError,在 HotSpot 虚拟机栈中,可以使用参数 -Xss1M 来设置栈的大小为 1MB。随着调用函数参数的增加和局部变量的增加,单次函数调用对栈空间的需求也会增加,因此栈的最大递归次数不是一成不变的。函数嵌套调用的次数由栈的大小决定:栈越大,函数嵌套调用次数越多;对一个函数而言,它的参数越多,内部局部变量越多,它的栈帧就越大,其嵌套调用次数就会越少。

  • Xss1M: 设置栈的大小

native method stack

stack 一样,同样抛出两种异常:StackOverflowErrorOutOfMemoryError。在 sun 的 HOT SPOT 虚拟机中,不区分本地方法栈和虚拟机栈

HEAP

  • -Xmx: 设置堆的最大值
  • -Xms: 设置堆的最小值,即 JVM 启动时,所占据的操作系统内存大小。JVM 会试图将系统内存尽可能地限制在 -Xms 中,因此当内存使用量触及 -Xms 指定的大小时,会触发 Full GC。因此-Xms 值设置为 -Xmx,可以在系统运行初期减少 GC 的次数和耗时。
  • Xmn: 设置新生代大小。等于把 -XX:NewSize-XX:MaxNewSize 设置成了相同的大小。这两个如果设置成不同的值,会导致内存震荡,产生不必要的开销。
    • -XX:NewSize: 设置新生代的初始大小
    • -XX:MaxNewSize: 设置新生代的最大值

错误的把 Xmx 参数设置为了 Xmn 参数以后:

获取当前内存/最大可用内存/最大可用堆:

1
2
3
Runtime.getRuntime().freeMemory() / 1024 / 1024
Runtime.getRuntime().totalMemory() / 1024 / 1024
Runtime.getRuntime().maxMemory() / 1000 / 1000

逃逸分析

Java 7 开始支持对象的栈分配和逃逸分析机制,这样的机制能够将堆分配对象变成栈分配对象:

1
2
3
4
5
void myMethod() {
V v = new V();
// use v
v = null;
}
  • -server: server 模式下,才可以启用逃逸分析
  • -XX:DoEscapeAnalysis: 启用逃逸分析

method area

方法区主要保存的是类的元数据:类型、常量池、字段、方法。在 Hot Spot 虚拟机中,方法区也称为永久区,同样也可以被 GC 回收。持久代的大小直接决定了系统可以支持多少个类定义和多少常量。对于使用 CGLIB 或者 Javassist 等动态字节码生成工具的应用程序而言,设置合理的持久代有利于维持系统稳定。

方法区的大小直接决定了系统可以保存多少个类,如果系统使用了一些动态代理,那么有可能会在运行时生成大量的类,如果这样,就需要设置一个合理的永久区大小,确保不发生永久区内存溢出。

  • -XX:MaxPermSize=4M: 设置持久代的最大值
  • -XX:PermSize=4M: 设置持久代的初始大小

在 JDK 1.8 中,永久区已经被彻底移除,取而代之的是元数据区 (Metaspace),元数据区是一块堆外的直接内存,如果不指定元数据区大小的话,默认情况下,虚拟机会耗尽所有的可用系统内存。

  • -XX:MaxMetaspaceSize: 指定元数据区大小

直接内存

使用 NIO 之后,直接内存的使用变得非常普遍,直接内存跳过了 Java 堆,可以直接访问原生堆空间。直接内存适合申请次数少、访问较为频繁的场合。如果需要频繁申请,则并不适合使用直接内存

  • -XX:MaxDirectMemorySize: 最大可用直接内存,默认为 -Xmx

区域比例

  • -XX:SurvivorRatio=8: 设置新生代中 eden 空间S0 空间 的比例关系
  • -XX:NewRatio=2: 设置老生代和新生代的比例

垃圾回收算法

  • 引用计数法: 无法解决循环引用问题
  • 标记-清除算法 (Mark-Sweep):
    1. 标记从根节点开始的可达对象
    2. 清除所有未被标记的对象
    3. 最大缺点: 回收后的空间是不连续的
  • 复制算法 (新生代):
    1. 内存空间分为两块,每次只用一块
    2. 存活对象复制到未使用的内存块中
    3. 清除正在使用的内存块中的所有对象
    4. 交换两个内存的角色
    5. 适合于新生代: 垃圾对象通常多于存活对象
  • 标记-压缩算法:
    1. 标记从根节点开始的可达对象
    2. 将所有存活对象 (未标记的对象) 压缩到内存的一端
    3. 清理边界外 (标记和未标记对象的边界) 的对象

  • 分代 (Generational Collecting):
    1. 根据每块内存空间特点的不同,使用不同的回收算法。如新生代 (存活对象少,垃圾对象多) 使用复制算法,老年代 (大部分对象是存活对象) 使用标记-压缩算法

为了支持高频率的新生代回收,虚拟机可能使用一种叫做卡表 (Card Table) 的数据结构。卡表为一个比特位集合,每一个比特位可以用来表示老年代的某一区域中的所有对象是否持有新生代对象的引用。这样在新生代 GC 时,只需先扫描卡表,就能快速知道用不用扫描特定的老年代对象,而卡表为 0 的所在区域一定不含有新生代对象的引用。

谁才是真正的垃圾

  • 可触及性: 根节点可到达
  • 可复活: finalize() 中复活
  • 不可触及: finalize() 中未复活

finalize() 方法只会被调用一次

1
2
3
4
5
@Override
protected void finalize() throws Throwable {
super.finalize();
obj = this;
}

1
StringBuffer str = new StringBuffer("Hello world");

假设以上代码是在函数体内运行的,那么:


软引用: java.lang.ref.SoftReference 可被回收的引用


弱引用: 发现即回收。由于垃圾回收器的线程通常优先级很大,因此并不一定很快地发现持有弱引用的对象。


虚引用: 跟踪垃圾回收过程

垃圾回收器

串行回收器

  • 新生代垃圾串行收集器,使用 -XX:+UseSerialGC 来指定新生代和老年代都是用串行收集器。这个收集器虽然古老,但却久经考验。使用单线程进行垃圾回收。虚拟机在 Client 模式下运行,它是默认的垃圾收集器。独占式回收

  • 老年代串行收集器,使用的是标记-压缩算法。
    • -XX:+UseSerialGC: 新生代、老生代都使用串行回收器
    • -XX:+UseParNewGC
    • -XX:+UseParallelGC


并行回收器

新生代 ParNew 回收器:

  • -XX:+UseParNewGC
  • -XX:+UseConcMarkSweepGC

回收器工作时的线程数量可以使用 -XX:ParallelGCThreads 参数指。一般最好与 CPU 数量相当,避免过多的线程数。默认算法

1
2
3
4
5
6
int getGCThreadsCount() {
if ( countOfCPU < 8 )
return countOfCPU;
else
return 3 + ( ( 5 * countOfCPU ) / 8 );
}


新生代 ParallelGC 回收器: 关注系统吞吐量

  • -XX:+UseParallelGC
  • -XX:+UseParallelOldGC

两个重要参数控制系统吞吐量:

  • -XX:MaxGCPauseMillis: 设置最大垃圾收集停顿时间
  • -XX:GCTimeRatio: 设置吞吐量大小
  • -XX:+UseAdaptiveSizePolicy: 打开自适应 GC 策略


老年代 ParallelOldGC: 标记压缩算法


  • 并行收集器,将串行回收器多线程化。并行回收器工作时的线程数量可以使用 -XX:ParallelGCThreads 参数指定,一般最好与 CPU 数量相当,避免过多的线程数,影响垃圾收集性能。
    • -XX:+UseParNewGC: 新生代使用并行回收收集器 (ParNew),老年代使用串行收集器
    • -XX:+UseConcMarkSweepGC: 新生代使用并行收集器 (ParNew),老年代使用 CMS
  • 新生代并行回收收集器,使用复制算法
    • -XX:+UseParallelGC: 新生代使用并行回收收集器 (ParallelGC),老年代使用串行收集器
  • 老年代并行回收收集器,使用标记-压缩算法
    • 使用 -XX:+UseParallelOldGC: 新生代使用 ParallelGC ,老年代使用 ParallelOldGC

CMS (Concurrent Mark Sweep): 关注系统停顿时间,非独占式

  • -XX:+UseConcMarkSweepGC
  • -XX:CMSInitiatingOccupancyFraction: 当老年代的空间使用率达到 68% (默认) 时进行一次 CMS 垃圾回收
  • -XX:+UseCMSCompactAtFullCollection: 在垃圾收集完成之后,进行一次内存碎片整理
  • CMS 收集器,这是一个关注停顿的垃圾收集器


  • G1 收集器: JDK 1.7 正式启用

新生代串行收集器和老年代串行收集器都是串行的、独占式的垃圾收集器。不要求整个 eden 区、年轻代或者老年代都连续


使用 -XX:+UseSerialGC 打印出的 GC 信息:

1
2
3
4
[GC (Allocation Failure) 
[DefNew: 18954K->897K(28864K), 0.0020543 secs]
18954K->897K(93056K), 0.0020917 secs]
[Times: user=0.00 sys=0.00, real=0.00 secs]

使用 -XX:+UseParNewGC 打印出的 GC 信息:

1
2
3
4
[GC (Allocation Failure) 
[ParNew: 19468K->880K(28864K), 0.0033698 secs]
19468K->880K(93056K), 0.0034037 secs]
[Times: user=0.00 sys=0.00, real=0.00 secs]

使用 -XX:+UseParallelOldGC (默认) 打印出的 GC 信息:

1
2
3
4
[GC (Allocation Failure) 
[PSYoungGen: 24485K->448K(28160K)]
368549K->344520K(379904K), 0.0039329 secs]
[Times: user=0.02 sys=0.00, real=0.01 secs]

G1 (Garbage-First) 垃圾收集器

以前的垃圾收集器 (serial, parallel, CMS) 将堆分为固定大小的三个区域: 年轻代、老年代和永久代:

但是,G1 采取了一种不同的方法:

堆被分成了一系列相同大小的区域,并且相同角色的区域的大小不再是固定的,这样在内存使用上能够提供更大的灵活性。当垃圾收集开始的时候,G1 和 CMS 执行的操作其实是一样的:

  1. 并发全局扫描标记检查存活的对象
  2. 哪些区域垃圾对象最多,G1 就先收集哪些区域,这也是它为什么称为 Garbage-First 的原因

其他垃圾收集器使用 jvm 内置线程回收,而 G1 采用应用线程承担回收工作。

G1 垃圾收集器 VS CMS 垃圾收集器

G1 就是计划取代 Concurrent Mark-Sweep Collector (CMS). 与 CMS 相比,G1:

  • G1 是一个 compacting collector. G1 compacts sufficiently to completely avoid the use of fine-grained free lists for allocation, and instead relies on regions. This considerably simplifies parts of the collector, and mostly eliminates potential fragmentation issues.
  • G1 提供了更多的可预测的垃圾收集停顿,允许用户指定停顿时间

实用 JVM 参数

  • 获取堆快照。

发生 OutOfMemoryError 时,可以使用 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=C:\m.hprof 来保存当前的堆快照到文件中。也可以加上参数 -XX:OnOutOfMemoryError=c:\reset.bat 来运行一段脚本。

当发生 OutOfMemoryError (在一个 Windows 32 系统上就发生过) 的时候,应该尝试使用增大可用堆

1
java -Xmn1024M -jar xxx.jar

TODO: 思考: 如果知晓程序究竟需要多大内存?

  • 获取 GC 信息

使用参数 -verbose:gc 或者 -XX:+PrintGC 来获取简要的 GC 信息,也可以使用 -XX:+PrintGCDetails 来获取更加详细的信息。如果需要在 GC 发生的时刻打印 GC 发生的时间,则可以追加 -XX:+PrintGCTimeStamps 选项以查看相对时间或者 -XX:+PrintGCDateStamps 以查看绝对时间。如果许雅查看新生对象晋升到老年代的实际阈值,可以使用参数 -XX:+PrintTenuringDistribution -XX:MaxTenuringThreshold=18 来运行程序。如果需要在 GC 时,打印详细的堆信息,则可以打开 -XX:+PrintHeapAtGC 开关。

  • 控制 GC

-XX:+PrintExplicitGC 选项用于禁止显式的 GC 操作,即禁止在程序中使用 System.gc() 触发的 Full GC。另一个有用的 GC 控制参数是 -Xincgc,一旦启用这个参数,系统便会进行增量式的 GC。

JVM 调优的主要过程有: 确定堆内存大小 (-Xmx、-Xms)、合理分配新生代和老年代 (-XX:NewRatio、-Xmn、-XX:SurvivorRatio)、确定永久区大小 (-XX:Permsize、-XX:MaxPermSize)、选择垃圾收集器、对垃圾收集器进行合理的设置。除此之外,禁用显式 GC (-XX:+DisableExplicitGC)、禁用类元数据回收 (+Xnoclassgc)、禁用类验证 (-Xverify:none) 等设置,对提升系统性能也有一定的帮助。

  • GC 日志示例

使用 -XX:+PrintGC 获取的 GC 日志:

1
2
[GC (Allocation Failure)  GC前堆使用量20M->GC后堆使用量(当前可用堆大小90M), 本次GC花费 0.0028389 秒]
[GC (Allocation Failure) 20409K->432K(92672K), 0.0028389 secs]

同样的代码使用 -X:+PrintGCDetails 获取的 GC 日志:

1
2
3
4
5
6
7
8
9
10
11
12
[GC (Allocation Failure) [新生代: 从20M->降为0.4M(可用28M)] 整个堆从20M->将为0.4M(可用90M), 0.0151333 secs] [Times: 用户态时间耗时,系统态时间耗时,GC 实际经历的时间]
新生代 总大小 28M, 已用 13M [下界,当前上界,上界]
[GC (Allocation Failure) [PSYoungGen: 20409K->448K(28160K)] 20409K->456K(92672K), 0.0151333 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
Heap
PSYoungGen total 28160K, used 13461K [0x00000000e1380000, 0x00000000e4a80000, 0x0000000100000000)
eden space 24576K, 52% used [0x00000000e1380000,0x00000000e20356d0,0x00000000e2b80000)
from space 3584K, 12% used [0x00000000e2b80000,0x00000000e2bf0020,0x00000000e2f00000)
to space 3584K, 0% used [0x00000000e4700000,0x00000000e4700000,0x00000000e4a80000)
ParOldGen total 64512K, used 8K [0x00000000a3a00000, 0x00000000a7900000, 0x00000000e1380000)
object space 64512K, 0% used [0x00000000a3a00000,0x00000000a3a02000,0x00000000a7900000)
Metaspace used 3264K, capacity 4494K, committed 4864K, reserved 1056768K
class space used 363K, capacity 386K, committed 512K, reserved 1048576K

如果需要更为全面的堆信息,还可以使用参数 -XX:+PrintHeapAtGC,它会在每次 GC 前后分别打印堆的信息

1
2
3
4
5
{Heap before GC invocations=1 (full 0):
...
Heap after GC invocations=1 (full 0):
...
}

如果需要分析 GC 发生的时间,还可以使用 -XX:+PrintGCTimeStamps 参数,该输出时间为虚拟机启动后的时间偏移量:

1
2
3
0.174: [GC (Allocation Failure)  20409K->504K(92672K), 0.0016586 secs]
0.179: [GC (Allocation Failure) 19415K->464K(92672K), 0.0031200 secs]
0.186: [GC (Allocation Failure) 19812K->432K(92672K), 0.0009531 secs]

由于 GC 还会引起应用程序停顿,使用参数 -XX:+PrintGCApplicationConcurrentTime 可以打印应用程序的执行时间,使用参数 -XX:+PrintGCApplicationStoppedTime 可以打印应用程序由于 GC 而产生的停顿时间:

1
2
3
4
5
6
7
Application time: 0.0084849 seconds
[GC (Allocation Failure) 20409K->520K(92672K), 0.0044274 secs]
Total time for which application threads were stopped: 0.0045452 seconds, Stopping threads took: 0.0000210 seconds
Application time: 0.0033066 seconds
[GC (Allocation Failure) 19431K->440K(117248K), 0.0020202 secs]
Total time for which application threads were stopped: 0.0021438 seconds, Stopping threads took: 0.0000258 seconds
Application time: 0.0082455 seconds

如果想跟踪系统内的软引用、弱引用、虚引用和 Finalize 队列,则可以使用打开 -XX:+PrintReferenceGC 开关. 使用参数 -Xloggc:log/gc.log 启动虚拟机,将 GC 日志输出到 gc.log 文件中

1
2
3
4
5
6
7
8
9
10
Java HotSpot(TM) 64-Bit Server VM (25.111-b14) for linux-amd64 JRE (1.8.0_111-b14), built on Sep 22 2016 16:14:03 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 6052560k(316636k free), swap 6233084k(4248464k free)
CommandLine flags: -XX:InitialHeapSize=96840960 -XX:MaxHeapSize=1549455360 -XX:+PrintGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
0.183: Application time: 0.0107645 seconds
0.183: [GC (Allocation Failure) 20409K->432K(92672K), 0.0033748 secs]
0.187: Total time for which application threads were stopped: 0.0035825 seconds, Stopping threads took: 0.0000191 seconds
0.192: Application time: 0.0054269 seconds
0.193: [GC (Allocation Failure) 19343K->496K(117248K), 0.0108382 secs]
0.204: Total time for which application threads were stopped: 0.0116746 seconds, Stopping threads took: 0.0000766 seconds
0.212: Application time: 0.0084699 seconds

系统参数查看:

  • -XX:+PrintVMOptions: 打印虚拟机接受的命令行显示参数
  • -XX:+PrintCommandLineFlags: 打印虚拟机的显示和隐式参数
  • -XX:+PrintFlagsFinal: 打印所有的系统参数的值
1
2
# 打印出系统的堆大小
java -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize|PermSize|ThreadStackSize'

Minor GC、Major GC 和 Full GC

  • Minor GC: 从年轻代回收垃圾,当 JVM 无法分配新对象的时候会触发 Minor GC,也就是说 Eden 区域已经满了
  • Major GC: 清除 Tenured 区域
  • Full GC: 清除整个堆,包括 Yound 和 Tenured 区域

Java 各版本默认垃圾收集器

参考 1 说:

On server-class machines running the server VM, the garbage collector (GC) has changed from the previous serial collector […] to a parallel collector

Reference 2 says:

Starting with J2SE 5.0, when an application starts up, the launcher can attempt to detect whether the application is running on a “server-class” machine and, if so, use the Java HotSpot Server Virtual Machine (server VM) instead of the Java HotSpot Client Virtual Machine (client VM).

Also, reference 2 says:

注意: 对于 Java SE 6, the definition of a server-class machine is one with at least 2 CPUs and at least 2GB of physical memory.

Java 7 和 Java 8 使用的都是 Parallel GC,Java 9 使用的是 G1 垃圾收集器

JVM 的工作模式

  • java -version: 查看 Server VM
  • java -client -version: 查看 Client VM

ClientServer 模式下的各种参数可能会有很大不同

Heap Memory 最佳实践

  • 是否分配了过多实例: 使用 jcmd 8998 GC.class_histogram 来查看各实例有多少个,也可以使用 jmap -histo 8998 来获得相同的结果
  • 分析堆快照: 使用 jhat、jvisualvm、mat 等工具来分析 hprof 文件
    • jcmd 8998 GC.heap_dump /path/to/heap_dump.hprof
    • jmap -dump:live,file=/path/to/heap_dump.hprof 8998: 引入 live 强制 full GC

Java Monitoring 常用工具

jstack: Dumps the stacks of a Java 进程

1
jstack $PID > $DATE_DIR/jstack-$PID.dump 2>&1

jinfo: Provides visibility into the system properties of the JVM, and allows some system properties to be set dynamically.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
root@zk-pc:~# jinfo 18772
Attaching to process ID 18772, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.144-b01
Java System Properties:

com.sun.management.jmxremote.authenticate = false
java.runtime.name = Java(TM) SE Runtime Environment
java.vm.version = 25.144-b01
...(省略好多)

VM Flags:
Non-default VM flags: -XX:CICompilerCount=3 -XX:InitialHeapSize=98566144 -XX:+ManagementServer -XX:MaxHeapSize=1549795328 -XX:MaxNewSize=516423680 -XX:MinHeapDeltaBytes=524288 -XX:NewSize=32505856 -XX:OldSize=66060288 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseFastUnorderedTimeStamps -XX:+UseParallelGC
Command line: -Dcom.sun.management.jmxremote.port=5780 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -javaagent:/usr/lib/intellij_idea/idea-IC-172.3968.16/lib/idea_rt.jar=35487:/usr/lib/intellij_idea/idea-IC-172.3968.16/bin -Dfile.encoding=UTF-8

jstat: 提供有关 GC 和类加载活动的相关信息

显示可用的九个 options:

1
jstat -options

One useful option is -gcutil, which displays the time spent in GC as well as the percentage of each GC area that is currently filled. Other options to jstat will display the GC sizes in terms of KB.

Remember that jstat takes an optional argument—the number of milliseconds to repeat the command—so it can monitor over time the effect of GC in an application.

1
jstat -gcutil process_id 1000

打印出的是:

1
2
3
root@zk-pc:~# jstat -gcutil 18772
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
0.00 71.53 97.93 34.02 96.70 93.37 29 0.133 1 0.040 0.172

gccapacity 可以显示 VM 内存中三代(young,old,perm)对象的使用和占用大小

1
jstat -gccapacity process_id

打印出的是:

1
2
3
root@zk-pc:~# jstat -gccapacity 18772
NGCMN NGCMX NGC S0C S1C EC OGCMN OGCMX OGC OC MCMN MCMX MC CCSMN CCSMX CCSC YGC FGC
31744.0 504320.0 30720.0 4608.0 4608.0 21504.0 64512.0 1009152.0 44032.0 44032.0 0.0 1069056.0 22272.0 0.0 1048576.0 2560.0 32 1

jmap (Memory Map): Provides heap dumps and other information about JVM memory usage.

1
jmap $PID

打印的是一堆这种东西:

1
2
3
4
5
6
7
8
9
10
11
12
root@zk-pc:~# jmap 18772
Attaching to process ID 18772, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.144-b01
0x0000000000400000 7K /usr/lib/jvm/oracle_jdk8/jdk1.8.0_144/bin/java
0x00007f7072978000 98K /lib/x86_64-linux-gnu/libresolv-2.23.so
0x00007f7072b93000 26K /lib/x86_64-linux-gnu/libnss_dns-2.23.so
0x00007f7072d9a000 10K /lib/x86_64-linux-gnu/libnss_mdns4_minimal.so.2
0x00007f70737a1000 87K /lib/x86_64-linux-gnu/libgcc_s.so.1
0x00007f70739b7000 251K /usr/lib/jvm/oracle_jdk8/jdk1.8.0_144/jre/lib/amd64/libsunec.so
...(省略好多)

Print histogram(直方图;柱状图) of java object heap; if the “live” suboption is specified, only count live objects:

1
2
jmap -histo $PID
jmap -histo:live $PID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
root@zk-pc:~# jmap -F -histo 18772
Object Histogram:

num #instances #bytes Class description
--------------------------------------------------------------------------
1: 65711 10183976 char[]
2: 13523 8919400 byte[]
3: 54732 2159368 java.lang.Object[]
4: 7341 1451792 int[]
5: 56423 1354152 java.lang.String
6: 15476 619040 java.util.TreeMap$Entry
7: 16562 529984 java.io.ObjectStreamClass$WeakClassKey
8: 11915 476600 java.util.LinkedHashMap$Entry
9: 9716 466368 java.util.HashMap
10: 3993 453312 java.lang.Class
11: 11568 370176 java.util.concurrent.ConcurrentHashMap$Node
12: 6160 306952 java.util.HashMap$Node[]
13: 4210 279856 java.util.Hashtable$Entry[]
14: 8320 266240 java.util.Vector
15: 8070 258240 java.util.HashMap$Node
16: 10495 251880 org.jsoup.nodes.Attribute
17: 4181 200688 java.util.Hashtable
...(省略好多)

Print java heap summary:

1
jmap -heap $PID

打印出的是一堆这种东西:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
root@zk-pc:~# jmap -heap 18772
Attaching to process ID 18772, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.144-b01

using thread-local object allocation.
Parallel GC with 4 thread(s)

Heap Configuration:
MinHeapFreeRatio = 0
MaxHeapFreeRatio = 100
MaxHeapSize = 1549795328 (1478.0MB)
NewSize = 32505856 (31.0MB)
MaxNewSize = 516423680 (492.5MB)
OldSize = 66060288 (63.0MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
capacity = 23068672 (22.0MB)
used = 11772712 (11.227333068847656MB)
free = 11295960 (10.772666931152344MB)
51.03333213112571% used
From Space:
capacity = 11010048 (10.5MB)
used = 2035424 (1.941131591796875MB)
free = 8974624 (8.558868408203125MB)
18.48696754092262% used
To Space:
capacity = 11534336 (11.0MB)
used = 0 (0.0MB)
free = 11534336 (11.0MB)
0.0% used
PS Old Generation
capacity = 45088768 (43.0MB)
used = 13718432 (13.082916259765625MB)
free = 31370336 (29.917083740234375MB)
30.42538665061773% used

8999 interned Strings occupying 836656 bytes.

堆内存使用最佳实践

堆分析

(1) 查看直方图

1
2
3
4
5
// jcmd 命令默认就会进行 full GC
jcmd 6808 GC.class_histogram
jmap -histo 6808
// 如果指明 live: 选项,将会强制进行一个 full GC
jmap -histo:live 6808
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
 num     #instances         #bytes  class name
----------------------------------------------
1: 12227 1303424 [C
2: 1003 627856 [B
3: 1917 461864 [I
4: 3828 421768 java.lang.Class
5: 11665 279960 java.lang.String
6: 6065 194080 java.util.concurrent.ConcurrentHashMap$Node
7: 2794 173144 [Ljava.lang.Object;
8: 3072 122880 org.apache.lucene.index.FreqProxTermsWriter$PostingList
9: 2760 110400 java.util.LinkedHashMap$Entry
10: 1097 101144 [Ljava.util.HashMap$Node;
11: 5440 87040 java.lang.Object
12: 2680 85760 java.util.HashMap$Node
13: 520 45760 java.lang.reflect.Method
14: 44 44064 [Ljava.util.concurrent.ConcurrentHashMap$Node;
15: 781 43736 java.util.LinkedHashMap
16: 96 41088 [Lorg.apache.lucene.index.RawPostingList;
...

(2) Dump 堆

1
2
3
4
5
6
// 指明 live,强制进行 full GC
jmap -dump:live,file=/tmp/heap_dump.hprof 6808
// 或者
jmap -F -dump:format=b,file=filename.hprof 20961
// 或者简单点
jmap -F -dump:file=filename.hprof 20961

注意: 路径一定要显示指明,否则不知道默认保存到哪里去了

通常有三种工具能够分析 .hprof 文件:

  • jhat
  • jvisualvm
  • mat

(3) 内存溢出

内存溢出通常发生在:

  • Native 内存用光了
  • permgen(Java 7) 或者 metaspace(Java 8) 内存用光了
  • Java 堆内存用光了
  • JVM 进行 GC 的时间太长了

使用更少的内存

(1) 减少对象大小

(2) 延迟初始化
(3) 不可变对象
(4) String Interning

对象生命周期管理

JIT

(1) 编译还是解释

Languages like C++ and Fortran are called compiled languages because their programs are delivered as binary (compiled) code: the program is written, and then a static compiler produces a binary. The assembly code in that binary is targeted to a particular CPU. Complementary CPUs can execute the same binary: for example, AMD and Intel CPUs share a basic, common set of assembly language instructions, and later versions of CPUs almost always can execute the same set of instructions as previous versions of that CPU.

Languages like PHP and Perl, on the other hand, are interpreted. The same program source code can be run on any CPU as long as the machine has the correct interpreter (that is, the program called php or perl). The interpreter translates each line of the program into binary code as that line is executed.

Java attempts to find a middle ground here. Java applications are compiled—but instead of being compiled into a specific binary for a specific CPU, they are compiled into an idealized assembly language. This assembly language (know as Java bytecodes) is then run by the java binary (in the same way that an interpreted PHP script is run by the php binary). This gives Java the platform independence of an interpreted language. Because it is executing an idealized binary code, the java program is able to compile the code into the platform binary as the code executes. This compilation occurs as the program is executed: it happens “just in time.

(2) HotSpot 名字的含义

In a typical program, only a small subset of code is executed frequently, and the performance of an application depends primarily on how fast those sections of code are executed. These critical sections are known as the hot spots of the application; the more the section of code is executed, the hotter that section is said to be.

Hence, when the JVM executes code, it does not begin compiling the code immediately. There are two basic reasons for this. First, if the code is going to be executed only once, then compiling it is essentially a wasted effort; it will be faster to interpret the Java bytecodes than to compile them and execute (only once) the compiled code.

the more times that the JVM executes a particular method or loop, the more information it has about that code. This allows the JVM to make a number of optimizations when it compiles the code.

(3) 寄存器和内存

If the value of sum were to be retrieved from (and stored back to) main memory on every iteration of this loop, performance would be dismal. Instead, the compiler will load a register with the initial value of sum, perform the loop using that value in the register, and then (at an indeterminate point in time) store the final result from the register back to main memory.

Register usage is a general optimization of the compiler, and when escape analysis is enabled (see the end of this chapter), register use is quite aggressive.

(4) 选择 Java 编译器

  • A 32-bit client version (-client)
  • A 32-bit server version (-server)
  • A 64-bit server version (-d64)

For the sake of compatibility, the argument specifying which compiler to use is not rigorously followed. If you have a 64-bit JVM and specify -client, the application will use the 64-bit server compiler anyway. If you have a 32 bit JVM and you specify -d64, you will get an error that the given instance does not support a 64-bit JVM.

The client compiler begins compiling sooner than the server compiler does. code produced by the server compiler will be faster than that produced by the client compiler. couldn’t the JVM start with the client compiler, and then use the server compiler as code gets hotter? That technique is known as tiered compilation. With tiered compilation, code is first compiled by the client compiler; as it becomes hot, it is recompiled by the server compiler.

1
2
# Java 7 需要打开, Java 8 默认开启
-server -XX:+TieredCompilation
  • For GUI programs, uses the client compiler by default. Performance is often all about perception: if the initial startup seems faster, and everything else seems fine, users will tend to view the program that has started faster as being faster overall.
  • For long-running applications, always choose the server compiler, preferably in conjunction with tiered compilation.

查看默认编译器:

1
java -version

(5) 更多考虑因素

Code Cache: When the JVM compiles code, it holds the set of assembly-language instructions in the code cache. Code Cache 有固定大小, and once it has filled up, the JVM is not able to compile any additional code.


编译阈值: The major factor involved here is 多频繁 the code is executed; once it is executed a certain number of times, its compilation threshold is reached, and the compiler deems that it has enough information to compile the code.

Compilation is based on two counters in the JVM: 方法调用次数, and 方法内循环的实际次数. When the JVM executes a Java method, it checks the sum of those two counters and decides whether or not the method is eligible for compilation. This kind of compilation has no official name but is often called standard compilation (标准编译).

But what if the method has a really long loop—or one that never exits and provides all the logic of the program? In that case, the JVM needs to compile the loop without waiting for a method invocation. So every time the loop completes an execution, the branching counter is incremented and inspected. If the branching counter has exceeded its individual threshold, then the loop (and not the entire method) becomes eligible for compilation.

This kind of compilation is called on-stack replacement (OSR), because even if the loop is compiled, that isn’t sufficient: the JVM has to have the ability to start executing the compiled version of the loop while the loop is still running. When the code for the has finished compiling, the JVM replaces the code (on-stack), and the next iteration of the loop will execute the much-faster compiled version of the code (下一次循环就是编译版本了).

Standard compilation is triggered by the value of the -XX:CompileThreshold=N flag. The default value of N for the client compiler is 1,500; for the server compiler it is 10,000.


查看编译过程: -XX:+PrintCompilation.

jstat has two options to provide information about the compiler. The -compiler option supplies summary information about 多少方法被编译了 (here 5003 is the process ID of the program to be inspected):

1
jstat -compiler 5003

lternately, you can use the -printcompilation option to get information about the 最后一个方法 that is compiled. In this example, jstat repeats the information for process ID 5003 every second (1,000 ms):

1
jstat -printcompilation 5003 1000

编译线程个数:


内联:

One of the most important optimizations the compiler makes is to inline methods.

1
2
3
4
5
public class Point {
private int x, y;
public void getX() { return x; }
public void setX(int i) { x = i; }
}

当你写这样代码的时候:

1
2
Point p = getPoint();
p.setX(p.getX() * 2);

编译后的代码执行的将会是:

1
2
Point p = getPoint();
p.x = p.x * 2;

The basic decision about whether to inline a method depends on 多频繁 and 大小. The JVM determines if a method is hot (i.e., called frequently) based on an internal calculation; it is not directly subject to any tunable parameters. If a method is eligible for inlining because it is called frequently, then it will be inlined only if its 字节码大小小于 325 字节 (or whatever is specified as the -XX:MaxFreqInlineSize=N flag). Otherwise, it is eligible for inlining only if it is small: 小于 35 字节 (or whatever is specified as the -XX:MaxInlineSize=N flag)


逃逸分析:

The server compiler performs some very aggressive optimizations if escape analysis is enabled (-XX:+DoEscapeAnalysis, 默认开启).

1
2
3
4
5
6
7
8
9
10
11
12
public class Factorial {
private BigInteger factorial;
private int n;
public Factorial(int n) {
this.n = n;
}
public synchronized BigInteger getFactorial() {
if (factorial == null)
factorial = ...;
return factorial;
}
}

The factorial object is referenced only inside that loop; no other code can ever access that object. Hence, the JVM is free to perform a number of optimizations on that object:

  • It needn’t get a synchronization lock when calling the getFactorial() method.
  • It needn’t store the field n in memory; it can keep that value in a register. Similarly it can store the factorial object reference in a register.
  • In fact, it needn’t allocate an actual factorial object at all; it can just keep track of the individual fields of the object.

(6) Deoptimization

Deoptimization means that the compiler 不得不撤销一些优化; the effect is that the performance of the application will be reduced—at least until the compiler can recompile the code in question. There are two cases of deoptimization: when code is “made not entrant,” and when code is “made zombie”.


Not Entrant Code:

There are two things that cause code to be made not entrant. One is due to the way classes and interfaces work, and one is an implementation detail of tiered compilation

1
2
3
4
5
6
7
8
9
10
11
12
StockPriceHistory sph;
String log = request.getParameter("log");
if (log != null && log.equals("true")) {
sph = new StockPriceHistoryLogger(...);
}
else {
sph = new StockPriceHistoryImpl(...);
}
// Then the JSP makes calls to:
sph.getHighPrice();
sph.getStdDev();
// and so on

If a bunch of calls are made to http://localhost:8080/StockServlet (that is, without the log parameter), the compiler will see that the actual type of the sph object is StockPriceHistoryImpl. It will then inline code and perform other optimizations based on that knowledge. Later, say a call is made to http://localhost:8080/StockServlet?log=true. Now the assumption the compiler made regarding the type of the sph object is false; the previous optimizations are no longer valid. This generates a deoptimization trap, and the previous optimizations are discarded. If a lot of additional calls are made with logging enabled, the JVM will quickly end up compiling that code and making new optimizations.

In tiered compilation, code is compiled by the client compiler, and then later compiled by the server compiler (and actually it’s a little more complicated than that, as discussed in the next section). When the code compiled by the server compiler is ready, the JVM must replace the code compiled by the client compiler. It does this by 将旧代码标记为 Not Entrant and using the same mechanism to substitute the newly compiled (and more efficient) code.


Deoptimizing Zombie Code:

Recall that the compiled code is held in a fixedsize code cache; when zombie methods are identified, it means that the code in question can be removed from the code cache, making room for other classes to be compiled (or limiting the amount of memory the JVM will need to allocate later).

The possible downside here is that if the code for the class is made zombie and then later reloaded and heavily used again, the JVM will need to recompile and reoptimize the code.

TODO

远程 JVisualVM

远程机器上输入 jstatd:

1
2
3
4
5
6
7
8
Could not create remote object
access denied ("java.util.PropertyPermission" "java.rmi.server.ignoreSubClasses" "write")
java.security.AccessControlException: access denied ("java.util.PropertyPermission" "java.rmi.server.ignoreSubClasses" "write")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
at java.security.AccessController.checkPermission(AccessController.java:884)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549)
at java.lang.System.setProperty(System.java:792)
at sun.tools.jstatd.Jstatd.main(Jstatd.java:139)

你需要创建一个安全策略文件: jstatd.all.policy,里面写上这句话:

1
grant codebase "file:/opt/java/jdk1.7.0_21/lib/tools.jar" { permission java.security.AllPermission; };

然后使用如下命令重新启动:

1
jstatd -J-Djava.security.policy=/home/user/jstatd.all.policy

在本机测试,是否能够 telnetjstatd 服务:

1
telnet 10.108.112.218 1099

有些时候,jstatd 可能绑定的并不是正确的网卡:

1
-J-Djava.rmi.server.hostname=10.1.1.123

强制使用 IPV4:

1
-J-Djava.net.preferIPv4Stack=true

查看一些日志输出:

1
-J-Djava.rmi.server.logCalls=true

最后的命令:

1
jstatd -J-Djava.security.policy=./jstatd.all.policy -J-Djava.rmi.server.hostname=10.108.112.218 -J-Djava.rmi.server.logCalls=true

GC 日志分析工具

DUMP 什么

以下是 dubbo - dump.sh 备份的内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
DUMP_DATE=`date +%Y%m%d%H%M%S`
DATE_DIR=$DUMP_DIR/$DUMP_DATE

echo -e "Dumping the $SERVER_NAME ...\c"
for PID in $PIDS ; do
jstack $PID > $DATE_DIR/jstack-$PID.dump 2>&1
echo -e ".\c"
jinfo $PID > $DATE_DIR/jinfo-$PID.dump 2>&1
echo -e ".\c"
jstat -gcutil $PID > $DATE_DIR/jstat-gcutil-$PID.dump 2>&1
echo -e ".\c"
jstat -gccapacity $PID > $DATE_DIR/jstat-gccapacity-$PID.dump 2>&1
echo -e ".\c"
jmap $PID > $DATE_DIR/jmap-$PID.dump 2>&1
echo -e ".\c"
jmap -heap $PID > $DATE_DIR/jmap-heap-$PID.dump 2>&1
echo -e ".\c"
jmap -histo $PID > $DATE_DIR/jmap-histo-$PID.dump 2>&1
echo -e ".\c"
if [ -r /usr/sbin/lsof ]; then
/usr/sbin/lsof -p $PID > $DATE_DIR/lsof-$PID.dump
echo -e ".\c"
fi
done

if [ -r /bin/netstat ]; then
/bin/netstat -an > $DATE_DIR/netstat.dump 2>&1
echo -e ".\c"
fi
if [ -r /usr/bin/iostat ]; then
/usr/bin/iostat > $DATE_DIR/iostat.dump 2>&1
echo -e ".\c"
fi
if [ -r /usr/bin/mpstat ]; then
/usr/bin/mpstat > $DATE_DIR/mpstat.dump 2>&1
echo -e ".\c"
fi
if [ -r /usr/bin/vmstat ]; then
/usr/bin/vmstat > $DATE_DIR/vmstat.dump 2>&1
echo -e ".\c"
fi
if [ -r /usr/bin/free ]; then
/usr/bin/free -t > $DATE_DIR/free.dump 2>&1
echo -e ".\c"
fi
if [ -r /usr/bin/sar ]; then
/usr/bin/sar > $DATE_DIR/sar.dump 2>&1
echo -e ".\c"
fi
if [ -r /usr/bin/uptime ]; then
/usr/bin/uptime > $DATE_DIR/uptime.dump 2>&1
echo -e ".\c"
fi

从上可知一般统计的都有如下几项:

  • jstack: 线程信息
  • jinfo: 配置信息. The configuration information includes Java system properties and Java Virtual Machine (JVM) command-line flags.
  • jstat -gcutil: 垃圾收集统计
  • jstat -gccapacity: Displays statistics about the capacities of the generations and their corresponding spaces.
  • jmap: Prints 共享对象内存 maps or 堆内存 details for a process, core file, or remote debug server.
  • jmap -heap: Prints a heap summary of the garbage collection used, the head configuration, and generation-wise heap usage. In addition, the number and size of interned Strings are printed.
  • jmap -histo: Prints a histogram of the heap
  • lsof -p
  • netstat -an
  • iostat: Report Central Processing Unit (CPU) statistics and input/output statistics for devices, partitions and network filesystems (NFS).
  • mpstat: Report 处理器 related statistics.
  • vmstat: vmstat (virtual memory statistics) is a computer system monitoring tool that collects and displays summary information about operating system memory, processes, interrupts, paging and block I/O.
  • free -t: Display amount of 可用/已用内存 in the system. -t: Display a line showing the column totals.
  • sar: In computing, sar (System Activity Report) is a Unix System V-derived system monitor command used to report on various system loads, including CPU 活动, memory/paging, 设备负载, 网络. Linux distributions provide sar through the sysstat package.
  • uptime: uptime gives a one line display of the following information. The 当前时间, 多长时间 the system has been running, 多少用户 are currently logged on, and the 系统平均负载 averages for the past 1, 5, and 15 minutes.

实际运用中如何清晰明了地观察 JVM 的运行过程?

  • 图形工具: JProfiler, JConsole, Java VisualVM
  • 命令: jps, jstack, jmap, jhat, jstat

JVM 如何进阶

问:JVM如何进阶,目前周志明的《深入理解JVM》第2版看了两遍,能够根据目录口述书中大部分内容,还需要了解哪些知识?

答:周志明的书只能算是 JVM 的入门书籍。接下来你应该去读一读《Java虚拟机规范》,周志明的书很多内容是从里面来的,但是规范本身比较详细,注意读英文原版。其次去读一下Oralce的文档:《Hotspot Memory Management white paper》, 《Java Platform, Standard Edition HotSpot Virtual Machine Garbage Collection Tuning Guide》。现在你需要进一步修炼关于内存管理的部分,阅读比如《垃圾回收算法与实现》,如果这本读完还不满足,那么阅读《自动内存管理艺术——垃圾回收算法手册》。到了这一步,理论你已经掌握得很好了,是时候把 Hotspot 源码 download 下来编译好之后断点调试玩玩了,这个时候我要推荐你今年阿里人刚出的《揭秘Java虚拟机》,不过阅读这本书之前你要是愿意先读完《深入理解计算机系统》效果更好。到了这一步,剩下的,自己探索了,我也在探索。

线上CPU很高、内存占用很少,有能快速查找到原因的方法吗?

给一个代码,在 Linux 下保存成 .sh 文件直接执行即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/bin/sh
ts=$(date +"%s")
jvmPid=$1
defaultLines=100
defaultTop=20

threadStackLines=${2:-$defaultLines}
topThreads=${3:-$defaultTop}

jvmCapture=$(top -b -n1 | grep java )
threadsTopCapture=$(top -b -n1 -H | grep java )
jstackOutput=$(echo "$(jstack $jvmPid )" )
topOutput=$(echo "$(echo "$threadsTopCapture" | head -n $topThreads | perl -pe 's/\e\[?.*?[\@-~] ?//g' | awk '{gsub(/^ +/,"");print}' | awk '{gsub(/ +|[+-]/," ");print}' | cut -d " " -f 1,9 )\n ")

echo "*************************************************************************************************************"

uptime

echo "Analyzing top $topThreads threads"

echo "*************************************************************************************************************"

printf %s "$topOutput" | while IFS= read line

do
pid=$(echo $line | cut -d " " -f 1)
hexapid=$(printf "%x" $pid)
cpu=$(echo $line | cut -d " " -f 2)
echo -n $cpu"% [$pid] "
echo "$jstackOutput" | grep "tid.*0x$hexapid " -A $threadStackLines | sed -n -e '/0x'$hexapid'/,/tid/ p' | head -n -1
echo "\n"
done

echo "\n"

代码的意思,打印出 JVM 的所有线程以及按照 CPU 占比排序。

您好,想问一个 JVM 比较基础的知识,现在的垃圾收集都是分代回收,那么在回收新生代的时候是要同时扫描老年代吗?是全表还是有一种策略,比如 G1 的 Remembered set,这个 set 只是记录了一种引用关系;那其它的分代回收,比如 CMS 和 ParNew 组合时只能是回收新生代的时候扫描老年代吗?那这样效率不就是降低了不少吗?

答:对于老年代指向新生代的引用,JVM提供了一种叫 card table 的数据结构,所以每次并不需要全量遍历老年代,只需要遍历 card table 就行了。

线上定位内存 JVM 内存溢出,除了打印堆栈拿出来分析,还有没有其它的方式?

答:导出 JVM dump 文件,在本地使用 Eclipse 插件 MAT 分析,可视化的分析最方便、直观、有效。

垃圾回收器怎么选择

  • 最小化地使用内存和并行开销,请选择 Serial GC
  • 最大化应用程序的吞吐量,请选择Parallel GC
  • 最小化 GC 的中断或者停顿时间,请选择 CMS GC

并发和并行都可以表示两个或者多个任务一起执行,但是偏重点不同。并发偏重于多个任务交替执行,而多个任务之间有可能还是串行的。而并行是真正意义上的“同时执行”。

内存泄漏代码示例

1
2
3
4
5
6
7
while (true) {
for (int i=0; i<10000; i++) {
if (!m.contains(new Key(i))) {
m.put(new Key(i), "Number:" + i);
}
}
}

Interned Strings

String 类型的常量池比较特殊。主要使用方法有两种:

  • 直接使用双引号声明出来的 String 对象会直接存储在常量池中。
  • 如果不是双引号声明的 String 对象,可以使用 String 提供的 intern 方法。intern 会先判断是否存在常量池中,如果不存在,则会将当前字符串放入常量池中。

JDK 6的常量池放在 Perm 区中,默认大小只有 4 MB。JDK 7开始,放在中。

MAT

1) The Dominator Tree:

The key to understanding your retained heap, is looking at the dominator tree. The dominator tree is a tree produced by the complex object graph in your system. The dominator tree allows you to identify the largest memory graphs. An Object X is said to dominate an Object Y if every path from the Root to Y must pass through X.

https://javaeesupportpatterns.blogspot.jp/2013/03/openjpa-memory-leak-case-study.html

JVM 诊断示例

1) 健康的 JVM:

2) 启动内存暴涨:

3) 激增:

4) 内存泄露

JVisualVM

需要安装一个 Visual GC 插件:

才能显示具体的 GC 过程:

TODO

https://plumbr.eu/handbook/garbage-collection-algorithms-implementations#serial-minor-gc

如何在生产环境使用 Btrace 进行调试

大多数问题的解决方式都是在本地打断点进行调试,或者在测试环境利用输出日志进行调试,这种方式简单粗暴,但过程比较繁琐,需要各种重新发布,重启应用,还不能保证一次就找到问题的根源。

BTrace 是 sun 公司推出的一款 Java 动态、安全追踪(监控)工具,可以在不用重启的情况下监控系统运行情况,方便的获取程序运行时的数据信息,如方法参数、返回值、全局变量和堆栈信息等,并且做到最少的侵入,占用最少的系统资源。

由于 Btrace 会把脚本逻辑直接侵入到运行的代码中,所以在使用上做很多限制:

  1. 不能创建对象
  2. 不能使用数组
  3. 不能抛出或捕获异常
  4. 不能使用循环
  5. 不能使用 synchronized 关键字
  6. 属性和方法必须使用 static 修饰

根据官方声明,不恰当的使用 BTrace 可能导致 JVM 崩溃,如在 BTrace 脚本使用错误的 class 文件,所以在上生产环境之前,务必在本地充分的验证脚本的正确性

Btrace 可以做什么?

  • 接口性能变慢,分析每个方法的耗时情况;
  • 当在 Map 中插入大量数据,分析其扩容情况;
  • 分析哪个方法调用了 System.gc()
  • 执行某个方法抛出异常时,分析运行时参数

假设服务器端运行的是如下代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public class BtraceCase {
public static Random random = new Random();
public int size;

public static void main(String[] args) throws Exception {
new BtraceCase().run();
}

public void run() throws Exception {
while (true) {
add(random.nextInt(10), random.nextInt(10));
}
}

public int add(int a, int b) throws Exception {
Thread.sleep(random.nextInt(10) * 100);
return a + b;
}
}

我们想要对 add 方法的传入参数、返回值和执行耗时进行分析:

通过 jps 获取服务器端的进程ID: 8454,执行命令

1
btrace 8454 Debug.java

实现对运行代码的监控:

可以发现,Btrace 可以获取每次执行 add 方法时的数据,当然 Btrace 能做的远远不止这些,比如获取当前 jvm 堆使用情况、当前线程的执行栈等等。

参数说明

1
2
3
4
5
6
7
8
9
10
// clazz: 需要监控的类
// method: 需要监控的方法
// clazz 和 method 可以使用正则、接口、注解等来指定
// location: 拦截位置
// Kind.ENTRY: 进入方法的时候,调用脚本
// Kind.RETURN: 执行完的时候,调用脚本
// 只有定义为 RETURN,才能获取方法的返回结果 @Return 和 @Duration
@OnMethod(clazz="com.metty.rpc.common.BtraceCase",
method="add",
location=@Location(Kind.RETURN))

如何使用 Btrace 定位问题

  • 找出所有耗时超过 1ms 的过滤器 Filter

由于 @Dutation 返回的时间是纳秒级别,需要进行转换。

  • 哪个方法调用了 System.gc(),调用栈如何?

  • 统计方法的调用次数,且每隔 1 分钟打印调用次数

Btrace 的 @OnTimer 注解可以实现定时执行脚本中的一个方法

  • 方法执行时,查看对象的实例属性值

通过反射机制,可以很方便的得到当前实例的属性值。

总结

Btrace 能做的事情太多,但使用之前切记检查脚本的可行性,一旦 Btrace 脚本侵入到系统中,只有通过重启才能恢复

待读

JConsole

JVisualVM

执行 jmap 不允许操作

解决办法就是在运行 jmap 之前执行命令:

1
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

jstack 分析

线程状态:

  • BLOCKED: 这个线程正在等待其它线程释放锁
  • WAITING: 使用 wait, join, park 命令之后,线程正在等待
  • TIMED_WAITING: 使用 sleep, wait, join, park 命令之后, 线程正在等待,最大等待时间是由方法参数决定的

线程类型:

  • daemon 线程: 当没有其它非 daemon 线程之后, daemon 线程自动停止工作

Analyzing HotSpot Crashes

1
2
3
4
5
6
7
8
9
10
11
12
13
public class Crash {

final static Unsafe UNSAFE = getUnsafe();

public static void crash(int x) {
UNSAFE.putInt(0x99, x);
}

public static void main(String...args) {
crash(0x42);
}

}
  • RAX: register a extended, 64 bit register
  • RDX: register d extended, 64 bit register
  • EAX: 32 bit register
  • R9: register 9

JVM Tuning at Twitter

Web services biggest enemy:

  • Latency

Server-side Latency contributors:

  • By far the biggest contributror is garbage collector
  • others are, in no particular order:
    • in-process locking and thread scheduling
    • I/O
    • application algorithm inefficiencies

Areas of performance tuning:

  • Memory tuning
  • Lock contention tuning
  • CPU usage tuning
  • I/O tuning

Areas of memory performance tuning:

  • Memory footprint tuning
  • Allocation rate tuning
  • Garbage collection tuning

Memory footprint tuning:

  • Maybe you have too much data !
  • Maybe your data representation is fat !
  • You can also have a genuine memory leack …

Too much data:

  • Run with -verbosegc
  • Observe numbers in “Full GC” messages
1
[Full GC $before->$after($total), $time secs]
  • Can you give the JVM more memory ?
  • Do you need all that data in memory ? Consider using:
    • a LRU cache
    • soft references

Fat data: object header

  • JVM objects is normally two machine words.
  • That’s 16 bytes, or 128 bits on a 64-bit JVM!
  • new java.lang.Object() takes 16 bytes.
  • new byte[0] takes 24 bytes.
    • 16 bytes object header
    • 4 bytes for the length of the array
    • 4 bytes of padding

Fat data: padding

1
2
3
4
5
6
7
class A {
byte x;
}

class B extends A {
byte y;
}
  • new A() takes 24 bytes
    • 16 bytes object header
    • 1 byte filed
    • 7 bytes padding
  • new B() takes 32 bytes
    • 16 bytes object header
    • 1 byte for x
    • 7 bytes padding
    • 1 byte for y
    • 7 bytes padding

Fat data: no inline structs

1
2
3
class C {
Object A = new Object();
}
  • new C() takes 40 bytes

JVM 概览

64 位 VM 带来哪些性能损失:

内部 Java 对象表示 (成为普通对象指针,Ordinary Object Pointers, 或 oops) 的长度从 32 位变成了 64 位,导致 CPU 高速缓存行中可用的 oops 变少,从而降低了 CPU 缓存的效率。

接着,Java 6 添加了压缩指针 (-XX:+UseCompressedOops),它能够通过对齐、偏移量将 64 位指针压缩成 32 位。CPU 使用率由此得以改善

什么时候触发类加载:

HotSpot VM 负责解析常量池符号,这个过程需要加载、链接,然后初始化 Java 类和 Java 接口。类加载的最佳时机是在解析 Java 字节码类文件中常量池符号的时候Class.forName()ClassLoader.loadClass()、反射 APIJNI_FindClass都可以引发类加载。

HotSpot VM 自身也会引发类加载,启动时,除了加载许多普通类,也会加载诸如 java.lang.Objectjava.lang.Thread 这样的核心类

加载类时也需要加载它的所有 Java 超类和所有 Java 超接口。

实际上,加载类是 HotSpot VM 和特定类加载器如 java.lang.ClassLoader 之间相互协作的过程。

类加载阶段可能遇见哪些异常:

  • 没有找到类名字对应的二进制文件: NoClassDefFound
  • 语法错误: ClassFormatError 或者 UnsupportedClassVersionError
  • 类继承层次有误: ClassCircularityError
  • 直接超类本身并不是接口: IncompatibleClassChangeError

参考

推荐文章