-XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler
What does it can provide for us and what kind of enhancements we can expect to get, and more over - what dirty-hacks could be dropped ?Let's explore an example that a bit synthetic but based on a real production code.
Guava
I bet lots of you hear or even use Preconditions class from guava library:checkArgument(value > 0, "Non-negative value is expected, was %s", value);
Everything is perfect while we have not met that piece of code on a critical execution path - the issue is in implicit garbage production.
That's the body of checkArgument
method :
public static void checkArgument(
boolean expression,
@Nullable String errorMessageTemplate,
@Nullable Object... errorMessageArgs) {
if (!expression) {
throw new IllegalArgumentException(format(errorMessageTemplate, errorMessageArgs));
}
}
Let's turn implicit into explicit:
boolean expression = value > 0;
Object[] errorMessageArgs = new Object[]{Integer.valueOf(value)};
if (!expression) {
throw new IllegalArgumentException(format(errorMessageTemplate, errorMessageArgs));
}
Hereby we've got a dilemma: Usually such kind of checks in a production code are safe guards - from one side we don't want to pay (extra and unnecessary garbage) for that, from another side do not want to drop fail fast checks.
In fact the root cause of problem are in autoboxing and varargs objects, those could be not used at all (esp. in positive scenarios).
Unfortunately, when Escape Analysis (rus.) faces the conditional branch it can not determine object as unnecessary.
Ok, how can we address this problem ?
For instance, do method overload of
checkArgument
(in fact it has been done in guava >= 20 for the case of 1 or 2 primitive arguments): public static void checkArgument(boolean expression, @Nullable String errorMessageTemplate, int p1) {
if (!expression) {
throw new IllegalArgumentException(format(errorMessageTemplate, p1));
}
}
Well, what if there more than one or two arguments (for those there are overloaded methods in guava) ?
The answer is - write your own hack (adding more and more overloaded methods) or experience extra garbage pressure. I faced a place in our prod code that has combination of 3 ints and 1 String that is executed millions of times and the response time is constrained by SLA.
Graal
Now, let's have a look to Java 10 and-XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler
Graal contains tones of different improvements and new type of optimizations, in particular Partial Escape Analysis - one of that, in short, is in that it is able to detect that some of object allocations are used only in one of a condition branch - therefore - it is legal to move that allocation from outside of branch into that particular branch where objects are used.
The moment of truth. JMH
PartialEATest:@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@Warmup(iterations = 5, time = 5000, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 5000, timeUnit = TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class PartialEATest {
@Param(value = {"-1", "1"})
private int value;
@Benchmark
public void allocate(Blackhole bh) {
checkArg(bh, value > 0, "expected non-negative value: %s, %s", object, 1000, "A", 700);
}
private static void checkArg(Blackhole bh, boolean cond, String msg, Object ... args){
if (!cond){
bh.consume(String.format(msg, args));
}
}
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(PartialEATest.class.getSimpleName())
.addProfiler(GCProfiler.class)
.build();
new Runner(opt).run();
}
}
Among all others we are interesting in allocations - that's why I turned GCProfiler on :
Options | Benchmark | (value) | Score | Error | Units |
---|---|---|---|---|---|
-Graal | PartialEATest.allocate:·gc.alloc.rate.norm | -1 | 1008,000 | ± 0,001 | B/op |
-Graal | PartialEATest.allocate:·gc.alloc.rate.norm | 1 | 32,000 | ± 0,001 | B/op |
+Graal | PartialEATest.allocate:·gc.alloc.rate.norm | -1 | 1024,220 | ± 0,908 | B/op |
+Graal | PartialEATest.allocate:·gc.alloc.rate.norm | 1 | ≈ 10⁻⁴ | B/op |
It is pretty clear that Graal does not allocate objects without reason. It's the right time to drop performance dirty hacks like shown overloaded methods.
Compiled method
To be 100% sure let's check generated assembler code we have in case of old and good C2 vs Graal - for that we need to use hsdis - download it from somewhere or [build by your own](http://dolzhenko.blogspot.com/2018/03/build-hsdis-with-java-10-on-macosx.html), and add some jvm parameters:
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:CompileCommand=print,"com/elastic/PartialEATest.*"
Compiled method :: C2
There are tooones of generated code - entire C2 generated code - let's have a took to it up to occurrence of first autoboxing:
ImmutableOopMap{rbx=Oop }pc offsets: 1684 1697 Compiled method (c2) 619 736 4 com.elastic.PartialEATest::allocate (55 bytes)
total in heap [0x00000001189a0c90,0x00000001189a1410] = 1920
relocation [0x00000001189a0e08,0x00000001189a0e38] = 48
main code [0x00000001189a0e40,0x00000001189a1060] = 544
stub code [0x00000001189a1060,0x00000001189a1078] = 24
oops [0x00000001189a1078,0x00000001189a10a0] = 40
metadata [0x00000001189a10a0,0x00000001189a10b0] = 16
scopes data [0x00000001189a10b0,0x00000001189a1210] = 352
scopes pcs [0x00000001189a1210,0x00000001189a13c0] = 432
dependencies [0x00000001189a13c0,0x00000001189a13c8] = 8
handler table [0x00000001189a13c8,0x00000001189a1410] = 72
----------------------------------------------------------------------
com/elastic/PartialEATest.allocate(Lorg/openjdk/jmh/infra/Blackhole;)V [0x00000001189a0e40, 0x00000001189a1078] 568 bytes
[Entry Point]
[Constants]
# {method} {0x000000022ea937b8} 'allocate' '(Lorg/openjdk/jmh/infra/Blackhole;)V' in 'com/elastic/PartialEATest'
# this: rsi:rsi = 'com/elastic/PartialEATest'
# parm0: rdx:rdx = 'org/openjdk/jmh/infra/Blackhole'
# [sp+0x30] (sp of caller)
0x00000001189a0e40: cmp rax,QWORD PTR [rsi+0x8]
0x00000001189a0e44: jne 0x0000000110eb7580 ; {runtime_call ic_miss_stub}
0x00000001189a0e4a: xchg ax,ax
0x00000001189a0e4c: nop DWORD PTR [rax+0x0]
[Verified Entry Point]
0x00000001189a0e50: mov DWORD PTR [rsp-0x14000],eax
0x00000001189a0e57: push rbp
0x00000001189a0e58: sub rsp,0x20 ;*synchronization entry
; - com.elastic.PartialEATest::allocate@-1 (line 26)
0x00000001189a0e5c: mov r11d,DWORD PTR [rsi+0x10]
;*getfield value {reexecute=0 rethrow=0 return_oop=0}
; - com.elastic.PartialEATest::allocate@1 (line 26)
0x00000001189a0e60: mov DWORD PTR [rsp],r11d
0x00000001189a0e64: test r11d,r11d
0x00000001189a0e67: jle 0x00000001189a0ffc ;*ifle {reexecute=0 rethrow=0 return_oop=0}
; - com.elastic.PartialEATest::allocate@4 (line 26)
0x00000001189a0e6d: cmp r11d,0xffffff80
0x00000001189a0e71: jl 0x00000001189a100e ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::valueOf@3 (line 1048)
; - com.elastic.PartialEATest::allocate@24 (line 26)
0x00000001189a0e77: cmp r11d,0x7f
0x00000001189a0e7b: jg 0x00000001189a0ea9 ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::valueOf@10 (line 1048)
; - com.elastic.PartialEATest::allocate@24 (line 26)
0x00000001189a0e7d: mov ebp,r11d
0x00000001189a0e80: add ebp,0x80 ;*iadd {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::valueOf@20 (line 1049)
; - com.elastic.PartialEATest::allocate@24 (line 26)
0x00000001189a0e86: cmp ebp,0x100
0x00000001189a0e8c: jae 0x00000001189a101e
0x00000001189a0e92: movsxd r10,r11d
0x00000001189a0e95: movabs r11,0x12ed02000 ; {oop(a 'java/lang/Integer'[256] {0x000000012ed02000})}
0x00000001189a0e9f: mov rbp,QWORD PTR [r11+r10*8+0x418]
;*aaload {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::valueOf@21 (line 1049)
; - com.elastic.PartialEATest::allocate@24 (line 26)
................
again - entire C2 generated code
Compiled method :: Graal
The same trick but for GraalImmutableOopMap{rbx=Oop }pc offsets: 251 264 Compiled method (JVMCI) 1850 3888 4 com.elastic.PartialEATest::allocate (55 bytes)
total in heap [0x0000000119292590,0x0000000119292830] = 672
relocation [0x0000000119292708,0x0000000119292718] = 16
main code [0x0000000119292720,0x0000000119292795] = 117
stub code [0x0000000119292795,0x0000000119292798] = 3
oops [0x0000000119292798,0x00000001192927a0] = 8
metadata [0x00000001192927a0,0x00000001192927a8] = 8
scopes data [0x00000001192927a8,0x00000001192927c8] = 32
scopes pcs [0x00000001192927c8,0x0000000119292828] = 96
dependencies [0x0000000119292828,0x0000000119292830] = 8
----------------------------------------------------------------------
com/elastic/PartialEATest.allocate(Lorg/openjdk/jmh/infra/Blackhole;)V (com.elastic.PartialEATest.allocate(Blackhole)) [0x0000000119292720, 0x0000000119292798] 120 bytes
[Entry Point]
[Constants]
# {method} {0x0000000231e007b8} 'allocate' '(Lorg/openjdk/jmh/infra/Blackhole;)V' in 'com/elastic/PartialEATest'
# this: rsi:rsi = 'com/elastic/PartialEATest'
# parm0: rdx:rdx = 'org/openjdk/jmh/infra/Blackhole'
# [sp+0x20] (sp of caller)
0x0000000119292720: cmp rax,QWORD PTR [rsi+0x8]
0x0000000119292724: jne 0x000000010eadc300 ; {runtime_call ic_miss_stub}
0x000000011929272a: nop
0x000000011929272b: data16 data16 nop WORD PTR [rax+rax*1+0x0]
0x0000000119292736: data16 nop WORD PTR [rax+rax*1+0x0]
[Verified Entry Point]
0x0000000119292740: mov DWORD PTR [rsp-0x14000],eax
0x0000000119292747: sub rsp,0x18
0x000000011929274b: mov QWORD PTR [rsp+0x10],rbp
0x0000000119292750: cmp DWORD PTR [rsi+0x10],0x1
0x0000000119292754: jl 0x000000011929276d ;*ifle {reexecute=0 rethrow=0 return_oop=0}
; - com.elastic.PartialEATest::allocate@4 (line 26)
0x000000011929275a: mov rbp,QWORD PTR [rsp+0x10]
0x000000011929275f: add rsp,0x18
0x0000000119292763: mov rcx,QWORD PTR [r15+0x70]
0x0000000119292767: test DWORD PTR [rcx],eax ; {poll_return}
0x0000000119292769: vzeroupper
0x000000011929276c: ret ;*return {reexecute=0 rethrow=0 return_oop=0}
; - com.elastic.PartialEATest::allocate@54 (line 27)
0x000000011929276d: mov DWORD PTR [r15+0x314],0xffffffed
;*ifle {reexecute=0 rethrow=0 return_oop=0}
; - com.elastic.PartialEATest::allocate@4 (line 26)
0x0000000119292778: mov QWORD PTR [r15+0x320],0x0
0x0000000119292783: call 0x000000010eadd2a4 ; ImmutableOopMap{rsi=Oop }
;*aload_0 {reexecute=1 rethrow=0 return_oop=0}
; - com.elastic.PartialEATest::allocate@0 (line 26)
; {runtime_call DeoptimizationBlob}
0x0000000119292788: nop
It is pretty obvious how is C2 generated code bigger than Graal code. There are no any autoboxing or varargs in Graal version - in fact just method call.
Комментариев нет:
Отправить комментарий