12 апр. 2018 г.

Guava, Graal and Partial Escape Analysis

Recently java 10 release happened - in fact, Graal was available earlier, but now it is more easy to access and use it - Congratulations, you're running #Graal! - just add a couple options:
-XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler
What does it can provide for us and what kind of enhancements we can expect to get, and more over - what dirty-hacks could be dropped ?
Let's explore an example that a bit synthetic but based on a real production code.

Guava

I bet lots of you hear or even use Preconditions class from guava library:
checkArgument(value > 0, "Non-negative value is expected, was %s", value);
Everything is perfect while we have not met that piece of code on a critical execution path - the issue is in implicit garbage production. That's the body of checkArgument method :
  public static void checkArgument(
      boolean expression,
      @Nullable String errorMessageTemplate,
      @Nullable Object... errorMessageArgs) {
    if (!expression) {
      throw new IllegalArgumentException(format(errorMessageTemplate, errorMessageArgs));
    }
  }

Let's turn implicit into explicit:

boolean expression = value > 0;
Object[] errorMessageArgs = new Object[]{Integer.valueOf(value)};
if (!expression) {
  throw new IllegalArgumentException(format(errorMessageTemplate, errorMessageArgs));
}

Hereby we've got a dilemma: Usually such kind of checks in a production code are safe guards - from one side we don't want to pay (extra and unnecessary garbage) for that, from another side do not want to drop fail fast checks.

In fact the root cause of problem are in autoboxing and varargs objects, those could be not used at all (esp. in positive scenarios).

Unfortunately, when Escape Analysis (rus.) faces the conditional branch it can not determine object as unnecessary.

Ok, how can we address this problem ?

For instance, do method overload of checkArgument (in fact it has been done in guava >= 20 for the case of 1 or 2 primitive arguments):
  public static void checkArgument(boolean expression, @Nullable String errorMessageTemplate, int p1) {
    if (!expression) {
      throw new IllegalArgumentException(format(errorMessageTemplate, p1));
    }
  }

Well, what if there more than one or two arguments (for those there are overloaded methods in guava) ?
The answer is - write your own hack (adding more and more overloaded methods) or experience extra garbage pressure. I faced a place in our prod code that has combination of 3 ints and 1 String that is executed millions of times and the response time is constrained by SLA.

Graal

Now, let's have a look to Java 10 and -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler Graal contains tones of different improvements and new type of optimizations, in particular Partial Escape Analysis - one of that, in short, is in that it is able to detect that some of object allocations are used only in one of a condition branch - therefore - it is legal to move that allocation from outside of branch into that particular branch where objects are used.

The moment of truth. JMH

PartialEATest:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@Warmup(iterations = 5, time = 5000, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 5000, timeUnit = TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class PartialEATest {

    @Param(value = {"-1", "1"})
    private int value;

    @Benchmark
    public void allocate(Blackhole bh) {
        checkArg(bh, value > 0, "expected non-negative value: %s, %s", object, 1000, "A", 700);
    }

    private static void checkArg(Blackhole bh, boolean cond, String msg, Object ... args){
        if (!cond){
            bh.consume(String.format(msg, args));
        }
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(PartialEATest.class.getSimpleName())
                .addProfiler(GCProfiler.class)
                .build();

        new Runner(opt).run();
    }
}

Among all others we are interesting in allocations - that's why I turned GCProfiler on :
Options Benchmark (value) Score Error Units
-Graal PartialEATest.allocate:·gc.alloc.rate.norm -1 1008,000 ± 0,001 B/op
-Graal PartialEATest.allocate:·gc.alloc.rate.norm 1 32,000 ± 0,001 B/op
+Graal PartialEATest.allocate:·gc.alloc.rate.norm -1 1024,220 ± 0,908 B/op
+Graal PartialEATest.allocate:·gc.alloc.rate.norm 1 ≈ 10⁻⁴ B/op

It is pretty clear that Graal does not allocate objects without reason. It's the right time to drop performance dirty hacks like shown overloaded methods.

Compiled method


To be 100% sure let's check generated assembler code we have in case of old and good C2 vs Graal - for that we need to use hsdis - download it from somewhere or [build by your own](http://dolzhenko.blogspot.com/2018/03/build-hsdis-with-java-10-on-macosx.html), and add some jvm parameters:
-XX:+UnlockDiagnosticVMOptions 
-XX:PrintAssemblyOptions=intel 
-XX:CompileCommand=print,"com/elastic/PartialEATest.*" 

Compiled method :: C2


There are tooones of generated code - entire C2 generated code - let's have a took to it up to occurrence of first autoboxing:
ImmutableOopMap{rbx=Oop }pc offsets: 1684 1697 Compiled method (c2)     619  736       4       com.elastic.PartialEATest::allocate (55 bytes)
 total in heap  [0x00000001189a0c90,0x00000001189a1410] = 1920
 relocation     [0x00000001189a0e08,0x00000001189a0e38] = 48
 main code      [0x00000001189a0e40,0x00000001189a1060] = 544
 stub code      [0x00000001189a1060,0x00000001189a1078] = 24
 oops           [0x00000001189a1078,0x00000001189a10a0] = 40
 metadata       [0x00000001189a10a0,0x00000001189a10b0] = 16
 scopes data    [0x00000001189a10b0,0x00000001189a1210] = 352
 scopes pcs     [0x00000001189a1210,0x00000001189a13c0] = 432
 dependencies   [0x00000001189a13c0,0x00000001189a13c8] = 8
 handler table  [0x00000001189a13c8,0x00000001189a1410] = 72
----------------------------------------------------------------------
com/elastic/PartialEATest.allocate(Lorg/openjdk/jmh/infra/Blackhole;)V  [0x00000001189a0e40, 0x00000001189a1078]  568 bytes
[Entry Point]
[Constants]
  # {method} {0x000000022ea937b8} 'allocate' '(Lorg/openjdk/jmh/infra/Blackhole;)V' in 'com/elastic/PartialEATest'
  # this:     rsi:rsi   = 'com/elastic/PartialEATest'
  # parm0:    rdx:rdx   = 'org/openjdk/jmh/infra/Blackhole'
  #           [sp+0x30]  (sp of caller)
  0x00000001189a0e40: cmp    rax,QWORD PTR [rsi+0x8]
  0x00000001189a0e44: jne    0x0000000110eb7580  ;   {runtime_call ic_miss_stub}
  0x00000001189a0e4a: xchg   ax,ax
  0x00000001189a0e4c: nop    DWORD PTR [rax+0x0]
[Verified Entry Point]
  0x00000001189a0e50: mov    DWORD PTR [rsp-0x14000],eax
  0x00000001189a0e57: push   rbp
  0x00000001189a0e58: sub    rsp,0x20           ;*synchronization entry
                                                ; - com.elastic.PartialEATest::allocate@-1 (line 26)

  0x00000001189a0e5c: mov    r11d,DWORD PTR [rsi+0x10]
                                                ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@1 (line 26)

  0x00000001189a0e60: mov    DWORD PTR [rsp],r11d
  0x00000001189a0e64: test   r11d,r11d
  0x00000001189a0e67: jle    0x00000001189a0ffc  ;*ifle {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@4 (line 26)

  0x00000001189a0e6d: cmp    r11d,0xffffff80
  0x00000001189a0e71: jl     0x00000001189a100e  ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.lang.Integer::valueOf@3 (line 1048)
                                                ; - com.elastic.PartialEATest::allocate@24 (line 26)

  0x00000001189a0e77: cmp    r11d,0x7f
  0x00000001189a0e7b: jg     0x00000001189a0ea9  ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.lang.Integer::valueOf@10 (line 1048)
                                                ; - com.elastic.PartialEATest::allocate@24 (line 26)

  0x00000001189a0e7d: mov    ebp,r11d
  0x00000001189a0e80: add    ebp,0x80           ;*iadd {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.lang.Integer::valueOf@20 (line 1049)
                                                ; - com.elastic.PartialEATest::allocate@24 (line 26)

  0x00000001189a0e86: cmp    ebp,0x100
  0x00000001189a0e8c: jae    0x00000001189a101e
  0x00000001189a0e92: movsxd r10,r11d
  0x00000001189a0e95: movabs r11,0x12ed02000    ;   {oop(a 'java/lang/Integer'[256] {0x000000012ed02000})}
  0x00000001189a0e9f: mov    rbp,QWORD PTR [r11+r10*8+0x418]
                                                ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - java.lang.Integer::valueOf@21 (line 1049)
                                                ; - com.elastic.PartialEATest::allocate@24 (line 26)
................                                                
again - entire C2 generated code

Compiled method :: Graal

The same trick but for Graal
ImmutableOopMap{rbx=Oop }pc offsets: 251 264 Compiled method (JVMCI)    1850 3888       4       com.elastic.PartialEATest::allocate (55 bytes)
 total in heap  [0x0000000119292590,0x0000000119292830] = 672
 relocation     [0x0000000119292708,0x0000000119292718] = 16
 main code      [0x0000000119292720,0x0000000119292795] = 117
 stub code      [0x0000000119292795,0x0000000119292798] = 3
 oops           [0x0000000119292798,0x00000001192927a0] = 8
 metadata       [0x00000001192927a0,0x00000001192927a8] = 8
 scopes data    [0x00000001192927a8,0x00000001192927c8] = 32
 scopes pcs     [0x00000001192927c8,0x0000000119292828] = 96
 dependencies   [0x0000000119292828,0x0000000119292830] = 8
----------------------------------------------------------------------
com/elastic/PartialEATest.allocate(Lorg/openjdk/jmh/infra/Blackhole;)V (com.elastic.PartialEATest.allocate(Blackhole))  [0x0000000119292720, 0x0000000119292798]  120 bytes
[Entry Point]
[Constants]
  # {method} {0x0000000231e007b8} 'allocate' '(Lorg/openjdk/jmh/infra/Blackhole;)V' in 'com/elastic/PartialEATest'
  # this:     rsi:rsi   = 'com/elastic/PartialEATest'
  # parm0:    rdx:rdx   = 'org/openjdk/jmh/infra/Blackhole'
  #           [sp+0x20]  (sp of caller)
  0x0000000119292720: cmp    rax,QWORD PTR [rsi+0x8]
  0x0000000119292724: jne    0x000000010eadc300  ;   {runtime_call ic_miss_stub}
  0x000000011929272a: nop
  0x000000011929272b: data16 data16 nop WORD PTR [rax+rax*1+0x0]
  0x0000000119292736: data16 nop WORD PTR [rax+rax*1+0x0]
[Verified Entry Point]
  0x0000000119292740: mov    DWORD PTR [rsp-0x14000],eax
  0x0000000119292747: sub    rsp,0x18
  0x000000011929274b: mov    QWORD PTR [rsp+0x10],rbp
  0x0000000119292750: cmp    DWORD PTR [rsi+0x10],0x1
  0x0000000119292754: jl     0x000000011929276d  ;*ifle {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@4 (line 26)

  0x000000011929275a: mov    rbp,QWORD PTR [rsp+0x10]
  0x000000011929275f: add    rsp,0x18
  0x0000000119292763: mov    rcx,QWORD PTR [r15+0x70]
  0x0000000119292767: test   DWORD PTR [rcx],eax  ;   {poll_return}
  0x0000000119292769: vzeroupper 
  0x000000011929276c: ret                       ;*return {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@54 (line 27)

  0x000000011929276d: mov    DWORD PTR [r15+0x314],0xffffffed
                                                ;*ifle {reexecute=0 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@4 (line 26)

  0x0000000119292778: mov    QWORD PTR [r15+0x320],0x0
  0x0000000119292783: call   0x000000010eadd2a4  ; ImmutableOopMap{rsi=Oop }
                                                ;*aload_0 {reexecute=1 rethrow=0 return_oop=0}
                                                ; - com.elastic.PartialEATest::allocate@0 (line 26)
                                                ;   {runtime_call DeoptimizationBlob}
  0x0000000119292788: nop
It is pretty obvious how is C2 generated code bigger than Graal code. There are no any autoboxing or varargs in Graal version - in fact just method call.

Outcome

Seems Graal makes lots of optimizations that we did manually. From one side it is still experimental part of openjdk and it takes some time to be enabled-by-default compiler - from another side - it is not clear what cases it can handle (esp. in comparison to C2).

Комментариев нет: