Sumatra

Skip to end of metadata
Go to start of metadata

Project Sumatra Wiki

This page, with its child pages, contains design notes for Project Sumatra

OpenJDK project page: http://openjdk.java.net/projects/sumatra
Repositories: http://hg.openjdk.java.net/sumatra/sumatra-dev/{scratch,hotspot,jdk,...} (repo info)
Developer list: http://mail.openjdk.java.net/mailman/listinfo/sumatra-dev

Goals

  • Enable Java applications to take advantage of heterogeneous processing units (GPUs/APUs)
  • Extend JVM JITs to generate code for heterogeneous processing hardware
  • Integrate the JVM data model with data types efficiently processed by such hardware
  • Allow the JVM to efficiently interoperate with high-performance libraries built for such hardware
  • Extend the JVM managed runtime to track pointers and storage allocation throughout such a system

Challenges

Here are some of the specific technical challenges.

  • mitigate the complexities of present-day GPU backends and layered standards
    • standards include: OpenCL, CUDA, Intel Phi, PTX, HSA (forthcoming), ...
    • FIXME: choose 1-3 of the standards (e.g., PTX, HSAIL/HSA) for initial backend development
  • build compromise data schemes for both the JVM and GPU hardware
    • define Java model for "value types" which can be pervasively unboxed (like tuples or structs)
    • need to support flatter data structures (Complex values, vector and RBGA values, 2D arrays) from Java
    • need to support mix of primitives and JVM-managed pointers
      • range of solutions: "don't"; like JNI array-critical; pinning read barrier; stack maps and safepoints in GPU
      • range of solutions: no pointers; pointers are opaque (e.g., indices into Java-side array); arena pointers; pinning read barrier.
    • need "foreign data interface" that is competent to interoperate (without copying) to standard sparse array packages
    • adapt (or extend if necessary) JNI as a foreign invocation interface that is competent to call purpose-built C code for complex GPU requests
  • reduce data copying and inter-phase latency between ISAs and loop kernels
    • agreement of data structures will reduce copying
    • more flexible loop kernel container will allow loop kernel fusion
  • cope with dynamically varying mixes of managed parallel and serial data and code
    • use JVM dynamic compilation techniques to build customized kernels and execution strategies
    • optimize computation requests relative to online data
  • automatically (at each appropriate level of the system) sense load and distribute cleanly between CPU and GPUs
    • compile (online) JDK 8 parallel collection pipelines to data parallel compute requests
    • partition simple Java bytecode call graphs (after profile-directed inlining) into CPU and GPU
  • learn to efficiently flatten nested or keyed parallel constructs
    • apply existing technology on nested data parallelism (to JVM execution of GPU code)
    • apply existing technology on MapReduce (to JVM execution of GPU code)
    • ensure that Java views of flattened and grouped parallel data sets are compatible with GPU capabilities
    • efficiently implement "nonlinear streams" in JDK 8 parallel collections
  • create a practical and predictable story for loop vectorization, presumably user-assisted, and with useful failure modes
    • build a low-level library of vector intrinsics (e.g., AVX-style) that can be called (manually) from Java
    • apply existing technology for loop vectorization
    • build user-assisted loop vectorizers for Java, possibly based on type annotations (JSR 308)
  • deal with exceptional conditions as they arise in loop kernels
    • allow GPU loop kernels to call back to CPU for infrequent edge cases (argument reduction, exceptions, allocation overflows, deoptimization of slow paths)
    • engineer a loop kernel container API which accounts for multiple CPU outcomes, and aggregates per kernel iteration (perhaps with continuation-passing style)
  • define a robust and clear data-parallel execution model on top of the JVM bytecode, memory, and thread specifications
    • interpret (or adapt if necessary) the Java Memory Model (JSR 133) to the needs of data parallel programming
    • interpret (or adapt if necessary) the thread-based Java concurrency model (define GPU kernel effects in terms of bytecode execution by weakened quasi-threads)
  • Investigate use of Java Language constructs and programming idioms that can be effectively compiled for a data-parallel execution engine (such as a GPU).
    • potential candidate - Lambda methods and expressions
    • other options?
  • Investigate opportunities for GPU enabled 'intrinsics' versions of existing JDK APIs
    • candicates may be sort, (de)+compression, crc checking, search, convolutions etc.
  • adopt and adapt insights from previous work on data-parallel Java projects

FIXME: Most of these items need their own wiki pages and/or email conversations.

Roadmap

FIXME: In what order will we address these challenges?

Known investigations

  • Eric Caspole and Tom Deneau are investigating how we might intercept C1/C2 compilation triggers to convert Lambda IntStream/IntConsumer to OpenCL as a demonstration vehicle. They are using Aparapi infrastructure as a proxy for a 'real' backend compiler.
  • Vasanth Venkatachalam is investigating use of Graal to create a backend (compiler) for the soon to be released HSAIL standard, from the HSA foundation.

FIXME: Add your work here!

See something wrong on this page? Fix it!

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

Sign up or Log in to add a comment or watch this page.


The individuals who post here are part of the extended Oracle community and they might not be employed or in any way formally affiliated with Oracle. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Oracle nor any other party necessarily agrees with them.