Java Corpus Tools

Skip to end of metadata
Go to start of metadata


  1. Create a platform to enable language designers and the community to validate extensions to Java.
  2. Encourage and facilitate empirical research of features in Java.

A Query Language for Language Designers

Java Source Code Query Languages

In this section, we give an overview of the seven query languages that we evaluate in this paper: Java Tools Language, Browse-By-Query, SOUL, JQuery, .QL, Jackpot and PMD. We selected these languages because they provide a variety of design choices and strictly provide a query language. For example, we didn't select Findbugs as it only lets programmers query source by creating new classes based on a Java framework. We also only selected source code query languages that included a guide or a working implementation.

Name Paradigm Model Input Date
Java Tools Language Logic Relational Bytecode 2006
Browse-By-Query Declarative (English-like Queries) Relational Bytecode 2005
SOUL Logic Relational Source 2011
JQuery Logic Relational Source 2003
.QL Object-Oriented, SQL-like Relational Source 2007
Jackpot Declarative Relational Source 2009
PMD XPath Tree Source 2004

Java Language Tools

The Java Tools Language (JTL) is a logic-paradigm query language to select Java elements in a code base. The current implementation is based on an analysis of Java bytecode classes. The JTL syntax is inspired by Query-by-Example ideas in order to increase productivity of users. JTL also relies on Datalog-like semantics. For example, one could find all methods taking three int parameters and returning a subclass of Java's Date class using the follow query:

In addition, JTL features variable binding and data flow queries.


Browse-By-Query (BBQ) reads Java bytecode files and creates a database representing classes, method calls, fields, field references, string constants and string constant references. This database can then be interrogated through English-like queries. The syntax is motivated by the desire to be intuitive. For example, one could find all the methods that call a method whose name matches start by composing the following query:

In addition, BBQ provides filtering mechanisms, set and and relational operators that can be combined to compose more complex queries.


SOUL is a logic-paradigm query language. It contains an extensive predicate library called CAVA that matches queries against AST nodes of a Java program generated by the Eclipse JDT. SOUL facilitates the specification of queries by using example-driven matching of templates and structural unification to match a code excerpt with an AST node. In practice, this means a user can create a logic variable to match an AST node and reuse this variable within the query regardless of the execution path where the variable appears. For example, one could specify a query that finds instances of Scanner that is read after it was closed as follows: 


JQuery is a logic-paradigm query language built on top of the logic programming language TyRuBa. The implementation of JQuery analyse the AST of a Java program by making calls to the Eclipse JDT. JQuery includes a library of predicates that allows querying Java elements and the relationships between them. For example, the following query finds all method declarations ?M that have at least one parameter of type Integer:


.QL is an object-oriented query language. It enables programmers to query Java source code by composing queries that look like SQL. The motivation for this design choice is to reduce barrier to entry for developers that learn it. In addition, the authors argue that object-orientation provides the structure necessary for building reusable queries. An implementation is available, called SemmleCode, which includes an editor and various optimisations. As an example, the following query describes how to find all classes that declare a method equals, but which do not specify a method hashCode.


Jackpot is a module for the NetBeans IDE for querying and transforming Java source files. Jackpot lets user query the AST of a Java program by composing rules under the form of a Java expression. In addition, one can specify variables to bind to a matching AST node. For example, the following query will match any code surrounded by a call to readLock() and readUnlock():


PMD is a ruleset based Java source code analyzer that identifies bugs or potential problems including dead code, duplicate code or overcomplicated expressions. PMD has an extensive archive of built-in rules that can be used to identify such problems. One can specify new rules by writing it in Java and making use of the PMD helper classes. Alternatively, one can also compose custom rules via an XPath expression that queries the AST of the program to analyze. For example, the following query finds all method declarations that have at least one parameter of type Integer:

Uses Cases

In this section, we describe the use cases examined for the evaluation. We selected use cases that are source of language design discussions and make use of a variety of Java features.

Final Array and Anonymous Inner Classes

Java lets programmers create inner classes, which is a nested class not declared static. There exists three different types of inner classes: non-static member, local and anonymous classes.

Inner classes have a restriction that any local variable, formal parameter, or exception parameter used but not declared in the inner class must be declared final.

However, programmers can circumvent this restriction by declaring a final array with only one element and mutate the element of the array. The following code illustrates this mechanism:

Use Case 1: Find occurrence of an anonymous inner class whose code references a final array variable in the enclosing scope and which mutates array elements via that variable.

Generic Constructors

A constructor can have two sets of type arguments. A constructor can use the type parameters declared in a generic class. One can then specify the types after the class name: new Foo<Integer>(). In addition, a constructor can declare its own type parameters. The types are then specified between the new token and the class name: new <Integer> Foo<Number>(). The code below illustrates a constructor of class Foo which declares its own type parameter S that extends the class's own parameters.

Use Case 2: Find generic constructors whose type parameters extend the enclosing class's own type parameters.

Capture Conversion Idiom

Java 5 introduced wildcards as a variance mechanism for generics. Safety is achieved by restricting accesses to fields and methods based on the position of the type parameter. The unbounded wildcard <?> represents any type and can be used to provide a simple form of bivariance. In practice, this means that List<?> is a supertype of List<T> for any T. 

The unbounded wildcard is frequently used as part of the capture conversion idiom. In the code below, the signature of reverse is prefered over rev as it doesn't expose implementation information to the caller. The argument List<?> is passed to rev, which takes a List<T>. However, allowing such a subtype relation would be unsafe. Java provides capture conversion as a solution: the unknown type <?> is captured in a type variable and T is infered to be that type variable within the method rev.

Use Case 3: Find occurrences of the capture conversion idiom.

Overloaded Methods

Overloading methods allows programmers to declare methods with the same name but with different signatures.

For example, one could write an add method that takes a different number of parameters:

Often this pattern can be rewritten by using the varargs feature if the overloaded methods' parameters share a single type:

Related to this use case, recent work has investigated overloading in Java and found that a quarter of overloaded methods are simulating default arguments.

Use Case 4: Find overloaded methods with multiple parameters that share a single type.

Covariant Arrays

In Java and C#, array subtyping is covariant, meaning that type B[] is considered a subtype of A[] whenever B is a subtype of A. However, this relation can cause runtime exceptions. Consider the following Java code where Banana and Apple are subtypes of Fruit:

The assignment to the first element of the variable fruit on line 3 will cause an ArrayStore exception. Although statically, the variable fruit has type Fruit[], its runtime type is Banana[] and thus we cannot use it to store an Apple.

Use Case 5: Find occurences of covariant array uses in assignment, method calls, constructor instantiations and return statements.

Rethrown Exception

Java 7 introduced an improved checking for rethrown exceptions. Previously, a rethrown exception was treated as throwing the type of the catch parameter. Now, when a catch parameter is declared final, the type is known to be only the exception types that were thrown in the try block and are a subtype of the catch parameter type.

This new feature introduced two source incompatibilities with respect to Java 6. The code below illustrates the changes. Previously, the statement throw exception would throw a Foo exception. Now, it throws a DaughterOfFoo exception. As a consequence, the catch block catch(SonOfFoo anotherException) is not reachable anymore as the try only throws a DaughterOfFoo exception.

Use Case 6: Find occurences of nested try/catch blocks that rethrow an exception.


[This section is work in progress]

X = not supported

= supported

? = not sure

  Final Array & Anonymous Class Generic Constructors Capture Conversion Idiom Overloaded Methods Sharing Single Type Covariant Arrays Rethrown Exception 
JQuery X X X X (*1) X X
.QL (*2) ? ? ?
Jackpot X
PMD (Xpath) X X X X X X

*1: can find overloaded methods but not sharing single type: method(?C, ?M1), method(?C, ?M2), likeThis(?M1, ?M2). Tested with operations available in eclipse plugin. Paper describe different operations that don't seem to be supported.

*2: .QL documentation & tool is kept secretive for competitive advantage protection 


  •  no working Eclipse plugin found. (authors emailed)
  • argument list pattern (sec 2.2) public _ (_, String, *) : any public method that accept a String as its second argument and returns any type
  • quantifiers: no, all, exists...
  • anonymous class
  • pattern naming (integral := byte | short | int | long)
  • variable binding
  • no structural matching of AST (deemed difficult because uses class files) (so can't look match on a loop for example or find all local variables in method)
  • no submethod information (local vars etc)
  • read[F]/write[F] predicates to indicate whether a method reads/write to a field. Nice feature: write[_] tells whether method writes to a field at all. Q: How do we do this recursively? All methods within the method are not writing to fields...
  • no support for generics (according to documentation)

Use case 1 not possible because no support for local variable decl. Use case 2 & 3 not possible because no support for generics/wildcards (due to bytecode source). Use 4 not possible because no support for statements & types of expressions. Use case 6 not possible because no control flow support.


  • doesn’t detect local inner classes (local & anonymous). Only inner classes (doesn't differentiate): class in all classes
  • no access to local variable declared in methods
  • no support for generics.
  • no support for constructors (considered as method init) 
  • no AST structural matching. (e.g loops ...)
  • no variable binding/unification
  • set operators (union, intersection)
  • support for read/write of fields references

Similar reasons to JTL.


  • All Java AST supported by logic queries. (tight symbiosis with Eclipse JDT) 
  • When there isn't a dedicated predicate, the language allows to query the Eclipse semantic analyzer directly. (this solves the covariant arrays use case)
  • template matching doesn't support all features. (for example generics, wildcards, try/catch blocks)


Similar reasons to JTL.


  • New version stores every AST node unit in DB
  • has notion of aggregates (count, sum, max, min, avg)
  • ".QL: Object-Oriented Queries made Easy"

Very expressive. Though sql statements may not scale for control flow matching. (direct ast pattern matching is clearer for some use cases)



No variable binding support which restricts a lot of the analysis. 


[This section is work in progress]

Query Language Features

  • variable binding
  • quantifiers (forall, exists, no)
  • predicates
  • set operators
  • aggregate operators
  • behavioural template matching
  • list pattern matching (parameters, generics)
  • querying style

Java Language Features supported

  • Attributed AST Nodes
  • control flow statements
  • local variables
  • generics / wildcards
  • anonymous class


[This section is work in progress]

Relational based query languages not low level enough. 

Tree based query languages not expressive enough.

Idea: mix both

Idea: decouple query-by-example from constraints on variables. - two different views-

It seems pure template based languages are not expressive enough and so are pure logic-based queries. 

Sweet spot a combination of both mechanisms (which SOUL provides).


Tools available

  • Parser generator
  • Language Development Toolkit
  • Database/Indexing
    • Lucene (jackpot is based on it)
    • Relational databases (Oracle, MySQL, PostgreSQL...)
  • Static Analysis
    • PMD api
    • Eclipse AST
    • Tree API (annotation processing framework)

Time Plan

Week Date Log Milestones
1 9 July - 13 July
  • Source Code Query Languages Research
  • Write up of Review on wiki
2 16 July - 20 July
  • Source Code Query Languages Evaluation & Taxonomy 
  • Write up of Results on wiki
3 23 July - 27 July
  • Tools research (parser, ide support, static analysis, database)
  • Comparison of static analysis API (pmd, tree, eclipse)
  • Working on Eclipse Plugin to query JCT use cases
Milestone 1 
  • Comparative Study of Source Code Query Languages
4 30 July - 3 August
  • Developed Eclipse plugin
  • Investigating Diagnostic Listener
5 6 August - 10 August
  • Come up with syntax queries expressing our use cases
  • Enhancing diagnostic created by java Compiler
6 13 August - 17 August
  • Research relations between Java elements that we are interested to store in database
  • Enhancing diagnostic created by java Compiler
Milestone 2 
  • Database representation of Java Source Code for Assignments and Method Declarations
7 20 August - 24 August
  • Develop backend API to query database
Milestone 3 
  • Develop backend API to query database
8 27 August - 31 August
  • Design Grammar of the Language
Milestone 4 
  • Design how general queries will look like in the query language
9 3 September - 7 September
  • Develop prototype to query database
  • Output in web interface
Milestone 5 
  • Develop Parser and Compiler to query covariant assignment and overloaded method with single type use cases
  • Front end displaying results of query
10 10 September - 14 September
  • Implement Test Coverage
11 17 September - 21 September
  • Extend database & backend support
12 24 September - 28 September
  • Preparing presentation for JavaONE
13 1 October - 5 October
  • BOF presentation at JavaOne
14 8 October - 12 October
  • Stockholm presentation

Demo Plan


Give a live demonstration of the query language for two use cases: Covariant Arrays assignments & Overloaded Methods sharing single type


7th September


Build the minimum vertical implementation to make this possible. This consists of the following milestones:

1) Create a database representation of a Java program, which stores information attributed AST nodes of assignments and method declarations. (deadline: 17/8)

2) Develop backend API that will query the database. (deadline: 23/8)

3) Design query language, different types of queries it will support (deadline: 29/8)

4) Build minimum grammar & parser & compiler to query the two use cases (deadline: 6/9)

5) Output results frontend (deadline: 7/9)

Extra Time

Extra time will be used to expand the infrastructure horizontally

- Support more AST nodes and relations

- Support more types of queries

- Extend backend

- Test coverage

- Web interface for Demo

Relevant Literature

[1] Brian Goetz. Language designer’s notebook: Quantitative language design.

[2] Chris Parnin, Christian Bird, and Emerson Murphy-Hill. 2011. Java generics adoption: how new features are introduced, championed, or ignored. In Proceedings of the 8th Working Conference on Mining Software Repositories (MSR '11)

[3] Ewan Tempero, Craig Anslow, Jens Dietrich, Ted Han, Jing Li, Markus Lumpe, Hayden Melton, and James Noble. 2010. The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies. In Proceedings of the 2010 Asia Pacific Software Engineering Conference (APSEC '10)

[4] Joseph Gil and Keren Lenz. 2010. The use of overloading in JAVA programs. In Proceedings of the 24th European conference on Object-oriented programming (ECOOP '10)

[5] Raoul-Gabriel Urma and Janina Voigt. Using the OpenJDK to Investigate Covariance in Java. Java Magazine May/June 2012.

Related Projects

[a] Refactoring NG.

[b] Tal Cohen, Joseph (Yossi) Gil, and Itay Maman. 2006. JTL: the Java tools language. In Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications (OOPSLA '06)

[c] Browse By Query.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

Sign up or Log in to add a comment or watch this page.

The individuals who post here are part of the extended Oracle community and they might not be employed or in any way formally affiliated with Oracle. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Oracle nor any other party necessarily agrees with them.