Wednesday, 10 July 2013

Performance Gotchas Using Guava Collections

We're using the Google Guava Collection Library in our current codebase - its a very nice set of APIs and utilities that greatly simplify the use of collections.

Amongst other things it allows for easy filtering of collections - this will feel very natural if you've used comparably built in functionality from languages like Ruby (or if like me you like functional languages).

There is one class though that can cause serious performance issues: Collections2

The problem is that this class maintains a "live view" onto the original collection and although can be useful, it can also introduce problems into in your application in surprising ways.

I was recently asked to look at a sudden decrease in responsiveness in our application. I started with memory first and found that we were at 99% use in our PermGen space. I initially thought that this was the problem but it only took a minute or so to realize that it was 99% of currently allocated space, not 99% of total allowed space (i.e. there was room for the JVM to allocate more if necessary).

I then investigated the problematic code paths and found that certain blocks of code, which looks completely innocuous, were incredibly slow - for example:

public void doSomething(Collection<Integer> someData) {
    Map<Integer, String> someOtherData = 
        new HashMap<Integer, String>(someData.size());
    ...more code...

I found that as soon as this method was stepped into (via a debugger or by watching logs) it would suddenly pause for a period of time before continuing.  This was especially surprising when the collection passed in appeared to be empty!

Working my way up the call chain I found that the collection passed in was created using Collections2.filter

Taking a look at the javadocs I soon realised what the problem was - although the collection passed in (someData in my example) appeared to be empty, most operations on it still iterated over every item in the original collection using the Predicate originally specified.

The problem is best shown with a simplified example:

public static void main(String[] args) {
    List<Integer> list = new ArrayList<Integer>();
    for (int i = 0; i < 1000000; i++) {

    Collection<Integer> filtered = 
        Collections2.filter(list, new Predicate<Integer>() {
        public boolean apply(Integer integer) {
            return integer < 5;

    long start = System.nanoTime();
        String.format("list.size(): %19d", (System.nanoTime() - start)));

    start = System.nanoTime();
        String.format("filtered.size(): %15d", (System.nanoTime() - start)));

Your output may vary, but on my PC I get:

list.size():                9817
filtered.size():        24157424

The problem here is that the filtered collection will iterate 1,000,000 (in this example) for each operation - even for a simply operation such as size.  If you placed trace statements in the apply method you would see that it would be called the same number of times as the number of elements in the original collection.

Now to be fair the javadocs do warn about using this class:

Many of the filtered collection's methods, such as size(), iterate across every element in the underlying collection and determine which elements satisfy the filter. When a live view is not needed, it may be faster to copy Iterables.filter(unfiltered, predicate) and use the copy.

It turned out that a colleague had innocently started using this instead of FluentIterable (or just Iterables as stated above) and this had caused the performance issues we were seeing.

Although you could say the javadocs should have been read more carefully I'd additionally argue that perhaps the name of Collections2 could have been improved upon to make its use/intent clearer.

Still, its a nice class to know about - just don't use it when a straight copy will be better.

No comments:

Post a Comment