HashMaps, Immutable, and Mutable Objects in Java

The immutable vs mutable debate has a lot of muscle behind it.  On the mutable side of the world, people argue performance.  If you modify your data structures in place, then you are going to have a faster program.   You don’t have to create a new object every time you replace a value.   On the immutable side of the world,  folks argue correctness.  If you have an object that doesn’t change, then you can guarantee that any reference to it will remain valid forever.  No worries about some other sneaky thread coming along and messing with your data (a nightmare bug scenario if I’ve ever heard of one).

Java is a language in which the standard object wrappers are immutable.  No changing the characters of that string, instead you have to create a new string.  To get around this problem, the Hadoop echo-system has introduced Writable objects, for example Text or LongWritable.  These are mutable object containers for primitive types.  Inside of a Mapper or Reducer, you can’t keep references to these objects and expect them to stay around between function calls, as the underlying data will change.

But what is the big deal?  Lets look at a toy problem to explore the performance differences in Java of using Mutable vs. Immutable objects.  In our toy problem:

  • We have a random sequence of numerical IDs.
  • Those IDs come from a small unknown set
  • But the range of ID values is large
  • We would like to count the number of occurrences of each ID
  • To do this we will use a [long -> long] map, where the key is the ID, and the value is the count

Our TestHarness class:

  • We are going to bake in the loop to force the Java runtime to compile the loop code into machine code.
  • We are going to then run a sequence of operations to update the counters.
  • We are going to finally clear the map and do this a number of times, to sample the runtime a few times.

import java.util.HashMap;
class TestPerformance
{
  static final int SAMPLES = 10;
  static final int ITERATIONS = 1000000;
  static final int NUMKEYS = 1000;
//
  static class MutableLong
  {
    public long val;
    public MutableLong(long val) { this.val = val; }
    public int hashCode() { return (int)this.val; }
    public boolean equals(Object other)
    {
      return other instanceof MutableLong ? this.val == ((MutableLong)other).val : false;
    }
  };
//
  // Uses the building java.lang.Long class
  static HashMap<Long,Long> dataMap = new HashMap<Long,Long>();
  static HashMap<Long, MutableLong> mixedDataMap = new HashMap<Long, MutableLong>();
  static HashMap<MutableLong, MutableLong> mutableDataMap = new HashMap<MutableLong, MutableLong>();
  static long[] sequence = new long[ITERATIONS];
//
  static void init()
  {
    long[] keys = new long[NUMKEYS];
    for (int i = 0; i < NUMKEYS; i++)
      keys[i] = (long)(Math.random() * Long.MAX_VALUE);
    for (int i = 0; i < ITERATIONS; i++)
      sequence[i] = keys[(int)(Math.random() * NUMKEYS)];
  }
//
  static void runLoop()
  {
    for ( int iteration = 0; iteration < ITERATIONS; iteration++)
    {
      //TODO: LOOP CODE GOES HERE!!!
    }
  }
//
  public static void main(String[] args)
  {
    init();
//
    // Allow the just in time compiler to kick in first
    runLoop();
//
    for (int sample = 0; sample < SAMPLES; sample++)
    {
      long start = System.nanoTime();
      runLoop();
      long end = System.nanoTime();
      System.out.println(end - start);
    }
  }
}

So lets think about the code to put in the hole above.  Here is Immutable Baseline code:


long nextKey = sequence[iteration];
if(dataMap.containsKey(nextKey))
{
  dataMap.put(nextKey, dataMap.get(nextKey) + 1);
}
else
{
  dataMap.put(nextKey, 1L);
}

This code uses the Immutable Long object (the Java compiler automatically boxes and unboxes the Long object around the Long constants in order to do this). However, it is really, really inefficient. It queries the hashmap for the object twice, when once would be sufficient.

We see a better version of this loop, still using the Immutable Object HashMap in the Immutable Optimized code:


long nextKey = sequence[iteration];
Long currentVal = dataMap.get(nextKey);
if(currentVal != null)
{
  dataMap.put(nextKey, currentVal + 1);
}
else
{
  dataMap.put(nextKey, 1L);
}

This code is much better because we only use one get and one put for each item in the sequence.

However, we can do even better with a mutable value for our counter: Mutable Counter code:


long nextKey = sequence[iteration];
MutableLong currentVal = mixedDataMap.get(nextKey);
if(currentVal != null)
{
  currentVal.val += 1;
}
else
{
  mixedDataMap.put(nextKey, new MutableLong(1));
}

Here we are only using a mutable object for the counter. This gives a huge boost, because now we only put an object in the map when we first see that key.

What if we use a mutable object for the key as well? This should hypothetically gain us time because we don’t have to allocate a key object for most items in the sequence: Then we have the Mutable Key code:


MutableLong mutableKey = new MutableLong(0);
for (int iteration = 0; iteration < ITERATIONS; iteration++)
{
  long nextKey = sequence[iteration];
  mutableKey.val = nextKey;
  MutableLong currentVal = mutableDataMap.get(mutableKey);
  if(currentVal != null)
  {
    currentVal.val += 1;
  }
  else
  {
    mutableDataMap.put(new MutableLong(nextKey), new MutableLong(1));
  }
}

So how do these code snippets compare?

Here are the timing results using Java 1.7.0_25 on a 64-bit Linux machine:

dyerware.com


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>