Counting Words in Java with a Mutable Counter

One of the hallmark features of the Java programming language is the immutability of basic “wrapper” data types like String and Integer. Immutable types are inherently concurrency safe and have other properties that can improve code quality, but sometimes immutable types can be costly because of their inherent unchangeability.

Immutable types are safe for concurrent access because nothing about them can be changed once they have been created – they carry whatever data was given to them when they are created. This also has other benefits; for example, immutable types can be safely used as keys in Maps. The only way to change the value of one of these immutable types is to create a new instance of the same type, replacing or combining the value of the existing object with a new value.

So for example, say there’s a String whose value is “Hello”, and you want it to change it to say “Hello, world!”. The code would go something like this:

String message = "Hello";  // initial value
message = message + ", world!";
System.out.println(message);  // now prints "Hello, world!"

Behind the scenes, the compiled Java code will create an instance of a StringBuilder object, then add the original value of the message object to it, then append the “, world!” string to it, then call the toString() on the StringBuilder object to obtain a new String object, which is bound to the message variable. The old instance of String object as well as the StringBuilder instance becomes “garbage” which is eventually garbage collected by the Java Virtual Machine.

As a more practical example, consider counting distinct words in a text file. The core of the code would keep a Map to hold the counts of each distinct word that’s encountered. Here’s how this code might look:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
    private Map<String,Integer> mWordMap;
    private int mTotalWordCount;
 
    private void count(BufferedReader wordFile) throws IOException {
	BreakIterator biter = BreakIterator.getWordInstance();
	mWordMap = new HashMap<String,Integer>( INITIAL_MAP_CAPACITY );
	mTotalWordCount = 0;
 
	int start, end;
	String buf, word;
	while ( (buf = wordFile.readLine()) != null ) {
	    biter.setText( buf );
	    start = biter.first();
	    for ( end = biter.next(); end != BreakIterator.DONE;
		  start = end, end = biter.next() ) {
		word = buf.substring( start, end );
		word = word.toLowerCase();
 
		if ( isExtraneousCharacter(word) || leadingCharNotLetter(word) ) {
		    continue;
		}
		mTotalWordCount += 1;
 
		Integer count = mWordMap.get( word );		if ( count == null ) {		    mWordMap.put( word, 1 );		} else {		    count += 1;		    mWordMap.put( word, count );		}	    }
	}
    }

The code of interest is highlighted: try to get a count for the word; if not found add the word to the Map with an initial count of 1; otherwise, add 1 to the count and save the new count in the Map. The process of incrementing the count creates a new instance of Integer; the old instance becomes garbage which is eventually garbage collected. If many large documents are being counted, this constant churning of Integer objects can become a performance issue: it takes time (as well as memory) to create instances of objects, and it takes time to garbage collect them later.

To avoid some of this creation-destruction cycle, a Counter class can be created. This is essentially a mutable holder of an integer value;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    public static class Counter {
	private int mCount;
	public Counter() {
	    mCount = 1;
	}
	public int add1() {
	    mCount += 1;
	    return mCount;
	}
	public int getCount() {
	    return mCount;
	}
	public String toString() {
	    return "" + mCount;
	}
    }

An instance of Counter is created with an initial value of 1 (the assignment of the value 1 is implicit in the constructor), and the Counter class provides an add1() method to increment the count as more instances of a word is found. Consider the potential difference in performance here: if a 40,000 word document was being counted, and there are 2000 unique words in the document, there would be 40000 – 2000 = 38000 instances of Integer created and then thrown away very quickly. Keeping a Map to count the words, only 2000 instances of Counter would ever be created, and there would be no garbage collection attributable to the counting process.

Here’s a revised version of the above code using the Counter class:

24
25
26
27
28
29
    Counter count = mWordMap.get( word );
    if ( count == null ) {
        mWordMap.put( word, new Counter() );
    } else {
        count.add1();
    }

The revised code is very similar to the previous, but the hidden cost of object instantiation and garbage collection is no longer there.

Some testing on a MacBook Pro with 2.3 GHz processors shows that the version of WordCount with the mutable Counter runs in roughly 5% less time than the immutable Integer version (running over several test documents, each 1000-8000 words in length).

A heap dump of the two versions of the code showed that the counts of the Integer and MutableCounter object instances were dwarfed by the counts of other objects in the running code:

WordCountInteger (10 iterations):
Class                                    Instance Count  Total Size
class char[]                             206448          1763214
class java.lang.String                   201449          1611592
class java.text.StringCharacterIterator   20710           331360
class java.util.HashMap$Node              20393           326288
class byte[]                               1329           137760
class java.lang.Integer                   10571            42284class java.lang.Class                       631            30288
class java.lang.Object[]                    623            18840
class int[]                                  60             6952
...more listings followed...
WordCountMutable (10 iterations):
Class                                    Instance Count  Total Size
class char[]                             206436          1762898
class java.lang.String                   201436          1611488
class java.text.StringCharacterIterator   20710           331360
class java.util.HashMap$Node              20393           326288
class byte[]                               1339           137840
class WordCountMutable$Counter            19950            79800class java.lang.Class                       634            30432
class java.lang.Object[]                    623            18852
class int[]                                  60             6952
...more listings followed...


Getting the heap dump:
$ java -agentlib:hprof=file=memdump.hprof,format=b” WordCountInteger …textfile… 10
$ jhat memdump.hprof
…opens a web server on port 7000…
…browse to “http://localhost:7000/histo/”

The number of iterations was reduced from 1000’s to 10 because heap data collection really slows down the Java VM!

Something else to notice is the difference between the number of Integer instances shown by the heap dump with the “immutable” version of the code compared to the number of instances of Counter in the latter heap dump. There were 1995 unique words found in the document, run for 10 iterations in each heap map test, so the counts of the number of Counter instances (19950) is exactly right. But the number of instances of Integer (10571) is much lower than that. What happened here? A likely possibility: the Integer class has its own built-in caching mechanism – it will cache any Integer objects created with values between -127 and 128 (see the private IntegerCache class in the J2SE source code). So for this application, words with any counts that fall between 1 and 128 will get the same Integer instance back to represent that count, not create a new instance each time. This caching is in effect only when ‘Integer.valueOf(int)” static method is used to create a new instance. So this could account for the difference in the number of Integers and Counters in the two cases.

On a final note, the Apache Commons Lang library for Java does provide implementations of Mutable objects with many more features than the simple Counter used in the above code.

$ jar tf commons-lang3-3.6.jar | grep Mutable
org/apache/commons/lang3/mutable/MutableByte.class
org/apache/commons/lang3/mutable/MutableFloat.class
org/apache/commons/lang3/mutable/MutableObject.class
org/apache/commons/lang3/tuple/MutablePair.class
org/apache/commons/lang3/mutable/MutableLong.class
org/apache/commons/lang3/tuple/MutableTriple.class
org/apache/commons/lang3/mutable/MutableBoolean.class
org/apache/commons/lang3/mutable/Mutable.class
org/apache/commons/lang3/mutable/MutableDouble.class
org/apache/commons/lang3/mutable/MutableInt.class
org/apache/commons/lang3/mutable/MutableShort.class

The documentation says that these classes have been available since version 2.1, and further says:

“A typical use case would be to enable a primitive or string to be passed
to a method and allow that method to effectively change the value of the
primitive/string.”

“Another use case is to store a frequently changing primitive in a collection
(for example a total in a map) without needing to create new Integer/Long
wrapper objects.”

All the mutable numeric classes implement the Mutable interface, which looks like this:

$ javap -classpath commons-lang3-3.6.jar org.apache.commons.lang3.mutable.Mutable
Compiled from "Mutable.java"
public interface org.apache.commons.lang3.mutable.Mutable {
  public abstract T getValue();
  public abstract void setValue(T);
}

There is a lot of value in relying on immutable data types; by virtual of their unchangeable nature:

  • They’re thread safe
  • They can be cached
  • They can be safely used as keys in any kind of hash table
  • They improve the understandability and predictability of the code

But there can be some cases where using a mutable type can help improve code performance.

TL; DR

A mutable counter class can be used to reduce some of the load from repeated instantiation/replacement of instances of the normally immutable classes like java.lang.Integer.

The java.lang.Integer class does provide an efficiency boost by caching and reusing instances of Integer that are created via the static Integer.valueOf() method for a range of small integer values.

The ‘jhat’ tool was used to get counts of object creation, revealing the effect of caching in the java.lang.Integer class:

$ java -agentlib:hprof=file=memdump.hprof,format=b" WordCountInteger ...textfile... 10
$ jhat memdump.hprof
...opens a web server on port 7000...
...browse to "http://localhost:7000/histo/"

Further Reading

Immutable object – Wikipedia
immutability – If immutable objects are good, why do people keep creating mutable objects? – stackexchange.com
MutableInt (Commons Lang 2.4 API)
Integer (Java Platform SE 8 ) – valueOf(int) – Java 8 documentation of Integer.valueOf(int)
GC: Integer – java.lang.Integer (.java) – GrepCode Class Source – IntegerCache implementation for OpenJDK

Versions

$ java -version
java version “1.8.0_40”
Java(TM) SE Runtime Environment (build 1.8.0_40-b25)
Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)

Add a Comment