import java.*: Collections and Algorithms

Chuck Allison

Java lacks both a standard and templates, but it nevertheless offers something resembling the C++ Standard Template Library.


Everyone needs to implement lists and trees at least once, wouldn't you say? Computer science departments certainly think so and so do I, but not in the workplace. I suppose it didn't hurt me when I wrote binary search trees in FORTRAN at my first job in the late '70s, nor when I wrote a generic list library in C almost ten years ago, but you just can't afford to roll all of your own tools in production. The proliferation of C libraries in the '80s and C++ libraries in the '90s helped a great deal, but we C/C++ folks pride ourselves in writing portable code (or at least we used to, but that's another topic of discussion), and historically, few libraries have ported well across platforms. This is a big enough problem that Bjarne Stroustrup, inventor of C++, once said that his biggest mistake with C++ was releasing it before it had a robust foundation library, including data structures [1].

That "mistake" has been more than rectified by the generic containers and algorithms in the now-standard C++ library (also erroneously called STL, for Standard Template Library [2]). C++ programmers now routinely and effectively use containers of any type of object with maximum efficiency on any platform with a standard C++ system. Iterators and function objects are everyday terms now, and yet we're all still learning how far we can go with this brilliantly designed library.

But this is a column about Java, and Java has no STL, nor even templates (although there is talk in that direction). In 1997 Objectspace, Inc. ported STL to Java and called it JGL [3], which originally stood for "The Java Generic Library," but Sun put the kibosh on such an official-sounding title, so now it's called "The Generic Collection Library for Java" (but the JGL acronym has stuck). JGL has everything STL-like that a Java programmer could hope for, even without templates. Function objects, for example, by convention use the execute method to substitute for overloading operator() in C++. All the standard function objects and adapters are there.

JGL had a growing audience in the days of JDK 1.x, especially since the latter has only four wimpy containers: Vector, Stack, Bitset, and Hashtable. With Java 2, Sun has given us a lot less than JGL, but the new collections and their associated algorithms are well designed and easy to use. You may find that they meet most day-to-day production software needs. In this article, I survey the collections in both Java 1 and 2, with an emphasis on the latter [4].

Collections in JDK 1.x

The Vector class is pretty much what you would expect: a container that holds an arbitrary number of instances derived from Object, which means you can store anything but primitives in a Vector. (This is how Java and some other object-oriented languages achieve genericity — by processing instances of a "Mother of All Classes" such as Object). The program in Listing 1 populates a Vector with seven integers and uses the indexOf method to perform various searches. indexOf searches linearly from the beginning for a match (using the equals method, of course) and returns the 0-based index of where it first occurs, or -1 if the object isn't there. There is also a contains method that returns a boolean indicating whether or not an object is present (see Listing 2). The program in Listing 2 also illustrates an Enumeration object, which is an iterator of sorts. The Vector.elements method returns an Enumeration that points to the first (zeroth) vector element. To traverse the container you query Enumeration.hasMoreElements; if it returns true, you can access the next element via Enumeration.nextElement, which as a side effect advances the iterator for the next query. You can't delete an element via an Enumeration, but you can with Java 2's iterators, which you'll see shortly.

A serious drawback to the Vector class is that most of its methods are synchronized, which is fine if you need to write thread-safe code, but unacceptable otherwise, since your code has thread overhead you'll never use. To address this problem, the Java 2 collections come in both synchronized and non-synchronized flavors (more on this below).

The Stack class attempts to give you the typical last-in-first-out container semantics that you expect, but evidently its designers skipped Object-Oriented Design 101. For some mysterious reason, Stack extends Vector, instead of just using a Vector for implementation. The problem with this approach, of course, is that it lets you use a Stack as Vector, so you can add or remove elements anywhere you want instead of just at the beginning, which is all a Stack should let you do. I'm so appalled that I'm not even going to show an example that uses Stack (really!).

The BitSet class represents what its name implies: a collection of bits. Actually, a BitSet object is an ordered set of bits, beginning with bit 0, just like arrays. You can set, reset, and test individual bits and perform set operations on BitSets as a whole (e.g., OR, AND, XOR, etc.). Oddly enough, there is no method to toggle a bit directly, so you'll have to do it manually. Bitsets grow as needed to contain whatever bit number you need to refer to. The length method returns how many bits are in use. (In other words, length returns n if n-1 is the highest bit number you've referenced.) The program in Listing 3 creates two BitSets and computes their intersection relative to the first.

Hashtables are fast lookup tables. A Hashtable object stores key-value pairs by hashing the key into an open hash table, which is basically a dynamic array of linked lists, each list holding pairs whose keys hash to the same value. Hash tables tend to perform faster than tree-based solutions, but are by nature unordered. As you can see in Listing 4, you add a key-value pair with the put method and perform lookups with the get method. If you put a duplicate entry in the table for an existing key, the new value replaces the old (i.e., no duplicate keys, or "multimaps," allowed). If you search for a non-existent key, you get a null back. You can also search for the existence of keys and values independently and even traverse them via Enumerations.

That's pretty much all you get with JDK 1.x — no sets, no lists, no queues. And what's worse: no generic algorithms. If you want to sort a Vector, for example, tough luck; you have to write your own sort routine. Java 2 rectifies this situation with a well-defined hierarchy of collections and associated algorithms and also provides replacements for all the 1.x containers except BitSet.

The Java 2 Collection Hierarchy

Java 2 provides two families of containers, Collections and Maps. Collections are containers that implement the Collection interface, which defines basic collection methods for inserting, removing, and visiting elements (see Listing 5). The Collection interface is further refined into List and Set (see Figure 1). Lists are ordered sequences that support direct indexing(!) as well as bi-directional traversal. (For the STL-aware, a List is a sequence — see Listing 6 for its methods). Java 2 comes with LinkedList and ArrayList as implementations for List, as shown in Figure 2. (The leaves of the hierarchy are implementations.) Vector is also there for historical reasons, but ArrayList is preferred over Vector in new code. LinkedList implements a doubly-linked list and is optimal for inserting and removing elements anywhere in a list. ArrayList is simply an expandable array and is preferred when you need to use random access to list elements.

The program in Listing 7 puts both ArrayList and LinkedList to the test. The get method returns the object at a particular position, while indexOf performs a linear search to find the index of an element and returns -1 if it's not present. The subList method is a useful feature of the List interface. It returns a List that is a subset of the original, using an asymmetric range specification, similar to STL ranges, where the first index is inclusive and the second is exclusive. For example, the statements

LinkedList list = new LinkedList(array);
List view = list.subList(1,3);

define view as containing the second and third elements of list (i.e., indexes 1 and 2), where list is itself a shallow copy of array. Since view is just a wrapper for a subset of list, any operations on view affect list. Appending Gregory's instance to view, therefore, inserts it after James, not Charles, the latter being outside of view.

Listing 7 also illustrates various algorithms found in the java.util.Collections class, such as sort, binarySearch, max, min, and reverse. These are similar in spirit to those in java.util.Arrays, but apply only to collections. The complete listing of algorithms in java.util.Collections is in Listing 8. Notice the methods containing the prefixes unmodifiable and synchronized. These methods act as wrappers for existing collections and restrict their use. If you call unmodifiableCollection on an existing collection, for example, the object returned would throw an UnsupportedOperationException if you try to call any of the Collection methods that modify a collection. The synchronized methods return objects that are thread-safe (i.e., the mutator methods in the wrapper are synchronized), which is the only way to use collections in multithreaded situations since for efficiency reasons none of the Collection methods are synchronized. The wrappers act as a synchronized view into the original collection, and any changes you make in the wrapper are reflected in the backing collection. If want to iterate through a synchronized collection, you need to synchronize on the collection to ensure thread-safety, as in:

Collection coll = Collections.synchronizedCollection(aCollection);
          ...
synchronized(coll)
{
    Iterator iter = coll.iterator();
    while (iter.hasNext())
    {
        // use iter.next()...
    }
}

You may have noticed the glaring omission of some popular data structures, such as queue, stack, and deque. Although they're not there explicitly, you can use LinkedList to achieve the desired effect. LinkedList includes the following methods, among others:

// Some LinkedList methods:
void addFirst(Object o);// push_front()
void addLast(Object o); // push_back()
Object getFirst();      // front()
Object getLast();       // back()
Object removeFirst();   // pop_front()
Object removeLast();    // pop_back()

For you STL folks, I have included the names of the equivalent C++ functions in the comments above. To implement a queue, all you have to do is use a LinkedList that supports addFirst and removeLast, as follows:

class Queue
{
    private LinkedList data;
    
    public Queue()
    {
        data = new LinkedList();
    }
    public void put(Object o)
    {
        data.addFirst(o);
    }
    public Object get()
        throws NoSuchElementException
    {
        return data.removeLast();
    }
    public int size()
    {
        return data.size();
    }
}

For a stack, just replace removeLast with removeFirst. For a deque, which allows insertions and removals at both ends, just use LinkedList as it is.

Iterators

Iterators, like Enumerations, are objects that facilitate traversing through an ordered sequence (see Listing 9). Unlike Enumerations, you can remove elements via an iterator. All collections support iterators, but the order of the traversal depends on the collection itself. For lists, you visit objects in the order they are stored. For sets (discussed below), objects appear in no particular order unless the set is ordered. The ListIterator interface, also in Listing 9, lets you update an object through an iterator, as shown in Listing 10. All the iterators returned by the Java 2 collections are fail-fast, which means that if the underlying structure changes during the lifetime of an iterator, other than through operations performed through the iterator itself, the iterator becomes invalid and any attempted use of it thereafter will result in a ConcurrentModificationException.

Sets

The Set interface conforms to the usual notion of a mathematical set, which basically is just an unordered receptacle for elements, and provides operations for insertion, removal, and testing of membership. Since these are already part of the Collections interface, the Set interface is identical to Collection, with the added specification that duplicates are not allowed. The SortedSet interface adds methods useful for processing a set ordered by comparator (see Listing 11). The HashSet class implements Set, and TreeSet implements SortedSet. HashSets are more efficient, since they use hashing algorithms, but if order is important, TreeSets provide logarithmic-time algorithms since they store the object in balanced tree structures. As you can see in Listing 12, any attempts to insert duplicates into a set are ignored. For TreeSets, headSet returns a view from the beginning up to but not including the specified element, and tailSet comprises the rest of the sequence.

Maps

A map, like a hash table, is a dictionary, or associative array. A map stores pairs, treating the first element of the pair as a key, and the second as its associated value. Maps provide fast search times for keys, and duplicate keys are not allowed. Like Sets, there are HashMaps and TreeMaps, the former being faster and unordered while instances of the latter are ordered. For some reason, maps are in no way related to collections in Java 2; they have their own interface and implementation hierarchies (see Figures 3 and 4). A WeakHashMap stores keys as weak references, which means that if the only reference to the key object is the weak reference in the map, then that object can be garbage-collected, whereupon the entry is automatically removed from the map. (This basically gives you caching for free.)

The key methods of the Map interface are put and get (see Listing 13 for the complete specification). The put method stores a pair of objects in the map, and get retrieves the value for the given key. If you want to just see the keys, keySet returns them as a Set view, and the valuesmethod returns the values as a Collection. If you want to iterate through the pairs, you can call entrySet, which returns the pairs as a Set of instances of Map.EntrySet (which also appears in Listing 13). All the above is illustrated in the program in Listing 14. Note that calling put a second time with the same key replaces its associated value.

An Application

Let's finish by doing something useful. The program in Listing 15 implements a typical cross-reference lister, which prints each word from a text file with the line numbers where it appears. Applying Listing 15 to its own text produces the output:

a: 2, 28
Add: 45, 54
addLast: 55
already: 49
and: 45
args: 81, 84, 93, 100, 101
at: 28
BufferedReader: 21, 88, 102
(etc.)

This program uses a Map that associates a token with a list of line numbers. Since I want the tokens to appear in alphabetical order, I use a TreeMap with the default comparator. I want case to be ignored in comparisons, so I define a comparator, called NoCase, that calls on String.compareToIgnoreCase to do the job. Before inserting a new pair, I first check to see if it is already in the map. If not, I add it with the statement

map.put(token, new LinkedList());

Then I retrieve the list of lines with

LinkedList lines = (LinkedList) map.get(token);

I then check to see if the current line number has already been added to the list; if not, I append it. It would have been simpler to just use a TreeSet, but the constant re-balancing of the tree would degrade performance, and since the lines are entered in the correct order, a list suffices.

Summary

I hope you'll agree that Java has come of age as far as data structures and algorithms are concerned. The Arrays and Collections classes have useful algorithms for arrays and collections, respectively, and the families of Collection and Map implementations provide virtually everything you need in the way of data structures. As a C++ programmer, I must admit that I miss function objects a little, but only a little. I must also confess that I find the Java 2 collections easier to use than STL. Although my C++ version of the cross-reference program was 20 lines shorter than the Java version, it took me longer to get it right. To be fair, the C++ collections tend to be much faster than Java's, but this is the story of these two languages in general. (I realize not everyone agrees with me here!) If you want blinding speed, use C++. For typical applications, it's hard to imagine a situation that the Java 2 collections won't nicely handle, and if you need more intricate data structures, you have a nice hierarchy to plug into and algorithms ready to use.

Notes

[1] See my interview with Bjarne Stroustrup, "C++: The Making of a Standard," CUJ, October 1996, also available at http://www.freshsources.com (select "C++" under Articles).

[2] Don't get me started. The term STL is both misused and obsolete.

[3] Read about JGL at http://www.objectspace.com/jgl/prodJGL.asp.

[4] The array algorithms in java.util.Arrays were discussed in the March 2000 issue of this column.

Chuck Allison is a columnist with CUJ. He is the owner of Fresh Sources, a company specializing in object-oriented software development, training, and mentoring. He has been a contributing member of J16, the C++ Standards Committee, since 1991, and is the author of C and C++ Code Capsules: A Guide for Practitioners, Prentice-Hall, 1998. You can email Chuck at chuck@freshsources.com.