import java.*: Basic Stream I/O

Chuck Allison

Java supports input/output of streams with a gazillion combinations of options.


You can't use a language effectively if it doesn't enable you to easily communicate with the outside world, whether through a display, a file system, or a network. Furthermore, a language's I/O facilities should be robust, cohesive, and easy to use. In 1979 I was on a project that used a proprietary language called PL/S III. It had a small keyword list, was nicely block-structured, and was easy to learn — but it had no I/O capability whatsoever! I had been out of school only one year and was thoroughly dumbfounded. An I/O-less programming language? But...! What...? How...? Needless to say, I was full of questions. The answer: our local BAL [1] expert wrote assembler language routines to give us the I/O features we needed.

Today's Internet culture will not tolerate such impediments to rapid development. When it comes to I/O power, Java has everything most developers could possibly wish for. In the unlikely case you don't find what you need, there are (as always) classes to extend and interfaces to implement, and as fast you can define int MyInputClass.read(){...}, you've solved your problem.

Abandon printf, All Ye Who Enter Here

If you're looking to save keystrokes, though, look somewhere else. Certainly you've discovered by now that Java is not known for it's economy of expression, and nowhere does verbosity rule and reign so mightily than in using Java's I/O classes. As in other areas of the Java library, flexibility is the name of the game. A quick glance at the java.io package reveals over forty classes spread throughout four hierarchies, with a few extra classes on the side (see Figures 1, 2, 3, 4, and 5). There are even more classes in these hierarchies if you open the lid on the packages java.util.zip and java.security, but I'll leave those for another time. It is indeed a daunting maze at first glance, but there is good design lurking beneath.

The first thing to understand about these hierarchies is that the first two, InputStream and OutputStream, deal with streams of bytes, while the Readers and Writers traffic in characters, which are 16-bit Unicode quantities. If you're working with binary data, therefore, you'll probably want to use byte streams, whereas character streams are suitable for text processing.

The second thing to note is that these classes fall into two broad categories, roughly referred to as low-level and high-level streams. Low-level streams are objects you define independently of any other stream, like a stream connected to files or arrays in memory, or pipes between processes. They include the following:

ByteArrayInputStream &
ByteArrayOutputStream
FileInputStream & FileOutputStream
PipedInputStream & PipedOutputStream
CharArrayReader & CharArrayWriter
FileReader & FileWriter
PipedReader & PipedWriter
StringReader & StringWriter

The rest are either abstract classes or high-level streams that act as wrappers to existing streams to add functionality thereto. For example, if you have a FileOutputStream you can make it a BufferedOutputStream for increased efficiency like this:

// First define a file stream:
FileOutputStream fs =
    new FileOutputStream("myfile.out");

// Wrap in a buffered stream:
BufferedOutputStream bfs =
    new BufferedOutputStream(fs);

Or you can do it all at once like this:

BufferedOutputStream bfs =
    new BufferedOutputStream(
        new FileOutputStream(
            "myfile.out"));

The wrapper buffers the output for efficiency when processing large files, and when you call bfs.close the underlying file stream also closes. This design obviates the need for a combinatorial explosion of classes such as BufferedFileOutputStream, BufferedFileInputStream, BufferedByteArrayInputStream, etc. [2].

Overwhelmed yet? I warned you about all the typing! If you read my article in the July 2000 issue about text formatting, you saw a whole family of classes just for formatting different kinds of numbers. The separation of formatting from I/O is good design, just like the separation among low-level and high-level streams, but it keeps your fingers busy and your source files full. Alas, there is no concise printf-like operation in Java Land!

Byte Streams

The basic contracts for all byte streams are defined in the abstract classes InputStream and OutputStream and consist of methods to read or write one or more bytes. The following command-line filter uses the pre-defined streams System.in and System.out to copy standard input to standard output.

import java.io.*;

class CopyInput
{
    public static void 
    main(String[] args)
        throws IOException
    {
        int b;
        while ((b = System.in.read()) !=
                -1)
            System.out.write(b);
    }
}

Most of the methods found in the java.io package may throw an exception derived from java.io.IOException. In real programs, you'll want to catch and process these exceptions, but for convenience and clarity, in this article I just include a throws clause in the specification of main. As always, consult the online Java API documentation for more detail.

InputStream.read extracts the next byte from its underlying source and returns it as an int for the same reason that getc returns an int in C: so -1, the end-of-file indicator, will be distinct from any byte. OutputStream.write pushes a byte onto its underlying sink. It's important not to use System.out.print here. System.out is actually an instance of PrintWriter, whose print methods do some minimal formatting, which in this case would output the character representation of the byte code value of its argument. For example, in the ASCII character set an opening brace would print as "123" instead of "{".

To copy files with this program, you have to use redirection on the command line, of course, as in:

C:> java CopyFile <infile >outfile

To read command-line arguments as file names explicitly, you can use main's string array parameter, as follows:

import java.io.*;

class CopyFile
{
    public static void 
    main(String[] args)
        throws IOException
    {
        // Copy files explicitly:
        FileInputStream fin =
        new FileInputStream(
           args[0]);
        FileOutputStream fout =
        new FileOutputStream(
           args[1]);

        int b;
        while ((b = fin.read()) != -1)
            fout.write(b);

        fin.close();
        fout.close();
    }
}

Each constructor opens file streams for input/output automatically and throws an exception if the input file doesn't exist or if some other error occurs. You always need to close any top-level stream that you create. (Remember, Java doesn't have destructors like C++!) The program above will of course throw an ArrayIndexOutOfBoundsException if you don't provide two filenames on the command line. The program in Listing 1 combines the flexibility of the CopyInput.java and CopyFile.java above by defaulting to standard input or standard output if you omit any filenames.

The program in Listing 2 shows how easy it is to define your own streams. In the case of an input stream, all you have to do is extend InputStream and override the read method, which returns the next byte from your stream. Here I just return a random byte value. The other read methods that extract an array of bytes are implemented in terms of InputStream.read, so I get those for free. After reading the first byte, I save the next three bytes in a byte array. I then wrap a ByteArrayInputStream around that array and read it again. The method InputStream.read(byte[] arr) returns the number of bytes read, up to arr.length. The InputStream.available method returns the number of bytes that can be read without blocking, which in this case is the entire array. InputStream.mark is like ftell in C — it stores a file position that you can return to with the reset method, provided you don't read more than the number of bytes that you passed to mark initially. The skip method is an efficient way of ignoring bytes in a stream. The casts to byte are significant in this example. Remember that ints are actually returned. If I didn't do the cast, then negative numbers would print as their positive two's-complements.

High-Level Streams

The program in Listing 3 illustrates SequenceInputStream, a high-level input stream that wraps an arbitrary number of existing streams so that you can process them in sequence as a single stream. In this example, I treat the three log files depicted in Listings 4, 5, and 6 as a single log file. First I have to open each stream separately and place their stream references into any collection that can yield an Enumeration [3]. That enumeration then becomes the argument to the constructor for my SequenceInputStream, which I process as a single entity. SequenceInputStream.close automatically closes the underlying files' streams.

It is often convenient to write program values out to files so that you can read them back later. You don't really need to know how it's done, nor do you ever plan on reading the intermediate file(s) — you just want to reconstitute objects at some future time. This is a well known technique called serialization and is supported by two sets of classes in Java. The DataOutputStream class serializes Java's primitive types, as well as String objects, to an existing OutputStream with methods like writeBoolean, writeInt, writeFloat, etc.; a DataInputStream object reads those serialized bytes and reconstructs the corresponding objects, as shown in Listing 7.

It's instructive to look at the intermediate file, data.dat. Here it is in hexadecimal format (with linefeeds added for clarity):

/* Contents of data.dat in hex:
00
41
00 41
00 05 68 65 6C 6C 6F
3F 80 00 00 00 00
00 01
*/

As you can see, Booleans are stored as single bytes with the value 0 for false (and non-zero for true), and a char is stored in two-byte Unicode format. The writeUTF method stores a string in UTF-8 format, which is a standard and efficient way of serializing Unicode strings. The first two bytes (00 05) constitute a short integer representing the number of bytes in the serialized representation of the string, which bytes then follow. Traditional ASCII characters require only one byte, the next 1,919 Unicode code points require two bytes, and the rest require three, so UTF-8 is good for ASCII, but wasteful for Asian characters.

These data streams only work for primitives and strings. You can serialize arbitrary objects, including arrays, however, with ObjectOutputStream and ObjectInputStream. These are very intelligent classes. You can have objects within objects extending other objects, serialize them to a byte stream, and when you deserialize them all the relationships are intact. The program in Listing 8 defines a class Person, which extends class CarbonUnit and contains an instance of class Name and class Date. To serialize a Person to a byte stream, all classes involved must implement the marker interface Serializable, otherwise you get a java.io.NotSerializableException (Person is implicitly serializable since it extends CarbonUnit, which implements the Serializable interface). All non-static, non-transient fields are serialized; if you want to ignore a field during serialization, qualify it with the transient keyword. This typically applies to computed fields or values or references that are cached. When an object is reconstituted, its transient fields are zero initialized.

When I'm ready to serialize a Person object, I wrap a FileOutputStream with an ObjectOutputStream and call writeObject. That's it! To deserialize, I wrap the FileInputStream in an ObjectInputStream and call ReadObject. Easy enough. Since readObject returns an Object, I have to cast to a Person.

How are these methods so smart? First of all, ObjectOutputStream.writeObject determines the actual type information of its argument via reflection, an object introspection capability in Java that I'll discuss in a future article. Also, object references are replaced with local serial numbers that have meaning within each serialization stream, so when objects are serialized, the restored references point to the right objects.

You might want to experiment with this example in a couple of ways. First, remove the "implements Serializable" qualifier from the CarbonUnit class. You'll get no errors, but the id numbers will be wrong in the reconstituted objects (3 and 4, respectively, instead of 0 and 1). Why is that so [4]? And if you try removing the same qualifier from Name, what will happen [5]?

Character Streams

With only a few exceptions, all byte streams have a character-based Reader/Writer counterpart. A character stream version of CopyFile.java above, for example, looks like this:

import java.io.*;

class CopyChars
{
    public static void 
    main(String[] args)
        throws IOException
    {
        // Copy files
        // character-by-character:
        FileReader fin =
            new FileReader(args[0]);
        FileWriter fout =
            new FileWriter(args[1]);

        int c;
        while ((c = fin.read()) != -1)
            fout.write(c);

        fin.close();
        fout.close();
    }
}

This version doesn't add any functionality over the byte stream version and is of little use. Character streams do allow you to read a line at a time, however, which can be useful in text processing applications. To read lines you need a BufferedReader. To behave like a command-line filter, the utility in Listing 9 wraps System.in in a BufferedReader and calls BufferedReader.getLine repeatedly. InputStreamReader is a bridge from byte streams to character streams — it takes a byte stream and wraps it in a Reader that returns characters [6]. There is also an OutputStreamWriter for converting output byte streams to Writers. Since keeping track of lines is a common task, java.io provides the LineNumberReader class that keeps count for you (see Listing 10).

Token Parsing

A good deal of text processing consists of fishing through input files for data surrounded by delimiters, which are ignored on input. A common situation in data processing requires reading comma-delimited files, such as when exporting data from one database to import into another. With the StreamTokenizer class you can define which characters in a character stream make up tokens and which don't (similar to strtok in C). The program in Listing 11 reads files of employee tokens that come in groups of three: a name, a number, and a title. The first and third fields are strings and can contain spaces. By default, StreamTokenizer recognizes white space as an ordinary (i.e., non-token) character, so to preserve spaces in a token it should be surrounded by quotes, which are subsequently discarded on input. To tag the comma character as a non-token character, call the ordinaryChar method. The parseNumbers method tells the tokenizer to recognize numbers and not just strings in general. A nice feature for compiler writers is the ability to ignore C and C++-style comments (both of which are valid in Java, of course — see the calls to slashSlashComments and slashStarComments). The tokenizing loop is driven by a call to nextToken, which returns StreamTokenizer.TT_EOF when input is exhausted. To retrieve a token as a string, simply access the public sval field. For numbers, use nval. Whenever a non-token character is extracted, it is stored in the ttype field.

With fixed-sized records like this, you may prefer to read a line at a time and parse each string by searching for commas. If you want to allow empty fields, then you'll need to use the String class search methods to find commas and then calculate substrings. If no empty fields are allowed, then the StringTokenizer class provides a simpler solution. When I create a StringTokenizer object for each input line in Listing 12, I pass it the characters I'm not interested in (a comma and linefeed, in this case — I suppose I should have included other punctuation as well). It collects tokens consisting of all other characters and returns them via the nextToken method. One drawback to this approach is that it does not ignore comments, so I had to remove the leading comment line from the input.

Quite often, though, lines don't matter, like when parsing a Java program. The program in Listing 13 shows how easy it is to extract tokens from a free-form text file. It looks for the class keyword and then reads the next word, interpreting it as a class name. This time the loop inspects the ttype field, which returns TT_WORD if a word was found, TT_NUMBER if a number was found instead (which doesn't apply in this case), TT_EOL if it found the end of a line (which only works if you have previously called eolIsSignificant), and TT_EOF if input was exhausted; otherwise it found a non-token character and returns it.

Summary

Even if you have to type almost to the point of suffering from carpal tunnel syndrome, I'm sure you'll agree that the design of the Java I/O system is Good Work. You have facilities to create a stream that talks to just about anything, with common filtering features such as buffering, counting lines, or literally interpreting program objects. The high-level classes come nicely equipped to function as decorators for the low-level ones, so the number of classes needed is much smaller than it would have been otherwise. And in many of the cases where you have to define your own stream, you can get away with just overriding a single read or write method. In January's article, I'll talk more about files and working with your local file system from Java programs.

Notes

[1] If you entered the programming biz after 1990, you probably don't know that this is an acronym for Basic Assembler Language, which was the lingua franca of IBM mainframes.

[2] This is an application of the Decorator pattern, described in the pioneering patterns book, Design Patterns: Elements of Reusable Object-oriented Software, Gamma et al (Addison-Wesley, 1995). A Decorator enhances an object by adding to and/or modifying its functionality while maintaining its interface, so the new object can act as an instance of the old, just like inheritance does, but with the flexibility of combining layers at run time.

[3] I chose ArrayList instead of Vector, which is my wont, because the former is one of the new collection classes. If you need a refresher on collections and enumerations, see my previous article in the September 2000 issue of this magazine.

[4] If a non-serializable base class has a no-arg constructor, it is called when the derived object is reconstituted, which in this case erroneously increments CarbonUnit.nextID. By making CarbonUnit serializable, the true values are stored and retrieved. Not all base classes need to be Serializable — just the ones that you don't want to be default-initialized when the derived object is built (in which case there must be a no-arg constructor).

[5] Since Name is a non-transient, non-static field of Person, writeObject attempts to serialize it, which results in a NotSerializableException. The standard Date class is also serializable.

[6] How many bytes make up a character depends on the encoding you choose when you create the reader. (There is an optional constructor that takes a second argument for specifying the encoding.)

Chuck Allison is a long-time columnist with CUJ. During the day he does Internet-related development in Java and C++ as a Software Engineering Senior in the Custom Development Department at Novell, Inc. in Provo, Utah. He was a contributing member of the C++ standards committee for most of the 1990's and authored the book C & C++ Codes Capsules: A Guide for Practitioners (Prentice-Hall, 1998). He has taught mathematics and computer science at six western colleges and universities and at many corporations throughout the U.S. You can email Chuck at chuck@freshsources.com.