Java offers more formatting power than C, but not necessarily in a more convenient package.
Hello, World ... Goodbye printf()
The first thing you usually do when you learn a new language is experiment with some simple I/O. Java provides System.out, an object of type PrintStream, for output to the console. Here's the proverbial "Hello, World" using System.out:
class Hello { public static void main(String[] args) { System.out.println ("Hello, world"); } }Although this little programette is neither longer nor more complex than its C++ equivalent, the Java plot thickens quickly if you plan on formatting your output. In fact, Java makes you build a separate object to express your format. Consider the following C snippet, for instance:
#include <stdio.h> int main() { double x = 123.456; printf("%8.2f\n", x); return 0; } /* Output: 123.46 */A rough Java equivalent looks like this:
import java.text.*; class Decimals { public static void main(String[] args) { double x = 123.456; DecimalFormat fmt = new DecimalFormat("#####.##"); System.out.println (fmt.format(x)); } } /* Output: 123.46 */The java.text package defines the DecimalFormat class, which you use for formatting numbers (both integers and reals). The pound-sign acts as a mask character for any decimal digit.
I say that the program above is a "rough equivalent" of the C version because the output isn't right-justified in a field of eight characters, as you would expect. If you want alignment, you have to do it manually. Argh!
Objects for Formatting
As you can see above, formatting output in Java is basically a two-step process: 1) build a string with a format object, and 2) send the string to the desired output stream. In C terms, it's like having to always use sprintf first to build strings before actually shipping them off to your output stream. Some people praise this design for its flexibility, but it can drive C hackers nuts while learning Java.
There are seven Format classes altogether, arranged into the hierarchy depicted in Figure 1. Format, NumberFormat, and DateFormat are abstract classes. The usual technique of getting a Format object is to call a version of getInstance, a factory method provided with each of these abstract classes, as shown in Figure 2. For most locales, NumberFactory.getInstance returns a DecimalFormat object, but in FormatIntegers.java I'm just using the format method, which is declared in NumberFormat and overridden in DecimalFormat. Since I requested a minimum of three integer digits, I get a leading zero when printing the number 10, and since I make the upper limit four digits, I lose the most significant digit of 10,000 in the last output line.
Notice the commas in the last two lines of output in Figure 2. All format classes except ChoiceFormat are sensitive to locale, which is a set of conventions for displaying numbers, currencies, and dates for different languages and cultures. Since the default locale is English/United States, the comma appears as the grouping separator for numerals. (See the next section for more on locales.)
Financial applications usually need dollar signs and negative numbers in parentheses instead of being preceded by a negative sign. The getCurrencyFormat factory method returns a DecimalFormat object with a predefined pattern for the occasion (see Figure 3). The first half of the program in Figure 4 shows how you can define a DecimalFormat pattern for currency yourself. Zeroes in a pattern string are replaced by the digit 0 when there is no corresponding digit in the input number, while pound signs are ignored [1]. The second half of Figure 4 shows how to express numbers in scientific notation. And as you can see in Figure 5, there is also a predefined formatter for percentages available via NumberFormat.getPercentInstance.
I mentioned earlier that numbers are not right justified as they are in C (or every other language in popular use, for that matter!). Sad but true. The program in Figure 6 shows how to do it manually. You need to know the length of the formatted number so you can compute how many spaces to prepend before you print the number itself. That information is available in a FieldPosition object, returned by an overloaded version of method DecimalFormat.format. First you create a FieldPosition object, giving its constructor the flag indicating the type of quantity you want it to track. (Other types include FRACTION and various date components.) The call to format then populates fpos with information about where the formatted field begins and ends in the string it returns. In this case all we need to know is how long the formatted string is, which FieldPosition.getEndIndex tells us. Again, notice the commas in the display.
Locales
Historically, computing has been woefully provincial in favor of the United States. It's bad enough that programming languages use English keywords, but popular computing environments have supported only those character sets that accommodate English (viz. ASCII and EBCDIC), and in some cases other Western European languages. Over time other character sets have been developed, but there has been no universal platform, as it were, fit to handle text formatting and display for all languages and cultures. One of the first efforts to solve this problem resulted in the concept of locales.
Locales originated with ANSI C as a means to provide program support for localizing software for different geopolitical regions. A locale in Standard C is a collection of preferences for the processing and display of information that is sensitive to culture, language, or national origin, such as date and monetary formats. There are five categories of information, named by macros defined in <locale.h>, on which locales have an effect (see Table 1). Each of these categories can be set to a different locale (e.g., "american", "italian", etc.)
Standard C defines the following two functions to deal with locales directly:
struct lconv *localeconv(void); char * setlocale(int category, char *locale);localeconv returns a static lconv object containing settings for the LC_MONETARY and LC_NUMERIC categories, and setlocale changes the locale for the given category to that specified in locale. You can set all categories to the given locale by specifying a category of LC_ALL, as I do in Figure 7. All standard C implementations must support the minimalist "C" locale, and a native locale named by the empty string (which may be the same as the "C" locale). Unfortunately, few U.S. vendors of C/C++ compilers provide any additional locale support, even today, eleven years after C was originally standardized.
Java, on the other hand, comes with support for 145 locales out of the box (i.e., with the Sun JDK). The program in Figure 8 displays selected locales. The static method Locale.getAvailableLocales returns an array of objects representing the locales supported by your installation of the Java library. (The Locale class is defined in the package java.util). As you can see from the results of Locale.getDisplayName, a locale represents a language-country pair, since some languages are spoken in many countries, and yet the cultural conventions that govern display of currency and date values can differ among those countries.
The program in Figure 9 illustrates how three different locales format numeric and date component text. Note that I need to specify both the language and country when creating a Locale object. Languages are represented by two-character codes in lower case, and countries by codes in upper case [2]. This program also illustrates an overloaded version of getXXXInstance that associates a formatter with a specific locale. DateFormat.getDateInstance takes an additional first parameter specifying how much formatting you want. The choices, in decreasing order of detail, are FULL, LONG, MEDIUM, and SHORT.
Parsing
All the examples so far have dealt only with formatted output, but all Format classes also support a parse method for reading input according to the same conventions we've been discussing for output. The following program shows how to read numbers in two different locales:
import java.util.*; import java.text.*; class ParseInput { public static void main(String[] args) throws ParseException { NumberFormat fmt1 = NumberFormat.getInstance(); System.out.println (fmt1.parse("1,234.56")); NumberFormat fmt2 = NumberFormat.getInstance (Locale.GERMANY); System.out.println (fmt2.parse("1.234,56")); } } /* Output: 1234.56 1234.56 */The parse method will throw a ParseException if the input string is not valid. For convenience, the Locale class defines several static finals for commonly used locales, such as Locale.GERMANY above.
The program in Figure 10 parses dates. Recall that SimpleDateFormat is the concrete class that extends the abstract DateFormat class. The abstract classes do not support patterns, so I have to cast to SimpleDateFormat to use the toPattern and applyPattern methods. As you can see, the default pattern for dates is "MMM d, yyyy". Three M's result in an abbreviated month string, such as "Jul". Anything less than that resolves to the numeral (e.g., 7 for July), and four M's or more gives the full name ("July"). As always, the input must match the expected pattern if you don't want a ParseException.
Parsing more than one number from a single string will certainly leave you longing for C's sscanf function. As the program in Figure 11 illustrates, you have to keep track of where you're at in the input string with a ParsePosition object. When you call the overloaded version of parse that takes a ParsePosition as a second argument, that ParsePosition is updated with the index of the next unread character in the input string. Unfortunately, you just can't just turn around and call parse because it doesn't skip white space. You have to know the structure of the input string and manually skip the characters you want to ignore, like I did here by incrementing the parse index. Double argh!
Summary
I really don't mind the separation of formatting from I/O in Java. It makes sense conceptually. The design of the Format classes is clean and flexible, but if you're coming from C, you might not find them convenient. Most of all, I miss C's flexibility of choosing right vs. left alignment in output formats. I'd be interested in your opinion. It was necessary to discuss locales since the formatting classes use them, but this was mainly an article on formatting, so I omitted discussion of internationalization issues other than locales. If you have immediate need for localizing your software, take a look in the JDK documentation for the class descriptions of ResourceBundle and MessageFormat, which facilitate run-time substitution of strings for programming in multiple languages.
Notes
[1] The pattern for negative numbers borrows everything from the pattern for non-negatives; only prefixes and suffixes need be specified for negatives. You're supposed to be able to specify these directly in the pattern, as follows: "#,###.00;(#)". Following the rule just explained, only the prefix and suffix are inferred from the pattern after the semicolon, so only a single '#' is required. Unfortunately, this is not correctly implemented as of JDK 1.3 Release Candidate 3, the latest JDK download as of this writing. (It drops the suffix.)
[2] These are standard ISO codes found at http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt and http://www.chemie.fu-berlin.de/diverse/doc/ISO_3166.html, respectively.
Chuck Allison is a columnist with CUJ. He is the owner of Fresh Sources, a company specializing in object-oriented software development, training, and mentoring. He has been a contributing member of J16, the C++ Standards Committee, since 1991, and is the author of C and C++ Code Capsules: A Guide for Practitioners, Prentice-Hall, 1998. You can email Chuck at chuck@freshsources.com.