Perl Practicum: Fun With Formats

by Hal Pomeranz

Before Perl became a general purpose programming language, it was PERL: the Practical Extraction and Report Language. You can find the evolutionary remains of Perl's humble beginnings hidden away in dark corners of the language. Formats, for example, are a Perl language construct with a syntax unlike any other Perl construct and which generally have functionality that can be emulated with other routines (notably printf()). For these and other reasons, most people first learning Perl seem to skip over information about formats, but if you write any reasonable number of scripts to produce reports from long files of data, formats can be a valuable tool.

Simple Reporting

One of the first useful Perl applications I wrote was a little program to balance my checkbook: the application reads in a file of data containing all of the transactions I have made to date, and prints a nicely formatted statement with a running balance. I originally wrote the output portion using printf() statements, but when I gave the code to Tom Limoncelli, he sent it back to me with all of the printf() statements replaced with format code. Darn it, his version was nicer (but my checkbook was balanced first).

I wanted to make the data file as easy to type as possible, so the format is very simple. The first line of the input file is the starting balance, in pennies (no need to type a decimal point and no floating point arithmetic). Each of the following lines represents a transaction: four tab separated fields giving the check number or transaction code, the date, a description, and the amount (again in pennies). Deposits and other credits to the account are represented as negative values (I seem to put money into my accounts much less frequently than I take it out). Here is a simple program to read this input file and generate a statement of the account:

     format STDOUT =
     @<<<<< @>>>> @<<<<<<<<<<<<<<<<<<< $@######.## $@######.##
     $code, $date,$descript, 	       $amt,       $balance
     .

     open(INP, "transactions") || die "Can't read transactions file\n";
     chop($penny_balance = <INP>);
     while (<INP>) {
          chop;
          ($code, $date, $descript, $penny_amt) = split(/\t/);
          $penny_balance -= $penny_amt;
          $amt = $penny_amt / 100;
          $balance = $penny_balance / 100;
          write;
     }
     close(INP);
     format top =
     .
     Trans: Date: Description: Amount: Balance:
     ====== ===== ============ ======= ========
     .

The first four lines in the example are a format declaration. The first line defines the format's name. When the write() function is called to print a line of formatted data, it uses the format named for the currently selected file handle. In our example, the program is sending the report to the standard output. Note that if no format name is specified, STDOUT is assumed, but it is always better to name formats explicitly, even when you are using STDOUT.

The second line is a picture of how each output line will look. Each group of characters beginning with an @ is an output field specifier - everything else is a literal (e.g., the $ signs at the beginning of the two money fields). Less-than (<) signs mean that the field should be left justified, and greater-than (>) signs mean right justified; the pipe symbol (|) specifies centered fields. Numeric fields are indicated with hash marks (#) and an optional decimal point. The field width is the number of special characters, INCLUDING the @ sign (in the example below, the first field is six characters wide, the second is five, etc.). This enables the picture to resemble a somewhat abstract but perfectly aligned example of the output.

The picture's third line associates a variable with each field. When the write() function is called, the current value of each of the named variables is printed using the specified format. It is clearer to read if you to try and line up the variable specifications with their associated field specifications on the line above.

The last line of a format declaration is always a dot on a line by itself. This terminates the format declaration.

Format declarations can appear anywhere in the program. The example above contains two format declarations: one before the code and one after. This was done to make the point; in your own code, I recommend you group all formats together near the top of the script. If there are multiple formats with the same name in the program, the one defined last will be the one that gets used.

If a format with the special name top is defined in the program, this format will be printed at the beginning of each page of formatted output. The special variable $= defines the number of lines per page; 60 is the default, but you can assign a smaller number if you like (for example, when printing to a terminal or small window). The special variable $- gives the number of lines left on the current page. You can force a new page by setting $- to 0. However: DO NOT mix print() and printf() statements with write() or else the $- variable will not be decremented correctly.

Dirty Tricks

While you can define a special top format for page headers, there is no way to define a format for page footers. There is, however, a trick for dealing with this situation. While write() usually uses the format named for the file handle that the output is going to, you can use a different format by assigning the alternate format's name to the special $~ variable. The trick then, is to keep track of the number of lines left on the page and emit a special footer format at the bottom of the page. Here is the program logic for doing this:

     format top =
     Trans: Date: Description: Amount: Balance:
     ====== ===== ============ ======= ========
     .
     format STDOUT =
     @<<<<< @>>>> @<<<<<<<<<<<<<<<<<<< $@######.## $@######.##
     $code, $date,$descript,           $amt,       $balance

     format footer =

     Page @###
          $%
     .

     $footer_depth = 2;

     open(INP, "transactions") || die "Can't read transactions file\n";
     chop($penny_balance = <INP>);
     while (<INP>) {
          chop;
          ($code, $date, $descript, $penny_amt) = split(/\t/);
          $penny_balance -= $penny_amt;
          $amt = $penny_amt / 100;
          $balance = $penny_balance / 100;
          write;
          if ($- == $footer_depth) {
               $~ = "footer";
               write;
               $~ = "STDOUT";
          }
     }
     close(INP);

First we introduce a new footer format and a new global constant, $footer_depth, which is the number of lines that the footer occupies on the page. The footer format in our example uses yet another special variable, $%, which gives the current page number (numbered starting with 1).

Each time we emit a line with write(), we check $- for the number of lines remaining on the page. When we have exactly $footer_depth lines left, it is time to write the page footer. To write the footer, we simply set $~ to the name of the footer format (footer, this example), issue a write(), and then reset $~to the usual format (STDOUT) before getting the next line from the transaction file. This line will appear on the next page after the header in the usual fashion.

While this method works very cleanly, when each write() statement only outputs a single line, anticipating the end of page when using multi-line formats can get tricky. Also notice that no footer will be output on the last page. Additional code would have to be added after the while() loop to output additional blank lines and the footer. This is left as an exercise to the reader.

If you ever want to change header formats for any reason - for example if you wanted a large header on the first page, but only minimal headers on the other pages - you can use the special $^ variable. This variable behaves like $~, but selects the header format instead. Never set $^ (or $~ for that matter) to a non-existent format because this will cause your program to exit with a fatal error at run time. If you want a null header, never define the top format at all, or set $^ to an empty format.

Multi-Line Formats

Consider a couple of important facts about the top format in the two examples. First, there are no field definitions anywhere in the format declaration. It is perfectly legal to have a format with no field declarations, though in practice you will probably only do this for header formats.

Second, the format declaration defines multiple lines of output. This also is perfectly legal and each line can have zero, one, or more field declarations in it. The general pattern for multi-line format declarations is one line of field descriptions, followed by a line containing the variables associated with those fields, followed by another line of field descriptions, etc.

The next example shows an interesting use of multi-line formats. For purposes of this example program, we are assuming a function called mailparse() which processes email messages one at a time from the standard input. For each message, mailparse() puts all header information in a global associative array, %header, indexed by the header tag (e.g., From, To) and all of the body lines in a global scalar variable called $body. The output is shown below the example. By the way, my editor never sent me that message: I made it up. Like all writers, I am always early for all deadlines. Well, that last part was a lie, but I really did make up the email message.

     format message =
     Date: @<<<<<<<<<<<<<<<<<<<<<<<<< ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
           $header{`Date'},           $body
     From: @<<<<<<<<<<<<<<<<<<<<<<<<< ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
           $header{`From'},           $body
     To  : @<<<<<<<<<<<<<<<<<<<<<<<<< ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
           $header{`To'},             $body
     Subj: ^<<<<<<<<<<<<<<<<<<<<<<<<< ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
           $header{`Subject'},        $body
     ~~ ^<<<<<<<<<<<<<<<<<<<<<<<<< ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        $header{`Subject'},        $body
     .

     $~ = "message";
     while (<STDIN>) {
          &mailparse();
          write;
     }


     Date: Tue, 11 Apr 1995 16:39:06    Hal-- What's the status of your Perl
     From: tmd@iwi.com (Tina M. Darmo   article for the upcoming issue of ;login;?
     To  : hal@netmarket.com (Hal Pom   Rob needs to review the article before
     Subj: Your ;login: article is      giving it to Carolyn for typesetting.
           *OVERDUE* 	 	        Please send email soon-- the fate of the
	 	 	 	 	universe is at stake. --Tina

There are a number of new constructs in the message format in this example. First are the fields that begin with ^ instead of @. For these fields, Perl outputs as much text as will fit in the field and then removes that text from the string variable. By stacking several ^ fields together using the same long string, you can output that string as a block of text with a ragged right margin, as shown in the output, with both the body of the message and the Subj: line. The special $: variable (last special variable in this column, I promise) is the set of characters on which Perl can legally break the line; the default value for $: is \n - (newline, space, or hyphen).

The special ~~ marker on the last line means "keep outputting lines until all variables ($body and $header{Subject} in this case) are exhausted." This is useful for situations where you are not sure how long your text may run, but you want to be able to output all of the information. You can put the ~~ anywhere on the line, but it is best to put it in a very visible location (the beginning of the line is almost always best).

Conclusion

I have run across many Perl programs with complex printf() blocks that would have been much easier to write and much more readable if the developer had used formats instead. If you need to quickly produce reports, or output large amounts of tabulated data, formats are an extremely effective tool.

Reproduced from ;login: Vol. 20 No. 3, June 1995.

Back to Table of Contents

11/27/96ah