Perl Practicum: It Slices, It Dices...

by Hal Pomeranz

Splitting Data

A common task that seems to have been generating a lot of questions on comp.lang.perl recently is how to split input records of data in order to extract the data fields. The first impulse is to head straight for the split() routine, but "there is always more than one way to do it," and split() is not always your best choice.

For example, split() does not deal gracefully with data in fixed-width fields. Sometimes you can split() on whitespace, but suppose one or more of the fields contain whitespace (perhaps a "full name" field) or suppose you would like to preserve the alignment of the data? You could use substr(), but that only allows you to pull out one field at a time and you typically have to remove any trailing spaces yourself. Consider using unpack() when faced with fixed width data: it gracefully solves all of these problems.

Fixed-Width Data

As an example, here is an "ls -n" equivalent (same as "ls -lg" but with numeric user and group IDs instead of names) that uses pack() and unpack() to manipulate the output of a BSD-style ls
     $template = "a14 A9 A9 a*";
     open(LS, "ls -lg |") || die "Can't ls!\n";
     while (<LS>)
     {
          ($first, $uid, $gid, $last) = unpack($template, $_);
               $uids{$uid} = (getpwnam($uid))[2] unless ($uids{$uid});
               $gids{$gid} = (getgrnam($gid))[2] unless ($gids{$gid});
          (getgrnam($gid))[2]unless($gids{$gid});
     print pack($template, $first, $uids{$uid},$gids{$gid},$last); }
There is actually a subtle bug in the above program. A completely pointless prize will be awarded to the first person who correctly identifies the bug to me.

The first argument to unpack() is a template describing each field and how wide the field is. Whitespace in the template is for readability only - it is strictly ignored by unpack(). The first "a14" in the template means the first field is a string of ASCII which is 14 characters long (in the ls output, this pulls off the mode bits and link count information). This is followed by two ASCII strings which are 9 characters long (the owner and group of the file), but the upper-case "A" also means strip off any trailing whitespace (so we can feed the result to the appropriate get*nam() function). The final "a*" means just pull off everything else on the line into the last field.

Notice that we can use the same template when we put the line back together with pack(). Perl's interpretation of "a" and "A" in pack() templates has been specifically designed to make this possible. The numeric value after each operator in the template gives the field width: "a" pads the field with nulls, and "A" pads with spaces. A "*" instead of a number means make the field exactly as long as the data supplied.

Irregular Data...

The split()function may also not be your best choice if your fields are very irregular. For example, the previous Perl Practicum showed regular expressions to match fields in the following data record:
     Pomeranz, Hal   (pomeranz) 	    x409
Recall that the last two fields are optional and sometimes the whitespace was literal spaces, sometimes tabs, other times a mixture of the two and there tended to be trailing whitespace. I needed to lose the comma, the "x" before the phone extension, and the parentheses around the email address.

I could have used split() to pull the line apart (the example below also illustrates that the first argument to split() is a fully-fledged regular expression)

     @fields = split(/[\s,()]+/);
though I still would have had the leading "x" in the extension field (it cannot go in the list of delimiters since the other fields might contain an "x"). I could also get null fields at the end of the list unless I first eliminated the trailing whitespace with
     s/\s+$//;
Furthermore, what happens when split() only returns a list of three values-is the last value an email address or a phone extension? One could examine the field to see if it matches /x\d{3}/, but it would be nice to be able to say
     ($last, $first, $email, $ext) = some_expression
and have $email or $ext be null if there is no such information on the line.

...And the Pattern Match

The pattern match operator, when in a list context, returns a list containing the values matching "sub-expressions" in the pattern. A sub-expression is anything in the pattern enclosed by parentheses; sub-expressions are returned in the order determined by the opening (left) parenthesis of each expression, reading from left to right. For example, consider the following expression to extract the time from an ASCII date string:
     $_ = "Wed Apr 20 20:39:34 PDT 1994";
     @fields = / ((\d+):(\d+):(\d+)) /;
There are four sub-expressions in the above pattern match--one sub-expression enclosing three others. The opening parenthesis for the larger sub-expression is left-most, followed by the three smaller expressions in order. So, $fields[0] will be "20:39:34" and the next three ele ments of the list will be set to "20", "39", and "34".

This behavior in a list context makes pattern match a very flexible split operator. It is worth mentioning here that if you assign a pattern match expression to a list, then Perl does not set the special $1, $2,..., $9 variables.

Taking the regular expression developed in the previous Perl Practicum and assigning it to a list yields

     ($first,$last,$junk1,$email,$junk2,$ext) = /^(.+),\s+(\w+)(\s+(\((\w+)\))?(\s+x(\d+))?\s*$/;
The junk fields are necessary because we had to enclose the optional expressions (which match the email and extension fields) in parentheses. Larry Wall is working on a regular expression grouping operator which will not generate sub-expressions, but we will probably have to wait for a later release of Perl5.

Quotes

A very difficult splitting problem is the breaking up of records that have quoted fields which enclose the delimiter character(s). Suppose I had records like this:
     "Pomeranz, Hal", Support, "Saratoga, CA, USA"
with some fields quoted and some not. While Perl regular expressions are not regular expressions in the strict mathematical sense, they cannot be used to generally solve the problem of matching opening and closing delimiters (like parentheses or braces)-particularly if the delimiter is a multi-character string or if you have nested delimiters. A one-line expression to match C-style comments has become the holy grail of comp.lang.perl and is believed to be in the class of problems that includes trisecting an angle with only a compass and straight-edge.

One option is to simply use split() to break each record up (using the chosen delimiter) and then reconstruct quoted fields after the fact. Of course, you would have to preserve the delimiters if you take this approach. Luckily, split() allows you to do this by using parentheses in the first argument to create a sub-expression:

     @list = split(/(,\s+)/);
Assuming the data line above is in $_, $list[0] will be `"Pomeranz', $list[1] will be `",' etc. Examine list's elements for leading and trailing double quotes to reassemble the fields.

Another approach would be to try and create an expression that matches the individual fields:

     @fields = /("[^"]+"|[^,]+), \s+("[^"]+"|[^,]+), \s+("[^"]+"|[^,]+)/;
This expression requires each field to be either a double quoted string (a double quote, followed by one or more non-quote characters, followed by a double quote) or one or more non-comma characters. This tactic will work as long as the records contain no nested quotes. However, both the split() tactic and the regular expression above will fail on records like
     This ", would be" nasty
The general solution to this problem requires a small function. The Perl distribution includes shellwords.pl which contains a function to parse lines of space delimited, optionally quoted fields. I have written a modified version of this library, quotewords.pl, which accepts any regular expression as a delimiter. You can obtain quotewords.pl from one of the Perl archives, or directly from me via email.

Conclusion

Data reduction is a fairly common task for Perl programs, and the method you use should be carefully tailored for the data you are operating on. The split() function is good for data with regular delimiters that do not appear inside the fields themselves (the classic example is the UNIX password file). For data in fixed-width fields, use pack() and unpack(), or substr() if you only need to extract a single field. Pattern match is a good generic split function, particularly if the data are very irregular. Dealing with quoted fields is always difficult, but the problem has been solved, so you do not have to reinvent the wheel.


Reproduced from ;login: Vol. 19 No. 3, June 1994

Back to Table of Contents

11/26/96ah