split()
routine, but "there is always
more than one way to do it," and split()
is not always
your best choice.
For example, split()
does not deal gracefully with data
in fixed-width fields. Sometimes you can split()
on
whitespace, but suppose one or more of the fields contain whitespace
(perhaps a "full name" field) or suppose you would like to preserve
the alignment of the data? You could use substr()
, but
that only allows you to pull out one field at a time and you typically
have to remove any trailing spaces yourself. Consider using
unpack()
when faced with fixed width data: it gracefully
solves all of these problems.
ls -n
" equivalent (same as
"ls -lg
" but with numeric user and group IDs instead of
names) that uses pack()
and unpack()
to
manipulate the output of a BSD-style ls
$template = "a14 A9 A9 a*"; open(LS, "ls -lg |") || die "Can't ls!\n"; while (<LS>) { ($first, $uid, $gid, $last) = unpack($template, $_); $uids{$uid} = (getpwnam($uid))[2] unless ($uids{$uid}); $gids{$gid} = (getgrnam($gid))[2] unless ($gids{$gid}); (getgrnam($gid))[2]unless($gids{$gid}); print pack($template, $first, $uids{$uid},$gids{$gid},$last); }There is actually a subtle bug in the above program. A completely pointless prize will be awarded to the first person who correctly identifies the bug to me.
The first argument to unpack()
is a template describing
each field and how wide the field is. Whitespace in the template is
for readability only - it is strictly ignored by
unpack()
. The first "a14" in the template means the first
field is a string of ASCII which is 14 characters long (in the ls
output, this pulls off the mode bits and link count information). This
is followed by two ASCII strings which are 9 characters long (the
owner and group of the file), but the upper-case "A" also means strip
off any trailing whitespace (so we can feed the result to the
appropriate get*nam()
function). The final "a*" means
just pull off everything else on the line into the last field.
Notice that we can use the same template when we put the line back
together with pack()
. Perl's interpretation of "a" and
"A" in pack()
templates has been specifically designed to
make this possible. The numeric value after each operator in the
template gives the field width: "a" pads the field with nulls, and "A"
pads with spaces. A "*" instead of a number means make the field
exactly as long as the data supplied.
split()
function may also not be your best choice if
your fields are very irregular. For example, the
previous Perl Practicum showed regular expressions to
match fields in the following data record:
Pomeranz, Hal (pomeranz) x409Recall that the last two fields are optional and sometimes the whitespace was literal spaces, sometimes tabs, other times a mixture of the two and there tended to be trailing whitespace. I needed to lose the comma, the "x" before the phone extension, and the parentheses around the email address.
I could have used split()
to pull the line apart (the
example below also illustrates that the first argument to
split()
is a fully-fledged regular expression)
@fields = split(/[\s,()]+/);though I still would have had the leading "x" in the extension field (it cannot go in the list of delimiters since the other fields might contain an "x"). I could also get null fields at the end of the list unless I first eliminated the trailing whitespace with
s/\s+$//;Furthermore, what happens when
split()
only returns a
list of three values-is the last value an email address or a
phone extension? One could examine the field to see if it
matches /x\d{3}/
, but it would be nice to be able to say
($last, $first, $email, $ext) = some_expressionand have
$email
or $ext
be null if there is
no such information on the line.
$_ = "Wed Apr 20 20:39:34 PDT 1994"; @fields = / ((\d+):(\d+):(\d+)) /;There are four sub-expressions in the above pattern match--one sub-expression enclosing three others. The opening parenthesis for the larger sub-expression is left-most, followed by the three smaller expressions in order. So,
$fields[0]
will be "20:39:34"
and the next three ele ments of the list will be set to "20", "39",
and "34".
This behavior in a list context makes pattern match a very flexible split operator. It is worth mentioning here that if you assign a pattern match expression to a list, then Perl does not set the special $1, $2,..., $9 variables.
Taking the regular expression developed in the previous Perl Practicum and assigning it to a list yields
($first,$last,$junk1,$email,$junk2,$ext) = /^(.+),\s+(\w+)(\s+(\((\w+)\))?(\s+x(\d+))?\s*$/;The junk fields are necessary because we had to enclose the optional expressions (which match the email and extension fields) in parentheses. Larry Wall is working on a regular expression grouping operator which will not generate sub-expressions, but we will probably have to wait for a later release of Perl5.
"Pomeranz, Hal", Support, "Saratoga, CA, USA"with some fields quoted and some not. While Perl regular expressions are not regular expressions in the strict mathematical sense, they cannot be used to generally solve the problem of matching opening and closing delimiters (like parentheses or braces)-particularly if the delimiter is a multi-character string or if you have nested delimiters. A one-line expression to match C-style comments has become the holy grail of comp.lang.perl and is believed to be in the class of problems that includes trisecting an angle with only a compass and straight-edge.
One option is to simply use split()
to break each record
up (using the chosen delimiter) and then reconstruct quoted fields
after the fact. Of course, you would have to preserve the delimiters
if you take this approach. Luckily, split()
allows you
to do this by using parentheses in the first argument to create a
sub-expression:
@list = split(/(,\s+)/);Assuming the data line above is in
$_,
$list[0]
will be `"Pomeranz', $list[1]
will
be `",' etc. Examine list's elements for leading and trailing double
quotes to reassemble the fields.
Another approach would be to try and create an expression that matches the individual fields:
@fields = /("[^"]+"|[^,]+), \s+("[^"]+"|[^,]+), \s+("[^"]+"|[^,]+)/;This expression requires each field to be either a double quoted string (a double quote, followed by one or more non-quote characters, followed by a double quote) or one or more non-comma characters. This tactic will work as long as the records contain no nested quotes. However, both the
split()
tactic and the regular
expression above will fail on records like
This ", would be" nastyThe general solution to this problem requires a small function. The Perl distribution includes
shellwords.pl
which contains a
function to parse lines of space delimited, optionally quoted fields.
I have written a modified version of this library,
quotewords.pl
, which accepts any regular expression as a
delimiter. You can obtain quotewords.pl from one of the Perl
archives, or directly from me via email.
split()
function is good for data with regular delimiters that do not appear inside the fields themselves (the classic example is the UNIX password file). For data in fixed-width fields, use pack()
and unpack()
, or substr()
if you only need to extract a single field. Pattern match is a good generic split function, particularly if the data are very irregular. Dealing with quoted fields is always difficult, but the problem has been solved, so you do not have to reinvent the wheel.
Reproduced from ;login: Vol. 19 No. 3, June 1994
Back to Table of Contents
11/26/96ah