The last two
The first basic rule is, always take full advantage of naturally
occurring delimiters. We put spaces between words in written (and
spoken) English because it helps us to understand it better -- look
for fixed tokens that help you break up your regular expression
"utterances". For example,
This rule should also be applied when building up regular
expressions. Suppose we wanted to match date strings
The process we used to build up the last example brings us to our
second rule: start simple and increase your complexity and level of
refinement gradually. For example, it was my recent misfortune to have
to parse a file with lines like
A first cut might be
The email address and phone extension are optional
The third rule is, never use a complex expression when a simple one
will do. For example, one expression to match IP addresses might be
Rule number four is never forget that Perl pattern matching is greedy:
the `
This greedy behavior can be a problem as well, particularly when you
are trying to match pairs of delimiters. For example, suppose you
wanted to match the first double quoted field in
The fifth and final rule is, be careful about anchoring your patterns
with
Another place where this can bite you is when you are trying to
verify the format of some data. The pattern
Reproduced from ;login: Vol. 19 No. 2, April 1994.
Back to Table of Contents
11/26/96ah
/^[-+]?\d+(\.\d+)?([eE][-+]?\d+)?$/
is just so much Greek if you try to read it all at once. Use the ()
and [] groupings to break the expression up into four manageable
pieces:
[+-]? \d+ (\.\d+)? ([eE][-+]?\d+)?
The first one is easy, an optional plus or minus sign, and the second
is trivial, one or more digits. The third says, "a literal period
followed by one or more digits," and the trailing question mark makes
the whole group optional. The fourth (also optional) group is a little
trickier: an upper or lower case `E', followed by an optional plus or
minus, followed by one or more digits. Put it all together and you
match any valid Perl number, but you probably figured this out by now.
Fri Jan 28 13:12:02 PST 1994
There are six different space separated blobs in that line, but there
are only two fundamental "types" of things to match: words ("Fri",
"Jan", and "PST") and numbers ("28", "1994", and the hours, minutes,
and seconds in the time string). Well we can just use "\w+
" for words
and "\d+
" for numbers, and the regular expression just pops out
# the expression below is wrong!
/^\w+ \w+ \d+ \d+:\d+:\d+ \w+ \d+$/
Actually, this is not quite right. The day of the month and the hour
of the day can both be single digit values, and the leading digit
position will then just be a space. So, we modify our pattern slightly
/^\w+ \w+\s+\d+s+\d+:\d+:\d+ \w+ \d+$/
I generally find "\s+
" clearer than " +
"
(that's space-plus, see what I mean?) in regular expressions, even
though they don't strictly mean the same thing.
Pomeranz, Hal (pomeranz) x409
Sometimes the white space was literal spaces, sometimes tabs, other
times a mixture of the two, and there tended to be lots of trailing
white space. Sometimes there was no email address, sometimes there was
no extension, and sometimes there was neither.
/^\w+, \w+ \(\w+\) x\d+$/
You can clearly see the four blocks corresponding to last name, first
name, email, and phone extension. Note that we have to backwhack the
parentheses around the email address because of their special meaning
in regular expressions. Now we can begin to address special cases.
/^\w+, \w+( \(\w+\))?( x\d+)?$/
Note that we have incorporated the space before the email address and
phone extension in the optional block along with each of those
fields. Theoretically, the line of data could simply end after the
first name with no additional white space. As a further refinement, we
have to deal with trailing white space, and the case where field
delimiters are not single spaces
/^\w+,\s+\w+(\s+\(\w+\))?(\s+x\d+)?\s*$/
Actually, last names can look like "Van Der Sluis" or "Cody-Lang", so
we remember Rule #1 (take advantage of naturally occurring delimiters)
and say that the last name is anything before the comma
/^.+,\s+\w+(\s+\(\w+\))?(\s+x\d+)?\s*$/
All right, we know the above expression accurately matches all the
data we might encounter because we have tested it thoroughly on actual
data (you did test thoroughly, right?). Actually, I really needed this
pattern so that I could extract the last and first names, email
address, and phone extension from the line. So now we have to make
everything we want to extract from the line into a subexpression by
throwing parentheses around the individual fields
/^(.+),\s+(\w+)(\s+(\((\w+)\))?(\s+x(\d+))?\s*$/
As Randal Schwartz is fond of saying, "Perl: checksummed line noise
with a sense of purpose."
/^([12]?\d?\d\.){3}[12]?\d?\d$/
but why bother? In most cases either
/^\d+\.\d+\.\d+\.\d+$/
or
/^(\d+\.){3}\d+$/
is more than sufficient. The first expression is probably more
readable, but your mileage may vary. In either case, the person who
has to maintain your code six months from now (who, you should
remember, might just be yourself) will thank you.
*
' and `+
' operators will eat as much
as they can as long as the pattern can be satisfied. This can work in
your favor when you are doing something like
$_ = "/usr/local/bin/perl";
($dir, $prog) = ~/^(.*)\/(.*)$/;
The first ".*
" will eat up everything but the last
`/
' which we force it to match (Rule #1 again) before we
pull off the program name.
$_ = `pomeranz "Hal Pomeranz" "S Clara"';
The expression
$name = ~/"(.*)"/ # wrong!
will set $name
equal to
Hal Pomeranz" "S Clara
which is not what we wanted. Instead you want
$name = ~/"([^"]+)"/
which says match a double quote, followed by one or more things that
are NOT a double quote, terminated with another double quote. This
"match everything except my trailing delimiter" concept is a useful
trick for your Perl toolkit.
^
and $
. Err towards using
^
and $
, even when they are not strictly
necessary. For example, a common idiom is
@files = grep(!/^\.\.?$/, readdir(DIR));
which gives you a list of files from directory handle DIR, except for
the "." (dot) and ".." (dot-dot) files. Leaving off the ^
and $
accidentally will throw away all filenames with a
dot in them, and leaving off the $
will throw out all dot
files in the directory. Either way, the result is bound to be
unexpected.
/\d+/
will match valid integers, but it also matches "foo2bar" and other
things which are definitely not numbers. To validate that values are
numbers you have to use
/^\d+$/
or a more complex expression like the one at the beginning of this
article.
You simply must become comfortable with regular expressions to use
Perl effectively. Always remember to break complex regular expressions
up into manageable pieces before trying to write or understand
them. Always work up from a simple case to greater stages of
refinement and complexity. Never make expressions any more complex
than they have to be or you will never be able to modify them without
breaking something else. Use greedy pattern matching to your advantage
but beware of the dark side. Finally, use ^
and $
freely to avoid unexpected problems.