Richard's Perl Step 06
Pattern Matching



- I used "Sams Teach Yourself Perl in 24 Hours" - 3rd Edition as a reference for this page
- Run in windows DOS prompt
perl thisprogram.pl

perl c:\......\thisprogram.pl

 

Examples following Perl 24 hours book, Hour 6:

eg013.pl (text only) - pattern matching

eg014.pl (text only) - pattern matching, delimiter, + * exact match etc

eg015.pl (text only) - substitution s///, global /g

 

 

What you need is the ability to recognize patterns in the input stream, to pick and choose data based on those patterns, regular expression and pattern will be used almost interchangeably. You can look at the perlre documentation for details. (heavy stuff)

In Perl, patterns are enclosed inside a pattern match operator, which is sometimes represented as m//. A simple pattern might appear as follows:
m/Simon/

Pattern matches occur against $_ unless you tell Perl otherwise (which you'll learn about later).

If the pattern specified by m// is found anywhere in the variable $_, the match operator returns true.

if (m/Piglet/) 
{        
    # the pattern "Piglet" is in $_    
} 

Inside the pattern, every character matches itself unless it is a metacharacter.
Metacharacters are characters that change the behavior of the pattern match. The list of metacharacters is as follows:

^ $ ( ) \ | @ [ { ? . + *

you can replace the slashes with any other character you want, such as the commas in the following example:

if (m,Waldo,) 
{ 
     print "Found Waldo.\n"; 
}  


The slash or other character that marks the beginning and end of the pattern is called the delimiter. Often you replace the delimiter when the pattern contains slashes (/) and the end of the pattern could be confused with the slashes inside the pattern. If you stick with slashes to delimit that pattern, the enclosed slashes need to have backslashes in front of them, as shown here:

if (m/\/usr\/local\/bin\/hangman/) 
{ 
     print "Found the hangman game!" 
}  
$pat=<STDIN>;  
chomp $pat;    
$_="The phrase that pays";    
if (/$pat/) 
{    #Look for the user's pattern        
     print "\"$_\" contains the pattern $pat\n";    
} 


The unprintables

/\^\$/; # A literal caret and dollar sign

Character

Matches

\n

A newline character

\r

A carriage return

\t

A tab

\f

A formfeed




Quantifiers

A quantifier is a kind of metacharacter that tells the regular expression how many consecutive occurrences of something to match.

The + causes the preceding character to match at least once, or as many times as it can and still have a matching expression. Thus, /do+g/ would

Match These

But Not These

Why Not

hounddog

badge

The required o is missing.

hotdog

doofus

The g is missing.

doogie howser

Doogie

D is not the same as d.

doooooogdoog

pagoda

The d, o, and g do not appear in order.

The * metacharacter is similar to the + metacharacter, but it causes the preceding character to be matched zero or more times. Thus, /car*t/ would:

Match These

But Not These

Why Not

carted

carrot

The o intrudes into the pattern, but the ed follows the pattern.

cat

carl

The t in the pattern isn't optional, but the r is.

carrrt

caart

The a in the pattern can't be repeated, but the r can.

The ? metacharacter causes the preceding character to be matched either zero times or once (but no more).
/c?ola/ would match cola and ola, but not ccola.

Perl allows you to match exactly as many occurrences as you need by using braces, {}.
pat{n,m} n is the minimum number of matches, m is the maximum number of matches, and pat is the character or group of characters you're trying to quantify. You can omit either n or m, but not both.

/x{5,10}/

x occurs at least 5 times, but no more than 10.

/x{9,}/

x occurs at least 9 times, possibly more.

/x{0,4}/

x occurs up to 4 times, possibly not at all.

/x{8}/

x must occur exactly 8 times.

A common idiom in regular expressions is .*. You can use it to match anything—usually anything between two other things that you're interested in. For example, /first.*last/
first then last
The good players get picked first, the bad last.
The first shall be last, and the last shall be first.


Character Classes

Another common practice in regular expressions is to ask for a match of "any of these characters." To write a character class, you enclose the characters it contains in square brackets, [].

Character Class

Explanation

[abcde]

Match any of a, b, c, d, or e

[a-e]

Same as above; match any of a, b, c, d, or e

[ls]Gg[rs]

Match an uppercase G or lowercase g

[0-9]

Match a digit

[0-9]+

Match one or more digits in sequence

[A-Za-z]{5}

Match any group of five alphabetic characters

[*!@#$%&()]

Match any of these punctuation marks

a character class, most metacharacters lose their "meta-ness" Thus, the * really represents a literal *.

If a caret (^) occurs as the first character of a character class, the character class is negated. That is, the character class matches any single character that is not in the class, as in this example:

/[^A-Z]/; # Matches non-uppercase-alphabetic characters.

Because ], ^, and - are special in a character class, some rules apply about trying to match those characters literally in a character class. To match a literal ^ in a character class, you must make sure it does not occur first in the class. To match a literal ], you either need to put it first in the class or put a backslash in front of it (for example, /[abc\]]/). To put a literal hyphen (-) in a character class, you can simply put it first in the class or put a backslash in front of it.

Perl contains shortcuts for certain commonly used character classes.

Pattern

Matches

\w

A word character; same as [a-zA-Z0-9_]

\W

A nonword character (the inverse of \w)

\d

A digit; same as [0-9]

\D

A nondigit

\s

A whitespace character; same as [ \t\f\r\n]

\S

A nonwhitespace character

 

Something important

Greediness in Matching

The match repetition characters, * and +, are the greediest they can be when trying to make a match. In other words, always match the last occurrence.
If you want to curb this greedy behaviour and make them match on the first occurrence instead, then you must use the greed-inhibiting character, "?".
(.*?)