Skip to content

perlretut: Grammar, clarifications, white-space #18486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 28 additions & 24 deletions pod/perlretut.pod
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,20 @@ expressions will allow you to manipulate text with surprising ease.
What is a regular expression? At its most basic, a regular expression
is a template that is used to determine if a string has certain
characteristics. The string is most often some text, such as a line,
sentence, web page, or even a whole book, but less commonly it could be
some binary data as well.
sentence, web page, or even a whole book, but it doesn't have to be. It
could be binary data, for example. Biologists often use Perl to look
for patterns in long DNA sequences.

Suppose we want to determine if the text in variable, C<$var> contains
the sequence of characters S<C<m u s h r o o m>>
(blanks added for legibility). We can write in Perl

$var =~ m/mushroom/

The value of this expression will be TRUE if C<$var> contains that
sequence of characters, and FALSE otherwise. The portion enclosed in
C<'E<sol>'> characters denotes the characteristic we are looking for.
sequence of characters anywhere within it, and FALSE otherwise. The
portion enclosed in C<'E<sol>'> characters denotes the characteristic we
are looking for.
We use the term I<pattern> for it. The process of looking to see if the
pattern occurs in the string is called I<matching>, and the C<"=~">
operator along with the C<m//> tell Perl to try to match the pattern
Expand Down Expand Up @@ -60,7 +63,7 @@ many examples. The first part of the tutorial will progress from the
simplest word searches to the basic regular expression concepts. If
you master the first part, you will have all the tools needed to solve
about 98% of your needs. The second part of the tutorial is for those
comfortable with the basics and hungry for more power tools. It
comfortable with the basics, and hungry for more power tools. It
discusses the more advanced regular expression operators and
introduces the latest cutting-edge innovations.

Expand Down Expand Up @@ -135,7 +138,7 @@ And finally, the C<//> default delimiters for a match can be changed
to arbitrary delimiters by putting an C<'m'> out front:

"Hello World" =~ m!World!; # matches, delimited by '!'
"Hello World" =~ m{World}; # matches, note the matching '{}'
"Hello World" =~ m{World}; # matches, note the paired '{}'
"/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
# '/' becomes an ordinary char

Expand All @@ -151,7 +154,7 @@ Let's consider how different regexps would match C<"Hello World">:
"Hello World" =~ /oW/; # doesn't match
"Hello World" =~ /World /; # doesn't match

The first regexp C<world> doesn't match because regexps are
The first regexp C<world> doesn't match because regexps are by default
case-sensitive. The second regexp matches because the substring
S<C<'o W'>> occurs in the string S<C<"Hello World">>. The space
character C<' '> is treated like any other character in a regexp and is
Expand All @@ -169,8 +172,8 @@ always match at the earliest possible point in the string:
"That hat is red" =~ /hat/; # matches 'hat' in 'That'

With respect to character matching, there are a few more points you
need to know about. First of all, not all characters can be used "as
is" in a match. Some characters, called I<metacharacters>, are
need to know about. First of all, not all characters can be used
"as-is" in a match. Some characters, called I<metacharacters>, are
generally reserved for use in regexp notation. The metacharacters are

{}[]()^$.|*+?-#\
Expand Down Expand Up @@ -832,8 +835,8 @@ Counting the opening parentheses to get the correct number for a
backreference is error-prone as soon as there is more than one
capturing group. A more convenient technique became available
with Perl 5.10: relative backreferences. To refer to the immediately
preceding capture group one now may write C<\g{-1}>, the next but
last is available via C<\g{-2}>, and so on.
preceding capture group one now may write C<\g-1> or C<\g{-1}>, the next but
last is available via C<\g-2> or C<\g{-2}>, and so on.

Another good reason in addition to readability and maintainability
for using relative backreferences is illustrated by the following example,
Expand Down Expand Up @@ -1970,10 +1973,11 @@ C<\x>I<XY> (without curly braces and I<XY> are two hex digits) doesn't
go further than 255. (Starting in Perl 5.14, if you're an octal fan,
you can also use C<\o{oct}>.)

/\x{263a}/; # match a Unicode smiley face :)
/\x{263a}/; # match a Unicode smiley face :)
/\x{ 263a }/; # Same

B<NOTE>: In Perl 5.6.0 it used to be that one needed to say C<use
utf8> to use any Unicode features. This is no more the case: for
utf8> to use any Unicode features. This is no longer the case: for
almost all Unicode processing, the explicit C<utf8> pragma is not
needed. (The only case where it matters is if your Perl script is in
Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.)
Expand Down Expand Up @@ -2050,16 +2054,16 @@ C<\p{Mark}>, meaning things like accent marks.

The Unicode C<\p{Script}> and C<\p{Script_Extensions}> properties are
used to categorize every Unicode character into the language script it
is written in. (C<Script_Extensions> is an improved version of
C<Script>, which is retained for backward compatibility, and so you
should generally use C<Script_Extensions>.)
For example,
is written in. For example,
English, French, and a bunch of other European languages are written in
the Latin script. But there is also the Greek script, the Thai script,
the Katakana script, I<etc>. You can test whether a character is in a
particular script (based on C<Script_Extensions>) with, for example
C<\p{Latin}>, C<\p{Greek}>, or C<\p{Katakana}>. To test if it isn't in
the Balinese script, you would use C<\P{Balinese}>.
the Katakana script, I<etc>. (C<Script> is an older, less advanced,
form of C<Script_Extensions>, retained only for backwards
compatibility.) You can test whether a character is in a particular
script with, for example C<\p{Latin}>, C<\p{Greek}>, or
C<\p{Katakana}>. To test if it isn't in the Balinese script, you would
use C<\P{Balinese}>. (These all use C<Script_Extensions> under the
hood, as that gives better results.)

What we have described so far is the single form of the C<\p{...}> character
classes. There is also a compound form which you may run into. These
Expand Down Expand Up @@ -2439,7 +2443,7 @@ substring delimited by parentheses. The problem with this regexp is
that it is pathological: it has nested indeterminate quantifiers
of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers
like this could take an exponentially long time to execute if there
was no match possible. To prevent the exponential blowup, we need to
is no match possible. To prevent the exponential blowup, we need to
prevent useless backtracking at some point. This can be done by
enclosing the inner quantifier as an independent subexpression:

Expand Down Expand Up @@ -2625,8 +2629,8 @@ section L</"Pragmas and debugging"> below.

More fun with C<?{}>:

$x =~ /(?{print "Hi Mom!";})/; # matches,
# prints 'Hi Mom!'
$x =~ /(?{print "Hi Mom!";})/; # matches,
# prints 'Hi Mom!'
$x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
# prints '1'
$x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
Expand Down