Created on Oct. 30, 2012, 2:21 p.m. by Hevok & updated on Nov. 10, 2012, 8:40 p.m. by Hevok
===================
¶
Regular Expressions
¶
===================
¶
A regular expression (regex or RE) describes a set of strings. Used in the right way REs become very powerful for text mining and manipulation.
¶
¶
.. contents:: Contents
¶
¶
RE Concatenation
¶
================
¶
REs can be concatenated. If A and B are both RE, then AB is also RE. If a string p matches A and another string q matches B, the string pq will match AB.
¶
¶
¶
Special Characters
¶
==================
¶
Regex allow the use of escaped letters and special symbols to match a wide range of strings according to certain rules.
¶
¶
.
¶
Matches any character except newline, DOTALL flag matches any character including newline.
¶
¶
^
¶
Matches start of string (Beginning of the string).
¶
¶
$
¶
Matches string just before the end (End of a string).
¶
¶
x*
¶
Means zero or more instances of the previous pattern.
¶
¶
x+
¶
Matches x one or more times. Extends the current character to match one or more times. w+ will match at least one word character, but would not match 20.
¶
¶
x?
¶
0 or 1 instances of the pattern. Matches an optional x character (matches are zero or one times).
¶
¶
``*?, +?, ??``
¶
Use ? after , +, or ? wild-card search to make them less "greedy." .? will match the minimum it can, rather than as much as possible.
¶
¶
{m}
¶
Specifies how many instances of the regex should be matched.
¶
¶
x{m,n}
¶
Causes resulting RE to match from m to n repetitions of the proceeding RE, attempting to match as many repetitions as possible. Thus it specifies a range of the number of instances that should be matches.
¶
¶
{m,n}?
¶
Specifies a range of the numbers of instances that should be matched, matching as few as possible.
¶
¶
{a|b|c}
¶
Matches either a or b or c.
¶
¶
``|``
¶
Or operator. Matches either the value on the left of the pipe or on the value on the right.
¶
¶
[]
¶
Indicates a set of characters for a single position in the regex. Put characters between.
¶
¶
()
¶
Put brackets around a set of characters and pull them oth later using the .groups() method of the match object.
¶
¶
(...)
¶
Indicates a grouping for the regex. A remembered group. Values of what matched can be retrieved by using the groups() method of the object returned by re.search.
¶
¶
(?...)
¶
Dunno?
¶
¶
(?iLmsux)
¶
Each letter defines the further meaning of the construction.
¶
¶
(?:...)
¶
Non-grouping of a regex.
¶
¶
(?P<name>...)
¶
Given name 'name to the regex for later usage.
¶
¶
(?P=name)
¶
Recalls the text matched by the regex named 'name'.
¶
¶
(?#...)
¶
A comment/remark. The parentheses and their contents are ignored.
¶
¶
(?=...)
¶
Matches if the preceding part of the regex and the subsequent part both match.
¶
¶
(?!...)
¶
Matches expressions when the part of the regex preceding the parenthesis is not followed by the regex in parentheses.
¶
¶
(?<=...)
¶
Matches the expression to the right of the parentheses when it is preceded by the value of ...
¶
¶
(?<!...)
¶
Matches the expression to the right of the parentheses when it is not preceded by the value of ...
¶
¶
(?(id/name)yes-pattern/no-pattern)
¶
WTF?
¶
¶
`\`
¶
Regular expression use a backslash for special characters.
¶
¶
``number``
¶
Matches a number?
¶
¶
``A``
¶
Matches the start of the string. This is similar to '^'.
¶
¶
``b``
¶
A word boundary must occur right here. Matches the empty string that forms the boundary at the beginning or end of a word.
¶
¶
``B``
¶
Matches the empty string that is not the beginning or end of a word.
¶
¶
``d``
¶
Any numeric digit. A digit character, 0-9.
¶
¶
``D``
¶
Matches any character that are not digits.
¶
¶
``s``
¶
A white space character, such as space or tab.
¶
¶
``S``
¶
A non-white space character.
¶
¶
``w``
¶
A "word" character: a-z, A-Z, 0-9, and a few others, such as underscore (_).
¶
¶
``W``
¶
A non-word character - the opposite of w. Examples include '&', '$', '@', etc.
¶
¶
``Z``
¶
Matches the end of a string. This is similar to '$'.
¶
¶
¶
Examples
¶
========
¶
.. sourcecode:: python
¶
¶
import re
¶
r = re.search('Bal{2,5}','34eBallllll342')
¶
if r: print "Positive"
¶
else: print "Negative"
¶
¶
RE, seq = '(?<=abc)def', 'abcdef'
¶
RE, seq = '(?<=-)w+', 'spam-egg' # Looks for a word following a hyphen
¶
¶
m = re.search(RE, seq)
¶
print m.group(0)
¶
¶
print '''n#Matching vs. Searching'''
¶
print re.match("c", "abcdef")
¶
print re.search("c", "abcdef")
¶
¶
print '''n#Module'''
¶
pattern = 'ABC'
¶
string = 'ABCD'
¶
prog = re.compile(pattern)
¶
result = prog.match(string)
¶
result = re.match(pattern, string) #re.compile is more efficient when the expression will used several times
¶
¶
#findall(string[,pos[,endpos]]) #findall by an positional limit the search regions
¶
¶
print '''n#group([group1, ...])'''
¶
m = re.match(r"(w+) (w+)", "Isaac Newton, physicist")
¶
print m.group(0)
¶
print m.group(1)
¶
print m.group(2)
¶
print m.group(1,2)
¶
¶
m = re.match(r"(?P<first_name>w+) (?P<last_name>w+)", "Malcolm Reynolds")
¶
print m.group('first_name')
¶
print m.group('last_name')
¶
¶
print '''n#Names groups can also be refered to by their index:'''
¶
print m.group(1)
¶
print m.group(2)
¶
¶
print '''n#If a group matches multiple times, only the last match is accessible:'''
¶
m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
¶
print m.group(1) # Matches only the last match.
¶
¶
print '''n#grouodict([default])'''
¶
m = re.match(r"(?P<first_name>w+) (?P<last_name>w+)", "Macolm Reynolds")
¶
print m.groupdict()
¶
¶
print '''n#Making a Phonebook'''
¶
input = """Ross McFluff: 834.345.1254 155 Elm Street
¶
Ronald Heathmore: 892.345.3428 436 Finley Avenue
¶
Frank Burger: 925.541.7625 662 South Dogwood Way
¶
Heather Albrecht: 548.326.4584 919 Park Place"""
¶
entries = re.split("n+", input)
¶
print entries
¶
print
¶
print [re.split(":? ", entry, 3) for entry in entries]
¶
print
¶
print [re.split(":? ", entry, 4) for entry in entries]
¶
¶
print '''n#Text Munging'''
¶
import random
¶
def repl(m):
¶
inner_word = list(m.group(2))
¶
random.shuffle(inner_word)
¶
return m.group(1) + "".join(inner_word) + m.group(3)
¶
text = '''Some long text here.'''
¶
¶
##Hikowa, please report your absences promptly!"
¶
##print re.sub(r"(w)(w+)(w)", repl, text)
¶
print re.sub(r"(w)(w+)(w)", repl, text)
¶
¶
print '''n#Finding all Adverbs'''
¶
text = "He was carefully disguised bt captured quickly by police."
¶
print re.findall(r"w+ly", text)
¶
¶
print '''n#Finding all Adverbs and tehir Positions'''
¶
text = "He was carefully disguised bt captured quickly by police."
¶
for m in re.finditer(r"w+ly",text):
¶
print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
¶
¶
#234567891123456789212345678931234567894123456789512345678961234567897123456789
¶
¶
Get positions of matches
¶
========================
¶
.. sourcecode:: python
¶
¶
import re # http://stackoverflow.com/questions/250271/python-regex-use-how-to-get-positions-of-matches
¶]
p = re.compile("[a-z]")
¶
for m in p.finditer('a1b2c3d4'):
¶
print m.start(), m.group()
¶
¶
Negation
¶
========
¶To^ inega the an exprbessgion defined ing of square brackets include ga ^tes atll tche varacterys beginningcluded. For instance, to match every non-alphanumeric character except spacome adefined colnes, one cousld formulate a regex like this pattern::
¶
¶
[^a-zA-Zds:]
¶
¶
* d - numeric class
¶
* s - white-space
¶
* a-zA-Z - matches all letters (w would also include some alphanumeric subsets)
¶
* ^ - negates them all, in sueffech that it results in allto non numeric chars, non spaces and non colons
¶
¶
Anchors
¶
=======
¶
¶
``^``
¶
Start of string, or start of line in multi-line pattern
¶
``A``
¶
Start of string
¶
``$``
¶
End of string, or end of line in multi-line pattern
¶
``Z``
¶
End of string
¶
``b``
¶
Word boundary
¶
``B``
¶
Not word boundary
¶
``<``
¶
Start of word
¶
``>``
¶
End of word
¶
¶
¶
Character Classes
¶
=================
¶
``c``
¶
Control character
¶
``s``
¶
White space
¶
``S``
¶
Not white space
¶
``d``
¶
Digit
¶
``D``
¶
Not digit
¶
``w``
¶
Word
¶
``W``
¶
Not word
¶
``x``
¶
Hexadecimal digit
¶
``O``
¶
Octal digit
¶
¶
Quantifiers
¶
===========
¶
¶
``*``
¶
0 or more
¶
``+``
¶
1 or more
¶
``?``
¶
0 or 1
¶
{n}
¶
Exactly n times
¶
{n,}
¶
n or more
¶
{n,m}
¶
n to m
¶
¶
Add a ? to a quantifier to make it ungreedy.
¶
¶
Groups & Ranges
¶
===============
¶
.
¶
Any character except new line (n)
¶
(a|b)
¶
a or b
¶
(...)
¶
Group
¶
(?:...)
¶
Passive (non-capturing) group
¶
[abc]
¶
Range (a or b or c)
¶
[^abc]
¶
Not a or b or c
¶
[a-q]
¶
Letter from a to q
¶
[A-Q]
¶
Upper case letter from A to Q
¶
[0-7]
¶
Digit from 0 to 7
¶
``n``
¶
nth group/subpattern
¶
¶
Note: Ranges are inclusive.
¶
¶
Common Metacharacters
¶
=====================
¶
^ [ . $ { * ( + ) | ? < >
¶
¶
The escape character is usually the backslash.
¶
¶
Special Character
¶
=================
¶
``n``
¶
New line
¶
``r``
¶
Carriage return
¶
``t``
¶
Tab
¶
``v``
¶
Vertical tab
¶
``f``
¶
Form feed
¶
``xxx``
¶
Octal character xxx
¶
``xhn``
¶
Hex character hh
¶
¶
¶
Resources
¶
=========
¶
There is an excellent book on regex [1] and the official python HowTo guide [2] as well as a nice regex primer [3] and the cheat-sheet [4]
¶
¶
.. [1] Mastering regular Expression by Jeffrey Friedly, published by O'Reilly, 1.Editions
¶
.. [2] Regular Expression HOWTO [http://docs.python.org/howto/regex.html#regex-howto]
¶
.. [3] Regex Primer [http://python.about.com/od/regularexpressions/a/regexprimer.htm]
¶
.. [4] Cheat-Sheet: [http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/]
Comment on This Data Unit