RegEx

Created on Oct. 30, 2012, 2:21 p.m. by Hevok & updated by Hevok on May 2, 2013, 5:14 p.m.

Regular Expressions

A regular expression (regex or RE) describes a set of strings. Used in the right way REs become very powerful for text mining and manipulation.

RE Concatenation

REs can be concatenated. If A and B are both RE, then AB is also RE. If a string p matches A and another string q matches B, the string pq will match AB.

Special Characters

Regex allow the use of escaped letters and special symbols to match a wide range of strings according to certain rules.

.
Matches any character except newline, DOTALL flag matches any character including newline.
^
Matches start of string (Beginning of the string).
$
Matches string just before the end (End of a string).
x*
Means zero or more instances of the previous pattern.
x+
Matches x one or more times. Extends the current character to match one or more times. w+ will match at least one word character, but would not match 20.
x?
0 or 1 instances of the pattern. Matches an optional x character (matches are zero or one times).
*?, +?, ??
Use ? after , +, or ? wild-card search to make them less "greedy." .? will match the minimum it can, rather than as much as possible.
{m}
Specifies how many instances of the regex should be matched.
x{m,n}
Causes resulting RE to match from m to n repetitions of the proceeding RE, attempting to match as many repetitions as possible. Thus it specifies a range of the number of instances that should be matches.
{m,n}?
Specifies a range of the numbers of instances that should be matched, matching as few as possible.
{a|b|c}
Matches either a or b or c.
|
Or operator. Matches either the value on the left of the pipe or on the value on the right.
[]
Indicates a set of characters for a single position in the regex. Put characters between.
()
Put brackets around a set of characters and pull them oth later using the .groups() method of the match object.
(...)
Indicates a grouping for the regex. A remembered group. Values of what matched can be retrieved by using the groups() method of the object returned by re.search.
(?...)
Dunno?
(?iLmsux)
Each letter defines the further meaning of the construction.
(?:...)
Non-grouping of a regex.
(?P...)
Given name 'name to the regex for later usage.
(?P=name)
Recalls the text matched by the regex named 'name'.
(?#...)
A comment/remark. The parentheses and their contents are ignored.
(?=...)
Matches if the preceding part of the regex and the subsequent part both match.
(?!...)
Matches expressions when the part of the regex preceding the parenthesis is not followed by the regex in parentheses.
(?<=...)
Matches the expression to the right of the parentheses when it is preceded by the value of ...
(?
Matches the expression to the right of the parentheses when it is not preceded by the value of ...
(?(id/name)yes-pattern/no-pattern)
WTF?
\
Regular expression use a backslash for special characters.
\number
Matches a number?
\A
Matches the start of the string. This is similar to '^'.
\b
A word boundary must occur right here. Matches the empty string that forms the boundary at the beginning or end of a word.
\B
Matches the empty string that is not the beginning or end of a word.
\d
Any numeric digit. A digit character, 0-9.
\D
Matches any character that are not digits.
\s
A white space character, such as space or tab.
\S
A non-white space character.
\w
A "word" character: a-z, A-Z, 0-9 (i.e. alphanumeric), and a few others, such as underscore (_).
\W
A non-word character - the opposite of w. Examples include '&', '$', '@', etc.
\Z
Matches the end of a string. This is similar to '$'.

Examples

import re
r = re.search('Bal{2,5}','34eBallllll342')
if r: print "Positive"
else: print "Negative"

RE, seq = '(?<=abc)def', 'abcdef'
RE, seq = '(?<=-)\w+', 'spam-egg'   # Looks for a word following a hyphen

m = re.search(RE, seq)
print m.group(0)

print '''\n#Matching vs. Searching'''
print re.match("c", "abcdef")
print re.search("c", "abcdef")

print '''\n#Module'''
pattern = 'ABC'
string = 'ABCD'
prog = re.compile(pattern)
result = prog.match(string)
result = re.match(pattern, string) #re.compile is more efficient when the expression will used several times

#findall(string[,pos[,endpos)  #findall by an positional limit the search regions

print '''\n#group([group1, ...])'''
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
print m.group(0)
print m.group(1)
print m.group(2)
print m.group(1,2)

m = re.match(r"(?P\w+) (?P\w+)", "Malcolm Reynolds")
print m.group('first_name')
print m.group('last_name')

print '''\n#Names groups can also be refered to by their index:'''
print m.group(1)
print m.group(2)

print '''\n#If a group matches multiple times, only the last match is accessible:'''
m = re.match(r"(..)+", "a1b2c3")    # Matches 3 times.
print m.group(1)                    # Matches only the last match.

print '''\n#grouodict([default])'''
m = re.match(r"(?P\w+) (?P\w+)", "Macolm Reynolds")
print m.groupdict()

print '''\n#Making a Phonebook'''
input = """Ross McFluff: 834.345.1254 155 Elm Street
Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger:             925.541.7625       662 South Dogwood Way
Heather Albrecht: 548.326.4584 919 Park Place"""
entries = re.split("\n+", input)
print entries
print
print [re.split(":? ", entry, 3) for entry in entries]
print
print [re.split(":? ", entry, 4) for entry in entries]

print '''\n#Text Munging'''
import random
def repl(m):
    inner_word = list(m.group(2))
    random.shuffle(inner_word)
    return m.group(1) + "".join(inner_word) + m.group(3)
text = '''Some long text here.'''

##Hikowa, please report your absences promptly!"
##print re.sub(r"(\w)(\w+)(\w)", repl, text)
print re.sub(r"(\w)(\w+)(\w)", repl, text)

print '''\n#Finding all Adverbs'''
text = "He was carefully disguised bt captured quickly by police."
print re.findall(r"\w+ly", text)

print '''\n#Finding all Adverbs and tehir Positions'''
text = "He was carefully disguised bt captured quickly by police."
for m in re.finditer(r"\w+ly",text):
    print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))

#234567891123456789212345678931234567894123456789512345678961234567897123456789

Get Positions of Matches

import re # http://stackoverflow.com/questions/250271/python-regex-use-how-to-get-positions-of-matches
p = re.compile("[a-z]")
for m in p.finditer('a1b2c3d4'):
    print m.start(), m.group()

Negation

^ in the beginning of square brackets negates all characters included. For instance, to match every non-alphanumeric character except some defined ones, one could formulate a regex like this:

[^a-zA-Z\d\s:]
  • d - numeric class
  • s - white-space
  • a-zA-Z - matches all letters (w would also include some alphanumeric subsets)
  • ^ - negates them all, in effect it results into non numeric chars, non spaces and non colons

Anchors

^
Start of string, or start of line in multi-line pattern
\A
Start of string
$
End of string, or end of line in multi-line pattern
\Z
End of string
\b
Word boundary
\B
Not word boundary
\<
Start of word
\>
End of word

Character Classes

\c
Control character
\s
White space
\S
Not white space
\d
Digit
\D
Not digit
\w
Word
\W
Not word
\x
Hexadecimal digit
\O
Octal digit

Quantifiers

*
0 or more
+
1 or more
?
0 or 1
{n}
Exactly n times
{n,}
n or more
{n,m}
n to m

Add a ? to a quantifier to make it ungreedy.

Groups & Ranges

.
Any character except new line (n)
(a|b)
a or b
(...)
Group
(?:...)
Passive (non-capturing) group
[abc]
Range (a or b or c)
[^abc]
Not a or b or c
[a-q]
Letter from a to q
[A-Q]
Upper case letter from A to Q
[0-7]
Digit from 0 to 7
\n
nth group/subpattern

Note: Ranges are inclusive.

Common Metacharacters

^ [ . $ { * ( + ) | ? < >

The escape character is usually the backslash.

Special Character

\n
New line
\r
Carriage return
\t
Tab
\v
Vertical tab
\f
Form feed
\xxx
Octal character xxx
\xhn
Hex character hh

Only utf-8 Regex Pattern

Using the UNICODE regex flag it is possible do something like ur'?u^[^Wd_]+$', which will match any string consisting solely of alphabetic unicode characters.

Resources

There is an excellent book on regex [1] and the official python HowTo guide [2] and re module [3] as well as a nice regex primer [4] and the cheat-sheet [5]. There is also a informational website dedicated to regular expressions [6].

[1]Mastering regular Expression by Jeffrey Friedly, published by O'Reilly, 1.Editions
[2]Regular Expression HOWTO [http://docs.python.org/howto/regex.html#regex-howto]
[3]Module re [http://docs.python.org/2.7/library/re.html#module-re]
[4]Regex Primer [http://python.about.com/od/regularexpressions/a/regexprimer.htm]
[5]Cheat-Sheet [http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/]
[6]Char classes [http://www.regular-expressions.info/charclass.html]
regex.jpg

Tags: rest, text mining
Categories: Tutorial
Parent: Tutorials

Update entry (Admin) | See changes

Comment on This Data Unit