RegEx

Created on Oct. 30, 2012, 2:21 p.m. by Hevok & updated by Hevok on May 2, 2013, 5:14 p.m.

Regular Expressions

A regular expression (regex or RE) describes a set of strings. Used in the right way REs become very powerful for text mining and manipulation.

Contents

Regular Expressions
- RE Concatenation
- Special Characters
- Examples
- Get Positions of Matches
- Negation
- Anchors
- Character Classes
- Quantifiers
- Groups & Ranges
- Common Metacharacters
- Special Character
- Only utf-8 Regex Pattern
- Resources

RE Concatenation

REs can be concatenated. If A and B are both RE, then AB is also RE. If a string p matches A and another string q matches B, the string pq will match AB.

Special Characters

Regex allow the use of escaped letters and special symbols to match a wide range of strings according to certain rules.

.: Matches any character except newline, DOTALL flag matches any character including newline.

^: Matches start of string (Beginning of the string).

$: Matches string just before the end (End of a string).
x*: Means zero or more instances of the previous pattern.
x+: Matches x one or more times. Extends the current character to match one or more times. w+ will match at least one word character, but would not match 20.
x?: 0 or 1 instances of the pattern. Matches an optional x character (matches are zero or one times).
*?, +?, ??: Use ? after , +, or ? wild-card search to make them less "greedy." .? will match the minimum it can, rather than as much as possible.
{m}: Specifies how many instances of the regex should be matched.
x{m,n}: Causes resulting RE to match from m to n repetitions of the proceeding RE, attempting to match as many repetitions as possible. Thus it specifies a range of the number of instances that should be matches.
{m,n}?: Specifies a range of the numbers of instances that should be matched, matching as few as possible.
{a|b|c}: Matches either a or b or c.
|: Or operator. Matches either the value on the left of the pipe or on the value on the right.
[]: Indicates a set of characters for a single position in the regex. Put characters between.
(): Put brackets around a set of characters and pull them oth later using the .groups() method of the match object.
(...): Indicates a grouping for the regex. A remembered group. Values of what matched can be retrieved by using the groups() method of the object returned by re.search.
(?...): Dunno?
(?iLmsux): Each letter defines the further meaning of the construction.
(?:...): Non-grouping of a regex.
(?P...): Given name 'name to the regex for later usage.
(?P=name): Recalls the text matched by the regex named 'name'.
(?#...): A comment/remark. The parentheses and their contents are ignored.
(?=...): Matches if the preceding part of the regex and the subsequent part both match.
(?!...): Matches expressions when the part of the regex preceding the parenthesis is not followed by the regex in parentheses.
(?<=...): Matches the expression to the right of the parentheses when it is preceded by the value of ...
(?: Matches the expression to the right of the parentheses when it is not preceded by the value of ...
(?(id/name)yes-pattern/no-pattern): WTF?
\: Regular expression use a backslash for special characters.
\number: Matches a number?
\A: Matches the start of the string. This is similar to '^'.
\b: A word boundary must occur right here. Matches the empty string that forms the boundary at the beginning or end of a word.
\B: Matches the empty string that is not the beginning or end of a word.
\d: Any numeric digit. A digit character, 0-9.
\D: Matches any character that are not digits.
\s: A white space character, such as space or tab.
\S: A non-white space character.
\w: A "word" character: a-z, A-Z, 0-9 (i.e. alphanumeric), and a few others, such as underscore (_).
\W: A non-word character - the opposite of w. Examples include '&', '$', '@', etc.
\Z: Matches the end of a string. This is similar to '$'.

Examples

import re
r = re.search('Bal{2,5}','34eBallllll342')
if r: print "Positive"
else: print "Negative"

RE, seq = '(?<=abc)def', 'abcdef'
RE, seq = '(?<=-)\w+', 'spam-egg'   # Looks for a word following a hyphen

m = re.search(RE, seq)
print m.group(0)

print '''\n#Matching vs. Searching'''
print re.match("c", "abcdef")
print re.search("c", "abcdef")

print '''\n#Module'''
pattern = 'ABC'
string = 'ABCD'
prog = re.compile(pattern)
result = prog.match(string)
result = re.match(pattern, string) #re.compile is more efficient when the expression will used several times

#findall(string[,pos[,endpos)  #findall by an positional limit the search regions

print '''\n#group([group1, ...])'''
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
print m.group(0)
print m.group(1)
print m.group(2)
print m.group(1,2)

m = re.match(r"(?P\w+) (?P\w+)", "Malcolm Reynolds")
print m.group('first_name')
print m.group('last_name')

print '''\n#Names groups can also be refered to by their index:'''
print m.group(1)
print m.group(2)

print '''\n#If a group matches multiple times, only the last match is accessible:'''
m = re.match(r"(..)+", "a1b2c3")    # Matches 3 times.
print m.group(1)                    # Matches only the last match.

print '''\n#grouodict([default])'''
m = re.match(r"(?P\w+) (?P\w+)", "Macolm Reynolds")
print m.groupdict()

print '''\n#Making a Phonebook'''
input = """Ross McFluff: 834.345.1254 155 Elm Street
Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger:             925.541.7625       662 South Dogwood Way
Heather Albrecht: 548.326.4584 919 Park Place"""
entries = re.split("\n+", input)
print entries
print
print [re.split(":? ", entry, 3) for entry in entries]
print
print [re.split(":? ", entry, 4) for entry in entries]

print '''\n#Text Munging'''
import random
def repl(m):
    inner_word = list(m.group(2))
    random.shuffle(inner_word)
    return m.group(1) + "".join(inner_word) + m.group(3)
text = '''Some long text here.'''

##Hikowa, please report your absences promptly!"
##print re.sub(r"(\w)(\w+)(\w)", repl, text)
print re.sub(r"(\w)(\w+)(\w)", repl, text)

print '''\n#Finding all Adverbs'''
text = "He was carefully disguised bt captured quickly by police."
print re.findall(r"\w+ly", text)

print '''\n#Finding all Adverbs and tehir Positions'''
text = "He was carefully disguised bt captured quickly by police."
for m in re.finditer(r"\w+ly",text):
    print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))

#234567891123456789212345678931234567894123456789512345678961234567897123456789

Get Positions of Matches

import re # http://stackoverflow.com/questions/250271/python-regex-use-how-to-get-positions-of-matches
p = re.compile("[a-z]")
for m in p.finditer('a1b2c3d4'):
    print m.start(), m.group()

Negation

^ in the beginning of square brackets negates all characters included. For instance, to match every non-alphanumeric character except some defined ones, one could formulate a regex like this:

[^a-zA-Z\d\s:]

d - numeric class
s - white-space
a-zA-Z - matches all letters (w would also include some alphanumeric subsets)
^ - negates them all, in effect it results into non numeric chars, non spaces and non colons

Anchors

^: Start of string, or start of line in multi-line pattern
\A: Start of string
$: End of string, or end of line in multi-line pattern
\Z: End of string
\b: Word boundary
\B: Not word boundary
\<: Start of word
\>: End of word

Character Classes

\c: Control character
\s: White space
\S: Not white space
\d: Digit
\D: Not digit
\w: Word
\W: Not word
\x: Hexadecimal digit
\O: Octal digit

Quantifiers

*: 0 or more
+: 1 or more
?: 0 or 1
{n}: Exactly n times
{n,}: n or more
{n,m}: n to m

Add a ? to a quantifier to make it ungreedy.

Groups & Ranges

.: Any character except new line (n)
(a|b): a or b
(...): Group
(?:...): Passive (non-capturing) group
[abc]: Range (a or b or c)
[^abc]: Not a or b or c
[a-q]: Letter from a to q
[A-Q]: Upper case letter from A to Q
[0-7]: Digit from 0 to 7
\n: nth group/subpattern

Note: Ranges are inclusive.

Common Metacharacters

^ [ . $ { * ( + ) | ? < >

The escape character is usually the backslash.

Special Character

\n: New line
\r: Carriage return
\t: Tab
\v: Vertical tab
\f: Form feed
\xxx: Octal character xxx
\xhn: Hex character hh

Only utf-8 Regex Pattern

Using the UNICODE regex flag it is possible do something like ur'?u^[^Wd_]+$', which will match any string consisting solely of alphabetic unicode characters.

There is an excellent book on regex [1] and the official python HowTo guide [2] and re module [3] as well as a nice regex primer [4] and the cheat-sheet [5]. There is also a informational website dedicated to regular expressions [6].

[1]	Mastering regular Expression by Jeffrey Friedly, published by O'Reilly, 1.Editions

[2]	Regular Expression HOWTO [http://docs.python.org/howto/regex.html#regex-howto]

[3]	Module re [http://docs.python.org/2.7/library/re.html#module-re]

[4]	Regex Primer [http://python.about.com/od/regularexpressions/a/regexprimer.htm]

[5]	Cheat-Sheet [http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/]

[6]	Char classes [http://www.regular-expressions.info/charclass.html]

Tags: rest, text mining
Categories: Tutorial
Parent: Tutorials

Update entry (Admin) | See changes

RegEx

Get Positions of Matches

Groups & Ranges

Resources

Comment on This Data Unit