Change: RegEx

created on Oct. 30, 2012, 2:21 p.m. by Hevok & updated on Nov. 10, 2012, 6:49 p.m. by Hevok

=================== Regular Expressions =================== A regular expression (regex or RE) describes a set of strings. Used in the right way REs become very powerful for text mining and manipulation.

RE Concatenation

REs can be concatened. If A and B are both RE, then AB is also RE. If a string p matches A and another string q matches B, the string pq will match AB.

Special Characters

Regex allow the use of escaped letters and special symbols to match a wide range of strings according to certail rules.

. Matches any character except newline, DOTALL flag matches any character including newline.

^ Matches start of string (Beginning of the string).

$ Matches string just before the end (End of a string).

x* Means zero or more instances of the previous pattern.

x+ Matches x one or more times. Extends the current character to match one or more times. \w+ will match at least one word character, but would not match 20.

x? 0 or 1 instances of the pattern. Matches an optional x character (matches are zero or one times).

*?, +?, ?? Use ? after , +, or ? wildcard search to make them less "greedy." .? will match the minimum it can, rather than as much as possible.

{m} Specifies how many instances of the regex should be matched.

x{m,n} Causes resulting RE to match from m to n repetitions of the proceding RE, attempting to match as many repetitions as possible. Thus it specifies a range of the number of instances that should be matches.

{m,n}? Specifies a range of the numers of instances that should be matched, matching as few as possible.

{a|b|c} Matches either a or b or c.

| Or operator. Matches either the value on the left of the pipe or on the value on the right.

[] Indicates a set of characters for a single position in the regex. Put characters between.

() Put brackets around a set of charakters and pull them oth later using the .groups() method of the match object.

(...) Indicates a grouping for the regex. A remembered group. Values of what matched can be retrieved by using the groups() method of the object returend by re.search.

(?...) Dunno?

(?iLmsux) Each letter defnies the further meaning of the construction.

(?:...) Non-grouping of a regex.

(?P...) Given name 'name to the regex for later usage.

(?P=name) Recalls the text matched by the regex named 'name'.

(?#...) A comment/remark. The parentheses and their contents ar ignored.

(?=...) Matches if the preceding part of the regex and the subsquent part both match.

(?!...) Matches expressions when the part of the regex preceding the parenthesis is not followed by the regex in parantheses.

(?<=...) Matches the expression to the right of the parantheses when it is preceded by the value of ...

(?<!...) Matches the expression to the right of the parantheses when it is not preceeded by the value of ...

(?(id/name)yes-pattern/no-pattern) WTF?

\\ Regular expression use a backslash for special characters.

\number Matches a number?

\A Matches the start of the string. This is similar to '^'.

\b A word boundary must occur right here. Matches the empty string that forms the boundary at the beginning or end of a word.

\B Matches the empty string that is not the beginnning or end of a word.

\d Any numeric digit. A digit character, 0-9.

\D Matches any character that are not digits.

\s A white space character, such as space or tab.

\S A non-white space character.

\w A "word" character: a-z, A-Z, 0-9, and a few others, such as underscore (_).

\W A non-word character - the opposite of \w. Examples inlcude '&', '$', '@', etc.

\Z Matches the end of a string. This is similiar to '$'.

Examples

.. sourcecode:: python

import re
r = re.search('Bal{2,5}','34eBallllll342')
if r: print "Positive"
else: print "Negative"

RE, seq = '(?<=abc)def', 'abcdef'
RE, seq = '(?<=-)\w+', 'spam-egg'   # Looks for a word following a hyphen

m = re.search(RE, seq)
print m.group(0)

print '''\n#Matching vs. Searching'''
print re.match("c", "abcdef")
print re.search("c", "abcdef")

print '''\n#Module'''
pattern = 'ABC'
string = 'ABCD'
prog = re.compile(pattern)
result = prog.match(string)
result = re.match(pattern, string) #re.compile is more efficient when the expression will used several times

#findall(string[,pos[,endpos]])  #findall by an positional limit the search regions

print '''\n#group([group1, ...])'''
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
print m.group(0)
print m.group(1)
print m.group(2)
print m.group(1,2)

m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
print m.group('first_name')
print m.group('last_name')

print '''\n#Names groups can also be refered to by their index:'''
print m.group(1)
print m.group(2)

print '''\n#If a group matches mutliple times, only the last match is accessible:'''
m = re.match(r"(..)+", "a1b2c3")    # Matches 3 times.
print m.group(1)                    # Matches only the last match.

print '''\n#grouodict([default])'''
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Macolm Reynolds")
print m.groupdict()

print '''\n#Making a Phonebook'''
input = """Ross McFluff: 834.345.1254 155 Elm Street
Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger:             925.541.7625       662 South Dogwood Way
Heather Albrecht: 548.326.4584 919 Park Place"""
entries = re.split("\n+", input)
print entries
print
print [re.split(":? ", entry, 3) for entry in entries]
print
print [re.split(":? ", entry, 4) for entry in entries]

print '''\n#Text Munging'''
import random
def repl(m):
    inner_word = list(m.group(2))
    random.shuffle(inner_word)
    return m.group(1) + "".join(inner_word) + m.group(3)
text = '''Some long text here.'''

##Hikowa, please report your absences promptly!"
##print re.sub(r"(\w)(\w+)(\w)", repl, text)
print re.sub(r"(\w)(\w+)(\w)", repl, text)

print '''\n#Finding all Adverbs'''
text = "He was carefully disguised bt captured quickly by police."
print re.findall(r"\w+ly", text)

print '''\n#Finding all Adverbs and tehir Positions'''
text = "He was carefully disguised bt captured quickly by police."
for m in re.finditer(r"\w+ly",text):
    print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))

#234567891123456789212345678931234567894123456789512345678961234567897123456789

Get positions of matches

.. sourcecode:: python

import re # http://stackoverflow.com/questions/250271/python-regex-use-how-to-get-positions-of-matches
p = re.compile("[a-z]")
for m in p.finditer('a1b2c3d4'):
    print m.start(), m.group()

Negation

To negate an expression defined in square brackets include a ^ at the very beginning. For instance, to match every non-alphanumeric character except space and colon use this pattern::

[^a-zA-Z\d\s:]
  • \d - numeric class
  • \s - white-space
  • a-zA-Z - matches all letters (\w would also include some alphanumeric subsets)
  • ^ - negates them all, in such that it results in all non numeric chars, non space non colons

Resources

There is an excellent book on regex [1] and the official python HowTo guide [2] as well as a nice regex primer [3].

.. [1] Mastering regular Expression by Jeffrey Friedly, published by O'Reilly, 1.Editions .. [2] Regular Expression HOWTO [http://docs.python.org/howto/regex.html#regex-howto] .. [3] Regex Primer [http://python.about.com/od/regularexpressions/a/regexprimer.htm]

regex.jpg

Tags: rest, text mining
Categories: Tutorial
Parent: Tutorials

Comment: Updated entry

See entry | Admin

Comment on This Data Unit