Change - RegEx

Created on Oct. 30, 2012, 2:21 p.m. by Hevok & updated on Nov. 10, 2012, 8:40 p.m. by Hevok

=================== ¶
Regular Expressions ¶
=================== ¶
A regular expression (regex or RE) describes a set of strings. Used in the right way REs become very powerful for text mining and manipulation. ¶
¶
.. contents:: Contents ¶
¶
RE Concatenation ¶
================ ¶
REs can be concatenated. If A and B are both RE, then AB is also RE. If a string p matches A and another string q matches B, the string pq will match AB. ¶
¶
¶
Special Characters ¶
================== ¶
Regex allow the use of escaped letters and special symbols to match a wide range of strings according to certain rules. ¶
¶
. ¶
Matches any character except newline, DOTALL flag matches any character including newline. ¶
¶
^ ¶
Matches start of string (Beginning of the string). ¶
¶
$ ¶
Matches string just before the end (End of a string). ¶
¶
x* ¶
Means zero or more instances of the previous pattern. ¶
¶
x+ ¶
Matches x one or more times. Extends the current character to match one or more times. w+ will match at least one word character, but would not match 20. ¶
¶
x? ¶
0 or 1 instances of the pattern. Matches an optional x character (matches are zero or one times). ¶
¶
``*?, +?, ??`` ¶
Use ? after , +, or ? wild-card search to make them less "greedy." .? will match the minimum it can, rather than as much as possible. ¶
¶
{m} ¶
Specifies how many instances of the regex should be matched. ¶
¶
x{m,n} ¶
Causes resulting RE to match from m to n repetitions of the proceeding RE, attempting to match as many repetitions as possible. Thus it specifies a range of the number of instances that should be matches. ¶
¶
{m,n}? ¶
Specifies a range of the numbers of instances that should be matched, matching as few as possible. ¶
¶
{a|b|c} ¶
Matches either a or b or c. ¶
¶
``|`` ¶
Or operator. Matches either the value on the left of the pipe or on the value on the right. ¶
¶
[] ¶
Indicates a set of characters for a single position in the regex. Put characters between. ¶
¶
() ¶
Put brackets around a set of characters and pull them oth later using the .groups() method of the match object. ¶
¶
(...) ¶
Indicates a grouping for the regex. A remembered group. Values of what matched can be retrieved by using the groups() method of the object returned by re.search. ¶
¶
(?...) ¶
Dunno? ¶
¶
(?iLmsux) ¶
Each letter defines the further meaning of the construction. ¶
¶
(?:...) ¶
Non-grouping of a regex. ¶
¶
(?P<name>...) ¶
Given name 'name to the regex for later usage. ¶
¶
(?P=name) ¶
Recalls the text matched by the regex named 'name'. ¶
¶
(?#...) ¶
A comment/remark. The parentheses and their contents are ignored. ¶
¶
(?=...) ¶
Matches if the preceding part of the regex and the subsequent part both match. ¶
¶
(?!...) ¶
Matches expressions when the part of the regex preceding the parenthesis is not followed by the regex in parentheses. ¶
¶
(?<=...) ¶
Matches the expression to the right of the parentheses when it is preceded by the value of ... ¶
¶
(?<!...) ¶
Matches the expression to the right of the parentheses when it is not preceded by the value of ... ¶
¶
(?(id/name)yes-pattern/no-pattern) ¶
WTF? ¶
¶
`\` ¶
Regular expression use a backslash for special characters. ¶
¶
``number`` ¶
Matches a number? ¶
¶
``A`` ¶
Matches the start of the string. This is similar to '^'. ¶
¶
``b`` ¶
A word boundary must occur right here. Matches the empty string that forms the boundary at the beginning or end of a word. ¶
¶
``B`` ¶
Matches the empty string that is not the beginning or end of a word. ¶
¶
``d`` ¶
Any numeric digit. A digit character, 0-9. ¶
¶
``D`` ¶
Matches any character that are not digits. ¶
¶
``s`` ¶
A white space character, such as space or tab. ¶
¶
``S`` ¶
A non-white space character. ¶
¶
``w`` ¶
A "word" character: a-z, A-Z, 0-9, and a few others, such as underscore (_). ¶
¶
``W`` ¶
A non-word character - the opposite of w. Examples include '&', '$', '@', etc. ¶
¶
``Z`` ¶
Matches the end of a string. This is similar to '$'. ¶
¶
¶
Examples ¶
======== ¶
.. sourcecode:: python ¶
¶
import re ¶
r = re.search('Bal{2,5}','34eBallllll342') ¶
if r: print "Positive" ¶
else: print "Negative" ¶
¶
RE, seq = '(?<=abc)def', 'abcdef' ¶
RE, seq = '(?<=-)w+', 'spam-egg' # Looks for a word following a hyphen ¶
¶
m = re.search(RE, seq) ¶
print m.group(0) ¶
¶
print '''n#Matching vs. Searching''' ¶
print re.match("c", "abcdef") ¶
print re.search("c", "abcdef") ¶
¶
print '''n#Module''' ¶
pattern = 'ABC' ¶
string = 'ABCD' ¶
prog = re.compile(pattern) ¶
result = prog.match(string) ¶
result = re.match(pattern, string) #re.compile is more efficient when the expression will used several times ¶
¶
#findall(string[,pos[,endpos]]) #findall by an positional limit the search regions ¶
¶
print '''n#group([group1, ...])''' ¶
m = re.match(r"(w+) (w+)", "Isaac Newton, physicist") ¶
print m.group(0) ¶
print m.group(1) ¶
print m.group(2) ¶
print m.group(1,2) ¶
¶
m = re.match(r"(?P<first_name>w+) (?P<last_name>w+)", "Malcolm Reynolds") ¶
print m.group('first_name') ¶
print m.group('last_name') ¶
¶
print '''n#Names groups can also be refered to by their index:''' ¶
print m.group(1) ¶
print m.group(2) ¶
¶
print '''n#If a group matches multiple times, only the last match is accessible:''' ¶
m = re.match(r"(..)+", "a1b2c3") # Matches 3 times. ¶
print m.group(1) # Matches only the last match. ¶
¶
print '''n#grouodict([default])''' ¶
m = re.match(r"(?P<first_name>w+) (?P<last_name>w+)", "Macolm Reynolds") ¶
print m.groupdict() ¶
¶
print '''n#Making a Phonebook''' ¶
input = """Ross McFluff: 834.345.1254 155 Elm Street ¶
Ronald Heathmore: 892.345.3428 436 Finley Avenue ¶
Frank Burger: 925.541.7625 662 South Dogwood Way ¶
Heather Albrecht: 548.326.4584 919 Park Place""" ¶
entries = re.split("n+", input) ¶
print entries ¶
print ¶
print [re.split(":? ", entry, 3) for entry in entries] ¶
print ¶
print [re.split(":? ", entry, 4) for entry in entries] ¶
¶
print '''n#Text Munging''' ¶
import random ¶
def repl(m): ¶
inner_word = list(m.group(2)) ¶
random.shuffle(inner_word) ¶
return m.group(1) + "".join(inner_word) + m.group(3) ¶
text = '''Some long text here.''' ¶
¶
##Hikowa, please report your absences promptly!" ¶
##print re.sub(r"(w)(w+)(w)", repl, text) ¶
print re.sub(r"(w)(w+)(w)", repl, text) ¶
¶
print '''n#Finding all Adverbs''' ¶
text = "He was carefully disguised bt captured quickly by police." ¶
print re.findall(r"w+ly", text) ¶
¶
print '''n#Finding all Adverbs and tehir Positions''' ¶
text = "He was carefully disguised bt captured quickly by police." ¶
for m in re.finditer(r"w+ly",text): ¶
print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0)) ¶
¶
#234567891123456789212345678931234567894123456789512345678961234567897123456789 ¶
¶
Get positions of matches ¶
======================== ¶
.. sourcecode:: python ¶
¶
import re # http://stackoverflow.com/questions/250271/python-regex-use-how-to-get-positions-of-matches &para]
p = re.compile("[a-z]") ¶
for m in p.finditer('a1b2c3d4'): ¶
print m.start(), m.group() ¶
¶
Negation ¶
======== ¶
To^ inega the an exprbessgion defined ing of square brackets include ga ^tes atll tche varacterys beginningcluded. For instance, to match every non-alphanumeric character except spacome adefined colnes, one cousld formulate a regex like this pattern:: ¶
¶
[^a-zA-Zds:] ¶
¶
* d - numeric class ¶
* s - white-space ¶
* a-zA-Z - matches all letters (w would also include some alphanumeric subsets) ¶
* ^ - negates them all, in
sueffech that it results in allto non numeric chars, non spaces and non colons ¶
¶
Anchors ¶
======= ¶
¶
``^`` ¶
Start of string, or start of line in multi-line pattern ¶
``A`` ¶
Start of string ¶
``$`` ¶
End of string, or end of line in multi-line pattern ¶
``Z`` ¶
End of string ¶
``b`` ¶
Word boundary ¶
``B`` ¶
Not word boundary ¶
``<`` ¶
Start of word ¶
``>`` ¶
End of word ¶
¶
¶
Character Classes ¶
================= ¶
``c`` ¶
Control character ¶
``s`` ¶
White space ¶
``S`` ¶
Not white space ¶
``d`` ¶
Digit ¶
``D`` ¶
Not digit ¶
``w`` ¶
Word ¶
``W`` ¶
Not word ¶
``x`` ¶
Hexadecimal digit ¶
``O`` ¶
Octal digit ¶
¶
Quantifiers ¶
=========== ¶
¶
``*`` ¶
0 or more ¶
``+`` ¶
1 or more ¶
``?`` ¶
0 or 1 ¶
{n} ¶
Exactly n times ¶
{n,} ¶
n or more ¶
{n,m} ¶
n to m ¶
¶
Add a ? to a quantifier to make it ungreedy. ¶
¶
Groups & Ranges ¶
=============== ¶
. ¶
Any character except new line (n) ¶
(a|b) ¶
a or b ¶
(...) ¶
Group ¶
(?:...) ¶
Passive (non-capturing) group ¶
[abc] ¶
Range (a or b or c) ¶
[^abc] ¶
Not a or b or c ¶
[a-q] ¶
Letter from a to q ¶
[A-Q] ¶
Upper case letter from A to Q ¶
[0-7] ¶
Digit from 0 to 7 ¶
``n`` ¶
nth group/subpattern ¶
¶
Note: Ranges are inclusive. ¶
¶
Common Metacharacters ¶
===================== ¶
^ [ . $ { * ( + ) | ? < > ¶
¶
The escape character is usually the backslash. ¶
¶
Special Character ¶
================= ¶
``n`` ¶
New line ¶
``r`` ¶
Carriage return ¶
``t`` ¶
Tab ¶
``v`` ¶
Vertical tab ¶
``f`` ¶
Form feed ¶
``xxx`` ¶
Octal character xxx ¶
``xhn`` ¶
Hex character hh ¶
¶
¶
Resources ¶
========= ¶
There is an excellent book on regex [1] and the official python HowTo guide [2] as well as a nice regex primer [3] and the cheat-sheet [4] ¶
¶
.. [1] Mastering regular Expression by Jeffrey Friedly, published by O'Reilly, 1.Editions ¶
.. [2] Regular Expression HOWTO [http://docs.python.org/howto/regex.html#regex-howto] ¶
.. [3] Regex Primer [http://python.about.com/od/regularexpressions/a/regexprimer.htm] ¶
.. [4] Cheat-Sheet: [http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/]


Comment: Reformulated negation description.

Comment on This Data Unit