Change - RegEx

Created on Oct. 30, 2012, 2:21 p.m. by Hevok & updated on Nov. 16, 2012, 12:15 a.m. by Hevok

=================== ¶
Regular Expressions ¶
=================== ¶
A regular expression (regex or RE) describes a set of strings. Used in the right way REs become very powerful for text mining and manipulation. ¶
.. contents:: Contents ¶
RE Concatenation ¶
================ ¶
REs can be concatenated. If A and B are both RE, then AB is also RE. If a string p matches A and another string q matches B, the string pq will match AB. ¶
Special Characters ¶
================== ¶
Regex allow the use of escaped letters and special symbols to match a wide range of strings according to certain rules. ¶
. ¶
Matches any character except newline, DOTALL flag matches any character including newline. ¶
^ ¶
Matches start of string (Beginning of the string). ¶
$ ¶
Matches string just before the end (End of a string). ¶
x* ¶
Means zero or more instances of the previous pattern. ¶
x+ ¶
Matches x one or more times. Extends the current character to match one or more times. w+ will match at least one word character, but would not match 20. ¶
x? ¶
0 or 1 instances of the pattern. Matches an optional x character (matches are zero or one times). ¶
``*?, +?, ??`` ¶
Use ? after , +, or ? wild-card search to make them less "greedy." .? will match the minimum it can, rather than as much as possible. ¶
{m} ¶
Specifies how many instances of the regex should be matched. ¶
x{m,n} ¶
Causes resulting RE to match from m to n repetitions of the proceeding RE, attempting to match as many repetitions as possible. Thus it specifies a range of the number of instances that should be matches. ¶
{m,n}? ¶
Specifies a range of the numbers of instances that should be matched, matching as few as possible. ¶
{a|b|c} ¶
Matches either a or b or c. ¶
``|`` ¶
Or operator. Matches either the value on the left of the pipe or on the value on the right. ¶
[] ¶
Indicates a set of characters for a single position in the regex. Put characters between. ¶
() ¶
Put brackets around a set of characters and pull them oth later using the .groups() method of the match object. ¶
(...) ¶
Indicates a grouping for the regex. A remembered group. Values of what matched can be retrieved by using the groups() method of the object returned by ¶
(?...) ¶
Dunno? ¶
(?iLmsux) ¶
Each letter defines the further meaning of the construction. ¶
(?:...) ¶
Non-grouping of a regex. ¶
(?P<name>...) ¶
Given name 'name to the regex for later usage. ¶
(?P=name) ¶
Recalls the text matched by the regex named 'name'. ¶
(?#...) ¶
A comment/remark. The parentheses and their contents are ignored. ¶
(?=...) ¶
Matches if the preceding part of the regex and the subsequent part both match. ¶
(?!...) ¶
Matches expressions when the part of the regex preceding the parenthesis is not followed by the regex in parentheses. ¶
(?<=...) ¶
Matches the expression to the right of the parentheses when it is preceded by the value of ... ¶
(?<!...) ¶
Matches the expression to the right of the parentheses when it is not preceded by the value of ... ¶
(?(id/name)yes-pattern/no-pattern) ¶
WTF? ¶
`\` ¶
Regular expression use a backslash for special characters. ¶
``number`` ¶
Matches a number? ¶
``A`` ¶
Matches the start of the string. This is similar to '^'. ¶
``b`` ¶
A word boundary must occur right here. Matches the empty string that forms the boundary at the beginning or end of a word. ¶
``B`` ¶
Matches the empty string that is not the beginning or end of a word. ¶
``d`` ¶
Any numeric digit. A digit character, 0-9. ¶
``D`` ¶
Matches any character that are not digits. ¶
``s`` ¶
A white space character, such as space or tab. ¶
``S`` ¶
A non-white space character. ¶
``w`` ¶
A "word" character: a-z, A-Z, 0-9 (i.e. alphanumeric), and a few others, such as underscore (_). ¶
``W`` ¶
A non-word character - the opposite of w. Examples include '&', '$', '@', etc. ¶
``Z`` ¶
Matches the end of a string. This is similar to '$'. ¶
Examples ¶
======== ¶
.. sourcecode:: python ¶
import re ¶
r ='Bal{2,5}','34eBallllll342') ¶
if r: print "Positive" ¶
else: print "Negative" ¶
RE, seq = '(?<=abc)def', 'abcdef' ¶
RE, seq = '(?<=-)w+', 'spam-egg' # Looks for a word following a hyphen ¶
m =, seq) ¶
print ¶
print '''n#Matching vs. Searching''' ¶
print re.match("c", "abcdef") ¶
print"c", "abcdef") ¶
print '''n#Module''' ¶
pattern = 'ABC' ¶
string = 'ABCD' ¶
prog = re.compile(pattern) ¶
result = prog.match(string) ¶
result = re.match(pattern, string) #re.compile is more efficient when the expression will used several times ¶
#findall(string[,pos[,endpos]]) #findall by an positional limit the search regions ¶
print '''n#group([group1, ...])''' ¶
m = re.match(r"(w+) (w+)", "Isaac Newton, physicist") ¶
print ¶
print ¶
print ¶
print,2) ¶
m = re.match(r"(?P<first_name>w+) (?P<last_name>w+)", "Malcolm Reynolds") ¶
print'first_name') ¶
print'last_name') ¶
print '''n#Names groups can also be refered to by their index:''' ¶
print ¶
print ¶
print '''n#If a group matches multiple times, only the last match is accessible:''' ¶
m = re.match(r"(..)+", "a1b2c3") # Matches 3 times. ¶
print # Matches only the last match. ¶
print '''n#grouodict([default])''' ¶
m = re.match(r"(?P<first_name>w+) (?P<last_name>w+)", "Macolm Reynolds") ¶
print m.groupdict() ¶
print '''n#Making a Phonebook''' ¶
input = """Ross McFluff: 834.345.1254 155 Elm Street ¶
Ronald Heathmore: 892.345.3428 436 Finley Avenue ¶
Frank Burger: 925.541.7625 662 South Dogwood Way ¶
Heather Albrecht: 548.326.4584 919 Park Place""" ¶
entries = re.split("n+", input) ¶
print entries ¶
print ¶
print [re.split(":? ", entry, 3) for entry in entries] ¶
print ¶
print [re.split(":? ", entry, 4) for entry in entries] ¶
print '''n#Text Munging''' ¶
import random ¶
def repl(m): ¶
inner_word = list( ¶
random.shuffle(inner_word) ¶
return + "".join(inner_word) + ¶
text = '''Some long text here.''' ¶
##Hikowa, please report your absences promptly!" ¶
##print re.sub(r"(w)(w+)(w)", repl, text) ¶
print re.sub(r"(w)(w+)(w)", repl, text) ¶
print '''n#Finding all Adverbs''' ¶
text = "He was carefully disguised bt captured quickly by police." ¶
print re.findall(r"w+ly", text) ¶
print '''n#Finding all Adverbs and tehir Positions''' ¶
text = "He was carefully disguised bt captured quickly by police." ¶
for m in re.finditer(r"w+ly",text): ¶
print '%02d-%02d: %s' % (m.start(), m.end(), ¶
#234567891123456789212345678931234567894123456789512345678961234567897123456789 ¶
Get Positions of Matches ¶
======================== ¶
.. sourcecode:: python ¶
import re # &para]
p = re.compile("[a-z]") ¶
for m in p.finditer('a1b2c3d4'): ¶
print m.start(), ¶
Negation ¶
======== ¶
^ in the beginning of square brackets negates all characters included. For instance, to match every non-alphanumeric character except some defined ones, one could formulate a regex like this:: ¶
[^a-zA-Zds:] ¶
* d - numeric class ¶
* s - white-space ¶
* a-zA-Z - matches all letters (w would also include some alphanumeric subsets) ¶
* ^ - negates them all, in effect it results into non numeric chars, non spaces and non colons ¶
Anchors ¶
======= ¶
``^`` ¶
Start of string, or start of line in multi-line pattern ¶
``A`` ¶
Start of string ¶
``$`` ¶
End of string, or end of line in multi-line pattern ¶
``Z`` ¶
End of string ¶
``b`` ¶
Word boundary ¶
``B`` ¶
Not word boundary ¶
``<`` ¶
Start of word ¶
``>`` ¶
End of word ¶
Character Classes ¶
================= ¶
``c`` ¶
Control character ¶
``s`` ¶
White space ¶
``S`` ¶
Not white space ¶
``d`` ¶
Digit ¶
``D`` ¶
Not digit ¶
``w`` ¶
Word ¶
``W`` ¶
Not word ¶
``x`` ¶
Hexadecimal digit ¶
``O`` ¶
Octal digit ¶
Quantifiers ¶
=========== ¶
``*`` ¶
0 or more ¶
``+`` ¶
1 or more ¶
``?`` ¶
0 or 1 ¶
{n} ¶
Exactly n times ¶
{n,} ¶
n or more ¶
{n,m} ¶
n to m ¶
Add a ? to a quantifier to make it ungreedy. ¶
Groups & Ranges ¶
=============== ¶
. ¶
Any character except new line (n) ¶
(a|b) ¶
a or b ¶
(...) ¶
Group ¶
(?:...) ¶
Passive (non-capturing) group ¶
[abc] ¶
Range (a or b or c) ¶
[^abc] ¶
Not a or b or c ¶
[a-q] ¶
Letter from a to q ¶
[A-Q] ¶
Upper case letter from A to Q ¶
[0-7] ¶
Digit from 0 to 7 ¶
``n`` ¶
nth group/subpattern ¶
Note: Ranges are inclusive. ¶
Common Metacharacters ¶
===================== ¶
^ [ . $ { * ( + ) | ? < > ¶
The escape character is usually the backslash. ¶
Special Character ¶
================= ¶
``n`` ¶
New line ¶
``r`` ¶
Carriage return ¶
``t`` ¶
Tab ¶
``v`` ¶
Vertical tab ¶
``f`` ¶
Form feed ¶
``xxx`` ¶
Octal character xxx ¶
``xhn`` ¶
Hex character hh ¶
Resources ¶
========= ¶
There is an excellent book on regex [1] and the official python HowTo guide [2] a
nd re module [3] as well as a nice regex primer [34] and the cheat-sheet [45]. ¶
.. [1] Mastering regular Expression by Jeffrey Friedly, published by O'Reilly, 1.Editions ¶
.. [2] Regular Expression HOWTO [] ¶
.. [3;
Module re [] ¶
.. [4;
Regex Primer [] ¶
.. [
45] Cheat-Sheet: []

Comment: Included module re.

Comment on This Data Unit