5.1. Regexp Syntax

5.1.1. Rationale

  • Regular Expressions are also known as regexp, regex or re

  • Identifiers - what to find

  • Qualifiers - range to find

  • Quantifiers - how many occurrences of preceding qualifier or identifier

  • Recall information about raw strings r'...'

  • Recall information about escape characters, i.e.:

    • \n - newline,

    • \\n - string of characters with \ and then n

    • . - in regexp means any character

    • \. - just a dot

    • * - in regexp means any times

    • \* - just asterisk character

5.1.2. Identifiers

  • What to find

  • \s - whitespace (space, tab, newline)

  • \S - anything but whitespace

  • \d - digit

  • \D - anything but digit

  • \b - whitespace around words

  • \B - anything but whitespace around words

  • \w - any unicode alphabet character (lower or upper, also with diacritics (i.e. ąćęłńóśżź...)

  • \W - anything but any unicode alphabet character (i.e. whitespace, dots, comas, dashes)

  • \t - tab

  • \n - newline

  • \v - vertical space

  • \f - form feed

5.1.3. Qualifier

  • Range to find

  • [a-z] - any lowercase ASCII letter from a to z

  • [A-Z] - any uppercase ASCII letter from A to Z

  • [0-9] - any digit from 0 to 9

  • [a-zA-Z] - any ASCII letter from: a to z or from A to Z

  • [a-zA-Z0-9] - any ASCII letter from a to z or from A to Z or digit from 0 to 9

  • [abc] - letter a or b or c

  • a|b - letter a or b (also works with expressions)

  • [a-z]|[0-9] - any lowercase ASCII letter from a to z or digit from 0 to 9

  • . - any character besides newline

  • ^ - start of a string

  • $ - end of a string

Examples:

  • [d-m] - dowolna mała litera z przedziału: d-m

  • [3-7] - dowlna cyfra z przedziału 3-7

  • [d-mK-P3-8] - dowolna mała litera z przedziału d-m oraz dowolna duża litera K-P oraz dowolna cyfra 3-8

  • [xz2] - x lub z lub 2

  • d|x - d lub x

  • [d-k]|[ABC]|[3-8] - dowolna mała litera d-k lub duża A,B,C lub cyfra 3-8

  • [A-Z][a-z]+ - jedna duża litera, a później mała minimum raz

5.1.4. Quantifier

  • How many occurrences of preceding qualifier or identifier

Greedy (prefer longest matches):

  • {n} - exactly n times

  • {,n} - maximum n times

  • {n,} - minimum n times

  • {n,m} - minimum n times, maximum m times

  • * - minimum 0 times, no maximum

  • + - minimum 1 time, no maximum

  • ? - minimum 0 times, maximum 1 time (could be)

Non-Greedy (prefer shortest matches):

  • {,n}? - maximum n times, but prefer shorter

  • {n,}? - minimum n times, but prefer shorter

  • {n,m}? - minimum n times, maximum m times, but prefer shorter

  • *? - minimum 0 times, no maximum, but prefer shorter

  • +? - minimum 1 time, no maximum, but prefer shorter

  • ?? - minimum 0 times, maximum 1 time (could be), but prefer shorter

Examples:

  • [0-9]{2} - exactly two digits from 0 to 9

  • \d{2} - exactly two digits from 0 to 9

  • [A-Z]{2,10} - duża litera A-Z minimalnie 2, maksymalnie 10

  • [A-Z]{2-10}-[0-9]{,5} - duża litera A-Z minimalnie 2, maksymalnie 10 później myślnik - później maksymalnie 5 cyfr

  • [a-z]+ - minimalnie jedna litera, ale staraj się dopasowywać jak najwięcej liter

  • \d+ - liczba

  • \d+\.\d+ - ułamek dziesiętny

5.1.5. Negation

  • Logically inverts qualifier

  • [^abc] - anything but letter a or b or c

5.1.6. Groups

  • Catch expression results

  • Can be named or positional

  • można się odwoływać pozycyjnie oraz keyword

  • () - group

Define:

  • (...) - grupa nie nazwana

  • (?P<name>...) - grupa nazwana name

Backreference:

  • \1 - odwołaj się pozycyjnie do pierwszej grupy

  • $1 - odwołaj się pozycyjnie do pierwszej grupy (niektóre języki programwania)

  • (?P=name) - odwołaj się do grupy nazwanej name

Examples:

  • (\w+) - słowa lub całe cyfry

  • \d+(\.\d+)? - liczba z częścią ułamka dziesiętnego lub bez

  • \d+(,\d+)? - liczba wraz z separatorem tysięcznym (US) - czyli przecinek ,

  • (?P<word>\w+) - grupa nazwana word składająca się z \w+ (dowolny unicode minimum raz)

DATA = 'Mark Watney'
result = re.search(r'(?P<firstname>\w+) (?P<lastname>\w+)', DATA)

result.groupdict()
# {'firstname': 'Mark', 'lastname': 'Watney'}

5.1.7. Flags

  • re.IGNORECASE - bez względu na wielkość liter

  • re.MULTILINE - wyrażenie może zacząć się w jednej linii i skończyć w innej; zmienia znaczenie: ^ - początek linii, $ - koniec linii

  • re.DOTALL - . również łapie końce linii

5.1.8. Extensions

  • In other programming languages

  • [:allnum:] == [a-zA-Z0-9]

  • [:alpha:] == [a-zA-Z]

  • [a-Z] == [a-zA-Z]

  • [a-9] == [a-zA-Z0-9]

5.1.9. Matching

  • \ - Escapes special characters (allows matching *, ?, etc)

Table 5.3. Regular Expression Pattern Matching

Syntax

Description

[a-z]

One small letter form a to z

[A-Z]

One capital letter form A to Z

[0-9]

One digit from 0 to 9

[a-zA-Z0-9]

One of the following: small or capital letter or digit

[abc]

One of the following: a, b or c

A|B

One of either A or B patterns

5.1.10. Negation

Table 5.4. Regular Expression Pattern Negation

Syntax

Description

[^abc]

None of the following: a, b or c

^(?!.*word).*$

Not containing word

5.1.11. Unicode

  • \w - Includes most characters that can be part of a word in any language, as well as numbers and the underscore

Table 5.5. Regular Expression Patterns

Syntax

Description

\w

Unicode word character

\d

Unicode decimal digit [0-9], and many other digit characters

\s

Unicode whitespace characters [\t\n\r\f\v] and non-breaking spaces

5.1.12. Qualifiers

Table 5.6. Regular Expression Qualifiers

Syntax

Description

.

Any character except a newline

^

Start of the string

$

End of the string

*

Zero or more repetitions of the preceding pattern (as many as possible)

+

One or more repetitions of the preceding pattern

?

Zero or one repetitions of the preceding pattern

5.1.13. Quantifiers

Table 5.7. Regular Expression Quantifiers

Syntax

Description

{m}

Exactly m copies of the previous RE should be matched

{m,}

At least m repetitions

{,n}

At most n repetitions

{m,n}

Match from m to n repetitions of the preceding RE (as many as possible)

{m,n}?

Match from m to n repetitions of the preceding RE (as few as possible)

5.1.14. Non-Greedy

  • Adding ? after the qualifier makes it non-greedy

  • Non-greedy - as few as possible

  • Greedy - as many as possible

Table 5.8. Regular Expression Greedy and Non-Greedy Qualifiers

Syntax

Description

?

zero or one (greedy)

*

zero or more (greedy)

+

one or more (greedy)

??

zero or one (non greedy)

*?

zero or more (non greedy)

+?

one or more (non greedy)

5.1.15. Flags

Table 5.9. Regular Expression Flags

Flag

Description

re.IGNORECASE

Case-insensitive (Unicode support i.e. Ü and ü)

re.MULTILINE

^ matches beginning of the string and each line

re.MULTILINE

$ matches end of the string and each line

re.DOTALL

. matches newlines

5.1.16. Multiline

  • re.MULTILINE - Flag turns on Multiline search

  • ^ - Matches the start of the string, and immediately after each newline

  • $ - Matches the end of the string or just before the newline at the end of the string also matches before a newline

5.1.17. Groups

  • (?P<name>...)- Define named group

  • (?P=name)- Backreferencing by group name

  • \number - Backreferencing by group number

Table 5.10. Regular Expression Groups

Syntax

Description

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group

(?P<name>...)

substring matched by the group is accessible via the symbolic group name name

(?P=name)

A backreference to a named group

\number

Matches the contents of the group of the same number

Example:

  • (?P<tag><.*?>)text(?P=tag)

  • (?P<tag><.*?>)text\1

  • (.+) \1 matches the the or 55 55

  • (.+) \1 not matches thethe (note the space after the group)

import string

string.punctuation
# '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

string.whitespace
# ' \t\n\r\x0b\x0c'

string.ascii_lowercase
# 'abcdefghijklmnopqrstuvwxyz'

string.ascii_uppercase
# 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

string.ascii_letters
# 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

string.digits
# '0123456789'

string.hexdigits
# '0123456789abcdefABCDEF'

string.octdigits
# '01234567'

string.printable
# '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

5.1.18. Examples

  • r'^[a-zA-Z0-9][\w.+-]*@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,20}$'

5.1.19. Visualization

../../_images/regexp-vizualization.png

Figure 5.1. Visualization for pattern r'^[a-zA-Z0-9][\w.+-]*@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,20}$'