4.8. Regex Syntax Flag¶

re.ASCII - perform ASCII-only matching instead of full Unicode matching
re.IGNORECASE - case-insensitive search
re.LOCALE - case-insensitive matching dependent on the current locale (deprecated)
re.MULTILINE - match can start in one line, and end in another
re.DOTALL - dot (.) matches also newline characters
re.UNICODE - turns on unicode character support for \w
re.VERBOSE - ignores spaces (except \s) and allows for comments in in re.compile()
re.DEBUG - display debugging information during pattern compilation

The final piece of regex syntax that Python's regular expression engine offers is a means of setting the flags. Usually the flags are set by passing them as additional parameters when calling the re.compile() function, but sometimes it's more convenient to set them as part of the regex itself. The syntax is simply (?flags) where flags is one or more of the following:

re.ASCII
re.IGNORECASE
re.LOCALE
re.MULTILINE
re.DOTALL
re.UNICODE
re.VERBOSE
re.DEBUG

If the flags are set this way, they should be put at the start of the regex; they match nothing, so their effect on the regex is only to set the flags. The letters used for the flags are the same as the ones used by Perl's regex engine, which is why s is used for re.DOTALL and x is used for re.VERBOSE [1].

4.8.1. SetUp¶

>>> import re

4.8.2. ASCII¶

Short: a
Long: re.ASCII
Perform ASCII-only matching instead of full Unicode matching
Works for \w, \W, \b, \B, \d, \D, \s and \S
ASCII only search is faster, but does not include unicode characters

>>> TEXT = 'cześć'  # 'hello' in Polish
>>> re.findall(r'\w', TEXT)
['c', 'z', 'e', 'ś', 'ć']
>>>
>>> re.findall(r'\w', TEXT, flags=re.ASCII)
['c', 'z', 'e']

Mind that range character class [a-z] is always ASCII:

>>> TEXT = 'cześć'  # 'hello' in Polish
>>>
>>> re.findall(r'[a-z]', TEXT)
['c', 'z', 'e']
>>>
>>> re.findall(r'[a-z]', TEXT, flags=re.ASCII)
['c', 'z', 'e']

4.8.3. IGNORECASE¶

Short: i
Long: re.IGNORECASE
Case-insensitive search
Has Unicode support i.e. Ą and ą

>>> TEXT = 'Email from Mark Watney <mwatney@nasa.gov> received on: Sat, Jan 1st, 2000 at 12:00 AM'
>>>
>>> re.findall(r'NASA', TEXT)
[]
>>>
>>> re.findall(r'NASA', TEXT, flags=re.IGNORECASE)
['nasa']

4.8.4. LOCALE¶

Short: L
Long: re.LOCALE
Case-insensitive matching dependent on the current locale
Work for \w, \W, \b, \B
Use of this flag is discouraged as the locale mechanism is very unreliable
It only works with 8-bit locales

>>> import locale
>>>
>>> locale.getlocale()  
('en_US', 'UTF-8')

4.8.5. MULTILINE¶

Short: m
Long: re.MULTILINE
Match can start in one line, and end in another
Changes meaning of ^, now it is a start of a line
Changes meaning of $, now it is an end of line

>>> TEXT = 'hello\nworld'
>>>
>>> re.findall('^[a-z]', TEXT)
['h']
>>>
>>> re.findall('^[a-z]', TEXT, flags=re.MULTILINE)
['h', 'w']

>>> TEXT = """We choose to go to the moon.
... We choose to go to the moon in this decade and do the other things,
... not because they are easy,
... but because they are hard,
... because that goal will serve to organize and measure the best of our energies and skills,
... because that challenge is one that we are willing to accept,
... one we are unwilling to postpone,
... and one which we intend to win,
... and the others, too."""
>>>
>>>
>>> sentence = r'[A-Z][a-z, ]+\.'
>>> re.findall(sentence, TEXT)
['We choose to go to the moon.']
>>>
>>> sentence = r'[A-Z][a-z, \n]+\.'
>>> re.findall(sentence, TEXT)  
['We choose to go to the moon.',
 'We choose to go to the moon in this decade and do the other things,\nnot because they are easy,\nbut because they are hard,\nbecause that goal will serve to organize and measure the best of our energies and skills,\nbecause that challenge is one that we are willing to accept,\none we are unwilling to postpone,\nand one which we intend to win,\nand the others, too.']

4.8.6. DOTALL¶

Short: s
Long: re.DOTALL
Dot (.) matches also newline characters
By default newlines are not matched by .

>>> TEXT = 'hello\nworld'
>>>
>>> re.findall(r'.', TEXT)
['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']
>>>
>>> re.findall(r'.', TEXT, flags=re.DOTALL)
['h', 'e', 'l', 'l', 'o', '\n', 'w', 'o', 'r', 'l', 'd']

Mind the \n character among results with re.DOTALL flag turned on.

4.8.7. UNICODE¶

Short: u
Long: re.UNICODE
On by default
Turns on unicode character support
Works for \w and \W

>>> TEXT = 'cześć'  # in Polish language means hello
>>>
>>> re.findall(r'\w', TEXT)
['c', 'z', 'e', 'ś', 'ć']
>>>
>>> re.findall(r'\w', TEXT, flags=re.UNICODE)
['c', 'z', 'e', 'ś', 'ć']