6. Regular Expressions

6.1. Constructing Regular Expressions

6.1.1. Visualizing RegExps

../_images/regexp-vizualization.png

Fig. 6.1. Visualization for pattern r'^[a-zA-Z0-9][\w.+-]*@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,20}$'

6.1.2. Regular Expression Syntax

Tab. 6.3. Regular Expression Syntax
Syntax Description
. (Dot.) In the default mode, this matches any character except a newline
^ (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline
$ Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline
* Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible
+ Causes the resulting RE to match 1 or more repetitions of the preceding RE
? Causes the resulting RE to match 0 or 1 repetitions of the preceding RE
*?, +?, ??
Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched
{m} Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match.
{m,n} Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible.
{m,n}? Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible.
\ Either escapes special characters (permitting you to match characters like ‘*’, ‘?’, and so forth), or signals a special sequence
[a-z] any character from a to z
[A-Z] any character from A to Z
[0-9] any digit from 0 to 9
[abc] will match a, b or c
| A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.
(...) Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group
(?P<name>...) substring matched by the group is accessible via the symbolic group name name
(?P=name) A backreference to a named group; it matches whatever text was matched by the earlier group named name. (?P<tag><.*?>)text(?P=tag) or (?P<tag><.*?>)text\1
\number Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches the the or 55 55, but not thethe (note the space after the group).
\d Unicode decimal digit [0-9], and many other digit characters
\s Unicode whitespace characters [\t\n\r\f\v] and non-breaking spaces
\w Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore

6.1.3. Regex Flags

Tab. 6.4. Regular Expression Flags
Flag Description
re.IGNORECASE Case-insensitive (Unicode support i.e. Ü and ü)
re.MULTILINE ^ matches beginning of the string and each line
re.MULTILINE $ matches end of the string and each line
re.DOTALL . matches newlines

6.2. Most frequent used functions in re module

6.2.1. re.match()

Code Listing 6.1. Usage of re.match()
import re

PATTERN = r'^[a-zA-Z0-9][\w.+-]*@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,20}$'


def is_valid_email(email: str) -> bool:
    """
    Function check email address against Regular Expression

    >>> is_valid_email('[email protected]')
    True
    >>> is_valid_email('[email protected]')
    True
    >>> is_valid_email('[email protected]')
    False
    >>> is_valid_email('[email protected]')
    True
    >>> is_valid_email('[email protected]')
    True
    >>> is_valid_email('[email protected]')
    False
    >>> is_valid_email('@nasa.gov')
    False
    >>> is_valid_email('[email protected]')
    False
    """
    if re.match(PATTERN, email):
        return True
    else:
        return False

6.2.3. re.findall() and re.finditer()

Code Listing 6.3. Usage of re.findall() and re.finditer()
import re

# used for redmine and track issue id
PATTERN = r'#[0-9]+'
TEXT = "Refs #23919, #31337 Removed obsolete comments"


re.findall(PATTERN, TEXT)
# ['#23919', '#31337']

6.2.4. re.compile()

Code Listing 6.4. Compiles at every loop iteration, and then matches
import re


DATABASE = [
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '@nasa.gov',
    '[email protected]',
]

PATTERN = r'^[a-zA-Z0-9][\w.+-]*@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,}$'


for email in DATABASE:
    re.match(PATTERN, email)
Code Listing 6.5. Compiling before loop, hence matching only inside
import re


DATABASE = [
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '[email protected]',
    '@nasa.gov',
    '[email protected]',
]

PATTERN = re.compile(r'^[a-zA-Z0-9][\w.+-]*@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,}$')


for email in DATABASE:
    PATTERN.match(email)

6.2.5. re.sub()

Code Listing 6.6. Usage of re.sub()
import re


PATTERN = r'\s[a-z]{3}\s'
TEXT = 'Baked Beans And Spam'


re.sub(PATTERN, ' & ', TEXT, flags=re.IGNORECASE)
# 'Baked Beans & Spam'

6.2.6. re.split()

Code Listing 6.7. Usage of re.split()
import re

PATTERN = r'\s[a-z]{3}\s'
TEXT = 'Baked Beans And Spam'


re.split(PATTERN, TEXT, flags=re.IGNORECASE)
# ['Baked Beans', 'Spam']

6.2.7. Comparision between re.match(), re.search() and re.findall()

Code Listing 6.8. Comparision between re.match(), re.search() and re.findall()
import re


PATTERN = r'#[0-9]+'
TEXT = "Refs #23919, #31337 Removed obsolete comments"


re.findall(PATTERN, TEXT)
# ['#23919', '#31337']

re.search(PATTERN, TEXT).group()
# '#23919'

re.match(PATTERN, TEXT)
# None

6.3. RegEx parameters (variables)

Code Listing 6.9. Usage of group in re.match()
import re

PATTERN = r'(?P<first_name>\w+) (?P<last_name>\w+)'
TEXT = 'José Jiménez'

matches = re.match(PATTERN, TEXT)


matches.group('first_name')
# 'José'

matches.group('last_name')
# 'Jiménez'

matches.group(1)
# 'José'

matches.group(2)
# 'Jiménez'

matches.groups()
# ('José', 'Jiménez')

matches.groupdict()
# {'first_name': 'José', 'last_name': 'Jiménez'}

6.4. Multi line searches

Code Listing 6.10. Usage of regexp
import re


PATTERN = r'^#[0-9]+'

TEXT = """
#27533 Fixed inspectdb crash;
#31337 Remove commented out code
"""


re.findall(PATTERN, TEXT)
# []

re.findall(PATTERN, TEXT, flags=re.MULTILINE)
# ['#27533', '#31337']

6.6. Practical example of Regex usage

6.6.1. Making a Phonebook

Code Listing 6.12. Practical example of Regex usage
import re

TEXT = """Ross McFluff: 834.345.1254 155 Elm Street

Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way


Heather Albrecht: 548.326.4584 919 Park Place"""


entries = re.split('\n+', TEXT)
# ['Ross McFluff: 834.345.1254 155 Elm Street',
# 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
# 'Frank Burger: 925.541.7625 662 South Dogwood Way',
# 'Heather Albrecht: 548.326.4584 919 Park Place']

out = [re.split(':?\s', entry, maxsplit=3) for entry in entries]
# [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
# ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
# ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
# ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

#addresses = {fname, lname, phone, address in out}

from pprint import pprint
pprint(out)

6.6.2. Finding all Adverbs

Code Listing 6.13. Finding all Adverbs
import re


TEXT = 'He was carefully disguised but captured quickly by police.'
ADVERBS = r'\w+ly'

re.findall(ADVERBS, TEXT)
# ['carefully', 'quickly']

6.6.3. Writing a Tokenizer

Code Listing 6.14. Writing a Tokenizer.
import collections
import re

"""
A tokenizer or scanner analyzes a string to categorize groups of characters.
This is a useful first step in writing a compiler or interpreter.

The text categories are specified with regular expressions.
The technique is to combine those into a single master regular
expression and to loop over successive matches
"""

Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])


def tokenize(code):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',  r'\d+(\.\d*)?'),  # Integer or decimal number
        ('ASSIGN',  r':='),           # Assignment operator
        ('END',     r';'),            # Statement terminator
        ('ID',      r'[A-Za-z]+'),    # Identifiers
        ('OP',      r'[+\-*/]'),      # Arithmetic operators
        ('NEWLINE', r'\n'),           # Line endings
        ('SKIP',    r'[ \t]+'),       # Skip over spaces and tabs
        ('MISMATCH',r'.'),            # Any other character
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    line_num = 1
    line_start = 0

    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group(kind)

        if kind == 'NEWLINE':
            line_start = mo.end()
            line_num += 1
        elif kind == 'SKIP':
            pass
        elif kind == 'MISMATCH':
            raise RuntimeError(f'{value!r} unexpected on line {line_num}')
        else:
            if kind == 'ID' and value in keywords:
                kind = value
            column = mo.start() - line_start
            yield Token(kind, value, line_num, column)

statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''

for token in tokenize(statements):
    print(token)

# Token(typ='IF', value='IF', line=2, column=4)
# Token(typ='ID', value='quantity', line=2, column=7)
# Token(typ='THEN', value='THEN', line=2, column=16)
# Token(typ='ID', value='total', line=3, column=8)
# Token(typ='ASSIGN', value=':=', line=3, column=14)
# Token(typ='ID', value='total', line=3, column=17)
# Token(typ='OP', value='+', line=3, column=23)
# Token(typ='ID', value='price', line=3, column=25)
# Token(typ='OP', value='*', line=3, column=31)
# Token(typ='ID', value='quantity', line=3, column=33)
# Token(typ='END', value=';', line=3, column=41)
# Token(typ='ID', value='tax', line=4, column=8)
# Token(typ='ASSIGN', value=':=', line=4, column=12)
# Token(typ='ID', value='price', line=4, column=15)
# Token(typ='OP', value='*', line=4, column=21)
# Token(typ='NUMBER', value='0.05', line=4, column=23)
# Token(typ='END', value=';', line=4, column=27)
# Token(typ='ENDIF', value='ENDIF', line=5, column=4)
# Token(typ='END', value=';', line=5, column=9)

6.6.4. National Identification Numbers (Worldwide)

6.7. Assignments

6.7.1. PESEL Validation

  1. Przeprowadź eksperyment myślowy (nie pisz kodu tylko pomyśl)

  2. Jak sprawdzić za pomocą wyrażeń regularnych czy:

    • czy pesel jest poprawny
    • jaka jest data urodzenia? (podaj obiekt datetime.date
    • płeć użytkownika który podał PESEL
  3. Mając PESEL “6907”

  4. Jakie wyrażenie może być na pierwszym miejscu w PESEL?

  5. Jakie wyrażenie może być na drugim miejscu w PESEL?

  6. Jakie wyrażenie może być na trzecim miejscu w PESEL?

  7. Jakie wyrażenie może być na czwartym miejscu w PESEL?

  8. Jakie wyrażenie może być na piątym miejscu w PESEL?

  9. Jakie wyrażenie może być na szóstym miejscu w PESEL?

About:
  • Filename: regex_pesel.py
  • Lines of code to write: 0 lines
  • Estimated time of completion: 10 min
Z gwiazdką:
  • sprawdź walidację numerów PESEL dla osób urodzonych po 2000 roku.
  • sprawdź sumę kontrolną

6.7.2. Parsing text from webpage

  1. Ze strony https://er.jsc.nasa.gov/seh/ricetalk.htm pobrano tekst przemówienia John F. Kennedy’ego “Moon Speech” i zamieszczono w listingu poniżej
  2. Skopiuj zawartość listingu do pliku moon-speech.html
  3. Za pomocą regexpów wytnij tekst fragmentu przemówienia JFK
  4. Zwróć pierwszy paragraf tekstu przemówienia zaczynający się od słów “We choose to go to the moon”
About:
  • Filename: regex_html.py
  • Lines of code to write: 5 lines
  • Estimated time of completion: 20 min
<html><body><bgsound src="jfktalk.wav" loop="2"><p></p><center><h3>John F. Kennedy Moon Speech - Rice Stadium</h3><img src="jfkrice.jpg"><h3>September 12, 1962</h3></center><p></p><hr><p></p><center>Movie clips of JFK speaking at Rice University: <a href="JFKatRice.mov">(.mov)</a> or <a href="jfkrice.avi">(.avi)</a> (833K)</center><p><a href="jfkru56k.asf">See and hear</a> the entire speech for 56K modem download [8.7 megabytes in a .asf movie format which requires Windows Media Player 7 (speech lasts about 33 minutes)].<br><a href="jfkru100.asf">See and hear</a> the entire speech for higher speed access [25.3 megabytes in .asf movie format which requires Windows Media Player 7].<br><a href="jfkslide.asf">See and hear</a> a five minute audio version of the speech with accompanying slides and music. This is a most inspirational presentation of, perhaps, the most famous space speech ever given. The file is a streaming video Windows Media Player 7 format. [11 megabytes in .asf movie format which requires Windows Media Player 7]. <br><a href="jfk_rice_speech.mpg">See and hear</a> the 17 minute 48 second speech in the .mpg format. This is a very large file of 189 megabytes and only suggested for those with DSL, ASDL, or cable modem access as the download time on a 28.8K or 56K modem would be many hours duration. </p><p></p><hr><p></p><center><h4>TEXT OF PRESIDENT JOHN KENNEDY'S RICE STADIUM MOON SPEECH</h4></center><p>President Pitzer, Mr. Vice President, Governor, CongressmanThomas, Senator Wiley, and Congressman Miller, Mr. Webb, Mr.Bell, scientists, distinguished guests, and ladies and gentlemen:</p><p>We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they areeasy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills,because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win,and the others, too. </p><p>It is for these reasons that I regard the decision last year to shift our efforts in space from low to high gear as among the mostimportant decisions that will be made during my incumbency in the office of the Presidency. </p><p>In the last 24 hours we have seen facilities now being created for the greatest and most complex exploration in man's history.We have felt the ground shake and the air shattered by the testing of a Saturn C-1 booster rocket, many times as powerful asthe Atlas which launched John Glenn, generating power equivalent to 10,000 automobiles with their accelerators on the floor.We have seen the site where the F-1 rocket engines, each one as powerful as all eight engines of the Saturn combined, will beclustered together to make the advanced Saturn missile, assembled in a new building to be built at Cape Canaveral as tall as a48 story structure, as wide as a city block, and as long as two lengths of this field.</p><p></p><hr><p></p><center><a href="movies.html">Return to Space Movies Cinema</a></center></body></html>