2.6. Type Str Methods

2.6.1. Rationale

  • str is immutable

  • str methods create a new modified str

a = 'Python'
a.replace('P', 'J')

print(a)  # Python
a = 'Python'
b = a.replace('P', 'J')

print(a)  # Python
print(b)  # Jython
a = 'Python'
a = a.replace('P', 'J')

print(a)  # Jython

2.6.2. Change Case

  • Unify data format before analysis

name = 'Angus MacGyver III'

name.upper()       # 'ANGUS MACGYVER III'
name.lower()       # 'angus macgyver iii'
name.title()       # 'Angus Macgyver Iii'
name.capitalize()  # 'Angus macgyver iii'

2.6.3. Replace

name = 'Jan Twardowski Iii'

name.replace('Iii', 'III')
# 'Jan Twardowski III'
Listing 2.34. This is naive sanitization. Reverse ordering will allow deleting files
cmd = input('Type system command to execute: ').strip()
# Type system command to execute: ls && rm -fr /

cmd = cmd.replace('&&', '#')
print(cmd)
# ls # rm -fr /

2.6.4. Strip Whitespace

name = '\tJan Twardowski    \n'

name.strip()        # 'Jan Twardowski'
name.rstrip()       # '\tJan Twardowski'
name.lstrip()       # 'Jan Twardowski    \n'
cmd = input('Type system command to execute: ').strip()
print(cmd)

2.6.5. Starts With

'Jan Twardowski'.startswith('Jan')  # True
START = ('vir', 'ver')

'virginica'.startswith(START)       # True
'versicolor'.startswith(START)      # True
'setosa'.startswith(START)          # False
Listing 2.35. Will check if command typed by user startswith disallowed command
forbidden = ('rm', 'cp', 'mv')

cmd = input('Type system command to execute: ').strip()
cmd.startswith(forbidden)

2.6.6. Ends With

'Jan Twardowski'.endswith(';')      # False
allowed = ('gov', 'int')

'nasa.gov'.endswith(allowed)         # True
'esa.int'.endswith(allowed)          # True
'roscosmos.ru'.endswith(allowed)     # False
Listing 2.36. Will check if command typed by user startswith disallowed command
allowed = ('gov', 'int')

email = input('Type your email: ').strip()
email.endswith(allowed)

2.6.7. Split by Line

DATA = """First Line
Second Line
Third Line
"""

DATA.splitlines()
# [
#   'First Line',
#   'Second Line',
#   'Third Line'
# ]

2.6.8. Split by Character

  • No argument - any number of whitespaces

setosa = '5.1,3.5,1.4,0.2,setosa'

setosa.split(',')
# ['5.1', '3.5', '1.4', '0.2', 'setosa']
text = 'We choose to go to the Moon'

text.split(' ')
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

text.split()
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
text = '10.13.37.1      nasa.gov esa.int roscosmos.ru'

text.split(' ')
# ['10.13.37.1', '', '', '', '', '', 'nasa.gov', 'esa.int', 'roscosmos.ru']

text.split()
# ['10.13.37.1', 'nasa.gov', 'esa.int', 'roscosmos.ru']
Listing 2.37. Naive sanitization. For this purpose there is shlex.split()
cmd = input('Type system command to execute: ').strip()
# Type system command to execute: ls && rm -fr /

cmd.split('&&')
# ['ls', 'rm -fr /']

2.6.9. Join by Character

text = ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

' '.join(text)
# 'We choose to go to the Moon'
setosa = ['5.1', '3.5', '1.4', '0.2', 'setosa']

','.join(setosa)
# '5.1,3.5,1.4,0.2,setosa'
crew = ['Mark Watney', 'Jan Twardowski', 'Melissa Lewis']

'\n'.join(crew)
# 'Mark Watney\nJan Twardowski\nMelissa Lewis'

print('\n'.join(crew))
# Mark Watney
# Jan Twardowski
# Melissa Lewis
TEXT = ['We choose to go to the Moon!',
        'We choose to go to the Moon in this decade and do the other things,',
        'not because they are easy, but because they are hard;',
        'because that goal will serve to organize and measure the best of our energies and skills,',
        'because that challenge is one that we are willing to accept, one we are unwilling to postpone,',
        'and one we intend to win, and the others, too.']

print('\n'.join(TEXT))
# We choose to go to the Moon!
# We choose to go to the Moon in this decade and do the other things,
# not because they are easy, but because they are hard;
# because that goal will serve to organize and measure the best of our energies and skills,
# because that challenge is one that we are willing to accept, one we are unwilling to postpone,
# and one we intend to win, and the others, too.

2.6.10. Expand Tabs

'01\t012\t0123\t01234'.expandtabs()
# '01      012     0123    01234'

'01\t012\t0123\t01234'.expandtabs(4)
#'01  012 0123    01234'

2.6.11. Is Whitespace

''.isspace()        # False
' '.isspace()       # True
'\t'.isspace()      # True
'\n'.isspace()      # True
../../_images/iss.jpg

Figure 2.5. ISS - International Space Station. Credits: NASA/Crew of STS-132 (img: s132e012208).

2.6.12. Is Alphabet Characters

'hello'.isalpha()   # True
'hello1'.isalpha()  # False

2.6.13. Is Numeric

'1'.isdecimal()     # True
'+1'.isdecimal()    # False
'-1'.isdecimal()    # False
'1.'.isdecimal()    # False
'1,'.isdecimal()    # False
'1.0'.isdecimal()   # False
'1,0'.isdecimal()   # False
'1_0'.isdecimal()   # False
'10'.isdecimal()    # True

'1'.isdigit()       # True
'+1'.isdigit()      # False
'-1'.isdigit()      # False
'1.'.isdigit()      # False
'1,'.isdigit()      # False
'1.0'.isdigit()     # False
'1,0'.isdigit()     # False
'1_0'.isdigit()     # False
'10'.isdigit()      # True

'1'.isnumeric()     # True
'+1'.isnumeric()    # False
'-1'.isnumeric()    # False
'1.'.isnumeric()    # False
'1.0'.isnumeric()   # False
'1,0'.isnumeric()   # False
'1_0'.isnumeric()   # False
'10'.isnumeric()    # True

'1'.isalnum()       # True
'+1'.isalnum()      # False
'-1'.isalnum()      # False
'1.'.isalnum()      # False
'1,'.isalnum()      # False
'1.0'.isalnum()     # False
'1,0'.isalnum()     # False
'1_0'.isalnum()     # False
'10'.isalnum()      # True

2.6.14. Find Sub-String Position

text = 'We choose to go to the Moon'

text.find('M')      # 23
text.find('Moo')    # 23
text.find('x')      # -1

2.6.15. Contains

'Monty' in 'Python'  # False
'Py' in 'Python'     # True
'py' in 'Python'     # False

2.6.16. Count Occurrences

text = 'Moon'

text.count('o')     # 2
text.count('Moo')   # 1
text.count('x')     # 0

2.6.17. Remove Prefix or Suffix

New in version Python: 3.9 PEP 616 String methods to remove prefixes and suffixes

filename = '1969-apollo11.tmp'

filename.removeprefix('1969-')
# 'apollo11.tmp'

filename.removesuffix('.tmp')
# '1969-apollo11'

filename.removeprefix('1969-').removesuffix('.tmp')
# 'apollo11'

2.6.18. Method Chaining

a = 'Python'

a = a.upper()
a = a.replace('P', 'C')
a = a.title()

print(a)
# Cython
a = 'Python'
a = a.upper().replace('P', 'C').title()

print(a)
# Cython
a.upper().replace('P', 'C').title()

# a -> 'Python'
# 'Python'.upper() -> 'PYTHON'
# 'PYTHON'.replace('P', 'C') -> 'CYTHON'
# 'CYTHON'.title() -> 'Cython'
Listing 2.38. Note, that there cannot be any char, not even space after \ character
a = 'Python'

a = a \
    .upper() \
    .replace('P', 'C') \
    .title()

print(a)
a = 'Python'

a = (a
    .upper()
    .replace('P', 'C')
    .title())

print(a)
a = 'Python'

a = a.upper().startswith('P').replace('P', 'C')
# Traceback (most recent call last):
#     ...
# AttributeError: 'bool' object has no attribute 'replace'

2.6.19. Cleaning User Input

  • 80% of machine learning and data science is cleaning data

  • Is This the Same Address?

  • This is a dump of distinct records of a single address

  • Which one of the below is a true address?

Listing 2.39. Addresses
'ul. Jana III Sobieskiego'
'ul Jana III Sobieskiego'
'ul.Jana III Sobieskiego'
'ulicaJana III Sobieskiego'
'Ul. Jana III Sobieskiego'
'UL. Jana III Sobieskiego'
'ulica Jana III Sobieskiego'
'Ulica. Jana III Sobieskiego'

'os. Jana III Sobieskiego'

'Jana 3 Sobieskiego'
'Jana 3ego Sobieskiego'
'Jana III Sobieskiego'
'Jana Iii Sobieskiego'
'Jana IIi Sobieskiego'
'Jana lll Sobieskiego'  # three small letters 'L'
Listing 2.40. Streets
'ul'
'ul.'
'Ul.'
'UL.'
'ulica'
'Ulica'

'os'
'os.'
'Os.'
'osiedle'
'oś'
'oś.'
'Oś.'
'ośedle'

'pl'
'pl.'
'Pl.'
'plac'

'al'
'al.'
'Al.'
'aleja'
'aleia'
'alei'
'aleii'
'aleji'
Listing 2.41. House and Apartment Number
'Ćwiartki 3/4'
'Ćwiartki 3 / 4'
'Ćwiartki 3 m. 4'
'Ćwiartki 3 m 4'
'Brighton Beach 1st apt 2'
'Brighton Beach 1st apt. 2'
'Myśliwiecka 3/5/7'

'Jana Twardowskiego 180f/8f'
'Jana Twardowskiego 180f/8'
'Jana Twardowskiego 180/8f'

'Jana Twardowskiego III 3 m. 3'
'Jana Twardowskiego 13d bud. A piętro II sala 3'
Listing 2.42. Phone Numbers
+48 (12) 355 5678
+48 123 555 678

123 555 678

+48 12 355 5678
+48 123-555-678
+48 123 555 6789

+1 (123) 555-6789
+1 (123).555.6789

+1 800-python
+48123555678

+48 123 555 678 wew. 1337
+48 123555678,1
+48 123555678,1,,2

2.6.20. Assignments

2.6.20.1. Type String Normalize

  • Assignment name: Type String Normalize

  • Last update: 2020-10-01

  • Complexity level: easy

  • Lines of code to write: 8 lines

  • Estimated time of completion: 5 min

  • Solution: solution/type_str_normalize.py

English
  1. Use data from "Input" section (see below)

  2. Use str methods to clean DATA

  3. Compare result with "Output" section (see below)

Polish
  1. Użyj danych z sekcji "Input" (patrz poniżej)

  2. Wykorzystaj metody str do oczyszczenia DATA

  3. Porównaj wyniki z sekcją "Output" (patrz poniżej)

Input
DATA = 'UL. jana \tTWArdoWskIEGO 3'
Output
result: str
# Jana Twardowskiego III
The whys and wherefores
  • Variable definition

  • Print formatting

  • Cleaning text input

2.6.20.2. Type String Clean

  • Assignment name: Type String Clean

  • Last update: 2020-10-01

  • Complexity level: easy

  • Lines of code to write: 11 lines

  • Estimated time of completion: 13 min

  • Solution: solution/type_str_clean.py

English
  1. Use data from "Input" section (see below)

  2. Expected value is Jana III Sobieskiego

  3. Use only str methods to clean each variable

  4. Discuss how to create generic solution which fit all cases

  5. Implementation of such generic function will be in Function Arguments Clean chapter

  6. Compare result with "Output" section (see below)

Polish
  1. Użyj danych z sekcji "Input" (patrz poniżej)

  2. Oczekiwana wartość Jana III Sobieskiego

  3. Wykorzystaj tylko metody str do oczyszczenia każdej zmiennej

  4. Przeprowadź dyskusję jak zrobić rozwiązanie generyczne pasujące do wszystkich przypadków

  5. Implementacja takiej generycznej funkcji będzie w rozdziale Function Arguments Clean

  6. Porównaj wyniki z sekcją "Output" (patrz poniżej)

Input
a = 'ul Jana III SobIESkiego'
b = '\tul. Jana trzeciego Sobieskiego'
c = 'ulicaJana III Sobieskiego'
d = 'UL. JANA 3 \nSOBIESKIEGO'
e = 'UL. jana III SOBiesKIEGO'
f = 'ULICA JANA III SOBIESKIEGO  '
g = 'ULICA. JANA III SOBIeskieGO'
h = ' Jana 3 Sobieskiego  '
i = 'Jana III Sobi\teskiego '

a = a.replace('ul', '').title().replace('Iii', 'III').strip()
b = b
c = c
d = d
e = e
f = f
g = g
h = h
i = i

expected = 'Jana III Sobieskiego'

print(f'{a == expected}\ta = "{a}"')
print(f'{b == expected}\tb = "{b}"')
print(f'{c == expected}\tc = "{c}"')
print(f'{d == expected}\td = "{d}"')
print(f'{e == expected}\te = "{e}"')
print(f'{f == expected}\tf = "{f}"')
print(f'{g == expected}\tg = "{g}"')
print(f'{h == expected}\th = "{h}"')
print(f'{i == expected}\ti = "{i}"')
Output
True    a = "Jana III Sobieskiego"
True    b = "Jana III Sobieskiego"
True    c = "Jana III Sobieskiego"
True    d = "Jana III Sobieskiego"
True    e = "Jana III Sobieskiego"
True    f = "Jana III Sobieskiego"
True    g = "Jana III Sobieskiego"
True    h = "Jana III Sobieskiego"
True    i = "Jana III Sobieskiego"
The whys and wherefores
  • Variable definition

  • Print formatting

  • Cleaning text input