2.6. Type Str Methods

2.6.1. Rationale

  • str is immutable

  • str methods create a new modified str

a = 'Python'
a.replace('P', 'J')

print(a)  # Python
a = 'Python'
b = a.replace('P', 'J')

print(a)  # Python
print(b)  # Jython
a = 'Python'
a = a.replace('P', 'J')

print(a)  # Jython

2.6.2. Change Case

  • Unify data format before analysis

name = 'jAn TwARDowSKi III'

name.upper()       # 'JAN TWARDOWSKI III'
name.lower()       # 'jan twardowski iii'
name.title()       # 'Jan Twardowski Iii'
name.capitalize()  # 'Jan twardowski iii'
name = 'Angus MacGyver'

name.upper()       # 'ANGUS MACGYVER'
name.lower()       # 'angus macgyver'
name.title()       # 'Angus Macgyver'
name.capitalize()  # 'Angus macgyver'

2.6.3. Replace

name = 'Jan Twardowski Iii'

name.replace('Iii', 'III')
# 'Jan Twardowski III'

2.6.4. Strip Whitespace

name = '\tJan Twardowski    \n'

name.strip()        # 'Jan Twardowski'
name.rstrip()       # '\tJan Twardowski'
name.lstrip()       # 'Jan Twardowski    \n'

2.6.5. Starts or Ends With

  • Understand this as "starts with" and "ends with"

name = 'Jan Twardowski'

name.startswith('Jan')  # True
name.endswith(';')      # False

2.6.6. Split by Line

DATA = """First Line
Second Line
Third Line
"""

DATA.splitlines()
# [
#   'First Line',
#   'Second Line',
#   'Third Line'
# ]

2.6.7. Split by Character

  • No argument - any number of whitespaces

setosa = '5.1,3.5,1.4,0.2,setosa'

setosa.split(',')
# ['5.1', '3.5', '1.4', '0.2', 'setosa']
text = 'We choose to go to the Moon'

text.split(' ')
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

text.split()
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
text = '10.13.37.1      nasa.gov esa.int roscosmos.ru'

text.split(' ')
# ['10.13.37.1', '', '', '', '', '', 'nasa.gov', 'esa.int', 'roscosmos.ru']

text.split()
# ['10.13.37.1', 'nasa.gov', 'esa.int', 'roscosmos.ru']

2.6.8. Join

text = ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

' '.join(text)
# 'We choose to go to the Moon'
setosa = [5.1, 3.5, 1.4, 0.2, 'setosa']

','.join(setosa)
# '5.1,3.5,1.4,0.2,setosa'

2.6.9. Is Whitespace

''.isspace()        # False
' '.isspace()       # True
'\t'.isspace()      # True
'\n'.isspace()      # True
../../_images/iss.jpg

Figure 2.2. ISS - International Space Station. Credits: NASA/Crew of STS-132 (img: s132e012208).

2.6.10. Is Alphabet Characters

'hello'.isalpha()   # True
'hello1'.isalpha()  # False

2.6.11. Find Sub-String Position

text = 'We choose to go to the Moon'

text.find('M')      # 23
text.find('Moo')    # 23
text.find('x')      # -1

2.6.12. Contains

'Py' in 'Python'     # True
'Monty' in 'Python'  # False

2.6.13. Count Occurrences

text = 'Moon'

text.count('o')     # 2
text.count('Moo')   # 1
text.count('x')     # 0

2.6.14. Remove Prefix or Suffix

New in version Python: 3.9 PEP 616 New str.removeprefix() and str.removesuffix() string methods

2.6.15. Methods Chaining

a = 'Python'
b = a.upper().replace('P', 'C').title()

print(a)            # Python
print(b)            # Cython
a = 'Python'

b = a.upper().startswith('P').replace('P', 'C')
# AttributeError: 'bool' object has no attribute 'replace'

2.6.16. Cleaning User Input

  • 80% of machine learning and data science is cleaning data

2.6.16.1. Addresses

  • Is This the Same Address?

  • This is a dump of distinct records of a single address

  • Which one of the below is a true address?

'ul. Jana III Sobieskiego'
'ul Jana III Sobieskiego'
'ul.Jana III Sobieskiego'
'ulicaJana III Sobieskiego'
'Ul. Jana III Sobieskiego'
'UL. Jana III Sobieskiego'
'ulica Jana III Sobieskiego'
'Ulica. Jana III Sobieskiego'

'os. Jana III Sobieskiego'

'Jana 3 Sobieskiego'
'Jana 3ego Sobieskiego'
'Jana III Sobieskiego'
'Jana Iii Sobieskiego'
'Jana IIi Sobieskiego'
'Jana lll Sobieskiego'  # three small letters 'L'

2.6.16.2. Streets

'ul'
'ul.'
'Ul.'
'UL.'
'ulica'
'Ulica'
'os'
'os.'
'Os.'
'osiedle'

'oś'
'oś.'
'Oś.'
'ośedle'
'pl'
'pl.'
'Pl.'
'plac'
'al'
'al.'
'Al.'

'aleja'
'aleia'
'alei'
'aleii'
'aleji'

2.6.16.3. House and Apartment Number

'1/2'
'1 / 2'
'1/ 2'
'1 /2'
'3/5/7'
'1 m. 2'
'1 m 2'
'1 apt 2'
'1 apt. 2'
'180f/8f'
'180f/8'
'180/8f'
'13d bud. A'

2.6.16.4. Phone Numbers

+48 (12) 355 5678
+48 123 555 678
123 555 678

+48 12 355 5678
+48 123-555-678
+48 123 555 6789

+1 (123) 555-6789
+1 (123).555.6789

+1 800-python
+48123555678

+48 123 555 678 wew. 1337
+48 123555678,1
+48 123555678,1,2,3

2.6.17. Assignments

2.6.17.1. Example

English
  1. For given text: UL. jana \tTWArdoWskIEGO 3

  2. Use str methods to clean variable

  3. Expected value is Jana Twardowskiego III

Polish
  1. Dla danego tekstu: UL. jana \tTWArdoWskIEGO 3

  2. Wykorzystaj metody str do oczyszczenia

  3. Oczekiwana wartość Jana Twardowskiego III

Solution
expected = 'Jana Twardowskiego III'
text = 'UL. jana \tTWArdoWskIEGO 3'

text = text.upper()
text = text.replace('UL.', '')
text = text.replace('\t', '')
text = text.replace('3', 'III')
text = text.title()
text = text.replace('Iii', 'III')
text = text.strip()

print('Matched:', text == expected)
# Matched: True

print(text)
# Jana Twardowskiego III
The whys and wherefores
  • Variable definition

  • Print formatting

  • Cleaning text input

2.6.17.2. String Cleaning

English
  1. Use data from "Input" section (see below)

  2. Expected value is Jana III Sobieskiego

  3. Use only str methods to clean each variable

  4. Discuss how to create generic solution which fit all cases

  5. Implementation of such generic function will be in Cleaning text input chapter

  6. Compare result with "Output" section (see below)

Polish
  1. Użyj danych z sekcji "Input" (patrz poniżej)

  2. Oczekiwana wartość Jana III Sobieskiego

  3. Wykorzystaj tylko metody str do oczyszczenia każdej zmiennej

  4. Przeprowadź dyskusję jak zrobić rozwiązanie generyczne pasujące do wszystkich przypadków

  5. Implementacja takiej generycznej funkcji będzie w rozdziale Cleaning text input

  6. Porównaj wyniki z sekcją "Output" (patrz poniżej)

Input
a = 'ul Jana III SobIESkiego'
b = '\tul. Jana trzeciego Sobieskiego'
c = 'ulicaJana III Sobieskiego'
d = 'UL. JANA 3 \nSOBIESKIEGO'
e = 'UL. jana III SOBiesKIEGO'
f = 'ULICA JANA III SOBIESKIEGO  '
g = 'ULICA. JANA III SOBIeskieGO'
h = ' Jana 3 Sobieskiego  '
i = 'Jana III Sobi\teskiego '

a = a.replace('ul', '').title().replace('Iii', 'III').strip()
b = b
c = c
d = d
e = e
f = f
g = g
h = h
i = i

expected = 'Jana III Sobieskiego'

print(f'{a == expected}\ta = "{a}"')
print(f'{b == expected}\tb = "{b}"')
print(f'{c == expected}\tc = "{c}"')
print(f'{d == expected}\td = "{d}"')
print(f'{e == expected}\te = "{e}"')
print(f'{f == expected}\tf = "{f}"')
print(f'{g == expected}\tg = "{g}"')
print(f'{h == expected}\th = "{h}"')
print(f'{i == expected}\ti = "{i}"')
Output
True    a = "Jana III Sobieskiego"
True    b = "Jana III Sobieskiego"
True    c = "Jana III Sobieskiego"
True    d = "Jana III Sobieskiego"
True    e = "Jana III Sobieskiego"
True    f = "Jana III Sobieskiego"
True    g = "Jana III Sobieskiego"
True    h = "Jana III Sobieskiego"
True    i = "Jana III Sobieskiego"
The whys and wherefores
  • Variable definition

  • Print formatting

  • Cleaning text input