1. str Methods

1.1. String immutability

  • str is immutable

  • str methods create a new modified str

a = 'Python'
a.replace('P', 'J')

print(a)  # Python
a = 'Python'
b = a.replace('P', 'J')

print(a)  # Python
print(b)  # Jython

1.2. String Arithmetic

  • Preferred string concatenation is using f-string formatting

first_name = 'Jan'
last_name = 'Twardowski'

name = first_name + ' ' + last_name
# Jan Twardowski
'José' * 3          # JoséJoséJosé
'-' * 10            # ----------

1.3. str methods

1.3.1. Changing Character Case

  • Unify data format before analysis

name = 'jAn TwARDowSKi III'

name.upper()       # 'JAN TWARDOWSKI III'
name.lower()       # 'jan twardowski iii'
name.title()       # 'Jan Twardowski Iii'
name.capitalize()  # 'Jan twardowski iii'

1.3.2. Replacing parts of the str

name = 'Jan Twardowski Iii'

name.replace('Iii', 'III')
# 'Jan Twardowski III'

1.3.3. Cleaning str from whitespaces

name = '\tJan Twardowski    \n'

name.strip()        # 'Jan Twardowski'
name.rstrip()       # '\tJan Twardowski'
name.lstrip()       # 'Jan Twardowski    \n'

1.3.4. Checking if str starts or ends with value

  • Understand this as “starts with” and “ends with”

name = 'Jan Twardowski'

name.startswith('Jan')  # True
name.endswith(';')      # False

1.3.5. Splitting str

setosa = '5.1,3.5,1.4,0.2,setosa'

setosa.split(',')
# ['5.1', '3.5', '1.4', '0.2', 'setosa']
text = 'We choose to go to the Moon'

text.split()
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

text.split(' ')
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
text = '10.13.37.1      nasa.gov esa.int roscosmos.ru'

text.split(' ')
# ['10.13.37.1', '', '', '', '', '', 'nasa.gov', 'esa.int', 'roscosmos.ru']

text.split()
# ['10.13.37.1', 'nasa.gov', 'esa.int', 'roscosmos.ru']

1.3.6. Joining str

text = ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

' '.join(text)
# 'We choose to go to the Moon'
setosa = [5.1, 3.5, 1.4, 0.2, 'setosa']

','.join(setosa)
# '5.1,3.5,1.4,0.2,setosa'

1.3.7. Checking if str contains only whitespace

''.isspace()        # False
' '.isspace()       # True
'\t'.isspace()      # True
'\n'.isspace()      # True

1.3.8. Checking if str contains only alphabet characters

'hello'.isalpha()   # True
'hello1'.isalpha()  # False

1.3.9. Finding starting position of a sub-string

text = 'We choose to go to the Moon'

text.find('M')      # 23
text.find('x')      # -1

1.3.10. Check if str is a part of another str

'th' in 'Python'     # True
'hello' in 'Python'  # False

1.4. Multiple statements in one line

a = 'Python'
b = a.upper().replace('P', 'C').title()

print(a)            # Python
print(b)            # Cython
a = 'Python'

b = a.upper().startswith('P').replace('P', 'C')
# AttributeError: 'bool' object has no attribute 'replace'

1.5. Cleaning str from user input

  • 80% of machine learning and data science is cleaning data

1.5.1. Is this the same address?

  • This is a dump of distinct records of a single address

  • Which one of the below is a true address?

'ul. Jana III Sobieskiego'
'ul Jana III Sobieskiego'
'ul.Jana III Sobieskiego'
'ulicaJana III Sobieskiego'
'Ul. Jana III Sobieskiego'
'UL. Jana III Sobieskiego'
'ulica Jana III Sobieskiego'
'Ulica. Jana III Sobieskiego'

'os. Jana III Sobieskiego'

'Jana 3 Sobieskiego'
'Jana 3ego Sobieskiego'
'Jana III Sobieskiego'
'Jana Iii Sobieskiego'
'Jana IIi Sobieskiego'
'Jana lll Sobieskiego'  # three small letters 'L'

1.5.2. Different way of spelling and abbreviating

'ul '
'ul. '
'ul.'
'ulica'
'Ul. '
'UL. '
'ulica '
'Ulica. '
'os. '
'ośedle'
'osiedle'
'os'
'plac '
'pl '
'al '
'al. '
'aleja '
'alei '
'aleia'
'aleii'
'aleji'

1.5.3. House number and apartment

'1/2'
'1 / 2'
'1/ 2'
'1 /2'
'3/5/7'

'1 m. 2'
'1 m 2'
'1 apt 2'
'1 apt. 2'

'180f/8f'
'180f/8'
'180/8f'

'13d bud. A'

1.6. Assignments

1.6.1. String cleaning

  • Filename: str_cleaning.py

  • Lines of code to write: 11 lines

  • Estimated time of completion: 15 min

expected = 'Jana III Sobieskiego'

a = '  Jana III Sobieskiego '
b = 'ul Jana III SobIESkiego'
c = '\tul. Jana trzeciego Sobieskiego'
d = 'ulicaJana III Sobieskiego'
e = 'UL. JA\tNA 3 SOBIES\tKIEGO'
f = 'UL. jana III SOBiesKIEGO'
g = 'ULICA JANA III SOBIESKIEGO  '
h = 'ULICA. JANA III SOBIeskieGO'
i = ' Jana 3 Sobieskiego  '
j = 'Jana III Sobi\teskiego '
k = 'ul.Jana III Sob\n\nieskiego\n'

print(f'{a == expected}\t a: "{a}"')
print(f'{b == expected}\t b: "{b}"')
print(f'{c == expected}\t c: "{c}"')
print(f'{d == expected}\t d: "{d}"')
print(f'{e == expected}\t e: "{e}"')
print(f'{f == expected}\t f: "{f}"')
print(f'{g == expected}\t g: "{g}"')
print(f'{h == expected}\t h: "{h}"')
print(f'{i == expected}\t i: "{i}"')
print(f'{j == expected}\t j: "{j}"')
print(f'{k == expected}\t k: "{k}"')
  1. Wykorzystując metody str

  2. Dane przeczyść, tak aby zmienne miały wartość Jana III Sobieskiego

  3. Nie wykorzystuj mechanizmu slice

  4. Przeprowadź dyskusję jak zrobić rozwiązanie generyczne pasujące do wszystkich? (Implementacja rozwiązania będzie w rozdziale Function Basics)

The whys and wherefores
  • Definiowanie zmiennych

  • Korzystanie z print formatting

  • Wczytywanie tekstu od użytkownika