1. str Methods

1.1. String immutability

  • str is immutable

  • str methods create a new modified str

a = 'Python'
a.replace('P', 'J')

print(a)  # Python
a = 'Python'
b = a.replace('P', 'J')

print(a)  # Python
print(b)  # Jython
a = 'Python'
a = a.replace('P', 'J')

print(a)  # Jython

1.2. String Arithmetic

  • Preferred string concatenation is using f-string formatting

first_name = 'Jan'
last_name = 'Twardowski'

name = first_name + ' ' + last_name
# Jan Twardowski
'Ha' * 3            # HaHaHa
'-' * 10            # ----------

1.3. str methods

1.3.1. Changing Character Case

  • Unify data format before analysis

name = 'jAn TwARDowSKi III'

name.upper()       # 'JAN TWARDOWSKI III'
name.lower()       # 'jan twardowski iii'
name.title()       # 'Jan Twardowski Iii'
name.capitalize()  # 'Jan twardowski iii'
name = 'Angus McGyver'

name.upper()       # 'ANGUS MCGYVER'
name.lower()       # 'angus mcgyver'
name.title()       # 'Angus Mcgyver'
name.capitalize()  # 'Angus mcgyver'

1.3.2. Replacing parts of the str

name = 'Jan Twardowski Iii'

name.replace('Iii', 'III')
# 'Jan Twardowski III'

1.3.3. Cleaning str from whitespaces

name = '\tJan Twardowski    \n'

name.strip()        # 'Jan Twardowski'
name.rstrip()       # '\tJan Twardowski'
name.lstrip()       # 'Jan Twardowski    \n'

1.3.4. Checking if str starts or ends with value

  • Understand this as "starts with" and "ends with"

name = 'Jan Twardowski'

name.startswith('Jan')  # True
name.endswith(';')      # False

1.3.5. Splitting by character or whitespace

setosa = '5.1,3.5,1.4,0.2,setosa'

setosa.split(',')
# ['5.1', '3.5', '1.4', '0.2', 'setosa']
text = 'We choose to go to the Moon'

text.split()
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

text.split(' ')
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
text = '10.13.37.1      nasa.gov esa.int roscosmos.ru'

text.split(' ')
# ['10.13.37.1', '', '', '', '', '', 'nasa.gov', 'esa.int', 'roscosmos.ru']

text.split()
# ['10.13.37.1', 'nasa.gov', 'esa.int', 'roscosmos.ru']

1.3.6. Splitting by line

DATA = """First Line
Second Line
Third Line
"""

DATA.splitlines()
# [
#   'First Line',
#   'Second Line',
#   'Third Line'
# ]

1.3.7. Joining str

text = ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

' '.join(text)
# 'We choose to go to the Moon'
setosa = [5.1, 3.5, 1.4, 0.2, 'setosa']

','.join(setosa)
# '5.1,3.5,1.4,0.2,setosa'

1.3.8. Checking if str contains only whitespace

''.isspace()        # False
' '.isspace()       # True
'\t'.isspace()      # True
'\n'.isspace()      # True

1.3.9. Checking if str contains only alphabet characters

'hello'.isalpha()   # True
'hello1'.isalpha()  # False

1.3.10. Finding starting position of a sub-string

text = 'We choose to go to the Moon'

text.find('M')      # 23
text.find('x')      # -1

1.3.11. Check if str is a part of another str

'th' in 'Python'     # True
'hello' in 'Python'  # False

1.3.12. Counting occurrences

text = 'Moon'

text.count('o')     # 2
text.count('Moo')   # 1
text.count('x')     # 0

1.4. Multiple statements in one line

a = 'Python'
b = a.upper().replace('P', 'C').title()

print(a)            # Python
print(b)            # Cython
a = 'Python'

b = a.upper().startswith('P').replace('P', 'C')
# AttributeError: 'bool' object has no attribute 'replace'

1.5. Cleaning str from user input

  • 80% of machine learning and data science is cleaning data

1.5.1. Is this the same address?

  • This is a dump of distinct records of a single address

  • Which one of the below is a true address?

'ul. Jana III Sobieskiego'
'ul Jana III Sobieskiego'
'ul.Jana III Sobieskiego'
'ulicaJana III Sobieskiego'
'Ul. Jana III Sobieskiego'
'UL. Jana III Sobieskiego'
'ulica Jana III Sobieskiego'
'Ulica. Jana III Sobieskiego'

'os. Jana III Sobieskiego'

'Jana 3 Sobieskiego'
'Jana 3ego Sobieskiego'
'Jana III Sobieskiego'
'Jana Iii Sobieskiego'
'Jana IIi Sobieskiego'
'Jana lll Sobieskiego'  # three small letters 'L'

1.5.2. Different way of spelling and abbreviating

'ul'
'ul.'
'Ul.'
'UL.'
'ulica'
'Ulica'
'os'
'os.'
'Os.'
'osiedle'

'oś'
'oś.'
'Oś.'
'ośedle'
'pl'
'pl.'
'Pl.'
'plac'
'al'
'al.'
'Al.'

'aleja'
'aleia'
'alei'
'aleii'
'aleji'

1.5.3. House number and apartment

'1/2'
'1 / 2'
'1/ 2'
'1 /2'
'3/5/7'
'1 m. 2'
'1 m 2'
'1 apt 2'
'1 apt. 2'
'180f/8f'
'180f/8'
'180/8f'
'13d bud. A'

1.5.4. Phone numbers

123 555 678

+48 (12) 355 5678
+48 12 355 5678
+48 123 555 678

+48 123-555-678
+48123555678
+48 123 555 6789

+1 (123) 555-6789
+1 (123).555.6789

+1 800-python

+48 123 555 678 wew. 1337
+48 123555678,1
+48 123555678,1,2,3

1.6. Assignments

1.6.1. String cleaning

  • Complexity level: easy

  • Lines of code to write: 11 lines

  • Estimated time of completion: 15 min

  • Filename: solution/str_cleaning.py

expected = 'Jana III Sobieskiego'

a = 'ul Jana III SobIESkiego'
b = '\tul. Jana trzeciego Sobieskiego'
c = 'ulicaJana III Sobieskiego'
d = 'UL. JANA 3 \nSOBIESKIEGO'
e = 'UL. jana III SOBiesKIEGO'
f = 'ULICA JANA III SOBIESKIEGO  '
g = 'ULICA. JANA III SOBIeskieGO'
h = ' Jana 3 Sobieskiego  '
i = 'Jana III Sobi\teskiego '

print(f'{a == expected}\t j: "{a}"')
print(f'{b == expected}\t b: "{b}"')
print(f'{c == expected}\t c: "{c}"')
print(f'{d == expected}\t d: "{d}"')
print(f'{e == expected}\t e: "{e}"')
print(f'{f == expected}\t f: "{f}"')
print(f'{g == expected}\t g: "{g}"')
print(f'{h == expected}\t h: "{h}"')
print(f'{i == expected}\t i: "{i}"')
  1. Wykorzystując metody str

  2. Dane przeczyść, tak aby zmienne miały wartość Jana III Sobieskiego

  3. Nie wykorzystuj mechanizmu slice

  4. Przeprowadź dyskusję jak zrobić rozwiązanie generyczne pasujące do wszystkich? (Implementacja rozwiązania będzie w rozdziale Function Basics)

The whys and wherefores
  • Definiowanie zmiennych

  • Korzystanie z print formatting

  • Wczytywanie tekstu od użytkownika