1. str Methods

1.1. String immutability

  • str is immutable

  • str methods create a new modified str

a = 'Python'
a.replace('P', 'J')

print(a)  # Python
a = 'Python'
b = a.replace('P', 'J')

print(a)  # Python
print(b)  # Jython
a = 'Python'
a = a.replace('P', 'J')

print(a)  # Jython

1.2. String Arithmetic

  • Preferred string concatenation is using f-string formatting

first_name = 'Jan'
last_name = 'Twardowski'

name = first_name + ' ' + last_name
# Jan Twardowski
'Ha' * 3            # HaHaHa
'-' * 10            # ----------

1.3. str methods

1.3.1. Changing Character Case

  • Unify data format before analysis

name = 'jAn TwARDowSKi III'

name.upper()       # 'JAN TWARDOWSKI III'
name.lower()       # 'jan twardowski iii'
name.title()       # 'Jan Twardowski Iii'
name.capitalize()  # 'Jan twardowski iii'
name = 'Angus McGyver'

name.upper()       # 'ANGUS MCGYVER'
name.lower()       # 'angus mcgyver'
name.title()       # 'Angus Mcgyver'
name.capitalize()  # 'Angus mcgyver'

1.3.2. Replacing parts of the str

name = 'Jan Twardowski Iii'

name.replace('Iii', 'III')
# 'Jan Twardowski III'

1.3.3. Cleaning str from whitespaces

name = '\tJan Twardowski    \n'

name.strip()        # 'Jan Twardowski'
name.rstrip()       # '\tJan Twardowski'
name.lstrip()       # 'Jan Twardowski    \n'

1.3.4. Checking if str starts or ends with value

  • Understand this as "starts with" and "ends with"

name = 'Jan Twardowski'

name.startswith('Jan')  # True
name.endswith(';')      # False

1.3.5. Splitting by character or whitespace

setosa = '5.1,3.5,1.4,0.2,setosa'

setosa.split(',')
# ['5.1', '3.5', '1.4', '0.2', 'setosa']
text = 'We choose to go to the Moon'

text.split()
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

text.split(' ')
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
text = '10.13.37.1      nasa.gov esa.int roscosmos.ru'

text.split(' ')
# ['10.13.37.1', '', '', '', '', '', 'nasa.gov', 'esa.int', 'roscosmos.ru']

text.split()
# ['10.13.37.1', 'nasa.gov', 'esa.int', 'roscosmos.ru']

1.3.6. Splitting by line

DATA = """First Line
Second Line
Third Line
"""

DATA.splitlines()
# [
#   'First Line',
#   'Second Line',
#   'Third Line'
# ]

1.3.7. Joining str

text = ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

' '.join(text)
# 'We choose to go to the Moon'
setosa = [5.1, 3.5, 1.4, 0.2, 'setosa']

','.join(setosa)
# '5.1,3.5,1.4,0.2,setosa'

1.3.8. Checking if str contains only whitespace

''.isspace()        # False
' '.isspace()       # True
'\t'.isspace()      # True
'\n'.isspace()      # True

1.3.9. Checking if str contains only alphabet characters

'hello'.isalpha()   # True
'hello1'.isalpha()  # False

1.3.10. Finding starting position of a sub-string

text = 'We choose to go to the Moon'

text.find('M')      # 23
text.find('Moo')    # 23
text.find('x')      # -1

1.3.11. Check if str is a part of another str

'Py' in 'Python'     # True
'Monty' in 'Python'  # False

1.3.12. Counting occurrences

text = 'Moon'

text.count('o')     # 2
text.count('Moo')   # 1
text.count('x')     # 0

1.4. Multiple statements in one line

a = 'Python'
b = a.upper().replace('P', 'C').title()

print(a)            # Python
print(b)            # Cython
a = 'Python'

b = a.upper().startswith('P').replace('P', 'C')
# AttributeError: 'bool' object has no attribute 'replace'

1.5. Cleaning str from user input

  • 80% of machine learning and data science is cleaning data

1.5.1. Is this the same address?

  • This is a dump of distinct records of a single address

  • Which one of the below is a true address?

'ul. Jana III Sobieskiego'
'ul Jana III Sobieskiego'
'ul.Jana III Sobieskiego'
'ulicaJana III Sobieskiego'
'Ul. Jana III Sobieskiego'
'UL. Jana III Sobieskiego'
'ulica Jana III Sobieskiego'
'Ulica. Jana III Sobieskiego'

'os. Jana III Sobieskiego'

'Jana 3 Sobieskiego'
'Jana 3ego Sobieskiego'
'Jana III Sobieskiego'
'Jana Iii Sobieskiego'
'Jana IIi Sobieskiego'
'Jana lll Sobieskiego'  # three small letters 'L'

1.5.2. Different way of spelling and abbreviating

'ul'
'ul.'
'Ul.'
'UL.'
'ulica'
'Ulica'
'os'
'os.'
'Os.'
'osiedle'

'oś'
'oś.'
'Oś.'
'ośedle'
'pl'
'pl.'
'Pl.'
'plac'
'al'
'al.'
'Al.'

'aleja'
'aleia'
'alei'
'aleii'
'aleji'

1.5.3. House number and apartment

'1/2'
'1 / 2'
'1/ 2'
'1 /2'
'3/5/7'
'1 m. 2'
'1 m 2'
'1 apt 2'
'1 apt. 2'
'180f/8f'
'180f/8'
'180/8f'
'13d bud. A'

1.5.4. Phone numbers

123 555 678

+48 (12) 355 5678
+48 12 355 5678
+48 123 555 678

+48 123-555-678
+48123555678
+48 123 555 6789

+1 (123) 555-6789
+1 (123).555.6789

+1 800-python

+48 123 555 678 wew. 1337
+48 123555678,1
+48 123555678,1,2,3

1.6. Assignments

1.6.1. String cleaning

  • Complexity level: easy

  • Lines of code to write: 11 lines

  • Estimated time of completion: 15 min

  • Filename: solution/str_cleaning.py

English
  1. For input data (see below)

  2. Expected value is Jana III Sobieskiego

  3. Use only str methods to clean each variable

  4. Compare with output data (see below)

  5. Discuss how to create generic solution which fit all cases

  6. Implementation of such generic function will be in Function Basics chapter

Polish
  1. Dla danych wejściowych (patrz poniżej)

  2. Oczekiwana wartość Jana III Sobieskiego

  3. Wykorzystaj tylko metody str do oczyszczenia każdej zmiennej

  4. Porównaj wyniki z danymi wyjściowymi (patrz poniżej)

  5. Przeprowadź dyskusję jak zrobić rozwiązanie generyczne pasujące do wszystkich przypadków

  6. Implementacja takiej generycznej funkcji będzie w rozdziale Function Basics

Input
a = 'ul Jana III SobIESkiego'
b = '\tul. Jana trzeciego Sobieskiego'
c = 'ulicaJana III Sobieskiego'
d = 'UL. JANA 3 \nSOBIESKIEGO'
e = 'UL. jana III SOBiesKIEGO'
f = 'ULICA JANA III SOBIESKIEGO  '
g = 'ULICA. JANA III SOBIeskieGO'
h = ' Jana 3 Sobieskiego  '
i = 'Jana III Sobi\teskiego '
Output
expected = 'Jana III Sobieskiego'

print(f'{a == expected}\t a: "{a}"')
print(f'{b == expected}\t b: "{b}"')
print(f'{c == expected}\t c: "{c}"')
print(f'{d == expected}\t d: "{d}"')
print(f'{e == expected}\t e: "{e}"')
print(f'{f == expected}\t f: "{f}"')
print(f'{g == expected}\t g: "{g}"')
print(f'{h == expected}\t h: "{h}"')
print(f'{i == expected}\t i: "{i}"')
The whys and wherefores
  • Variable definition

  • Print formatting

  • Cleaning text input