3.6. Type str Methods

3.6.1. String Immutability

  • str is immutable

  • str methods create a new modified str

a = 'Python'
a.replace('P', 'J')

print(a)  # Python
a = 'Python'
b = a.replace('P', 'J')

print(a)  # Python
print(b)  # Jython
a = 'Python'
a = a.replace('P', 'J')

print(a)  # Jython

3.6.2. String Arithmetic

  • Preferred string concatenation is using f-string formatting

'Ha' * 3            # HaHaHa
'-' * 10            # ----------
first_name = 'Jan'
last_name = 'Twardowski'

first_name + ' ' + last_name
# Jan Twardowski
Listing 36. How many string are there in a memory?
first_name = 'Jan'
last_name = 'Twardowski'
age = 42

# How many string are there in a memory?
first_name + ' ' + last_name

# How many string are there in a memory?
'Hello ' + first_name + ' ' + last_name + ' ' + str(age) + '!'

# How many string are there in a memory?
f'Hello {first_name} {last_name} {age}!'

3.6.3. String Methods

3.6.3.1. Change Case

  • Unify data format before analysis

name = 'jAn TwARDowSKi III'

name.upper()       # 'JAN TWARDOWSKI III'
name.lower()       # 'jan twardowski iii'
name.title()       # 'Jan Twardowski Iii'
name.capitalize()  # 'Jan twardowski iii'
name = 'Angus McGyver'

name.upper()       # 'ANGUS MCGYVER'
name.lower()       # 'angus mcgyver'
name.title()       # 'Angus Mcgyver'
name.capitalize()  # 'Angus mcgyver'

3.6.3.2. Replace

name = 'Jan Twardowski Iii'

name.replace('Iii', 'III')
# 'Jan Twardowski III'

3.6.3.3. Strip Whitespace

name = '\tJan Twardowski    \n'

name.strip()        # 'Jan Twardowski'
name.rstrip()       # '\tJan Twardowski'
name.lstrip()       # 'Jan Twardowski    \n'

3.6.3.4. Checking If Starts or Ends with Value

  • Understand this as "starts with" and "ends with"

name = 'Jan Twardowski'

name.startswith('Jan')  # True
name.endswith(';')      # False

3.6.3.5. Splitting by Line

DATA = """First Line
Second Line
Third Line
"""

DATA.splitlines()
# [
#   'First Line',
#   'Second Line',
#   'Third Line'
# ]

3.6.3.6. Splitting by Character or Whitespace

setosa = '5.1,3.5,1.4,0.2,setosa'

setosa.split(',')
# ['5.1', '3.5', '1.4', '0.2', 'setosa']
text = 'We choose to go to the Moon'

text.split(' ')
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

text.split()
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
text = '10.13.37.1      nasa.gov esa.int roscosmos.ru'

text.split(' ')
# ['10.13.37.1', '', '', '', '', '', 'nasa.gov', 'esa.int', 'roscosmos.ru']

text.split()
# ['10.13.37.1', 'nasa.gov', 'esa.int', 'roscosmos.ru']

3.6.3.7. Joining with String

text = ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

' '.join(text)
# 'We choose to go to the Moon'
setosa = [5.1, 3.5, 1.4, 0.2, 'setosa']

','.join(setosa)
# '5.1,3.5,1.4,0.2,setosa'

3.6.3.8. Checking If Contains Only Whitespace

''.isspace()        # False
' '.isspace()       # True
'\t'.isspace()      # True
'\n'.isspace()      # True

3.6.3.9. Checking If Contains Only Alphabet Characters

'hello'.isalpha()   # True
'hello1'.isalpha()  # False

3.6.3.10. Finding Starting Position of a Sub-string

text = 'We choose to go to the Moon'

text.find('M')      # 23
text.find('Moo')    # 23
text.find('x')      # -1

3.6.3.11. Check If is a Part of Another String

'Py' in 'Python'     # True
'Monty' in 'Python'  # False

3.6.3.12. Counting Occurrences

text = 'Moon'

text.count('o')     # 2
text.count('Moo')   # 1
text.count('x')     # 0

3.6.4. Multiple Statements in One Line

a = 'Python'
b = a.upper().replace('P', 'C').title()

print(a)            # Python
print(b)            # Cython
a = 'Python'

b = a.upper().startswith('P').replace('P', 'C')
# AttributeError: 'bool' object has no attribute 'replace'

3.6.5. Cleaning User Input

  • 80% of machine learning and data science is cleaning data

3.6.5.1. Is This the Same Address?

  • This is a dump of distinct records of a single address

  • Which one of the below is a true address?

'ul. Jana III Sobieskiego'
'ul Jana III Sobieskiego'
'ul.Jana III Sobieskiego'
'ulicaJana III Sobieskiego'
'Ul. Jana III Sobieskiego'
'UL. Jana III Sobieskiego'
'ulica Jana III Sobieskiego'
'Ulica. Jana III Sobieskiego'

'os. Jana III Sobieskiego'

'Jana 3 Sobieskiego'
'Jana 3ego Sobieskiego'
'Jana III Sobieskiego'
'Jana Iii Sobieskiego'
'Jana IIi Sobieskiego'
'Jana lll Sobieskiego'  # three small letters 'L'

3.6.5.2. Spelling and Abbreviations

'ul'
'ul.'
'Ul.'
'UL.'
'ulica'
'Ulica'
'os'
'os.'
'Os.'
'osiedle'

'oś'
'oś.'
'Oś.'
'ośedle'
'pl'
'pl.'
'Pl.'
'plac'
'al'
'al.'
'Al.'

'aleja'
'aleia'
'alei'
'aleii'
'aleji'

3.6.5.3. House and Apartment Number

'1/2'
'1 / 2'
'1/ 2'
'1 /2'
'3/5/7'
'1 m. 2'
'1 m 2'
'1 apt 2'
'1 apt. 2'
'180f/8f'
'180f/8'
'180/8f'
'13d bud. A'

3.6.5.4. Phone Numbers

+48 (12) 355 5678
+48 123 555 678
123 555 678

+48 12 355 5678
+48 123-555-678
+48 123 555 6789

+1 (123) 555-6789
+1 (123).555.6789

+1 800-python
+48123555678

+48 123 555 678 wew. 1337
+48 123555678,1
+48 123555678,1,2,3

3.6.6. Assignments

3.6.6.1. Example

  • Complexity level: easy

  • Lines of code to write: 8 lines

  • Estimated time of completion: 5 min

  • Filename: solution/str_methods.py

English
  1. For given text: UL. jana \tTWArdoWskIEGO 3

  2. Use str methods to clean variable

  3. Expected value is Jana Twardowskiego III

Polish
  1. Dla danego tekstu: UL. jana \tTWArdoWskIEGO 3

  2. Wykorzystaj metody str do oczyszczenia

  3. Oczekiwana wartość Jana Twardowskiego III

Solution
expected = 'Jana Twardowskiego III'
text = 'UL. jana \tTWArdoWskIEGO 3'

text = text.upper()
text = text.replace('UL.', '')
text = text.replace('\t', '')
text = text.replace('3', 'III')
text = text.title()
text = text.replace('Iii', 'III')
text = text.strip()

print('Matched:', text == expected)
# Matched: True

print(text)
# Jana Twardowskiego III
The whys and wherefores
  • Variable definition

  • Print formatting

  • Cleaning text input

3.6.6.2. String Cleaning

  • Complexity level: easy

  • Lines of code to write: 11 lines

  • Estimated time of completion: 15 min

  • Filename: solution/str_cleaning.py

English
  1. For input data (see below)

  2. Expected value is Jana III Sobieskiego

  3. Use only str methods to clean each variable

  4. Compare with output data (see below)

  5. Discuss how to create generic solution which fit all cases

  6. Implementation of such generic function will be in Function Definition chapter

Polish
  1. Dla danych wejściowych (patrz sekcja input)

  2. Oczekiwana wartość Jana III Sobieskiego

  3. Wykorzystaj tylko metody str do oczyszczenia każdej zmiennej

  4. Porównaj wyniki z danymi wyjściowymi (patrz sekcja output)

  5. Przeprowadź dyskusję jak zrobić rozwiązanie generyczne pasujące do wszystkich przypadków

  6. Implementacja takiej generycznej funkcji będzie w rozdziale Function Definition

Input
a = 'ul Jana III SobIESkiego'
b = '\tul. Jana trzeciego Sobieskiego'
c = 'ulicaJana III Sobieskiego'
d = 'UL. JANA 3 \nSOBIESKIEGO'
e = 'UL. jana III SOBiesKIEGO'
f = 'ULICA JANA III SOBIESKIEGO  '
g = 'ULICA. JANA III SOBIeskieGO'
h = ' Jana 3 Sobieskiego  '
i = 'Jana III Sobi\teskiego '
Output
expected = 'Jana III Sobieskiego'

print('a:', a == expected, a, sep='\t')
print('b:', b == expected, b, sep='\t')
print('c:', c == expected, c, sep='\t')
print('d:', d == expected, d, sep='\t')
print('e:', e == expected, e, sep='\t')
print('f:', f == expected, f, sep='\t')
print('g:', g == expected, g, sep='\t')
print('h:', h == expected, h, sep='\t')
print('i:', i == expected, i, sep='\t')
The whys and wherefores
  • Variable definition

  • Print formatting

  • Cleaning text input