2.6. Type Str Methods

2.6.1. String Immutability

How many string are there in a memory?

>>> firstname = 'Jan'
>>> lastname = 'Twardowski'
>>>
>>> firstname + ' ' + lastname
'Jan Twardowski'
>>> firstname = 'Jan'
>>> lastname = 'Twardowski'
>>>
>>> f'{firstname} {lastname}'
'Jan Twardowski'
>>> firstname = 'Jan'
>>> lastname = 'Twardowski'
>>> age = 42
>>>
>>> 'Hello ' + firstname + ' ' + lastname + ' ' + str(age) + '!'
'Hello Jan Twardowski 42!'
>>> firstname = 'Jan'
>>> lastname = 'Twardowski'
>>> age = 42
>>>
>>> f'Hello {firstname} {lastname} {age}!'
'Hello Jan Twardowski 42!'
../../_images/memory-str-1.png

Figure 2.2. Define str

../../_images/memory-str-2.png

Figure 2.3. Define another str with the same value

../../_images/memory-str-3.png

Figure 2.4. Define another str with different value

2.6.2. Rationale

  • str is immutable

  • str methods create a new modified str

    >>> a = 'Python'
    >>> a.replace('P', 'C')
    'Cython'
    >>> print(a)
    Python
    
    >>> a = 'Python'
    >>> b = a.replace('P', 'C')
    >>>
    >>> print(a)
    Python
    >>> print(b)
    Cython
    
    >>> a = 'Python'
    >>> a = a.replace('P', 'C')
    >>>
    >>> print(a)
    Cython
    

2.6.3. Strip Whitespace

>>> name = '\tAngus MacGyver    \n'
>>>
>>> name.strip()
'Angus MacGyver'
>>> name.rstrip()
'\tAngus MacGyver'
>>> name.lstrip()
'Angus MacGyver    \n'

2.6.4. Change Case

  • Unify data format before analysis

    >>> name = 'Angus MacGyver III'
    >>>
    >>> name.upper()
    'ANGUS MACGYVER III'
    >>> name.lower()
    'angus macgyver iii'
    >>> name.title()
    'Angus Macgyver Iii'
    >>> name.capitalize()
    'Angus macgyver iii'
    

2.6.5. Replace

>>> name = 'Angus MacGyver Iii'
>>>
>>> name.replace('Iii', 'III')
'Angus MacGyver III'

2.6.6. Starts With

>>> name = 'Angus MacGyver III'
>>> name.startswith('Angus')
True
>>> PREFIX = ('vir', 'ver')
>>>
>>> 'virginica'.startswith(PREFIX)
True
>>> 'versicolor'.startswith(PREFIX)
True
>>> 'setosa'.startswith(PREFIX)
False

2.6.7. Ends With

>>> name = 'Angus MacGyver Iii'
>>>
>>> name.endswith('III')
False
>>> DOMAINS = ('@nasa.gov', '@esa.int')
>>>
>>> email = 'mark.watney@nasa.gov'
>>> email.endswith(DOMAINS)
True
>>> email = 'ivan.ivanovich@roscosmos.ru'
>>> email.endswith(DOMAINS)
False

2.6.8. Split by Line

>>> DATA = """First Line
... Second Line
... Third Line
... """
>>> DATA.splitlines()
['First Line', 'Second Line', 'Third Line']

2.6.9. Split by Character

  • No argument - any number of whitespaces

    >>> setosa = '5.1,3.5,1.4,0.2,setosa'
    >>>
    >>> setosa.split(',')
    ['5.1', '3.5', '1.4', '0.2', 'setosa']
    
    >>> text = 'We choose to go to the Moon'
    >>>
    >>> text.split(' ')
    ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
    >>> text.split()
    ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
    
    >>> text = '10.13.37.1      nasa.gov esa.int roscosmos.ru'
    >>>
    >>> text.split(' ')
    ['10.13.37.1', '', '', '', '', '', 'nasa.gov', 'esa.int', 'roscosmos.ru']
    >>> text.split()
    ['10.13.37.1', 'nasa.gov', 'esa.int', 'roscosmos.ru']
    

2.6.10. Join by Character

>>> text = ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
>>> ' '.join(text)
'We choose to go to the Moon'
>>> setosa = ['5.1', '3.5', '1.4', '0.2', 'setosa']
>>> ','.join(setosa)
'5.1,3.5,1.4,0.2,setosa'
>>> crew = ['Mark Watney', 'Jan Twardowski', 'Melissa Lewis']
>>>
>>> '\n'.join(crew)
'Mark Watney\nJan Twardowski\nMelissa Lewis'
>>> TEXT = ['We choose to go to the Moon!',
...        'We choose to go to the Moon in this decade and do the other things,',
...        'not because they are easy, but because they are hard;',
...        'because that goal will serve to organize and measure the best of our energies and skills,',
...        'because that challenge is one that we are willing to accept, one we are unwilling to postpone,',
...        'and one we intend to win, and the others, too.']
...
>>> print('\n'.join(TEXT))
We choose to go to the Moon!
We choose to go to the Moon in this decade and do the other things,
not because they are easy, but because they are hard;
because that goal will serve to organize and measure the best of our energies and skills,
because that challenge is one that we are willing to accept, one we are unwilling to postpone,
and one we intend to win, and the others, too.

2.6.11. Is Whitespace

>>> text = ''
>>> text.isspace()
False
>>> text = ' '
>>> text.isspace()
True
>>> text = '\t'
>>> text.isspace()
True
>>> text = '\n'
>>> text.isspace()
True
../../_images/iss.jpg

Figure 2.5. ISS - International Space Station. Credits: NASA/Crew of STS-132 (img: s132e012208).

2.6.12. Is Alphabet Characters

>>> text = 'hello'
>>> text.isalpha()
True
>>> text = 'hello1'
>>> text.isalpha()
False

2.6.13. Is Numeric

2.6.14. Find Sub-String Position

>>> text = 'We choose to go to the Moon'
>>>
>>> text.find('M')
23
>>> text.find('Moo')
23
>>> text.find('x')
-1

2.6.15. Contains

>>> 'Monty' in 'Python'
False
>>> 'Py' in 'Python'
True
>>> 'py' in 'Python'
False

2.6.16. Count Occurrences

>>> text = 'Moon'
>>>
>>> text.count('o')
2
>>> text.count('Moo')
1
>>> text.count('x')
0

2.6.17. Remove Prefix or Suffix

Since Python 3.9: PEP 616 -- String methods to remove prefixes and suffixes

>>> filename = '1969-apollo11.txt'
>>>
>>> filename.removeprefix('1969-')
'apollo11.txt'
>>> filename.removesuffix('.txt')
'1969-apollo11'
>>> filename.removeprefix('1969-').removesuffix('.txt')
'apollo11'

2.6.18. Method Chaining

>>> a = 'Python'
>>>
>>> a = a.upper()
>>> a = a.replace('P', 'C')
>>> a = a.title()
>>>
>>> print(a)
Cython
>>> a = 'Python'
>>> a = a.upper().replace('P', 'C').title()
>>>
>>> print(a)
Cython
>>> a = 'Python'
>>> a.upper().replace('P', 'C').title()
'Cython'

How it works:

  1. a -> 'Python'

  2. 'Python'.upper() -> 'PYTHON'

  3. 'PYTHON'.replace('P', 'C') -> 'CYTHON'

  4. 'CYTHON'.title() -> 'Cython'

>>> a = 'Python'
>>> a = a.upper().startswith('P').replace('P', 'C')
Traceback (most recent call last):
AttributeError: 'bool' object has no attribute 'replace'

Note, that there cannot be any char, not even space after \ character:

>>> a = 'Python'
>>> a = a.upper() \
...      .replace('P', 'C') \
...      .title()
>>>
>>> print(a)
Cython
>>> a = 'Python'
>>> a = (a.upper()
...       .replace('P', 'C')
...       .title())
>>>
>>> print(a)
Cython

2.6.19. Cleaning User Input

  • 80% of machine learning and data science is cleaning data

  • Is This the Same Address?

  • This is a dump of distinct records of a single address

  • Which one of the below is a true address?

Numbers:

>>> number = 1
>>> number = 1.0
>>> number = 1.00
>>>
>>> number = '1'
>>> number = '1.0'
>>> number = '1.00'

Addresses:

>>> street = 'ul. Jana III Sobieskiego'
>>> street = 'ul Jana III Sobieskiego'
>>> street = 'ul.Jana III Sobieskiego'
>>> street = 'ulicaJana III Sobieskiego'
>>> street = 'Ul. Jana III Sobieskiego'
>>> street = 'UL. Jana III Sobieskiego'
>>> street = 'ulica Jana III Sobieskiego'
>>> street = 'Ulica. Jana III Sobieskiego'
>>>
>>> street = 'os. Jana III Sobieskiego'
>>>
>>> street = 'Jana 3 Sobieskiego'
>>> street = 'Jana 3ego Sobieskiego'
>>> street = 'Jana III Sobieskiego'
>>> street = 'Jana Iii Sobieskiego'
>>> street = 'Jana IIi Sobieskiego'
>>> street = 'Jana lll Sobieskiego'  # three small letters 'L'

Address prefix (street, road, court, place, etc.):

>>> prefix = 'ul'
>>> prefix = 'ul.'
>>> prefix = 'Ul.'
>>> prefix = 'UL.'
>>> prefix = 'ulica'
>>> prefix = 'Ulica'
>>>
>>> prefix = 'os'
>>> prefix = 'os.'
>>> prefix = 'Os.'
>>> prefix = 'osiedle'
>>> prefix = 'oś'
>>> prefix = 'oś.'
>>> prefix = 'Oś.'
>>> prefix = 'ośedle'
>>>
>>> prefix = 'pl'
>>> prefix = 'pl.'
>>> prefix = 'Pl.'
>>> prefix = 'plac'
>>>
>>> prefix = 'al'
>>> prefix = 'al.'
>>> prefix = 'Al.'
>>> prefix = 'aleja'
>>> prefix = 'aleia'
>>> prefix = 'alei'
>>> prefix = 'aleii'
>>> prefix = 'aleji'

House and apartment number:

>>> address = 'Ćwiartki 3/4'
>>> address = 'Ćwiartki 3 / 4'
>>> address = 'Ćwiartki 3 m. 4'
>>> address = 'Ćwiartki 3 m 4'
>>> address = 'Brighton Beach 1st apt 2'
>>> address = 'Brighton Beach 1st apt. 2'
>>> address = 'Myśliwiecka 3/5/7'
>>>
>>> address = 'Jana Twardowskiego 180f/8f'
>>> address = 'Jana Twardowskiego 180f/8'
>>> address = 'Jana Twardowskiego 180/8f'
>>>
>>> address = 'Jana Twardowskiego III 3 m. 3'
>>> address = 'Jana Twardowskiego 13d bud. A piętro II sala 3'

Phone Numbers:

>>> phone = '+48 (12) 355 5678'
>>> phone = '+48 123 555 678'
>>>
>>> phone = '123 555 678'
>>> phone = '123555678'
>>> phone = '+48123555678'
>>> phone = '+48 12 355 5678'
>>> phone = '+48 123-555-678'
>>> phone = '+48 123 555 6789'
>>> phone = '+1 (123) 555-6789'
>>> phone = '+1 (123).555.6789'
>>>
>>> phone = '+1 800-python'
>>> phone = '+1 800-798466'
>>>
>>> phone = '+48 123 555 678 wew. 1337'
>>> phone = '+48 123555678,1'
>>> phone = '+48 123555678,1,,2'

2.6.20. Assignments

Code 2.23. Solution
"""
* Assignment: Type String Normalize
* Complexity: easy
* Lines of code: 4 lines
* Time: 8 min

English:
    1. Use data from "Given" section (see below)
    2. Use `str` methods to clean `DATA`
    3. Compare result with "Tests" section (see below)

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Wykorzystaj metody `str` do oczyszczenia `DATA`
    3. Porównaj wyniki z sekcją "Tests" (patrz poniżej)

Tests:
    >>> type(result)
    <class 'str'>
    >>> result
    'Jana Twardowskiego III'
"""


# Given
DATA = 'UL. jana \tTWArdoWskIEGO 3'


Code 2.24. Solution
"""
* Assignment: Type String Clean
* Complexity: easy
* Lines of code: 8 lines
* Time: 13 min

English:
    1. Use data from "Given" section (see below)
    2. Expected value is `Jana III Sobieskiego`
    3. Use only `str` methods to clean each variable
    4. Discuss how to create generic solution which fit all cases
    5. Implementation of such generic function will be in `Function Arguments Clean` chapter
    6. Compare result with "Tests" section (see below)

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Oczekiwana wartość `Jana III Sobieskiego`
    3. Wykorzystaj tylko metody `str` do oczyszczenia każdej zmiennej
    4. Przeprowadź dyskusję jak zrobić rozwiązanie generyczne pasujące do wszystkich przypadków
    5. Implementacja takiej generycznej funkcji będzie w rozdziale `Function Arguments Clean`
    6. Porównaj wyniki z sekcją "Tests" (patrz poniżej)

Tests:
    >>> example
    'Jana Twardowskiego III'
    >>> a
    'Jana III Sobieskiego'
    >>> b
    'Jana III Sobieskiego'
    >>> c
    'Jana III Sobieskiego'
    >>> d
    'Jana III Sobieskiego'
    >>> e
    'Jana III Sobieskiego'
    >>> f
    'Jana III Sobieskiego'
    >>> g
    'Jana III Sobieskiego'
    >>> h
    'Jana III Sobieskiego'
    >>> i
    'Jana III Sobieskiego'
"""


# Given
example = 'UL. jana \tTWArdoWskIEGO 3'
a = 'ul Jana III SobIESkiego'
b = '\tul. Jana trzeciego Sobieskiego'
c = 'ulicaJana III Sobieskiego'
d = 'UL. JANA 3 \nSOBIESKIEGO'
e = 'UL. jana III SOBiesKIEGO'
f = 'ULICA JANA III SOBIESKIEGO  '
g = 'ULICA. JANA III SOBIeskieGO'
h = ' Jana 3 Sobieskiego  '
i = 'Jana III\tSobieskiego '

example = example.upper().replace('UL. ', '').replace('\t', '').strip().title().replace('3', 'III')