5. str

5.1. Defining str

  • " and ' works the same
name = ''
name = ""
name = 'Pan Twardowski'       # 'Pan Twardowski'
name = "Pan Twardowski"       # 'Pan Twardowski'

5.1.1. Multiline str

text = """First line
Second line
Third line
"""
# 'First line\nSecond line\nThird line\n'
text = """
    First line
    Second line
    Third line
"""
# '\n        First line\n        Second line\n        Third line\n    '

5.2. Type casting to str

str('hello')        # 'hello'
str(1969)           # '1969'
str(13.37)          # '13.37'

5.3. Single or double quote?

  • " and ' works the same
  • Choose one and keep consistency in code
  • Python console uses '
  • I use ' in this book to be consistent with Python
  • doctest uses single quotes and throws error on double quotes

5.3.1. When use double quotes?

my_str = 'It\'s Twardowski\'s Moon.'
my_str = "It's Twardowski's Moon."

5.3.2. When use single quotes?

  • HTML and XML uses double quotes
my_str = '<a href="http://python.astrotech.io">Python and Machine Learning</a>'

5.3.3. When use multiline?

my_str = """My name's "José Jiménez""""
my_str = '''My name's "José Jiménez"'''

5.4. Escape characters

5.4.1. New lines

\n
\r\n
../_images/type-machine.jpg

Fig. 5.1. Why we have ‘\r\n’ on Windows?

5.4.2. Other escape characters

Tab. 5.1. Escape characters
Escape sequence Description
\\ Backslash \
\' Single quote '
\" Double quote "
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Formfeed (FF)
\n ASCII Linefeed (LF)
\r ASCII Carriage Return (CR)
\t ASCII Horizontal Tab (TAB)
\uxxxx Character with 16-bit hex value XXXX
\Uxxxxxxxx Character with 32-bit hex value XXXXXXXX
\v ASCII Vertical Tab (VT)
\ooo ASCII character with octal value ooo
\xhh... ASCII character with hex value hh…
\x1F680     # after \x goes hexadecimal number
\U0001F680  # after \u goes four hexadecimal numbers
print('\U0001F680')     # 🚀

5.5. Characters before strings

5.5.1. Format String

  • String interpolation (variable substitution)
  • Since Python 3.6
name = 'José Jiménez'

print(f'My name... {name}')
# My name... José Jiménez

5.5.2. Unicode literals

  • In Python 3 str is Unicode
  • In Python 2 str is Bytes
  • In Python 3 u'...' is only for compatibility with Python 2
u'zażółć gęślą jaźń'

5.5.3. Bytes literals

  • Used while reading from low level devices and drivers
  • Used in sockets and HTTP connections
  • bytes is a sequence of octets (integers between 0 and 255)
  • bytes.decode() conversion to unicode str
  • str.encode() conversion to bytes
b'this is bytes literals'

5.5.4. Raw String

  • Escapes does not matters
r'(?P<foo>)\n'
path = r'C:\Users\Admin\file.txt'

print(path)
# C:\Users\Admin\file.txt
path = 'C:\Users\Admin\file.txt'

print(path)
# SyntaxError: (unicode error) 'unicodeescape'
#   codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
  • Problem: \Users
  • after \U... python expects Unicode codepoint in hex
  • s is invalid hexadecimal character

5.6. String methods

5.6.1. String immutability

  • str is immutable
  • str methods create a new modified str
a = 'Python'
a.replace('P', 'J')

print(a)  # Python
a = 'Python'
b = a.replace('P', 'J')

print(a)  # Python
print(b)  # Jython

5.6.2. String Arithmetic

first_name = 'Pan'
last_name = 'Twardowski'

name = first_name + last_name
# Pan Twardowski
'José' * 3          # JoséJoséJosé
'-' * 10            # ----------

5.6.3. str.title(), str.lower(), str.upper()

  • Unify data format before analysis
name = 'pAn TwARDowSKi III'

name.upper()       # 'PAN TWARDOWSKI III'
name.lower()       # 'pan twardowski iii'
name.title()       # 'Pan Twardowski Iii'
name.capitalize()  # 'Pan twardowski iii'

5.6.4. str.replace()

name = 'Pan Twardowski Iii'

name.replace('Iii', 'III')
# 'Pan Twardowski III'

5.6.5. str.strip(), str.lstrip(), str.rstrip()

name = '\tPan Twardowski    \n'

name.strip()        # 'Pan Twardowski'
name.rstrip()       # '\tPan Twardowski'
name.lstrip()       # 'Pan Twardowski    \n'

5.6.6. str.startswith() and str.endswith()

  • Understand this as “starts with” and “ends with”
name = 'Pan Twardowski'

name.startswith('Pan')  # True
name.endswith(';')      # False

5.6.7. str.split()

text = 'We choose to go to the Moon'

text.split()
# ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']
text = '10.13.37.1      nasa.gov esa.int roscosmos.ru'

text.split(' ')
# ['10.13.37.1', '', '', '', '', '', 'nasa.gov', 'esa.int', 'roscosmos.ru']

text.split()
# ['10.13.37.1', 'nasa.gov', 'esa.int', 'roscosmos.ru']
setosa = '5.1,3.5,1.4,0.2,setosa'

setosa.split(',')
# ['5.1', '3.5', '1.4', '0.2', 'setosa']

5.6.8. str.join()

text = ['We', 'choose', 'to', 'go', 'to', 'the', 'Moon']

' '.join(text)
# 'We choose to go to the Moon'
setosa = [5.1, 3.5, 1.4, 0.2, 'setosa']

','.join(setosa)
# '5.1,3.5,1.4,0.2,setosa'

5.6.9. str.isspace()

''.isspace()        # False
' '.isspace()       # True
'\t'.isspace()      # True
'\n'.isspace()      # True

5.6.10. str.isalpha()

'hello'.isalpha()   # True
'hello1'.isalpha()  # False

5.6.11. str in str

'th' in 'Python'     # True
'hello' in 'Python'  # False

5.6.12. len()

len('Python')   # 6
len('')         # 0

5.6.13. Multiple statements in one line

a = 'Python'
b = a.upper().replace('P', 'C').title()

print(a)            # Python
print(b)            # Cython
a = 'Python'

b = a.upper().startswith('P').replace('P', 'C')
# AttributeError: 'bool' object has no attribute 'replace'

5.7. Getting text from user

  • input() returns str
  • Space at the end of prompt
name = input('Type your name: ')
# User inputs: Pan Twardowski

print(name)     # 'Pan Twardowski'
type(name)      # <class 'str'>
age = input('Type your age: ')
# User inputs: 42

print(age)      # '42'
type(age)       # <class 'str'>

5.8. Cleaning str from user input

  • 80% of machine learning and data science is cleaning data

5.8.1. Is this the same address?

  • This is a dump of distinct records of a single address
  • Which one of the below is a true address?
'ul. Jana III Sobieskiego'
'ul Jana III Sobieskiego'
'ul.Jana III Sobieskiego'
'ulicaJana III Sobieskiego'
'Ul. Jana III Sobieskiego'
'UL. Jana III Sobieskiego'
'ulica Jana III Sobieskiego'
'Ulica. Jana III Sobieskiego'

'os. Jana III Sobieskiego'

'Jana 3 Sobieskiego'
'Jana 3ego Sobieskiego'
'Jana III Sobieskiego'
'Jana Iii Sobieskiego'
'Jana IIi Sobieskiego'
'Jana lll Sobieskiego'  # three small letters 'L'

5.8.2. Different way of spelling and abbreviating

'ul '
'ul. '
'ul.'
'ulica'
'Ul. '
'UL. '
'ulica '
'Ulica. '
'os. '
'ośedle'
'osiedle'
'os'
'plac '
'pl '
'al '
'al. '
'aleja '
'alei '
'aleia'
'aleii'
'aleji'

5.8.3. House number and apartment

'1/2'
'1 / 2'
'1/ 2'
'1 /2'
'3/5/7'

'1 m. 2'
'1 m 2'
'1 apt 2'
'1 apt. 2'

'180f/8f'
'180f/8'
'180/8f'

'13d bud. A'

5.9. Assignments

5.9.1. Emot print

  • Filename: types_emoticon.py
  • Lines of code to write: 4 lines
  • Estimated time of completion: 10 min
  1. Wczytaj od użytkownika imię

  2. Wyświetl hello IMIE EMOTICON, gdzie:

    • IMIE to imie wprowadzone przez usera
    • EMOTICON to Unicode Codepoint “U+1F642”
The whys and wherefores:
 
  • Definiowanie zmiennych
  • Korzystanie z print formatting
  • Wczytywanie tekstu od użytkownika

5.9.2. Variables and types

  • Filename: types_str_input.py
  • Lines of code to write: 4 lines
  • Estimated time of completion: 10 min
  1. Wczytaj od użytkownika imię

  2. Za pomocą f-string formatting wyświetl na ekranie:

    '''My name... "José Jiménez".
            I'm an """astronaut!"""'''
    
  3. Uwaga! Druga linijka zaczyna się od tabulacji

  4. Gdzie wartość w podwójnym cudzysłowiu to ciąg od użytkownika (w przykładzie użytkownik wpisał José Jiménez)

  5. Zwróć uwagę na znaki apostrofów, cudzysłowów, tabulacji i nowych linii

  6. W ciągu do wyświetlenia nie używaj spacji ani enterów - użyj \n i \t

  7. Nie korzystaj z dodawania stringów (str + str)

The whys and wherefores:
 
  • Definiowanie zmiennych
  • Korzystanie z print formatting
  • Wczytywanie tekstu od użytkownika

5.9.3. String cleaning

  • Filename: types_str_cleaning.py
  • Lines of code to write: 11 lines
  • Estimated time of completion: 15 min
  1. Dane poniżej przeczyść, tak aby zmienne miały wartość 'Jana III Sobieskiego'
  2. Przeprowadź dyskusję jak zrobić rozwiązanie generyczne pasujące do wszystkich? (Implementacja rozwiązania będzie w rozdziale Function Basics)
expected = 'Jana III Sobieskiego'

a = '  Jana III Sobieskiego '
b = 'ul Jana III SobIESkiego'
c = '\tul. Jana trzeciego Sobieskiego'
d = 'ulicaJana III Sobieskiego'
e = 'UL. JA\tNA 3 SOBIES\tKIEGO'
f = 'UL. jana III SOBiesKIEGO'
g = 'ULICA JANA III SOBIESKIEGO  '
h = 'ULICA. JANA III SOBIeskieGO'
i = ' Jana 3 Sobieskiego  '
j = 'Jana III\tSobieskiego '
k = 'ul.Jana III Sob\n\nieskiego\n'

print(f'{a == expected}\t a: "{a}"')
print(f'{b == expected}\t b: "{b}"')
print(f'{c == expected}\t c: "{c}"')
print(f'{d == expected}\t d: "{d}"')
print(f'{e == expected}\t e: "{e}"')
print(f'{f == expected}\t f: "{f}"')
print(f'{g == expected}\t g: "{g}"')
print(f'{h == expected}\t h: "{h}"')
print(f'{i == expected}\t i: "{i}"')
print(f'{j == expected}\t j: "{j}"')
print(f'{k == expected}\t k: "{k}"')
The whys and wherefores:
 
  • Definiowanie zmiennych
  • Korzystanie z print formatting
  • Wczytywanie tekstu od użytkownika