7.6. File Encoding

7.6.1. Rationale

  • utf-8 - a.k.a. Unicode - international standard (should be always used!)

  • iso-8859-1 - ISO standard for Western Europe and USA

  • iso-8859-2 - ISO standard for Central Europe (including Poland)

  • cp1250 or windows-1250 - Polish encoding on Windows

  • cp1251 or windows-1251 - Russian encoding on Windows

  • cp1252 or windows-1252 - Western European encoding on Windows

  • ASCII - ASCII characters only

../../_images/files-windows2000-notepad-saveas.png

Figure 7.1. Windows 2000 Notepad "Save As" window with possibility to select encoding. UTF-8 is not selected by default... Source: 1

../../_images/files-windows10-notepad-saveas.png

Figure 7.2. Windows 10 Notepad "Save As" window with possibility to select encoding. Since Windows 10.1903 (May 2019) notepad writes files in UTF-8 by default! Source: 2 3

../../_images/files-encoding-ascii2.jpg

Figure 7.3. ASCII table. Source: 4

../../_images/files-encoding-unicode2.png

Figure 7.4. Unicode. Source: 5

../../_images/files-encoding-unicode3.png

Figure 7.5. Unicode. Source: 6

7.6.2. UTF-8

>>> FILE = r'/tmp/myfile.txt'
>>>
>>> with open(FILE, mode='w', encoding='utf-8') as file:
...     file.write('Иван Иванович')
13
>>>
>>> with open(FILE, encoding='utf-8') as file:
...     print(file.read())
Иван Иванович
../../_images/files-encoding-utf.png

Figure 7.6. UTF-8. Source: 7

../../_images/files-encoding-utf2.jpg

Figure 7.7. UTF-8. Source: 8

7.6.3. Unicode Encode Error

>>> FILE = r'/tmp/myfile.txt'
>>>
>>> with open(FILE, mode='w', encoding='cp1250') as file:
...     file.write('Иван Иванович')
Traceback (most recent call last):
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>

7.6.4. Unicode Decode Error

>>> FILE = r'/tmp/myfile.txt'
>>>
>>> with open(FILE, mode='w', encoding='utf-8') as file:
...     file.write('Иван Иванович')
13
>>>
>>> with open(FILE, encoding='cp1250') as file:
...     print(file.read())
Traceback (most recent call last):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 1: character maps to <undefined>

7.6.5. Escape Characters

  • \r\n - is used on windows

  • \n - is used everywhere else

../../_images/type-machine.jpg

Figure 7.8. Why we have '\r\n' on Windows?

Frequently used escape characters:

  • \n - New line (ENTER)

  • \t - Horizontal Tab (TAB)

  • \' - Single quote ' (escape in single quoted strings)

  • \" - Double quote " (escape in double quoted strings)

  • \\ - Backslash \ (to indicate, that this is not escape char)

Less frequently used escape characters:

  • \a - Bell (BEL)

  • \b - Backspace (BS)

  • \f - New page (FF - Form Feed)

  • \v - Vertical Tab (VT)

  • \uF680 - Character with 16-bit (2 bytes) hex value F680

  • \U0001F680 - Character with 32-bit (4 bytes) hex value 0001F680

  • \o755 - ASCII character with octal value 755

  • \x1F680 - ASCII character with hex value 1F680

Emoticons:

>>> print('\U0001F680')
🚀
>>> a = '\U0001F9D1'  # 🧑
>>> b = '\U0000200D'  # ''
>>> c = '\U0001F680'  # 🚀
>>>
>>> astronaut = a + b + c
>>> print(astronaut)
🧑‍🚀

More information in Builtin Printing and https://en.wikipedia.org/wiki/List_of_Unicode_characters