7.6. File Encoding

7.6.1. Rationale

  • utf-8 - a.k.a. Unicode - international standard (should be always used!)

  • iso-8859-1 - ISO standard for Western Europe and USA

  • iso-8859-2 - ISO standard for Central Europe (including Poland)

  • cp1250 or windows-1250 - Polish encoding on Windows

  • cp1251 or windows-1251 - Russian encoding on Windows

  • cp1252 or windows-1252 - Western European encoding on Windows

  • ASCII - ASCII characters only

7.6.2. UTF-8

FILE = r'/tmp/myfile.txt'

with open(FILE, mode='w', encoding='utf-8') as file:
    file.write('Иван Иванович')

with open(FILE, encoding='utf-8') as file:
    print(file.read())
# Иван Иванович

7.6.3. Unicode Encode Error

FILE = r'/tmp/myfile.txt'

with open(FILE, mode='w', encoding='cp1250') as file:
    file.write('Иван Иванович')
# Traceback (most recent call last):
#   ...
# UnicodeEncodeError: 'charmap' codec can't encode characters in
# position 0-3: character maps to <undefined>

7.6.4. Unicode Decode Error

FILE = r'/tmp/myfile.txt'

with open(FILE, mode='w', encoding='utf-8') as file:
    file.write('Иван Иванович')

with open(FILE, encoding='cp1250') as file:
    print(file.read())
# Traceback (most recent call last):
#   ...
# UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 1: character maps to <undefined>

7.6.5. Escape Characters

  • \r\n - is used on windows

  • \n - is used everywhere else

../../_images/type-machine.jpg

Figure 7.1. Why we have '\r\n' on Windows?

Table 7.1. Frequently used escape characters

Sequence

Description

\n

New line (LF - Linefeed)

\r

Carriage Return (CR)

\t

Horizontal Tab (TAB)

\'

Single quote '

\"

Double quote "

\\

Backslash \

Table 7.2. Less frequently used escape characters

Sequence

Description

\a

Bell (BEL)

\b

Backspace (BS)

\f

New page (FF - Form Feed)

\v

Vertical Tab (VT)

\uF680

Character with 16-bit (2 bytes) hex value F680

\U0001F680

Character with 32-bit (4 bytes) hex value 0001F680

\o755

ASCII character with octal value 755

\x1F680

ASCII character with hex value 1F680

print('\U0001F680')     # 🚀