8.5. File Read

8.5.1. Rationale

  • Works with both relative and absolute path

  • Fails when directory with file cannot be accessed

  • Fails when file cannot be accessed

  • Uses context manager

  • mode parameter to open() function is optional (defaults to mode='rt')

8.5.2. Read From File

  • Always remember to close file

    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> file = open(FILE)
    >>> data = file.read()
    >>> file.close()
    

8.5.3. Read Using Context Manager

  • Context managers use with ... as ...: syntax

  • It closes file automatically upon block exit (dedent)

  • Using context manager is best practice

  • More information in Protocol Context Manager

    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> with open(FILE) as file:
    ...     data = file.read()
    

8.5.4. Read File at Once

  • Note, that whole file must fit into memory

    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> with open(FILE) as file:
    ...     data = file.read()
    

8.5.5. Read File as List of Lines

  • Note, that whole file must fit into memory

    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> with open(FILE) as file:
    ...     data = file.readlines()
    

Read selected (1-30) lines from file:

>>> FILE = r'/tmp/myfile.txt'
>>>
>>> with open(FILE) as file:
...     lines = file.readlines()[1:30]

Read selected (1-30) lines from file:

>>> FILE = r'/tmp/myfile.txt'
>>>
>>> with open(FILE) as file:
...     for line in file.readlines()[1:30]:
...         print(line)

Read whole file and split by lines, separate header from content:

>>> FILE = r'/tmp/myfile.txt'
>>>
>>> # doctest: +SKIP
... with open(FILE) as file:
...     header, *content = file.readlines()
...
...     for line in content:
...         print(line)

8.5.6. Reading File as Generator

  • Use generator to iterate over other lines

  • In those examples, file is a generator

    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> with open(FILE) as file:
    ...     for line in file:
    ...         print(line)
    
    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> with open(FILE) as file:
    ...     header = file.readline()
    ...
    ...     for line in file:
    ...         print(line)
    

8.5.7. Examples

>>> def isnumeric(x):
...     try:
...         float(x)
...         return True
...     except ValueError:
...         return False
>>>
>>>
>>> def clean(line):
...     line = line.strip().split(',')
...     line = map(lambda x: float(x) if isnumeric(x) else x, line)
...     return tuple(line)
>>>
>>>
>>> with open(FILE) as file:
...     header = clean(file.readline())
...
...     for line in file:
...         line = clean(line)
...         print(line)
>>> total = 0
>>>
>>> with open(FILE) as file:
...     for line in file:
...         total += sum(float(line))
>>>
>>> print(total)
0

8.5.8. Assignments

Code 8.9. Solution
"""
* Assignment: File Read Str
* Complexity: easy
* Lines of code: 2 lines
* Time: 3 min

English:
    1. Use data from "Given" section (see below)
    2. Write `DATA` to file `FILE`
    3. Read `FILE` to `result: str`
    4. Print `result`
    5. Compare result with "Tests" section (see below)

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Zapisz `DATA` do pliku `FILE`
    3. Wczytaj `FILE` do `result: str`
    4. Wypisz `result`
    5. Porównaj wyniki z sekcją "Tests" (patrz poniżej)

Tests:
    >>> import sys
    >>> sys.tracebacklimit = 0

    >>> assert type(result) is str

    >>> result
    'hello'

    >>> from os import remove
    >>> remove(FILE)
"""


# Given
FILE = r'_temporary.txt'
DATA = 'hello'

with open(FILE, mode='wt') as file:
    file.write(DATA)


Code 8.10. Solution
"""
* Assignment: File Read Multiline
* Complexity: easy
* Lines of code: 3 lines
* Time: 3 min

English:
    1. Use data from "Given" section (see below)
    2. Write `DATA` to file `FILE`
    3. Read `FILE` to `result: list[str]`
    4. Print `result`
    5. Compare result with "Tests" section (see below)

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Zapisz `DATA` do pliku `FILE`
    3. Wczytaj `FILE` do `result: list[str]`
    4. Wypisz `result`
    5. Porównaj wyniki z sekcją "Tests" (patrz poniżej)

Tests:
    >>> import sys
    >>> sys.tracebacklimit = 0

    >>> assert type(result) is list
    >>> assert all(type(x) is str for x in result)

    >>> result
    ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

    >>> from os import remove
    >>> remove(FILE)
"""


# Given
FILE = r'_temporary.txt'
DATA = 'sepal_length\nsepal_width\npetal_length\npetal_width\nspecies\n'

with open(FILE, mode='wt') as file:
    file.write(DATA)


Code 8.11. Solution
"""
* Assignment: File Read CSV
* Complexity: easy
* Lines of code: 15 lines
* Time: 8 min

English:
    1. Use data from "Given" section (see below)
    2. Write `DATA` to file `FILE`
    3. Read `FILE`
    4. Separate header from data
    5. Write header (first line) to `header`
    6. Read file and for each line:
        a. Strip whitespaces
        b. Split line by coma `,`
        c. Convert measurements do `tuple[float]`
        d. Append measurements to `features`
        e. Append species name to `labels`
    7. Print `header`, `features` and `labels`
    8. Compare result with "Tests" section (see below)

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Zapisz `DATA` do pliku `FILE`
    3. Wczytaj `FILE`
    4. Odseparuj nagłówek od danych
    5. Zapisz nagłówek (pierwsza linia) do `header`
    6. Zaczytaj plik i dla każdej linii:
        a. Usuń białe znaki z początku i końca linii
        b. Podziel linię po przecinku `,`
        c. Przekonwertuj pomiary do `tuple[float]`
        d. Dodaj pomiary do `features`
        e. Dodaj gatunek do `labels`
    7. Wyświetl `header`, `features` i `labels`
    8. Porównaj wyniki z sekcją "Tests" (patrz poniżej)

Hints:
    * `tuple(float(x) for x in X)`

Tests:
    >>> header
    ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
    >>> features  # doctest: +NORMALIZE_WHITESPACE
    [(5.4, 3.9, 1.3, 0.4),
     (5.9, 3.0, 5.1, 1.8),
     (6.0, 3.4, 4.5, 1.6),
     (7.3, 2.9, 6.3, 1.8),
     (5.6, 2.5, 3.9, 1.1),
     (5.4, 3.9, 1.3, 0.4)]
    >>> labels
    ['setosa', 'virginica', 'versicolor', 'virginica', 'versicolor', 'setosa']
    >>> from os import remove
    >>> remove(FILE)
"""


# Given
FILE = r'_temporary.csv'

DATA = """sepal_length,sepal_width,petal_length,petal_width,species
5.4,3.9,1.3,0.4,setosa
5.9,3.0,5.1,1.8,virginica
6.0,3.4,4.5,1.6,versicolor
7.3,2.9,6.3,1.8,virginica
5.6,2.5,3.9,1.1,versicolor
5.4,3.9,1.3,0.4,setosa
"""

header = []
features = []
labels = []

with open(FILE, mode='w') as file:
    file.write(DATA)


Code 8.12. Solution
"""
* Assignment: File Read Dict
* Complexity: medium
* Lines of code: 10 lines
* Time: 8 min

English:
    1. Use data from "Given" section (see below)
    2. Write `DATA` to file `FILE`
    3. Read `FILE` and for each line:
        a. Remove leading and trailing whitespaces
        b. Skip line if it is empty
        c. Split line by whitespace
        d. Separate IP address and hosts names
        e. Append IP address and hosts names to `result`
    4. Merge hostnames for the same IP
    5. Compare result with "Tests" section (see below)

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Zapisz `DATA` do pliku `FILE`
    3. Wczytaj `FILE` i dla każdej linii:
        a. Usuń białe znaki na początku i końcu linii
        b. Pomiń linię, jeżeli jest pusta
        c. Podziel linię po białych znakach
        d. Odseparuj adres IP i nazwy hostów
        e. Dodaj adres IP i nazwy hostów do `result`
    4. Scal nazwy hostów dla tego samego IP
    5. Porównaj wyniki z sekcją "Tests" (patrz poniżej)

Hints:
    * `str.isspace()`
    * `str.split()`

Tests:
    >>> result  # doctest: +NORMALIZE_WHITESPACE
    {'127.0.0.1': ['localhost'],
     '10.13.37.1': ['nasa.gov', 'esa.int', 'roscosmos.ru'],
     '255.255.255.255': ['broadcasthost'],
     '::1': ['localhost']}
    >>> from os import remove
    >>> remove(FILE)
"""


# Given
FILE = r'_temporary.txt'

DATA = """127.0.0.1       localhost
10.13.37.1      nasa.gov esa.int roscosmos.ru
255.255.255.255 broadcasthost
::1             localhost
"""

with open(FILE, mode='w') as file:
    file.write(DATA)


result = {}


Code 8.13. Solution
"""
* Assignment: File Read List of Dicts
* Complexity: hard
* Lines of code: 19 lines
* Time: 21 min

English:
    1. Use data from "Given" section (see below)
    2. Read file and for each line:
        a. Skip line if it's empty, is whitespace or starts with comment `#`
        b. Remove leading and trailing whitespaces
        c. Split line by whitespace
        d. Separate IP address and hosts names
        e. Use one line `if` to check whether dot `.` is in the IP address
        f. If is present then protocol is IPv4 otherwise IPv6
        g. Append IP address and hosts names to `result`
    3. Merge hostnames for the same IP
    4. Compare result with "Tests" section (see below)

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Przeczytaj plik i dla każdej linii:
        a. Pomiń linię jeżeli jest pusta, jest białym znakiem lub zaczyna się od komentarza `#`
        b. Usuń białe znaki na początku i końcu linii
        c. Podziel linię po białych znakach
        d. Odseparuj adres IP i nazwy hostów
        e. Wykorzystaj jednolinikowego `if` do sprawdzenia czy jest kropka `.` w adresie IP
        f. Jeżeli jest obecna to protokół  jest IPv4, w przeciwnym przypadku IPv6
        g. Dodaj adres IP i nazwy hostów do `result`
    3. Scal nazwy hostów dla tego samego IP
    4. Porównaj wyniki z sekcją "Tests" (patrz poniżej)

Hints:
    * `str.split()` - without an argument
    * `len(line) == 0`
    * `line.startswith('#')`
    * `ip = 'IPv4' if '.' in ip else 'IPv6'`

Tests:
    >>> result  # doctest: +NORMALIZE_WHITESPACE
    [{'ip': '127.0.0.1', 'hostnames': ['localhost', 'astromatt'], 'protocol': 'IPv4'},
     {'ip': '10.13.37.1', 'hostnames': ['nasa.gov', 'esa.int', 'roscosmos.ru'], 'protocol': 'IPv4'},
     {'ip': '255.255.255.255', 'hostnames': ['broadcasthost'], 'protocol': 'IPv4'},
     {'ip': '::1', 'hostnames': ['localhost'], 'protocol': 'IPv6'}]
    >>> from os import remove
    >>> remove(FILE)
"""


# Given
FILE = r'_temporary.txt'

DATA = """
##
# `/etc/hosts` structure:
#   - IPv4 or IPv6
#   - Hostnames
 ##

127.0.0.1       localhost
127.0.0.1       astromatt
10.13.37.1      nasa.gov esa.int roscosmos.ru
255.255.255.255 broadcasthost
::1             localhost
"""

with open(FILE, mode='w') as file:
    file.write(DATA)


result: list

Code 8.14. Solution
"""
* Assignment: File Read Passwd
* Complexity: hard
* Lines of code: 100 lines
* Time: 55 min

English:
    1. Use data from "Given" section (see below)
    2. Save listings content to files:
        a. `etc_passwd.txt`
        b. `etc_shadow.txt`
        c. `etc_group.txt`
    3. Copy also comments and empty lines
    4. Parse files and convert it to `result: list[dict]`
    5. Return list of users with `UID` greater than 1000
    6. User dict should contains data collected from all files
    7. Compare result with "Tests" section (see below)

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Zapisz treści listingów do plików:
        a. `etc_passwd.txt`
        b. `etc_shadow.txt`
        c. `etc_group.txt`
    3. Skopiuj również komentarze i puste linie
    4. Sparsuj plik i przedstaw go w formacie `result: list[dict]`
    5. Zwróć listę użytkowników, których `UID` jest większy niż 1000
    6. Dict użytkownika powinien zawierać dane z wszystkich plików
    7. Porównaj wyniki z sekcją "Tests" (patrz poniżej)

Tests:
    >>> result  # doctest: +NORMALIZE_WHITESPACE
    [{'username': 'watney',
      'uid': 1000,
      'gid': 1000,
      'home': '/home/watney',
      'shell': '/bin/bash',
      'algorithm': None,
      'password': None,
      'groups': ['astronauts', 'mars'],
      'last_changed': datetime.date(2015, 4, 25),
      'locked': True},
     {'username': 'twardowski',
      'uid': 1001,
      'gid': 1001,
      'home': '/home/twardowski',
      'shell': '/bin/bash',
      'algorithm': 'SHA-512',
      'password': 'tgfvvFWJJ5FKmoXiP5rXWOjwoEBOEoAuBi3EphRbJqqjWYvhEM2wa67L9XgQ7W591FxUNklkDIQsk4kijuhE50',
      'groups': ['astronauts', 'sysadmin', 'moon'],
      'last_changed': datetime.date(2015, 7, 16),
      'locked': False},
     {'username': 'ivanovic',
      'uid': 1002,
      'gid': 1002,
      'home': '/home/ivanovic',
      'shell': '/bin/bash',
      'algorithm': 'MD5',
      'password': 'SWlkjRWexrXYgc98F.',
      'groups': ['astronauts', 'sysadmin'],
      'last_changed': datetime.date(2005, 2, 11),
      'locked': False}]
"""


# Given
from datetime import date
from os.path import dirname, join

BASE_DIR = dirname(__file__)
FILE_GROUP = join(BASE_DIR, '../data/etc-group.txt')
FILE_SHADOW = join(BASE_DIR, '../data/etc-shadow.txt')
FILE_PASSWD = join(BASE_DIR, '../data/etc-passwd.txt')

SECOND = 1
MINUTE = 60 * SECOND
HOUR = 60 * MINUTE
DAY = 24 * HOUR

ALGORITHMS = {
    '1': 'MD5',
    '2a': 'Blowfish',
    '2y': 'Blowfish',
    '5': 'SHA-256',
    '6': 'SHA-512',
}

result: list


Code 8.15. /etc/passwd
##
# `/etc/passwd` structure:
#   - Username
#   - Password: `x` indicates that shadow passwords are used
#   - UID: User ID number
#   - GID: User's group ID number
#   - GECOS: Full name of the user
#   - Home directory
#   - Login shell
##

root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
nobody:x:99:99:Nobody:/:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
watney:x:1000:1000:Mark Watney:/home/watney:/bin/bash
twardowski:x:1001:1001:Jan Twardowski:/home/twardowski:/bin/bash
ivanovic:x:1002:1002:Ivan Ivanovic:/home/ivanovic:/bin/bash
Code 8.16. /etc/shadow
##
# `/etc/shadow` structure
#   - Username: from `/etc/passwd`
#   - Password
#   - Last Password Change: Days since 1970-01-01
#   - Minimum days between password changes: 0 - changed at any time
#   - Password validity: Days after which password must be changed, 99999 - many, many years
#   - Warning threshold: Days to warn user of an expiring password, 7 - full week
#   - Account inactive: Days after password expires and account is disabled
#   - Time since account is disabled: Days since 1970-01-01
#   - A reserved field for possible future use
#
# Password field (split by `$`):
#   - algorithm
#   - salt
#   - password hash
#
# Password algorithms:
#   - `1` - MD5
#   - `2a` - Blowfish
#   - `2y` - Blowfish
#   - `5` - SHA-256
#   - `6` - SHA-512
#
# Password special chars:
#   - ` ` (blank entry) - password is not required to log in
#   - `*` (asterisk) - account is disabled, cannot be unlocked, no password has ever been set
#   - `!` (exclamation mark) - account is locked, can be unlocked, no password has ever been set
#   - `!<password_hash>` - account is locked, can be unlocked, but password is set
#   - `!!` (two exclamation marks) - account created, waiting for initial password to be set by admin
##

root:$6$Ke02nYgo.9v0SF4p$hjztYvo/M4buqO4oBX8KZTftjCn6fE4cV5o/I95QPekeQpITwFTRbDUBYBLIUx2mhorQoj9bLN8v.w6btE9xy1:16431:0:99999:7:::
adm:$6$5H0QpwprRiJQR19Y$bXGOh7dIfOWpUb/Tuqr7yQVCqL3UkrJns9.7msfvMg4ZO/PsFC5Tbt32PXAw9qRFEBs1254aLimFeNM8YsYOv.:16431:0:99999:7:::
watney:!!:16550::::::
twardowski:$6$P9zn0KwR$tgfvvFWJJ5FKmoXiP5rXWOjwoEBOEoAuBi3EphRbJqqjWYvhEM2wa67L9XgQ7W591FxUNklkDIQsk4kijuhE50:16632:0:99999:7:::
ivanovic:$1$.QKDPc5E$SWlkjRWexrXYgc98F.:12825:0:90:5:30:13096:
Code 8.17. /etc/group
##
# `/etc/group` structure
#   - Group Name: from `/etc/passwd`
#   - Group Password: `x` indicates that shadow passwords are used)
#   - GID: Group ID
#   - Members: usernames from `/etc/passwd`
##

root::0:root
other::1:
bin::2:root,bin,daemon
sys::3:root,bin,sys,adm
adm::4:root,adm,daemon
mail::6:root
astronauts::10:twardowski,watney,ivanovic
daemon::12:root,daemon
sysadmin::14:twardowski,ivanovic
mars::1000:watney
moon::1001:twardowski
nobody::60001:
noaccess::60002:
nogroup::65534: