2.7. Pandas Read HTML

  • File paths works also with URLs

2.7.1. SetUp

>>> import pandas as pd
>>>
>>> pd.set_option('display.width', 250)
>>> pd.set_option('display.max_columns', 20)
>>> pd.set_option('display.max_rows', 30)

2.7.2. Read HTML

>>> DATA = 'https://python3.info/_static/apollo11.html'
>>>
>>> tables = pd.read_html(DATA)
>>> df = tables[0]
>>>
>>> df.head(n=5)
                                                   0                 1          2            3
0                                              Event  GET  (hhh:mm:ss)  GMT  Time    GMT  Date
1                       Terminal countdown  started.        -028:00:00   21:00:00  14 Jul 1969
2              Scheduled 11-hour hold  at T-9 hours.        -009:00:00   16:00:00  15 Jul 1969
3                   Countdown resumed at  T-9 hours.        -009:00:00   03:00:00  16 Jul 1969
4  Scheduled 1-hour  32-minute hold at T-3 hours ...        -003:30:00   08:30:00  16 Jul 1969

2.7.3. User Agent

>>> import requests
>>>
>>>
>>> DATA = 'https://python3.info/_static/apollo11.html'
>>> USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' \
...              '(KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
>>>
>>> resp = requests.get(DATA, headers={'User-Agent': USER_AGENT})
>>> tables = pd.read_html(resp.content)
>>> df = tables[0]
>>>
>>> df.head(n=5)
                                                   0                 1          2            3
0                                              Event  GET  (hhh:mm:ss)  GMT  Time    GMT  Date
1                       Terminal countdown  started.        -028:00:00   21:00:00  14 Jul 1969
2              Scheduled 11-hour hold  at T-9 hours.        -009:00:00   16:00:00  15 Jul 1969
3                   Countdown resumed at  T-9 hours.        -009:00:00   03:00:00  16 Jul 1969
4  Scheduled 1-hour  32-minute hold at T-3 hours ...        -003:30:00   08:30:00  16 Jul 1969

2.7.4. Assignments

Code 2.34. Solution
"""
* Assignment: Pandas Read HTML
* Complexity: easy
* Lines of code: 2 lines
* Time: 5 min

English:
    1. Read data from `DATA` as `data: pd.DataFrame`
    2. Define `result` with active European Space Agency astronauts
    3. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` jako `data: pd.DataFrame`
    2. Zdefiniuj `result` z aktywnymi astronautami Europejskiej Agencji Kosmicznej
    3. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `pip install --upgrade lxml`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.DataFrame, \
    'Variable `result` must be a `pd.DataFrame` type'

    >>> result['Name']
    0    Samantha Cristoforetti
    1           Alexander Gerst
    2          Andreas Mogensen
    3            Luca Parmitano
    4             Timothy Peake
    5            Thomas Pesquet
    6           Matthias Maurer
    Name: Name, dtype: object
"""

import pandas as pd


DATA = 'https://python3.info/_static/european-astronaut-corps.html'


# Read DATA, select active ESA astronauts
# type: pd.DataFrame
result = ...