2.7. Pandas Read HTML¶
File paths works also with URLs
2.7.1. SetUp¶
>>> import pandas as pd
>>>
>>> pd.set_option('display.width', 250)
>>> pd.set_option('display.max_columns', 20)
>>> pd.set_option('display.max_rows', 30)
2.7.2. Read HTML¶
>>> DATA = 'https://python3.info/_static/apollo11.html'
>>>
>>> tables = pd.read_html(DATA)
>>> df = tables[0]
>>>
>>> df.head(n=5)
0 1 2 3
0 Event GET (hhh:mm:ss) GMT Time GMT Date
1 Terminal countdown started. -028:00:00 21:00:00 14 Jul 1969
2 Scheduled 11-hour hold at T-9 hours. -009:00:00 16:00:00 15 Jul 1969
3 Countdown resumed at T-9 hours. -009:00:00 03:00:00 16 Jul 1969
4 Scheduled 1-hour 32-minute hold at T-3 hours ... -003:30:00 08:30:00 16 Jul 1969
2.7.3. User Agent¶
>>> import requests
>>>
>>>
>>> DATA = 'https://python3.info/_static/apollo11.html'
>>> USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' \
... '(KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
>>>
>>> resp = requests.get(DATA, headers={'User-Agent': USER_AGENT})
>>> tables = pd.read_html(resp.content)
>>> df = tables[0]
>>>
>>> df.head(n=5)
0 1 2 3
0 Event GET (hhh:mm:ss) GMT Time GMT Date
1 Terminal countdown started. -028:00:00 21:00:00 14 Jul 1969
2 Scheduled 11-hour hold at T-9 hours. -009:00:00 16:00:00 15 Jul 1969
3 Countdown resumed at T-9 hours. -009:00:00 03:00:00 16 Jul 1969
4 Scheduled 1-hour 32-minute hold at T-3 hours ... -003:30:00 08:30:00 16 Jul 1969
2.7.4. Assignments¶
"""
* Assignment: Pandas Read HTML
* Complexity: easy
* Lines of code: 2 lines
* Time: 5 min
English:
1. Read data from `DATA` as `data: pd.DataFrame`
2. Define `result` with active European Space Agency astronauts
3. Run doctests - all must succeed
Polish:
1. Wczytaj dane z `DATA` jako `data: pd.DataFrame`
2. Zdefiniuj `result` z aktywnymi astronautami Europejskiej Agencji Kosmicznej
3. Uruchom doctesty - wszystkie muszą się powieść
Hints:
* `pip install --upgrade lxml`
Tests:
>>> import sys; sys.tracebacklimit = 0
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result['Name']
0 Samantha Cristoforetti
1 Alexander Gerst
2 Andreas Mogensen
3 Luca Parmitano
4 Timothy Peake
5 Thomas Pesquet
6 Matthias Maurer
Name: Name, dtype: object
"""
import pandas as pd
DATA = 'https://python3.info/_static/european-astronaut-corps.html'
# Read DATA, select active ESA astronauts
# type: pd.DataFrame
result = ...