2.5. Pandas Read HTML

2.5.1. Rationale

  • File paths works also with URLs

2.5.2. Read HTML

DATA = 'https://python.astrotech.io/numerical-analysis/pandas/df-create.html'

pd.read_html(DATA)
# Traceback (most recent call last):
# urllib.error.HTTPError: HTTP Error 403: Forbidden
import requests

DATA = 'https://python.astrotech.io/numerical-analysis/pandas/df-create.html'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'

resp = requests.get(DATA, headers={'User-Agent': USER_AGENT})
dfs = pd.read_html(resp.content)

dfs[0]
#      Crew Role        Astronaut
# 0   Prime  CDR   Neil Armstrong
# 1   Prime  LMP      Buzz Aldrin
# 2   Prime  CMP  Michael Collins
# 3  Backup  CDR     James Lovell
# 4  Backup  LMP   William Anders
# 5  Backup  CMP       Fred Haise

2.5.3. Assignments

Code 2.51. Solution
"""
* Assignment: Pandas Read HTML
* Complexity: easy
* Lines of code: 2 lines
* Time: 5 min

English:
    1. Read data from `DATA` as `data: pd.DataFrame`
    2. Define `result` with active European Space Agency astronauts
    3. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` jako `data: pd.DataFrame`
    2. Zdefiniuj `result` z aktywnymi astronautami Europejskiej Agencji Kosmicznej
    3. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `pip install --upgrade lxml`
    * 3rd table

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> type(result) is pd.DataFrame
    True
    >>> len(result) > 0
    True
    >>> result['Name']
    0    Samantha Cristoforetti
    1           Alexander Gerst
    2          Andreas Mogensen
    3            Luca Parmitano
    4             Timothy Peake
    5            Thomas Pesquet
    6           Matthias Maurer
    Name: Name, dtype: object
"""

import pandas as pd

# DATA = 'https://en.wikipedia.org/wiki/European_Astronaut_Corps'
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/html/european-astronaut-corps.html'