3.15. HTML Scrapping¶

Requests HTML https://github.com/psf/requests-html
BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
Scrapy https://scrapy.org/

3.15.1. `BeautifulSoup`¶

3.15.2. Example usage¶

https://github.com/AstroMatt/thesis-masters-aerospace/blob/master/src/worldspaceflight-astronaut-bios.py

3.15.3. Install¶

$ pip install BeautifulSoup4

3.15.4. Parser¶

Parser	Typical usage	Advantages	Disadvantages
Python's html.parser	`BeautifulSoup(markup, "html.parser")`	Batteries included Decent speed tolerant (as of Python 2.7.3 and 3.2.)	Not very tolerant (before Python 2.7.3 or 3.2.2)
lxml's HTML parser	`BeautifulSoup(markup, "lxml")`	Very fast Tolerant	External C dependency
lxml's XML parser	`BeautifulSoup(markup, "lxml-xml")` `BeautifulSoup(markup, "xml")`	Very fast The only currently supported XML parser	External C dependency
html5lib	`BeautifulSoup(markup, "html5lib")`	Extremely tolerant Parses pages the same way a web browser does Creates valid HTML5	Very slow External Python dependency

3.15.5. Open¶

from bs4 import BeautifulSoup

with open("index.html") as file:
    html = BeautifulSoup(file, 'html.parser')

html.find(id='menubox').decompose()

3.15.6. Basic Usage¶

from bs4 import BeautifulSoup


html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="https://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="https://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="https://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
"""

html = BeautifulSoup(html_doc, 'html.parser')

print(html.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="https://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="https://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="https://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#     ...
#   </p>
#  </body>
# </html>

html.title              # <title>The Dormouse's story</title>
html.title.name         # 'title'
html.title.string       # 'The Dormouse's story'
html.title.parent.name  # 'head'
html.p                  # <p class="title"><b>The Dormouse's story</b></p>
html.p['class']         # 'title'
html.a                  # <a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>

html.find_all('a')
# [<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>]

html.find(id="link3")
# <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>

3.15.7. Iterating over items¶

for link in html.find_all('a'):
    print(link.get('href'))

# https://example.com/elsie
# https://example.com/lacie
# https://example.com/tillie

3.15.8. Getting Page Text¶

html.get_text()
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

3.15.9. Assignments¶

3.15.9.1. Scrapping Iris¶

Assignment: Scrapping Iris
Complexity: medium
Lines of code: 20 lines
Time: 21 min

English:

Using BeautifulSoup4 from https://python3.info/_static/iris-dirty.csv download Iris dataset.
Parse HTML code to clean data.
Delete first header row.
Name columns: sepal_length, sepal_width, petal_length, petal_width, species
Display data as list of dicts, keys should be column names.
Run doctests - all must succeed

Polish:

Za pomocą BeautifulSoup4 ze strony https://python3.info/_static/iris-dirty.csv pobierz dane zbioru Irysów.
Parsując kod HTML oczyść dane.
Skasuj pierwszy wiersz nagłówkowy.
Kolumny nazwij: sepal_length, sepal_width, petal_length, petal_width, species
Wyświetl dane w formacie listy dictów, kluczami mają być nazwy kolumn.
Uruchom doctesty - wszystkie muszą się powieść

3.15.9.2. Scrapping EVA¶

Assignment: Scrapping EVA
Complexity: medium
Lines of code: 100 lines
Time: 21 min

English:

Based on given URL:
Scrape page using BeautifulSoup4
Prepare CSV file with data about spacewalks
Try to do the same using pandas.read_html():
1. Providing fourth URL as parameter
2. For partially parsed page, e.g. extracted table
Run doctests - all must succeed

Polish:

Na podstawie podanych URL:
Skrapuj stronę wykorzystując BeautifulSoup4
Przygotuj plik CSV z danymi dotyczącymi spacerów kosmicznych
Spróbuj to samo zrobić za pomocą pandas.read_html():
1. Podając jako parametr czwarty URL
2. Dla częściowo sparsowanej strony, np. wyciągniętej tabelki
Uruchom doctesty - wszystkie muszą się powieść

3.15. HTML Scrapping¶

3.15.1. BeautifulSoup¶

3.15.2. Example usage¶

3.15.3. Install¶

3.15.4. Parser¶

3.15.5. Open¶

3.15.6. Basic Usage¶

3.15.7. Iterating over items¶

3.15.8. Getting Page Text¶

3.15.9. Assignments¶

3.15.9.1. Scrapping Iris¶

3.15.9.2. Scrapping EVA¶

3.15.1. `BeautifulSoup`¶