3.15. HTML Scrapping¶
Requests HTML https://github.com/psf/requests-html
BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
Scrapy https://scrapy.org/
3.15.1. BeautifulSoup
¶
3.15.2. Example usage¶
3.15.3. Install¶
$ pip install BeautifulSoup4
3.15.4. Parser¶
Parser |
Typical usage |
Advantages |
Disadvantages |
Python's html.parser |
|
|
|
lxml's HTML parser |
|
|
|
lxml's XML parser |
|
|
|
html5lib |
|
|
|
3.15.5. Open¶
from bs4 import BeautifulSoup
with open("index.html") as file:
html = BeautifulSoup(file, 'html.parser')
html.find(id='menubox').decompose()
3.15.6. Basic Usage¶
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="https://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="https://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="https://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
html = BeautifulSoup(html_doc, 'html.parser')
print(html.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="https://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="https://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="https://example.com/tillie" id="link2">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
html.title # <title>The Dormouse's story</title>
html.title.name # 'title'
html.title.string # 'The Dormouse's story'
html.title.parent.name # 'head'
html.p # <p class="title"><b>The Dormouse's story</b></p>
html.p['class'] # 'title'
html.a # <a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>
html.find_all('a')
# [<a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>]
html.find(id="link3")
# <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>
3.15.7. Iterating over items¶
for link in html.find_all('a'):
print(link.get('href'))
# https://example.com/elsie
# https://example.com/lacie
# https://example.com/tillie
3.15.8. Getting Page Text¶
html.get_text()
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
3.15.9. Assignments¶
3.15.9.1. Scrapping Iris¶
Assignment: Scrapping Iris
Complexity: medium
Lines of code: 20 lines
Time: 21 min
- English:
TODO: English Translation X. Run doctests - all must succeed
- Polish:
Za pomocą BeautifulSoup4 ze strony https://github.com/AstroMatt/book-python/blob/master/numerical-analysis/data/iris-dirty.csv pobierz dane zbioru Irysów.
Parsując kod HTML oczyść dane.
Skasuj pierwszy wiersz nagłówkowy.
Kolumny nazwij:
Sepal length
,Sepal width
,Petal length
,Petal width
,Species
Wyświetl dane w formacie listy dictów, kluczami mają być nazwy kolumn.
Uruchom doctesty - wszystkie muszą się powieść
3.15.9.2. Scrapping EVA¶
Assignment: Scrapping EVA
Complexity: medium
Lines of code: 100 lines
Time: 21 min
- English:
- TODO: English Translation
Run doctests - all must succeed
- Polish:
Na podstawie podanych URL:
Skrapuj stronę wykorzystując
BeautifulSoup4
Przygotuj plik CSV z danymi dotyczącymi spacerów kosmicznych
Spróbuj to samo zrobić za pomocą
pandas.read_html()
:Podając jako parametr czwarty URL
Dla częściowo sparsowanej strony, np. wyciągniętej tabelki
Uruchom doctesty - wszystkie muszą się powieść