2.10. Pandas Read XML¶
File paths works also with URLs
io.StringIO Converts
str
to File-like objectpd.read_xml()
2.10.1. SetUp¶
>>> import pandas as pd
>>>
>>> pd.set_option('display.width', 250)
>>> pd.set_option('display.max_columns', 20)
>>> pd.set_option('display.max_rows', 30)
>>>
>>>
>>> DATA = """<?xml version="1.0"?>
... <catalog>
... <book id="bk101">
... <author>Gambardella, Matthew</author>
... <title>XML Developer's Guide</title>
... <genre>Computer</genre>
... <price>44.95</price>
... <publish_date>2000-10-01</publish_date>
... <description>An in-depth look at creating applications
... with XML.</description>
... </book>
... <book id="bk102">
... <author>Ralls, Kim</author>
... <title>Midnight Rain</title>
... <genre>Fantasy</genre>
... <price>5.95</price>
... <publish_date>2000-12-16</publish_date>
... <description>A former architect battles corporate zombies,
... an evil sorceress, and her own childhood to become queen
... of the world.</description>
... </book>
... <book id="bk103">
... <author>Corets, Eva</author>
... <title>Maeve Ascendant</title>
... <genre>Fantasy</genre>
... <price>5.95</price>
... <publish_date>2000-11-17</publish_date>
... <description>After the collapse of a nanotechnology
... society in England, the young survivors lay the
... foundation for a new society.</description>
... </book>
... </catalog>
... """
>>>
2.10.2. Read XML¶
>>> pd.read_xml(DATA)
id author title genre price publish_date description
0 bk101 Gambardella, Matthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications\n ...
1 bk102 Ralls, Kim Midnight Rain Fantasy 5.95 2000-12-16 A former architect battles corporate zombies,\...
2 bk103 Corets, Eva Maeve Ascendant Fantasy 5.95 2000-11-17 After the collapse of a nanotechnology\n ...
2.10.3. XML and XSLT¶
>>> from io import StringIO
>>> from lxml.etree import XML, XSLT, parse
>>>
>>>
>>> TEMPLATE = """
... <html xsl:version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
... <table>
... <thead>
... <tr>
... <th>Id</th>
... <th>Author</th>
... <th>Title</th>
... <th>Genre</th>
... <th>Price</th>
... <th>Publish Date</th>
... <th>Description</th>
... </tr>
... </thead>
... <tbody>
...
... <xsl:for-each select="catalog/book">
... <tr>
... <td><xsl:value-of select="@id"/></td>
... <td><xsl:value-of select="author"/></td>
... <td><xsl:value-of select="title"/></td>
... <td><xsl:value-of select="genre"/></td>
... <td><xsl:value-of select="price"/></td>
... <td><xsl:value-of select="publish_date"/></td>
... <td><xsl:value-of select="description"/></td>
... </tr>
... </xsl:for-each>
...
... </tbody>
... </table>
... </html>
... """
>>>
>>> transform = XSLT(XML(TEMPLATE))
>>> data = parse(StringIO(DATA))
>>> html = str(transform(data))
>>> dfs = pd.read_html(html)
>>> result = dfs[0]
>>>
>>> result
Id Author Title Genre Price Publish Date Description
0 bk101 Gambardella, Matthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications wit...
1 bk102 Ralls, Kim Midnight Rain Fantasy 5.95 2000-12-16 A former architect battles corporate zombies, ...
2 bk103 Corets, Eva Maeve Ascendant Fantasy 5.95 2000-11-17 After the collapse of a nanotechnology societ...
>>>
>>> type(result) is pd.DataFrame
True
>>>
>>> len(result) > 0
True
>>>
>>> result.columns
Index(['Id', 'Author', 'Title', 'Genre', 'Price', 'Publish Date', 'Description'], dtype='object')
>>>
>>> result['Title']
0 XML Developer's Guide
1 Midnight Rain
2 Maeve Ascendant
Name: Title, dtype: object
2.10.4. Assignments¶
"""
* Assignment: Pandas Read XSLT Plants
* Complexity: medium
* Lines of code: 4 lines
* Time: 5 min
English:
1. Read data from `DATA` as `result: pd.DataFrame`
2. Use XSLT transformation
3. Make sure that columns and indexes are named properly
4. Calculate average cost of flower
5. Run doctests - all must succeed
Polish:
1. Wczytaj dane z `DATA` jako `result: pd.DataFrame`
2. Użyj transformaty XSLT
3. Upewnij się, że nazwy kolumn i indeks są dobrze ustawione
4. Wylicz średni koszt kwiatów
5. Uruchom doctesty - wszystkie muszą się powieść
Hints:
* `pip install --upgrade lxml`
Tests:
>>> import sys; sys.tracebacklimit = 0
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` has invalid type, should be `pd.DataFrame`'
>>> result
English Name Latin Name Cost
0 Bloodroot Sanguinaria canadensis $2.44
1 Columbine Aquilegia canadensis $9.37
2 Marsh Marigold Caltha palustris $6.81
3 Cowslip Caltha palustris $9.90
"""
import pandas as pd
from io import StringIO
from lxml.etree import XML, XSLT, parse
DATA = """
<CATALOG>
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Columbine</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<ZONE>3</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$9.37</PRICE>
<AVAILABILITY>030699</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Marsh Marigold</COMMON>
<BOTANICAL>Caltha palustris</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Sunny</LIGHT>
<PRICE>$6.81</PRICE>
<AVAILABILITY>051799</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Cowslip</COMMON>
<BOTANICAL>Caltha palustris</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$9.90</PRICE>
<AVAILABILITY>030699</AVAILABILITY>
</PLANT>
</CATALOG>
"""
TEMPLATE = """
<html xsl:version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<table>
<thead>
<tr>
<th>English Name</th>
<th>Latin Name</th>
<th>Cost</th>
</tr>
</thead>
<xsl:for-each select="CATALOG/PLANT">
<tr>
<td><xsl:value-of select="COMMON"/></td>
<td><xsl:value-of select="BOTANICAL"/></td>
<td><xsl:value-of select="PRICE"/></td>
</tr>
</xsl:for-each>
</table>
</html>
"""
# XSLT transformed DATA
# type: pd.DataFrame
result = ...