2.8. Pandas Read XML

2.8.1. Rationale

  • File paths works also with URLs

  • io.StringIO Converts str to File-like object

2.8.2. XML and XSLT

from io import StringIO
from lxml.etree import XML, XSLT, parse
import pandas as pd

DATA = """<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.</description>
   </book>
</catalog>
"""

TEMPLATE = """
    <html xsl:version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <table>
            <thead>
                <tr>
                    <th>Id</th>
                    <th>Author</th>
                    <th>Title</th>
                    <th>Genre</th>
                    <th>Price</th>
                    <th>Publish Date</th>
                    <th>Description</th>
                </tr>
            </thead>
            <tbody>

                <xsl:for-each select="catalog/book">
                    <tr>
                        <td><xsl:value-of select="@id"/></td>
                        <td><xsl:value-of select="author"/></td>
                        <td><xsl:value-of select="title"/></td>
                        <td><xsl:value-of select="genre"/></td>
                        <td><xsl:value-of select="price"/></td>
                        <td><xsl:value-of select="publish_date"/></td>
                        <td><xsl:value-of select="description"/></td>
                    </tr>
                </xsl:for-each>

            </tbody>
        </table>
    </html>
"""

transform = XSLT(XML(TEMPLATE))
data = parse(StringIO(DATA))
html = str(transform(data))
dfs = pd.read_html(html)
result = dfs[0]

result
# [      Id  ...                                        Description
# 0  bk101  ...  An in-depth look at creating applications  wit...
# 1  bk102  ...  A former architect battles corporate zombies, ...
# 2  bk103  ...  After the collapse of a nanotechnology  societ...
# [3 rows x 7 columns]]

type(result) is pd.DataFrame
# True

len(result) > 0
# True

result.columns
# Index(['Id', 'Author', 'Title', 'Genre', 'Price', 'Publish Date',
#        'Description'],
#       dtype='object')

result['Title']
# 0    XML Developer's Guide
# 1            Midnight Rain
# 2          Maeve Ascendant
# Name: Title, dtype: object

2.8.3. Assignments

Code 2.52. Solution
"""
* Assignment: Pandas Read XSLT Plants
* Complexity: medium
* Lines of code: 4 lines
* Time: 5 min

English:
    1. Read data from `DATA` as `result: pd.DataFrame`
    2. Use XSLT transformation
    3. Make sure that columns and indexes are named properly
    4. Calculate average cost of flower
    5. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` jako `result: pd.DataFrame`
    2. Użyj transformaty XSLT
    3. Upewnij się, że nazwy kolumn i indeks są dobrze ustawione
    4. Wylicz średni koszt kwiatów
    5. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `pip install --upgrade lxml`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> type(result) is pd.DataFrame
    True
    >>> len(result) > 0
    True
    >>> result
         English Name              Latin Name   Cost
    0       Bloodroot  Sanguinaria canadensis  $2.44
    1       Columbine    Aquilegia canadensis  $9.37
    2  Marsh Marigold        Caltha palustris  $6.81
    3         Cowslip        Caltha palustris  $9.90
"""

import pandas as pd
from io import StringIO
from lxml.etree import XML, XSLT, parse

DATA = """
    <CATALOG>
        <PLANT>
            <COMMON>Bloodroot</COMMON>
            <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
            <ZONE>4</ZONE>
            <LIGHT>Mostly Shady</LIGHT>
            <PRICE>$2.44</PRICE>
            <AVAILABILITY>031599</AVAILABILITY>
        </PLANT>
        <PLANT>
            <COMMON>Columbine</COMMON>
            <BOTANICAL>Aquilegia canadensis</BOTANICAL>
            <ZONE>3</ZONE>
            <LIGHT>Mostly Shady</LIGHT>
            <PRICE>$9.37</PRICE>
            <AVAILABILITY>030699</AVAILABILITY>
        </PLANT>
        <PLANT>
            <COMMON>Marsh Marigold</COMMON>
            <BOTANICAL>Caltha palustris</BOTANICAL>
            <ZONE>4</ZONE>
            <LIGHT>Mostly Sunny</LIGHT>
            <PRICE>$6.81</PRICE>
            <AVAILABILITY>051799</AVAILABILITY>
        </PLANT>
        <PLANT>
            <COMMON>Cowslip</COMMON>
            <BOTANICAL>Caltha palustris</BOTANICAL>
            <ZONE>4</ZONE>
            <LIGHT>Mostly Shady</LIGHT>
            <PRICE>$9.90</PRICE>
            <AVAILABILITY>030699</AVAILABILITY>
        </PLANT>
    </CATALOG>
"""

TEMPLATE = """
    <html xsl:version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <table>
            <thead>
                <tr>
                    <th>English Name</th>
                    <th>Latin Name</th>
                    <th>Cost</th>
                </tr>
            </thead>

            <xsl:for-each select="CATALOG/PLANT">
                <tr>
                    <td><xsl:value-of select="COMMON"/></td>
                    <td><xsl:value-of select="BOTANICAL"/></td>
                    <td><xsl:value-of select="PRICE"/></td>
                </tr>
            </xsl:for-each>

        </table>
    </html>
"""