XML/HTML examples

HTML

Python has numerous libraries for reading and writing data in the ubiquitous HTML and XML formats. Examples are lxml, Beautiful Soup and html5lib. While lxml is generally comparatively much faster, the other libraries are better at handling corrupted HTML or XML files.

pandas has a built-in function, read_html, which uses libraries like lxml, html5lib and Beautiful Soup to automatically parse tables from HTML files as DataFrame objects. These have to be installed additionally. With Spack you can provide lxml, BeautifulSoup and html5lib in your kernel:

$ spack env activate python-311
$ spack install py-lxml py-beautifulsoup4~html5lib~lxml py-html5lib

Alternatively, you can install BeautifulSoup with other package managers, for example

$ pipenv install lxml beautifulsoup4 html5lib

To show how this works, I use an HTML file from Wikipedia that gives an overview of different serialisation formats.

[1]:
import pandas as pd


tables = pd.read_html("https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats")

The pandas.read_html function has a number of options, but by default it looks for and tries to parse all table data contained in <table> tags. The result is a list of DataFrame objects:

[2]:
len(tables)
[2]:
3
[3]:
formats = tables[0]

formats.head()
[3]:
Name Creator-maintainer Based on Standardized?[definition needed] Specification Binary? Human-readable? Supports references?e Schema-IDL? Standard APIs Supports zero-copy operations
0 Apache Avro Apache Software Foundation No Apache Avro™ Specification Yes Partialg Built-in C, C#, C++, Java, PHP, Python, Ruby
1 Apache Parquet Apache Software Foundation No Apache Parquet Yes No No Java, Python, C++ No
2 ASN.1 ISO, IEC, ITU-T Yes ISO/IEC 8824 / ITU-T X.680 (syntax) and ISO/IE... BER, DER, PER, OER, or custom via ECN XER, JER, GSER, or custom via ECN Yesf Built-in OER
3 Bencode Bram Cohen (creator) BitTorrent, Inc. (maintai... De facto as BEP Part of BitTorrent protocol specification Except numbers and delimiters, being ASCII No No No No No
4 Binn Bernardo Ramos JSON (loosely) No Binn Specification Yes No No No No Yes

From here we can do some data cleansing and analysis, such as the number of different schema IDLs:

[4]:
formats["Schema-IDL?"].value_counts()
[4]:
Schema-IDL?
No                                                                                         15
Yes                                                                                         5
Built-in                                                                                    4
Schema WD                                                                                   1
Partial (Kwalify, Rx, built-in language type-defs)                                          1
XML schema, RELAX NG                                                                        1
WSDL, XML schema                                                                            1
Partial (JSON Schema Proposal, other JSON schemas/IDLs)                                     1
?                                                                                           1
Ion schema                                                                                  1
Partial (JSON Schema Proposal, ASN.1 with JER, Kwalify, Rx, Itemscript Schema), JSON-LD     1
—                                                                                           1
XML schema                                                                                  1
XML Schema                                                                                  1
Partial (Signature strings)                                                                 1
CDDL                                                                                        1
Schema-IDL?                                                                                 1
Name: count, dtype: int64

XML

pandas has a function read_xml, which makes reading XML files very easy:

[5]:
pd.read_xml("books.xml")
[5]:
id title language author license date
0 1 Python basics en Veit Schiele BSD-3-Clause 2021-10-28
1 2 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27
2 3 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26
3 4 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13

lxml

Alternatively, lxml.objectify can be used first to parse XML files. In doing so, we get a reference to the root node of the XML file with getroot:

[6]:
from lxml import objectify


parsed = objectify.parse(open("books.xml"))
root = parsed.getroot()
[7]:
books = []

for element in root.book:
    data = {}
    for child in element.getchildren():
        data[child.tag] = child.pyval
    books.append(data)
[8]:
pd.DataFrame(books)
[8]:
title language author license date
0 Python basics en Veit Schiele BSD-3-Clause 2021-10-28
1 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27
2 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26
3 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13