XML/HTML examples¶

HTML¶

Python has numerous libraries for reading and writing data in the ubiquitous HTML and XML formats. Examples are lxml, Beautiful Soup and html5lib. While lxml is generally comparatively much faster, the other libraries are better at handling corrupted HTML or XML files.

pandas has a built-in function, read_html, which uses libraries like lxml, html5lib and Beautiful Soup to automatically parse tables from HTML files as DataFrame objects. These have to be installed additionally. With Spack you can provide lxml, BeautifulSoup and html5lib in your kernel:

$ spack env activate python-311
$ spack install py-lxml py-beautifulsoup4~html5lib~lxml py-html5lib

Alternatively, you can install BeautifulSoup with other package managers, for example

$ uv add lxml beautifulsoup4 html5lib

To show how this works, I use an HTML file from Wikipedia that gives an overview of different serialisation formats.

[1]:

import pandas as pd


tables = pd.read_html(
    "https://docs.python.org/3/library/xml.dom.html",
)

The pandas.read_html function has a number of options, but by default it looks for and tries to parse all table data contained in <table> tags. The result is a list of DataFrame objects:

[2]:

len(tables)

[2]:

[3]:

xml_idl = tables[2]

xml_idl.head()

[3]:

	IDL Type	Python Type
0	boolean	bool or int
1	int	int
2	long int	int
3	unsigned int	int
4	DOMString	str or bytes

From here we can do some data cleansing and analysis, such as the number of different Python types:

[4]:

xml_idl["Python Type"].value_counts()

[4]:

Python Type
int             3
bool or int     1
str or bytes    1
Name: count, dtype: int64

XML¶

pandas has a function read_xml, which makes reading XML files very easy:

[5]:

pd.read_xml("books.xml")

[5]:

	id	title	language	author	license	date
0	1	Python basics	en	Veit Schiele	BSD-3-Clause	2021-10-28
1	2	Jupyter Tutorial	en	Veit Schiele	BSD-3-Clause	2019-06-27
2	3	Jupyter Tutorial	de	Veit Schiele	BSD-3-Clause	2020-10-26
3	4	PyViz Tutorial	en	Veit Schiele	BSD-3-Clause	2020-04-13

`lxml`¶

Alternatively, lxml.objectify can be used first to parse XML files. In doing so, we get a reference to the root node of the XML file with getroot:

[6]:

from pathlib import Path

from lxml import objectify

parsed = objectify.parse(Path.open("books.xml"))
root = parsed.getroot()

[7]:

books = []

for element in root.book:
    data = {}
    for child in element.getchildren():
        data[child.tag] = child.pyval
    books.append(data)

[8]:

pd.DataFrame(books)

[8]:

	title	language	author	license	date
0	Python basics	en	Veit Schiele	BSD-3-Clause	2021-10-28
1	Jupyter Tutorial	en	Veit Schiele	BSD-3-Clause	2019-06-27
2	Jupyter Tutorial	de	Veit Schiele	BSD-3-Clause	2020-10-26
3	PyViz Tutorial	en	Veit Schiele	BSD-3-Clause	2020-04-13

XML/HTML examples¶

HTML¶

XML¶

lxml¶

`lxml`¶