XML/HTML examples#
HTML#
Python has numerous libraries for reading and writing data in the ubiquitous HTML and XML formats. Examples are lxml, Beautiful Soup and html5lib. While lxml is generally comparatively much faster, the other libraries are better at handling corrupted HTML or XML files.
pandas has a built-in function, read_html
, which uses libraries like lxml, html5lib and Beautiful Soup to automatically parse tables from HTML files as DataFrame objects. These have to be installed additionally. With Spack you can provide lxml, BeautifulSoup and html5lib in your kernel:
$ spack env activate python-311
$ spack install py-lxml py-beautifulsoup4~html5lib~lxml py-html5lib
Alternatively, you can install BeautifulSoup with other package managers, for example
$ pipenv install lxml beautifulsoup4 html5lib
To show how this works, I use an HTML file from Wikipedia that gives an overview of different serialisation formats.
[1]:
import pandas as pd
tables = pd.read_html("https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats")
The pandas.read_html
function has a number of options, but by default it looks for and tries to parse all table data contained in <table>
tags. The result is a list of DataFrame objects:
[2]:
len(tables)
[2]:
3
[3]:
formats = tables[0]
formats.head()
[3]:
Name | Creator-maintainer | Based on | Standardized?[definition needed] | Specification | Binary? | Human-readable? | Supports references?e | Schema-IDL? | Standard APIs | Supports zero-copy operations | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Apache Avro | Apache Software Foundation | — | No | Apache Avro™ Specification | Yes | Partialg | — | Built-in | C, C#, C++, Java, PHP, Python, Ruby | — |
1 | Apache Parquet | Apache Software Foundation | — | No | Apache Parquet | Yes | No | No | — | Java, Python, C++ | No |
2 | ASN.1 | ISO, IEC, ITU-T | — | Yes | ISO/IEC 8824 / ITU-T X.680 (syntax) and ISO/IE... | BER, DER, PER, OER, or custom via ECN | XER, JER, GSER, or custom via ECN | Yesf | Built-in | — | OER |
3 | Bencode | Bram Cohen (creator) BitTorrent, Inc. (maintai... | — | De facto as BEP | Part of BitTorrent protocol specification | Except numbers and delimiters, being ASCII | No | No | No | No | No |
4 | Binn | Bernardo Ramos | JSON (loosely) | No | Binn Specification | Yes | No | No | No | No | Yes |
From here we can do some data cleansing and analysis, such as the number of different schema IDLs:
[4]:
formats["Schema-IDL?"].value_counts()
[4]:
Schema-IDL?
No 15
Yes 5
Built-in 4
Schema WD 1
Partial (Kwalify, Rx, built-in language type-defs) 1
XML schema, RELAX NG 1
WSDL, XML schema 1
Partial (JSON Schema Proposal, other JSON schemas/IDLs) 1
? 1
Ion schema 1
Partial (JSON Schema Proposal, ASN.1 with JER, Kwalify, Rx, Itemscript Schema), JSON-LD 1
— 1
XML schema 1
XML Schema 1
Partial (Signature strings) 1
CDDL 1
Schema-IDL? 1
Name: count, dtype: int64
XML#
pandas has a function read_xml
, which makes reading XML files very easy:
[5]:
pd.read_xml("books.xml")
[5]:
id | title | language | author | license | date | |
---|---|---|---|---|---|---|
0 | 1 | Python basics | en | Veit Schiele | BSD-3-Clause | 2021-10-28 |
1 | 2 | Jupyter Tutorial | en | Veit Schiele | BSD-3-Clause | 2019-06-27 |
2 | 3 | Jupyter Tutorial | de | Veit Schiele | BSD-3-Clause | 2020-10-26 |
3 | 4 | PyViz Tutorial | en | Veit Schiele | BSD-3-Clause | 2020-04-13 |
lxml
#
Alternatively, lxml.objectify
can be used first to parse XML files. In doing so, we get a reference to the root node of the XML file with getroot
:
[6]:
from lxml import objectify
parsed = objectify.parse(open("books.xml"))
root = parsed.getroot()
[7]:
books = []
for element in root.book:
data = {}
for child in element.getchildren():
data[child.tag] = child.pyval
books.append(data)
[8]:
pd.DataFrame(books)
[8]:
title | language | author | license | date | |
---|---|---|---|---|---|
0 | Python basics | en | Veit Schiele | BSD-3-Clause | 2021-10-28 |
1 | Jupyter Tutorial | en | Veit Schiele | BSD-3-Clause | 2019-06-27 |
2 | Jupyter Tutorial | de | Veit Schiele | BSD-3-Clause | 2020-10-26 |
3 | PyViz Tutorial | en | Veit Schiele | BSD-3-Clause | 2020-04-13 |