BeautifulSoup#
[1]:
import requests
url = "https://de.wikipedia.org/wiki/Liste_der_Stra%C3%9Fen_und_Pl%C3%A4tze_in_Berlin-Mitte"
r = requests.get(url)
Install:
With Spack you can make BeautifulSoup available in your kernel:
$ spack env activate python-311 $ spack install py-beautifulsoup4~html5lib~lxml
Alternatively, you can install BeautifulSoup with other package managers, for example
$ pipenv install beautifulsoup4
With
r.content
we can output the HTML of the page.
Next, we have to decompose this string into a Python representation of the page with BeautifulSoup:
[2]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content, "html.parser")
To structure the code, we create a new function
get_dom
(Document Object Model) that includes all the previous code:
[3]:
def get_dom(url):
r = request.get(url)
r.raise_for_status()
return BeautifulSoup(r.content, "html.parser")
Filtering out individual elements can be done, for example, via CSS selectors. These can be determined in a website, for example, by right-clicking on one of the table cells in the first column of the table in Firefox. In the Inspector that now opens, you can right-click the element again and then select Copy → CSS Selector. The clipboard will then contain, for example, table.wikitable:nth-child(13) > tbody:nth-child(2) > tr:nth-child(1)
. We now clean up this CSS selector, as we do not
want to filter for the 13th child element of the table.wikitable
or the 2nd child element in tbody
, but only for the 1st column within tbody
.
Finally, with limit=3
in this notebook, we only display the first three results as an example:
[4]:
links = soup.select(
"table.wikitable > tbody > tr > td:nth-child(1) > a", limit=3
)
print(links)
[<a href="/wiki/Ackerstra%C3%9Fe_(Berlin)" title="Ackerstraße (Berlin)">Ackerstraße</a>, <a href="/wiki/Alexanderplatz" title="Alexanderplatz">Alexanderplatz</a>, <a href="/wiki/Almstadtstra%C3%9Fe" title="Almstadtstraße">Almstadtstraße</a>]
However, we do not want the entire HTML link, but only its text content:
[5]:
for content in links:
print(content.text)
Ackerstraße
Alexanderplatz
Almstadtstraße
See also