BeautifulSoup#

[1]:
import requests


url = "https://de.wikipedia.org/wiki/Liste_der_Stra%C3%9Fen_und_Pl%C3%A4tze_in_Berlin-Mitte"
r = requests.get(url)
  1. Install:

    With Spack you can make BeautifulSoup available in your kernel:

    $ spack env activate python-311
    $ spack install py-beautifulsoup4~html5lib~lxml
    

    Alternatively, you can install BeautifulSoup with other package managers, for example

    $ pipenv install beautifulsoup4
    
  1. With r.content we can output the HTML of the page.

  1. Next, we have to decompose this string into a Python representation of the page with BeautifulSoup:

[2]:
from bs4 import BeautifulSoup


soup = BeautifulSoup(r.content, "html.parser")
  1. To structure the code, we create a new function get_dom (Document Object Model) that includes all the previous code:

[3]:
def get_dom(url):
    r = request.get(url)
    r.raise_for_status()
    return BeautifulSoup(r.content, "html.parser")

Filtering out individual elements can be done, for example, via CSS selectors. These can be determined in a website, for example, by right-clicking on one of the table cells in the first column of the table in Firefox. In the Inspector that now opens, you can right-click the element again and then select Copy → CSS Selector. The clipboard will then contain, for example, table.wikitable:nth-child(13) > tbody:nth-child(2) > tr:nth-child(1). We now clean up this CSS selector, as we do not want to filter for the 13th child element of the table.wikitable or the 2nd child element in tbody, but only for the 1st column within tbody.

Finally, with limit=3 in this notebook, we only display the first three results as an example:

[4]:
links = soup.select(
    "table.wikitable > tbody > tr > td:nth-child(1) > a", limit=3
)

print(links)
[<a href="/wiki/Ackerstra%C3%9Fe_(Berlin)" title="Ackerstraße (Berlin)">Ackerstraße</a>, <a href="/wiki/Alexanderplatz" title="Alexanderplatz">Alexanderplatz</a>, <a href="/wiki/Almstadtstra%C3%9Fe" title="Almstadtstraße">Almstadtstraße</a>]

However, we do not want the entire HTML link, but only its text content:

[5]:
for content in links:
    print(content.text)
Ackerstraße
Alexanderplatz
Almstadtstraße