Multi-processing example#

We’ll start with code that is clear, simple, and executed top-down. It’s easy to develop and incrementally testable:

[1]:
from multiprocessing.pool import ThreadPool as Pool

import requests


sites = [
    "https://github.com/veit/jupyter-tutorial/",
    "https://jupyter-tutorial.readthedocs.io/en/latest/",
    "https://github.com/veit/pyviz-tutorial/",
    "https://pyviz-tutorial.readthedocs.io/de/latest/",
    "https://cusy.io/en",
]


def sitesize(url):
    with requests.get(url) as u:
        return url, len(u.content)


pool = Pool(10)
for result in pool.imap_unordered(sitesize, sites):
    print(result)
('https://cusy.io/en', 36389)
('https://jupyter-tutorial.readthedocs.io/en/latest/', 40884)
('https://github.com/veit/jupyter-tutorial/', 236862)
('https://github.com/veit/pyviz-tutorial/', 213124)
('https://pyviz-tutorial.readthedocs.io/de/latest/', 32803)

Note

A good development strategy is to use map, to test your code in a single process and thread before moving to multi-processing.

Note

In order to better assess when ThreadPool and when process Pool should be used, here are some rules of thumb:

  • For CPU-heavy jobs, multiprocessing.pool.Pool should be used. Usually we start here with twice the number of CPU cores for the pool size, but at least 4.

  • For I/O-heavy jobs, multiprocessing.pool.ThreadPool should be used. Usually we start here with five times the number of CPU cores for the pool size.

  • If we use Python 3 and do not need an interface identical to pool, we use concurrent.future.Executor instead of multiprocessing.pool.ThreadPool; it has a simpler interface and was designed for threads from the start. Since it returns instances of concurrent.futures.Future, it is compatible with many other libraries, including asyncio.

  • For CPU- and I/O-heavy jobs, we prefer multiprocessing.Pool because it provides better process isolation.

[2]:
from multiprocessing.pool import ThreadPool as Pool

import requests


sites = [
    "https://github.com/veit/jupyter-tutorial/",
    "https://jupyter-tutorial.readthedocs.io/en/latest/",
    "https://github.com/veit/pyviz-tutorial/",
    "https://pyviz-tutorial.readthedocs.io/de/latest/",
    "https://cusy.io/en",
]


def sitesize(url):
    with requests.get(url) as u:
        return url, len(u.content)


for result in map(sitesize, sites):
    print(result)
('https://github.com/veit/jupyter-tutorial/', 236862)
('https://jupyter-tutorial.readthedocs.io/en/latest/', 40884)
('https://github.com/veit/pyviz-tutorial/', 213124)
('https://pyviz-tutorial.readthedocs.io/de/latest/', 32803)
('https://cusy.io/en', 36389)

What can be parallelised?#

Amdahl’s law#

The increase in speed is mainly limited by the sequential part of the problem, since its execution time cannot be reduced by parallelisation. In addition, parallelisation creates additional costs, such as for communication and synchronisation of the processes.

In our example, the following tasks can only be processed serially:

  • UDP DNS request request for the URL

  • UDP DNS response

  • Socket from the OS

  • TCP-Connection

  • Sending the HTTP request for the root resource

  • Waiting for the TCP response

  • Counting characters on the site

[3]:
from multiprocessing.pool import ThreadPool as Pool

import requests


sites = [
    "https://github.com/veit/jupyter-tutorial/",
    "https://jupyter-tutorial.readthedocs.io/en/latest/",
    "https://github.com/veit/pyviz-tutorial/",
    "https://pyviz-tutorial.readthedocs.io/de/latest/",
    "https://cusy.io/en",
]


def sitesize(url):
    with requests.get(url, stream=True) as u:
        return url, len(u.content)


pool = Pool(4)
for result in pool.imap_unordered(sitesize, sites):
    print(result)
('https://github.com/veit/jupyter-tutorial/', 236862)
('https://github.com/veit/pyviz-tutorial/', 213124)
('https://pyviz-tutorial.readthedocs.io/de/latest/', 32803)
('https://jupyter-tutorial.readthedocs.io/en/latest/', 40884)
('https://cusy.io/en', 36389)

Note

imap_unordered is used to improve responsiveness. However, this is only possible because the function returns the argument and result as a tuple.

Tips#

  • Don’t make too many trips back and forth

    If you get too many iterable results, this is a good indicator of too many trips, such as in

    >>>     def sitesize(url, start):
    ...         req = urllib.request.Request()
    ...         req.add_header('Range:%d-%d' % (start, start+1000))
    ...         u = urllib.request.urlopen(url, req)
    ...         block = u.read()
    ...         return url, len(block)
    
  • Make relevant progress on every trip

    Once you get the process, you should make significant progress and not get bogged down. The following example illustrates intermediate steps that are too small:

    >>> def sitesize(url, results):
    ...     with requests.get(url, stream=True) as u:
    ...         while True:
    ...             line = u.iter_lines()
    ...             results.put((url, len(line)))
    
  • Don’t send or receive too much data

    The following example unnecessarily increases the amount of data:

    >>> def sitesize(url):
    ...     with requests.get(url) as u:
    ...         return url, u.content