Multi-processing example#
We’ll start with code that is clear, simple, and executed top-down. It’s easy to develop and incrementally testable:
[1]:
from multiprocessing.pool import ThreadPool as Pool
import requests
sites = [
"https://github.com/veit/jupyter-tutorial/",
"https://jupyter-tutorial.readthedocs.io/en/latest/",
"https://github.com/veit/pyviz-tutorial/",
"https://pyviz-tutorial.readthedocs.io/de/latest/",
"https://cusy.io/en",
]
def sitesize(url):
with requests.get(url) as u:
return url, len(u.content)
pool = Pool(10)
for result in pool.imap_unordered(sitesize, sites):
print(result)
('https://cusy.io/en', 36389)
('https://jupyter-tutorial.readthedocs.io/en/latest/', 40884)
('https://github.com/veit/jupyter-tutorial/', 236862)
('https://github.com/veit/pyviz-tutorial/', 213124)
('https://pyviz-tutorial.readthedocs.io/de/latest/', 32803)
Note
A good development strategy is to use map, to test your code in a single process and thread before moving to multi-processing.
Note
In order to better assess when ThreadPool
and when process Pool
should be used, here are some rules of thumb:
For CPU-heavy jobs,
multiprocessing.pool.Pool
should be used. Usually we start here with twice the number of CPU cores for the pool size, but at least 4.For I/O-heavy jobs,
multiprocessing.pool.ThreadPool
should be used. Usually we start here with five times the number of CPU cores for the pool size.If we use Python 3 and do not need an interface identical to
pool
, we use concurrent.future.Executor instead ofmultiprocessing.pool.ThreadPool
; it has a simpler interface and was designed for threads from the start. Since it returns instances ofconcurrent.futures.Future
, it is compatible with many other libraries, includingasyncio
.For CPU- and I/O-heavy jobs, we prefer
multiprocessing.Pool
because it provides better process isolation.
[2]:
from multiprocessing.pool import ThreadPool as Pool
import requests
sites = [
"https://github.com/veit/jupyter-tutorial/",
"https://jupyter-tutorial.readthedocs.io/en/latest/",
"https://github.com/veit/pyviz-tutorial/",
"https://pyviz-tutorial.readthedocs.io/de/latest/",
"https://cusy.io/en",
]
def sitesize(url):
with requests.get(url) as u:
return url, len(u.content)
for result in map(sitesize, sites):
print(result)
('https://github.com/veit/jupyter-tutorial/', 236862)
('https://jupyter-tutorial.readthedocs.io/en/latest/', 40884)
('https://github.com/veit/pyviz-tutorial/', 213124)
('https://pyviz-tutorial.readthedocs.io/de/latest/', 32803)
('https://cusy.io/en', 36389)
What can be parallelised?#
Amdahl’s law#
The increase in speed is mainly limited by the sequential part of the problem, since its execution time cannot be reduced by parallelisation. In addition, parallelisation creates additional costs, such as for communication and synchronisation of the processes.
In our example, the following tasks can only be processed serially:
UDP DNS request request for the URL
UDP DNS response
Socket from the OS
TCP-Connection
Sending the HTTP request for the root resource
Waiting for the TCP response
Counting characters on the site
[3]:
from multiprocessing.pool import ThreadPool as Pool
import requests
sites = [
"https://github.com/veit/jupyter-tutorial/",
"https://jupyter-tutorial.readthedocs.io/en/latest/",
"https://github.com/veit/pyviz-tutorial/",
"https://pyviz-tutorial.readthedocs.io/de/latest/",
"https://cusy.io/en",
]
def sitesize(url):
with requests.get(url, stream=True) as u:
return url, len(u.content)
pool = Pool(4)
for result in pool.imap_unordered(sitesize, sites):
print(result)
('https://github.com/veit/jupyter-tutorial/', 236862)
('https://github.com/veit/pyviz-tutorial/', 213124)
('https://pyviz-tutorial.readthedocs.io/de/latest/', 32803)
('https://jupyter-tutorial.readthedocs.io/en/latest/', 40884)
('https://cusy.io/en', 36389)
Note
imap_unordered is used to improve responsiveness. However, this is only possible because the function returns the argument and result as a tuple.
Tips#
Don’t make too many trips back and forth
If you get too many iterable results, this is a good indicator of too many trips, such as in
>>> def sitesize(url, start): ... req = urllib.request.Request() ... req.add_header('Range:%d-%d' % (start, start+1000)) ... u = urllib.request.urlopen(url, req) ... block = u.read() ... return url, len(block)
Make relevant progress on every trip
Once you get the process, you should make significant progress and not get bogged down. The following example illustrates intermediate steps that are too small:
>>> def sitesize(url, results): ... with requests.get(url, stream=True) as u: ... while True: ... line = u.iter_lines() ... results.put((url, len(line)))
Don’t send or receive too much data
The following example unnecessarily increases the amount of data:
>>> def sitesize(url): ... with requests.get(url) as u: ... return url, u.content