Performance#
Python can be used to write and test code quickly because it is an interpreted language that types dynamically. However, these are also the reasons it is slow when performing simple tasks repeatedly, for example in loops.
When developing code, there can often be tradeoffs between different implementations. However, at the beginning of the development of an algorithm, it is usually counterproductive to worry about the efficiency of the code.
«We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.»[1]
kMeans example#
In the following, I show examples of the kmeans algorithm to form a previously known number of groups from a set of objects. This can be achieved in the following three steps:
Choose the first
k
elements as cluster centresAssign each new element to the cluster with the least increase in variance.
Update the cluster centre
Steps 2 and 3 are repeated until the assignments no longer change.
A possible implementation with pure Python could look like this:
# SPDXFileCopyrightText: 2021 Veit Schiele
#
# SPDXLicenseIdentifier: BSD3Clause
def dist(x, y):
"""Calculate the distance"""
return sum((xi  yi) ** 2 for xi, yi in zip(x, y))
def find_labels(points, centers):
"""Assign points to a cluster."""
labels = []
for point in points:
distances = [dist(point, center) for center in centers]
labels.append(distances.index(min(distances)))
return labels
def compute_centers(points, labels):
"""Calculate the cluster centres."""
n_centers = len(set(labels))
n_dims = len(points[0])
centers = [[0 for i in range(n_dims)] for j in range(n_centers)]
counts = [0 for j in range(n_centers)]
for label, point in zip(labels, points):
counts[label] += 1
centers[label] = [a + b for a, b in zip(centers[label], point)]
return [[x / count for x in center] for center, count in zip(centers, counts)]
def kmeans(points, n_clusters):
"""Calculates the cluster centres repeatedly until nothing changes."""
centers = points[n_clusters:].tolist()
while True:
old_centers = centers
labels = find_labels(points, centers)
centers = compute_centers(points, labels)
if centers == old_centers:
break
return labels
We can create sample data with:
from sklearn.datasets import make_blobs
points, labels_true = make_blobs(
n_samples=1000, centers=3, random_state=0, cluster_std=0.60
)
And finally, we can perform the calculation with:
kmeans(points, 10)
Performance measurements#
Once you have worked with your code, it can be useful to examine its efficiency more closely. The iPython Profiler or scalene can be used for this.
See also
airspeed velocity (asv) is a tool for benchmarking Python packages during their lifetime. Runtime, memory consumption and even userdefined values can be recorded and displayed in an interactive web frontend.
Search for existing implementations#
You should not try to reinvent the wheel: If there are existing implementations, you should use them. There are even two implementations for the kmeans algorithm:

from sklearn.cluster import KMeans KMeans(10).fit_predict(points)

from dask_ml.cluster import KMeans KMeans(10).fit(points).predict(points)
The best that could be said against these existing solutions is that they could create a considerable overhead in your project if you are not already using scikitlearn or DaskML elsewhere. In the following, I will therefore show you further possibilities to optimise your own code.
Find antipatterns#
Then you can use perflint to search your code for the most common performance antipatterns in Python.
See also
Vectorisations with NumPy#
NumPy moves repetitive operations into a statically typed compiled layer, combining the fast development time of Python with the fast execution time of C. You may be able to use Universal functions (ufunc), vectorisation and Indexing and slicing in all combinations to move repetitive operations into compiled code to avoid slow loops.
With NumPy we can do without some loops:
# SPDXFileCopyrightText: 2021 Veit Schiele
#
# SPDXLicenseIdentifier: BSD3Clause
import numpy as np
def find_labels(points, centers):
The advantages of NumPy are that the Python overhead only occurs per array and not per array element. However, because NumPy uses a specific language for array operations, it also requires a different mindset when writing code. Finally, the batch operations can also lead to excessive memory consumption.
Special data structures#
 pandas
 for SQLlike Group operations and
This way you can also bypass the loop in the
compute_centers
method:# # SPDXLicenseIdentifier: BSD3Clause diff = points[:, None, :]  centers distances = (diff**2).sum(1) return distances.argmin(1)
 scipy.spatial
for spatial queries like distances, nearest neighbours, kMeans etc.
Our
find_labels
method can then be written more specifically:import pandas as pd from scipy.spatial import cKDTree
 scipy.sparse
sparse matrices for 2dimensional structured data.
 Sparse
for Ndiemensional structured data.
 scipy.sparse.csgraph
for graphlike problems, for example searching for shortest paths.
 Xarray
for grouping across multiple dimensions.
Select compiler#
Faster Cpython#
At PyCon US in May 2021, Guido van Rossum presented Faster CPython, a project that aims to double the speed of Python 3.11. The cooperation with the other Python core developers is regulated in PEP 659 – Specializing Adaptive Interpreter. There is also an open issue tracker and various tools for collecting bytecode statistics. CPUintensive Python code in particular is likely to benefit from the changes; code already written in C, I/Oheavy processes and multithreaded code, on the other hand, are unlikely to benefit.
See also
If you don’t want to wait with your project until the release of Python 3.11 in the final version probably on 24 October 2022, you can also have a look at the following Python interpreters:
Cython#
For intensive numerical operations, Python can be very slow, even if you have avoided all antipatterns and used vectorisations with NumPy. In this case, translating code into Cython can be helpful. Unfortunately, the code often has to be restructured and thus increases in complexity. Explicit type annotations and the provision of code also become more cumbersome.
Our example could then look like this:
# SPDXFileCopyrightText: 2021 Veit Schiele
#
# SPDXLicenseIdentifier: BSD3Clause
cimport numpy as np
import numpy as np
cdef double dist(double[:] x, double[:] y):
"""Calculate the distance"""
cdef double dist = 0
for i in range(len(x)):
dist += (x[i]  y[i]) ** 2
return dist
def find_labels(double[:, :] points, double[:, :] centers):
"""Assign points to a cluster."""
cdef int n_points = points.shape[0]
cdef int n_centers = centers.shape[0]
cdef double[:] labels = np.zeros(n_points)
cdef double distance, nearest_distance
cdef int nearest_index
for i in range(n_points):
nearest_distance = np.inf
for j in range(n_centers):
distance = dist(points[i], centers[j])
if distance < nearest_distance:
See also
Numba#
Numba is a JIT compiler that translates mainly scientific Python and NumPy code into fast machine code, for example:
# SPDXFileCopyrightText: 2021 Veit Schiele
#
# SPDXLicenseIdentifier: BSD3Clause
import numba
@numba.jit(nopython=True)
def dist(x, y):
"""Calculate the distance"""
dist = 0
for i in range(len(x)):
dist += (x[i]  y[i]) ** 2
return dist
@numba.jit(nopython=True)
def find_labels(points, centers):
"""Assign points to a cluster."""
labels = []
min_dist = np.inf
min_label = 0
for i in range(len(points)):
for j in range(len(centers)):
distance = dist(points[i], centers[j])
However, Numba requires LLVM and some Python constructs are not supported.
Task planner#
ipyparallel, Dask and Ray can distribute tasks in a cluster. In doing so, they have different focuses:
ipyparallel
simply integrates into a JupyterHub.Dask imitates pandas, NumPy iterators, Toolz und PySpark when it distributes their tasks.
Ray provides a simple, universal API for building distributed applications.
RLlib will scale reinforcement learning in particular.
A backend for joblib supports distributed scikitlearn programs.
XGBoostRay is a backend for distributed XGBoost.
LightGBMRay is a backend for distributed LightGBM.
Collective Communication Lib offers a set of native collective primitives for Gloo and the NVIDIA Collective Communication Library (NCCL).
Our example could look like this with Dask:
# SPDXFileCopyrightText: 2021 Veit Schiele
#
# SPDXLicenseIdentifier: BSD3Clause
import numpy as np
from dask import array as da
from dask import dataframe as dd
def find_labels(points, centers):
"""Assign points to a cluster."""
diff = points[:, None, :]  centers
distances = (diff**2).sum(1)
return distances.argmin(1)
def compute_centers(points, labels):
"""Calculate the cluster centres."""
points_df = dd.from_dask_array(points)
labels_df = dd.from_dask_array(labels)
return points_df.groupby(labels_df).mean()
def kmeans(points, n_clusters):
"""Calculates the cluster centres repeatedly until nothing changes."""
centers = points[n_clusters:]
points = da.from_array(points, 1000)
while True:
old_centers = centers
labels = find_labels(points, da.from_array(centers, 5))
centers = compute_centers(points, labels)
centers = centers.compute().values
if np.all(centers == old_centers):
break
return labels.compute()
Multithreading, Multiprocessing and Async#
After a brief overview, three examples of threading, multiprocessing and async illustrate the rules and best practices.