Performance

Python can be used to write and test code quickly because it is an interpreted language that types dynamically. However, these are also the reasons it is slow when performing simple tasks repeatedly, for example in loops.

When developing code, there can often be trade-offs between different implementations. However, at the beginning of the development of an algorithm, it is usually counterproductive to worry about the efficiency of the code.

«We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.»[1]

k-Means example

In the following, I will provide examples of the k-means algorithm algorithm, which is used to form a predefined number of clusters from a set of objects. This can be achieved using MacQueen’s algorithm in the following three steps:

  1. Choose the first k elements as cluster centres

  2. Assign each new element to the cluster with the least increase in variance.

  3. Update the cluster centre

Steps 2 and 3 are repeated until the assignments no longer change.

A possible implementation with pure Python could look like this:

py_kmeans.py
def dist(x, y):
    """Calculate the distance"""
    return sum((xi - yi) ** 2 for xi, yi in zip(x, y))


def find_labels(points, centers):
    """Assign points to a cluster."""
    labels = []
    for point in points:
        distances = [dist(point, center) for center in centers]
        labels.append(distances.index(min(distances)))
    return labels


def compute_centers(points, labels):
    """Calculate the cluster centres."""
    n_centers = len(set(labels))
    n_dims = len(points[0])

    centers = [[0 for i in range(n_dims)] for j in range(n_centers)]
    counts = [0 for j in range(n_centers)]

    for label, point in zip(labels, points):
        counts[label] += 1
        centers[label] = [a + b for a, b in zip(centers[label], point)]

    return [[x / count for x in center] for center, count in zip(centers, counts)]


def kmeans(points, n_clusters):
    """Calculates the cluster centres repeatedly until nothing changes."""
    centers = points[-n_clusters:].tolist()
    while True:
        old_centers = centers
        labels = find_labels(points, centers)
        centers = compute_centers(points, labels)
        if centers == old_centers:
            break
    return labels

We can create sample data with:

from sklearn.datasets import make_blobs


points, labels_true = make_blobs(
    n_samples=1000, centers=3, random_state=0, cluster_std=0.60
)

And finally, we can perform the calculation with:

kmeans(points, 10)

Performance measurements

Once you have worked with your code, it can be useful to examine its efficiency more closely. cProfile, iPython Profiler, scalene, tprof or Memray can be used for this. So far, I usually carry out the following steps:

cProfile, iPython Profiler, scalene or tprof can be used for this. So far, I usually carry out the following steps:

  1. I profile the entire programme with cProfile or py-spy to find slow functions.

  2. If necessary, I can use the line_profiler to identify the slow sections within the function

  3. If the slow function is computationally intensive, I try one of the following optimisations; however, if the application is data-intensive (dictionaries, strings, I/O), I take a closer look at the architecture.

  4. Then I optimise a slow function.

  5. Finally, I create a new profile and filter out the result of my optimised version so that I can compare the results.

Added in version Python3.15: PEP 799 will provide a special profiling module that organises the profiling tools integrated in Python under a uniform namespace. This module contains:

profiling.tracing

deterministic function call tracing, which has been moved from cProfile.

profiling.sampling

the new statistical sampling profiler Tachyon.

See also

1. Search for existing implementations

You should not try to reinvent the wheel: If there are existing implementations, you should use them. There are even two implementations for the k-means algorithm:

The best that could be said against these existing solutions is that they could create a considerable overhead in your project if you are not already using scikit-learn or Dask-ML elsewhere. In the following, I will therefore show you further possibilities to optimise your own code.

2. Find anti-patterns

Then you can use perflint to search your code for the most common performance anti-patterns in Python.

See also

3. Vectorisations with NumPy

NumPy moves repetitive operations into a statically typed compiled layer, combining the fast development time of Python with the fast execution time of C.

Version

Spectral-norm

vs 3.14x

CPython 3.14 – Basis

14,046ms

NumPy

27ms

520x

You may be able to use Universal functions (ufunc), vectorisation, Indexing and slicing in various combinations to move repetitive operations into compiled code and thus avoid slow loops, for example:

np_kmeans.py
import numpy as np


def find_labels(points, centers):
    """Assign points to a cluster."""
    diff = points[:, None, :] - centers
    distances = (diff**2).sum(-1)
    return distances.argmin(1)

The advantages of NumPy are that the Python overhead only occurs per array and not per array element. However, because NumPy uses a specific language for array operations, it also requires a different mindset when writing code. Finally, the batch operations can also lead to excessive memory consumption.

See also

4. Special data structures

pandas
for SQL-like Group operations and

Aggregation.

This way you can also bypass the loop in the compute_centers method:

pd_kmeans.py
import numpy as np
import pandas as pd


def compute_centers(points, labels):
    """Calculate the cluster centres."""
    df = pd.DataFrame(points)
    return df.groupby(labels).mean().values
scipy.spatial

for spatial queries like distances, nearest neighbours, k-Means etc.

Our find_labels method can then be written more specifically:

sp_kmeans.py
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree


def find_labels(points, centers):
    """Assign points to a cluster."""
    distances, labels = cKDTree(centers).query(points, 1)
    return labels
scipy.sparse

sparse matrices for 2-dimensional structured data.

Sparse

for N-diemensional structured data.

scipy.sparse.csgraph

for graph-like problems, for example searching for shortest paths.

Xarray

for grouping across multiple dimensions.

5. Select compiler

Faster CPython

At PyCon US in May 2021, Guido van Rossum presented Faster CPython, a project that aims to double the speed of Python 3.11. The cooperation with the other Python core developers is regulated in PEP 659 – Specializing Adaptive Interpreter. There is also an open issue tracker and various tools for collecting bytecode statistics. CPU-intensive Python code in particular is likely to benefit from the changes; code already written in C, I/O-heavy processes and multithreaded code, on the other hand, are unlikely to benefit.

And indeed, the cPython versions have become significantly more efficient since then:

Version

CPython 3.10.4

1.422x

CPython 3.12.0

1.093x

CPython 3.13.0

1.024x

CPython 3.15.0a0 – Basis

Free-threaded Python was also included in another comparison:

Version

N-body

vs 3.14

Spectral-norm

vs 3.14x

CPython 3.10

1,663ms

0.75x

16,826ms

0.83x

CPython 3.11

1,200ms

1.04x

13,430ms

1.05x

CPython 3.13

1,134ms

1.10x

13,637ms

1.03x

CPython 3.14 – Basis

1,242ms

14,046ms

CPython 3.14t

1,513ms

0.82x

14,551ms

0.97x

– Surce: The Optimization Ladder

If you don’t want to wait with your project until the release of Python 3.11 in the final version probably on 24 October 2022, you can also have a look at the following Python interpreters:

Python JIT compiler

Added in version Python3.15: The JIT compiler in Python 3.15 shows an average performance increase of 3–4% compared to the standard CPython interpreter, with performance measurements varying between -20 and over 100% on x86-64 Linux and AArch64 macOS systems.

Version

CPython 3.15.0a0 (JIT)

1.001x

CPython 3.15.0a0 – Basis

Cython

For intensive numerical operations, Python can be very slow, even if you have avoided all anti-patterns and used vectorisations with NumPy. In this case, translating code into Cython can be helpful.

Version

N-body

vs 3.14

Spectral-norm

vs 3.14x

CPython 3.14 – Basis

1,242ms

14,046ms

Cython

10ms

124x

142ms

99x

Unfortunately, the code often has to be restructured and thus increases in complexity. Explicit type annotations and the provision of code also become more cumbersome.

Our example could then look like this:

cy_kmeans.pyx
cimport numpy as np
import numpy as np


cdef double dist(double[:] x, double[:] y):
    """Calculate the distance"""
    cdef double dist = 0
    for i in range(len(x)):
        dist += (x[i] - y[i]) ** 2
    return dist

def find_labels(double[:, :] points, double[:, :] centers):
    """Assign points to a cluster."""
    cdef int n_points = points.shape[0]
    cdef int n_centers = centers.shape[0]
    cdef double[:] labels = np.zeros(n_points)
    cdef double distance, nearest_distance
    cdef int nearest_index

    for i in range(n_points):
        nearest_distance = np.inf
        for j in range(n_centers):
            distance = dist(points[i], centers[j])
            if distance < nearest_distance:
                nearest_distance = distance
                nearest_index = j
        labels[i] = nearest_index
    return np.asarray(labels)

See also

Numba

Numba is a JIT compiler that translates mainly scientific Python and NumPy code into fast machine code, for example:

nb_kmeans.py
import numba


@numba.jit(nopython=True)
def dist(x, y):
    """Calculate the distance"""
    dist = 0
    for i in range(len(x)):
        dist += (x[i] - y[i]) ** 2
    return dist


@numba.jit(nopython=True)
def find_labels(points, centers):
    """Assign points to a cluster."""
    labels = []
    min_dist = np.inf
    min_label = 0
    for i in range(len(points)):
        for j in range(len(centers)):
            distance = dist(points[i], centers[j])
            if distance < min_dist:
                min_dist, min_label = distance, j
        labels.append(min_label)
    return labels

However, Numba requires LLVM and some Python constructs are not supported.

Version

N-body

vs 3.14

Spectral-norm

vs 3.14x

CPython 3.14 – Basis

1,242ms

14,046ms

Numba

22ms

56x

104ms

135x

6. Task planner

ipyparallel, Dask and Ray can distribute tasks in a cluster. In doing so, they have different focuses:

Our example could look like this with Dask:

ds_kmeans.py
import numpy as np
from dask import array as da
from dask import dataframe as dd


def find_labels(points, centers):
    """Assign points to a cluster."""
    diff = points[:, None, :] - centers
    distances = (diff**2).sum(-1)
    return distances.argmin(1)


def compute_centers(points, labels):
    """Calculate the cluster centres."""
    points_df = dd.from_dask_array(points)
    labels_df = dd.from_dask_array(labels)
    return points_df.groupby(labels_df).mean()


def kmeans(points, n_clusters):
    """Calculates the cluster centres repeatedly until nothing changes."""
    centers = points[-n_clusters:]
    points = da.from_array(points, 1000)
    while True:
        old_centers = centers
        labels = find_labels(points, da.from_array(centers, 5))
        centers = compute_centers(points, labels)
        centers = centers.compute().values
        if np.all(centers == old_centers):
            break
    return labels.compute()

7. Multithreading, Multiprocessing and Async

After a brief overview, three examples of threading, multiprocessing and async illustrate the rules and best practices.