Manipulation of strings¶

pandas offers the possibility to concisely apply Python’s string methods and regular expressions to whole arrays of data.

See also:

string
re

Vectorised string functions in pandas¶

Cleaning up a cluttered dataset for analysis often requires a lot of string manipulation. To make matters worse, a column containing strings sometimes has missing data:

[1]:

import numpy as np
import pandas as pd


addresses = {
    "Veit": np.nan,
    "Veit Schiele": "veit.schiele@cusy.io",
    "cusy GmbH": "info@cusy.io",
}
addresses = pd.Series(addresses)

addresses

[1]:

Veit                             NaN
Veit Schiele    veit.schiele@cusy.io
cusy GmbH               info@cusy.io
dtype: object

[2]:

addresses.isna()

[2]:

Veit             True
Veit Schiele    False
cusy GmbH       False
dtype: bool

You can apply string and regular expression methods to any value (by passing a lambda or other function) using data.map, but this fails for NA values. To deal with this, Series has array-oriented methods for string operations that skip and pass NA values. These are accessed via Series’ str attribute; for example, we could use str.contains to check whether each email address contains veit:

[3]:

addresses.str.contains("veit")

[3]:

Veit              NaN
Veit Schiele     True
cusy GmbH       False
dtype: object

Regular expressions can also be used, along with options such as IGNORECASE:

[4]:

import re

pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
matches = addresses.str.findall(pattern, flags=re.IGNORECASE).str[0]

matches

[4]:

Veit                                 NaN
Veit Schiele    (veit.schiele, cusy, io)
cusy GmbH               (info, cusy, io)
dtype: object

There are several ways to retrieve a vectorised element. Either use str.get or the index of str:

[5]:

matches.str.get(1)

[5]:

Veit             NaN
Veit Schiele    cusy
cusy GmbH       cusy
dtype: object

Similarly, you can also cut strings with this syntax:

[6]:

addresses.str[:5]

[6]:

Veit              NaN
Veit Schiele    veit.
cusy GmbH       info@
dtype: object

The pandas.Series.str.extract method returns the captured groups of a regular expression as a DataFrame:

[7]:

addresses.str.extract(pattern, flags=re.IGNORECASE)

[7]:

	0	1	2
Veit	NaN	NaN	NaN
Veit Schiele	veit.schiele	cusy	io
cusy GmbH	info	cusy	io

More vectorised pandas string methods:

Method	Description
`cat`	concatenates strings element by element with optional delimiter
`contains`	returns a boolean array if each string contains a pattern/gex
`count`	counts occurrences of the pattern
`extract`	uses a regular expression with groups to extract one or more strings from a set of strings; the result is a DataFrame with one column per group
`endswith`	equivalent to `x.endswith(pattern)` for each element
`startswith`	equivalent to `x.startswith(pattern)` for each element
`findall`	computes list of all occurrences of pattern/regex for each string
`get`	index in each element (get `i`-th element)
`isalnum`	Equivalent to built-in `str.alnum`
`isalpha`	Equivalent to built-in `str.isalpha`
`isdecimal`	Equivalent to built-in `str.isdecimal`
`isdigit`	Equivalent to built-in `str.isdigit`
`islower`	Equivalent to built-in `str.islower`
`isnumeric`	Equivalent to built-in `str.isnumeric`
`isupper`	Equivalent to built-in `str.isupper`
`join`	joins strings in each element of the series with the passed separator character
`len`	calculates the length of each string
`lower`, `upper`	converts case; equivalent to `x.lower()` or `x.upper()` for each element
`match`	uses `re.match` with the passed regular expression for each element, returning `True` or `False` if matched.
`extract`	captures group elements (if any) by index from each string
`pad`	inserts spaces on the left, right or both sides of strings
`centre`	Equivalent to `pad(side='both')`
`repeat`	Duplicates values (for example `s.str.repeat(3)` equals `x * 3` for each string)
`replace`	replaces pattern/rulex with another string
`slice`	splits each string in the series
`split`	splits strings using delimiters or regular expressions
`strip`	truncates spaces on both sides, including line breaks
`rstrip`	truncates spaces on the right side
`lstrip`	truncates spaces on the left side