Manipulation of strings#

pandas offers the possibility to concisely apply Python’s string methods and regular expressions to whole arrays of data.

See also:

Vectorised string functions in pandas#

Cleaning up a cluttered dataset for analysis often requires a lot of string manipulation. To make matters worse, a column containing strings sometimes has missing data:

[1]:
import numpy as np
import pandas as pd


addresses = {
    "Veit": np.nan,
    "Veit Schiele": "veit.schiele@cusy.io",
    "cusy GmbH": "info@cusy.io",
}
addresses = pd.Series(addresses)

addresses
[1]:
Veit                             NaN
Veit Schiele    veit.schiele@cusy.io
cusy GmbH               info@cusy.io
dtype: object
[2]:
addresses.isna()
[2]:
Veit             True
Veit Schiele    False
cusy GmbH       False
dtype: bool

You can apply string and regular expression methods to any value (by passing a lambda or other function) using data.map, but this fails for NA values. To deal with this, Series has array-oriented methods for string operations that skip and pass NA values. These are accessed via Series’ str attribute; for example, we could use str.contains to check whether each email address contains veit:

[3]:
addresses.str.contains("veit")
[3]:
Veit              NaN
Veit Schiele     True
cusy GmbH       False
dtype: object

Regular expressions can also be used, along with options such as IGNORECASE:

[4]:
import re


pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
matches = addresses.str.findall(pattern, flags=re.IGNORECASE).str[0]

matches
[4]:
Veit                                 NaN
Veit Schiele    (veit.schiele, cusy, io)
cusy GmbH               (info, cusy, io)
dtype: object

There are several ways to retrieve a vectorised element. Either use str.get or the index of str:

[5]:
matches.str.get(1)
[5]:
Veit             NaN
Veit Schiele    cusy
cusy GmbH       cusy
dtype: object

Similarly, you can also cut strings with this syntax:

[6]:
addresses.str[:5]
[6]:
Veit              NaN
Veit Schiele    veit.
cusy GmbH       info@
dtype: object

The pandas.Series.str.extract method returns the captured groups of a regular expression as a DataFrame:

[7]:
addresses.str.extract(pattern, flags=re.IGNORECASE)
[7]:
0 1 2
Veit NaN NaN NaN
Veit Schiele veit.schiele cusy io
cusy GmbH info cusy io

More vectorised pandas string methods:

Method

Description

cat

concatenates strings element by element with optional delimiter

contains

returns a boolean array if each string contains a pattern/gex

count

counts occurrences of the pattern

extract

uses a regular expression with groups to extract one or more strings from a set of strings; the result is a DataFrame with one column per group

endswith

equivalent to x.endswith(pattern) for each element

startswith

equivalent to x.startswith(pattern) for each element

findall

computes list of all occurrences of pattern/regex for each string

get

index in each element (get i-th element)

isalnum

Equivalent to built-in str.alnum

isalpha

Equivalent to built-in str.isalpha

isdecimal

Equivalent to built-in str.isdecimal

isdigit

Equivalent to built-in str.isdigit

islower

Equivalent to built-in str.islower

isnumeric

Equivalent to built-in str.isnumeric

isupper

Equivalent to built-in str.isupper

join

joins strings in each element of the series with the passed separator character

len

calculates the length of each string

lower, upper

converts case; equivalent to x.lower() or x.upper() for each element

match

uses re.match with the passed regular expression for each element, returning True or False if matched.

extract

captures group elements (if any) by index from each string

pad

inserts spaces on the left, right or both sides of strings

centre

Equivalent to pad(side='both')

repeat

Duplicates values (for example s.str.repeat(3) equals x * 3 for each string)

replace

replaces pattern/rulex with another string

slice

splits each string in the series

split

splits strings using delimiters or regular expressions

strip

truncates spaces on both sides, including line breaks

rstrip

truncates spaces on the right side

lstrip

truncates spaces on the left side