Manipulation of strings¶
pandas offers the possibility to concisely apply Python’s string methods and regular expressions to whole arrays of data.
Vectorised string functions in pandas¶
Cleaning up a cluttered dataset for analysis often requires a lot of string manipulation. To make matters worse, a column containing strings sometimes has missing data:
[1]:
import numpy as np
import pandas as pd
addresses = {
"Veit": np.nan,
"Veit Schiele": "veit.schiele@cusy.io",
"cusy GmbH": "info@cusy.io",
}
addresses = pd.Series(addresses)
addresses
[1]:
Veit NaN
Veit Schiele veit.schiele@cusy.io
cusy GmbH info@cusy.io
dtype: object
[2]:
addresses.isna()
[2]:
Veit True
Veit Schiele False
cusy GmbH False
dtype: bool
You can apply string and regular expression methods to any value (by passing a lambda or other function) using data.map
, but this fails for NA
values. To deal with this, Series
has array-oriented methods for string operations that skip and pass NA
values. These are accessed via Series’ str
attribute; for example, we could use str.contains
to check whether each email address contains veit
:
[3]:
addresses.str.contains("veit")
[3]:
Veit NaN
Veit Schiele True
cusy GmbH False
dtype: object
Regular expressions can also be used, along with options such as IGNORECASE
:
[4]:
import re
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
matches = addresses.str.findall(pattern, flags=re.IGNORECASE).str[0]
matches
[4]:
Veit NaN
Veit Schiele (veit.schiele, cusy, io)
cusy GmbH (info, cusy, io)
dtype: object
There are several ways to retrieve a vectorised element. Either use str.get
or the index of str
:
[5]:
matches.str.get(1)
[5]:
Veit NaN
Veit Schiele cusy
cusy GmbH cusy
dtype: object
Similarly, you can also cut strings with this syntax:
[6]:
addresses.str[:5]
[6]:
Veit NaN
Veit Schiele veit.
cusy GmbH info@
dtype: object
The pandas.Series.str.extract method returns the captured groups of a regular expression as a DataFrame:
[7]:
addresses.str.extract(pattern, flags=re.IGNORECASE)
[7]:
0 | 1 | 2 | |
---|---|---|---|
Veit | NaN | NaN | NaN |
Veit Schiele | veit.schiele | cusy | io |
cusy GmbH | info | cusy | io |
More vectorised pandas string methods:
Method |
Description |
---|---|
|
concatenates strings element by element with optional delimiter |
|
returns a boolean array if each string contains a pattern/gex |
|
counts occurrences of the pattern |
|
uses a regular expression with groups to extract one or more strings from a set of strings; the result is a DataFrame with one column per group |
|
equivalent to |
|
equivalent to |
|
computes list of all occurrences of pattern/regex for each string |
|
index in each element (get |
|
Equivalent to built-in |
|
Equivalent to built-in |
|
Equivalent to built-in |
|
Equivalent to built-in |
|
Equivalent to built-in |
|
Equivalent to built-in |
|
Equivalent to built-in |
|
joins strings in each element of the series with the passed separator character |
|
calculates the length of each string |
|
converts case; equivalent to |
|
uses |
|
captures group elements (if any) by index from each string |
|
inserts spaces on the left, right or both sides of strings |
|
Equivalent to |
|
Duplicates values (for example |
|
replaces pattern/rulex with another string |
|
splits each string in the series |
|
splits strings using delimiters or regular expressions |
|
truncates spaces on both sides, including line breaks |
|
truncates spaces on the right side |
|
truncates spaces on the left side |