Select and filter data#

Indexing series (obj[...]) works analogously to indexing NumPy arrays, except that you can use index values of the series instead of just integers. Here are some examples:

[1]:

import numpy as np
import pandas as pd

[2]:

idx = pd.date_range("2022-02-02", periods=7)
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7), index=idx)

[3]:

s

[3]:

2022-02-02    0.002127
2022-02-03    1.655759
2022-02-04   -1.552128
2022-02-05   -1.581026
2022-02-06   -0.992316
2022-02-07    1.490786
2022-02-08   -1.542455
Freq: D, dtype: float64

[4]:

s["2022-02-03"]

[4]:

1.655759430268265

[5]:

s[1]

[5]:

1.655759430268265

[6]:

 s[2:4]

[6]:

2022-02-04   -1.552128
2022-02-05   -1.581026
Freq: D, dtype: float64

[7]:

s[["2022-02-04", "2022-02-03", "2022-02-02"]]

[7]:

2022-02-04   -1.552128
2022-02-03    1.655759
2022-02-02    0.002127
dtype: float64

[8]:

s[[1, 3]]

[8]:

2022-02-03    1.655759
2022-02-05   -1.581026
Freq: 2D, dtype: float64

[9]:

s[s > 0]

[9]:

2022-02-02    0.002127
2022-02-03    1.655759
2022-02-07    1.490786
dtype: float64


While you can select data by label in this way, the preferred method for selecting index values is the loc operator:

[10]:

s.loc[["2022-02-04", "2022-02-03", "2022-02-02"]]

[10]:

2022-02-04   -1.552128
2022-02-03    1.655759
2022-02-02    0.002127
dtype: float64


The reason for the preference for loc is the different treatment of integers when indexing with []. In regular []-based indexing, integers are treated as labels if the index contains integers, so the behaviour varies depending on the data type of the index. In our example, the expression s.loc[[3, 2, 1]] will fail because the index does not contain integers:

[11]:

s.loc[[3, 2, 1]]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[11], line 1
----> 1 s.loc[[3, 2, 1]]

File ~/.local/share/virtualenvs/python-311-6zxVKbDJ/lib/python3.11/site-packages/pandas/core/indexing.py:1103, in _LocationIndexer.__getitem__(self, key)
1100 axis = self.axis or 0
1102 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1103 return self._getitem_axis(maybe_callable, axis=axis)

File ~/.local/share/virtualenvs/python-311-6zxVKbDJ/lib/python3.11/site-packages/pandas/core/indexing.py:1332, in _LocIndexer._getitem_axis(self, key, axis)
1329     if hasattr(key, "ndim") and key.ndim > 1:
1330         raise ValueError("Cannot index with multidimensional key")
-> 1332     return self._getitem_iterable(key, axis=axis)
1334 # nested tuple slicing
1335 if is_nested_tuple(key, labels):

File ~/.local/share/virtualenvs/python-311-6zxVKbDJ/lib/python3.11/site-packages/pandas/core/indexing.py:1272, in _LocIndexer._getitem_iterable(self, key, axis)
1269 self._validate_key(key, axis)
1271 # A collection of keys
-> 1272 keyarr, indexer = self._get_listlike_indexer(key, axis)
1273 return self.obj._reindex_with_indexers(
1274     {axis: [keyarr, indexer]}, copy=True, allow_dups=True
1275 )

File ~/.local/share/virtualenvs/python-311-6zxVKbDJ/lib/python3.11/site-packages/pandas/core/indexing.py:1462, in _LocIndexer._get_listlike_indexer(self, key, axis)
1459 ax = self.obj._get_axis(axis)
1460 axis_name = self.obj._get_axis_name(axis)
-> 1462 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
1464 return keyarr, indexer

File ~/.local/share/virtualenvs/python-311-6zxVKbDJ/lib/python3.11/site-packages/pandas/core/indexes/base.py:5877, in Index._get_indexer_strict(self, key, axis_name)
5874 else:
5875     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 5877 self._raise_if_missing(keyarr, indexer, axis_name)
5879 keyarr = self.take(indexer)
5880 if isinstance(key, Index):
5881     # GH 42790 - Preserve name from an Index

File ~/.local/share/virtualenvs/python-311-6zxVKbDJ/lib/python3.11/site-packages/pandas/core/indexes/base.py:5938, in Index._raise_if_missing(self, key, indexer, axis_name)
5936     if use_interval_msg:
5937         key = list(key)
-> 5938     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
5941 raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index([3, 2, 1], dtype='int64')] are in the [index]"


While the loc operator exclusively indexes labels, the iloc operator exclusively indexes with integers:

[12]:

s.iloc[[3, 2, 1]]

[12]:

2022-02-05   -1.581026
2022-02-04   -1.552128
2022-02-03    1.655759
Freq: -1D, dtype: float64


You can also slice with labels, but this works differently from normal Python slicing because the endpoint is included:

[13]:

s.loc["2022-02-03":"2022-02-04"]

[13]:

2022-02-03    1.655759
2022-02-04   -1.552128
Freq: D, dtype: float64


Setting with these methods changes the corresponding section of the row:

[14]:

s.loc["2022-02-03":"2022-02-04"] = 0

s

[14]:

2022-02-02    0.002127
2022-02-03    0.000000
2022-02-04    0.000000
2022-02-05   -1.581026
2022-02-06   -0.992316
2022-02-07    1.490786
2022-02-08   -1.542455
Freq: D, dtype: float64


Indexing in a DataFrame is used to retrieve one or more columns with either a single value or a sequence:

[15]:

data = {
"Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
"Decimal": [0, 1, 2, 3, 4, 5],
"Octal": ["001", "002", "003", "004", "004", "005"],
"Key": ["NUL", "Ctrl-A", "Ctrl-B", "Ctrl-C", "Ctrl-D", "Ctrl-E"],
}

df = pd.DataFrame(data)
df = pd.DataFrame(data, columns=["Decimal", "Octal", "Key"], index=df["Code"])

df

[15]:

Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
U+0002 2 003 Ctrl-B
U+0003 3 004 Ctrl-C
U+0004 4 004 Ctrl-D
U+0005 5 005 Ctrl-E
[16]:

df["Key"]

[16]:

Code
U+0000       NUL
U+0001    Ctrl-A
U+0002    Ctrl-B
U+0003    Ctrl-C
U+0004    Ctrl-D
U+0005    Ctrl-E
Name: Key, dtype: object

[17]:

df[["Decimal", "Key"]]

[17]:

Decimal Key
Code
U+0000 0 NUL
U+0001 1 Ctrl-A
U+0002 2 Ctrl-B
U+0003 3 Ctrl-C
U+0004 4 Ctrl-D
U+0005 5 Ctrl-E
[18]:

df[:2]

[18]:

Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
[19]:

df[df["Decimal"] > 2]

[19]:

Decimal Octal Key
Code
U+0003 3 004 Ctrl-C
U+0004 4 004 Ctrl-D
U+0005 5 005 Ctrl-E

The line selection syntax df[:2] is provided for convenience. Passing a single item or a list to the [] operator selects columns.

Another use case is indexing with a Boolean DataFrame, which is generated by a scalar comparison, for example:

[19]:

df["Decimal"] > 2

[19]:

Code
U+0000    False
U+0001    False
U+0002    False
U+0003     True
U+0004     True
U+0005     True
Name: Decimal, dtype: bool

[20]:

df[df["Decimal"] > 2] = "NA"

df

[20]:

Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
U+0002 2 003 Ctrl-B
U+0003 NA NA NA
U+0004 NA NA NA
U+0005 NA NA NA

Like Series, DataFrame has special operators loc and iloc for label-based and integer indexing respectively. Since DataFrame is two-dimensional, you can select a subset of the rows and columns with NumPy-like notation using either axis labels (loc) or integers (iloc).

[21]:

df.loc["U+0002", ["Decimal", "Key"]]

[21]:

Decimal         2
Key        Ctrl-B
Name: U+0002, dtype: object

[22]:

df.iloc[[2], [1, 2]]

[22]:

Octal Key
Code
U+0002 003 Ctrl-B
[23]:

df.iloc[[0, 1], [1, 2]]

[23]:

Octal Key
Code
U+0000 001 NUL
U+0001 002 Ctrl-A

Both indexing functions work with slices in addition to individual labels or lists of labels:

[24]:

df.loc[:"U+0003", "Key"]

[24]:

Code
U+0000       NUL
U+0001    Ctrl-A
U+0002    Ctrl-B
U+0003        NA
Name: Key, dtype: object

[25]:

df.iloc[:3, :3]

[25]:

Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
U+0002 2 003 Ctrl-B

So there are many ways to select and rearrange the data contained in a pandas object. In the following, I put together a brief summary of most of these possibilities for DataFrames:

Type

Note

df[LABEL]

selects a single column or a sequence of columns from the DataFrame

df.loc[LABEL]

selects a single row or a subset of rows from the DataFrame by label

df.loc[:, LABEL]

selects a single column or a subset of columns from the DataFrame by Label

df.loc[LABEL1, LABEL2]

selects both rows and columns by label

df.iloc[INTEGER]

selects a single row or a subset of rows from the DataFrame by integer position

df.iloc[INTEGER1, INTEGER2]

selects a single column or a subset of columns by integer position

df.at[LABEL1, LABEL2]

selects a single value by row and column label

df.iat[INTEGER1, INTEGER2]

selects a scalar value by row and column position (integers)

reindex NEW_INDEX

selects rows or columns by label

get_value, set_value

deprecated since version 0.21.0: use .at[] or .iat[] instead.