Selecting and filtering data

Indexing series (obj[...]) works similarly to indexing NumPy arrays, except that you can use index values of the series instead of just integers. Here are some examples:

[1]:
import numpy as np
import pandas as pd
[2]:
idx = pd.date_range("2022-02-02", periods=7)
s = pd.Series(np.random.randn(7), index=idx)
[3]:
s
[3]:
2022-02-02   -1.740150
2022-02-03    0.784229
2022-02-04    0.359960
2022-02-05   -0.308411
2022-02-06   -0.376975
2022-02-07   -0.644732
2022-02-08   -0.477871
Freq: D, dtype: float64
[4]:
s["2022-02-03"]
[4]:
np.float64(0.7842287488698951)
[5]:
s.iloc[1]
[5]:
np.float64(0.7842287488698951)
[6]:
 s[2:4]
[6]:
2022-02-04    0.359960
2022-02-05   -0.308411
Freq: D, dtype: float64
[7]:
s[["2022-02-04", "2022-02-03", "2022-02-02"]]
[7]:
2022-02-04    0.359960
2022-02-03    0.784229
2022-02-02   -1.740150
dtype: float64
[8]:
s.iloc[[1, 3]]
[8]:
2022-02-03    0.784229
2022-02-05   -0.308411
Freq: 2D, dtype: float64
[9]:
s[s > 0]
[9]:
2022-02-03    0.784229
2022-02-04    0.359960
Freq: D, dtype: float64

Although this allows you to select data by label, the preferred method for selecting index values is the loc operator:

[10]:
s.loc[["2022-02-04", "2022-02-03", "2022-02-02"]]
[10]:
2022-02-04    0.359960
2022-02-03    0.784229
2022-02-02   -1.740150
dtype: float64

The reason for preferring loc lies in the different treatment of integers when indexing with []. In regular [] based indexing, integers are treated as labels if the index contains integers, so the behaviour differs depending on the data type of the index. In our example, the expression s.loc[[3, 2, 1]] will fail because the index does not contain integers:

[11]:
s.loc[[3, 2, 1]]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[11], line 1
----> 1 s.loc[[3, 2, 1]]

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/indexing.py:1420, in _LocIndexer._getitem_axis(self, key, axis)
   1417     if hasattr(key, "ndim") and key.ndim > 1:
   1418         raise ValueError("Cannot index with multidimensional key")
-> 1420     return self._getitem_iterable(key, axis=axis)
   1422 # nested tuple slicing
   1423 if is_nested_tuple(key, labels):

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/indexing.py:1360, in _LocIndexer._getitem_iterable(self, key, axis)
   1357 self._validate_key(key, axis)
   1359 # A collection of keys
-> 1360 keyarr, indexer = self._get_listlike_indexer(key, axis)
   1361 return self.obj._reindex_with_indexers(
   1362     {axis: [keyarr, indexer]}, copy=True, allow_dups=True
   1363 )

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/indexing.py:1558, in _LocIndexer._get_listlike_indexer(self, key, axis)
   1555 ax = self.obj._get_axis(axis)
   1556 axis_name = self.obj._get_axis_name(axis)
-> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
   1560 return keyarr, indexer

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:6200, in Index._get_indexer_strict(self, key, axis_name)
   6197 else:
   6198     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
   6202 keyarr = self.take(indexer)
   6203 if isinstance(key, Index):
   6204     # GH 42790 - Preserve name from an Index

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:6249, in Index._raise_if_missing(self, key, indexer, axis_name)
   6247 if nmissing:
   6248     if nmissing == len(indexer):
-> 6249         raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6251     not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6252     raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index([3, 2, 1], dtype='int64')] are in the [index]"

Während der loc-Operator ausschließlich Label indiziert, indiziert der iloc-Operator ausschließlich mit ganzen Zahlen:

[12]:
s.iloc[[3, 2, 1]]
[12]:
2022-02-05   -0.308411
2022-02-04    0.359960
2022-02-03    0.784229
Freq: -1D, dtype: float64

Ihr könnt auch mit Labels slicen, aber das funktioniert anders als das normale Python-Slicing, da der Endpunkt inklusive ist:

[13]:
s.loc["2022-02-03":"2022-02-04"]
[13]:
2022-02-03    0.784229
2022-02-04    0.359960
Freq: D, dtype: float64

Durch die Einstellung mit diesen Methoden wird der entsprechende Abschnitt der Reihe geändert:

[14]:
s.loc["2022-02-03":"2022-02-04"] = 0

s
[14]:
2022-02-02   -1.740150
2022-02-03    0.000000
2022-02-04    0.000000
2022-02-05   -0.308411
2022-02-06   -0.376975
2022-02-07   -0.644732
2022-02-08   -0.477871
Freq: D, dtype: float64

Die Indizierung in einem DataFrame dient dazu, eine oder mehrere Spalten entweder mit einem einzelnen Wert oder einer Folge abzurufen:

[15]:
data = {
    "Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
    "Decimal": [0, 1, 2, 3, 4, 5],
    "Octal": ["001", "002", "003", "004", "004", "005"],
    "Key": ["NUL", "Ctrl-A", "Ctrl-B", "Ctrl-C", "Ctrl-D", "Ctrl-E"],
}

df = pd.DataFrame(data)
df = pd.DataFrame(data, columns=["Decimal", "Octal", "Key"], index=df["Code"])

df
[15]:
Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
U+0002 2 003 Ctrl-B
U+0003 3 004 Ctrl-C
U+0004 4 004 Ctrl-D
U+0005 5 005 Ctrl-E
[16]:
df["Key"]
[16]:
Code
U+0000       NUL
U+0001    Ctrl-A
U+0002    Ctrl-B
U+0003    Ctrl-C
U+0004    Ctrl-D
U+0005    Ctrl-E
Name: Key, dtype: object
[17]:
df[["Decimal", "Key"]]
[17]:
Decimal Key
Code
U+0000 0 NUL
U+0001 1 Ctrl-A
U+0002 2 Ctrl-B
U+0003 3 Ctrl-C
U+0004 4 Ctrl-D
U+0005 5 Ctrl-E

Die Zeilenauswahlsyntax df[:2] wird aus Gründen der Bequemlichkeit bereitgestellt. Durch die Übergabe eines einzelnen Elements oder einer Liste an den []-Operator werden Spalten ausgewählt.

[18]:
df[:2]
[18]:
Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A

Ein weiterer Anwendungsfall ist die Indizierung mit einem booleschen DataFrame, der beispielsweise durch einen Skalarvergleich erzeugt wird:

[19]:
df["Decimal"] > 2
[19]:
Code
U+0000    False
U+0001    False
U+0002    False
U+0003     True
U+0004     True
U+0005     True
Name: Decimal, dtype: bool
[20]:
df[df["Decimal"] > 2]
[20]:
Decimal Octal Key
Code
U+0003 3 004 Ctrl-C
U+0004 4 004 Ctrl-D
U+0005 5 005 Ctrl-E

You can also combine these Boolean DataFrames using bitwise operators:

[21]:
df[(df["Decimal"] > 2) & (df["Decimal"] < 5)]
[21]:
Decimal Octal Key
Code
U+0003 3 004 Ctrl-C
U+0004 4 004 Ctrl-D
[22]:
df[(df["Decimal"] < 3) | (df["Decimal"] > 4)]
[22]:
Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
U+0002 2 003 Ctrl-B
U+0005 5 005 Ctrl-E

Like Series, DataFrame also has special operators loc and iloc for label-based and integer indexing, respectively. Since DataFrame is two-dimensional, you can select a subset of rows and columns using NumPy-like notation by using either axis labels (loc) or integers (iloc).

[23]:
df.loc["U+0002", ["Decimal", "Key"]]
[23]:
Decimal         2
Key        Ctrl-B
Name: U+0002, dtype: object
[24]:
df.iloc[[2], [1, 2]]
[24]:
Octal Key
Code
U+0002 003 Ctrl-B
[25]:
df.iloc[[0, 1], [1, 2]]
[25]:
Octal Key
Code
U+0000 001 NUL
U+0001 002 Ctrl-A

Both indexing functions work with slices in addition to individual labels or lists of labels:

[26]:
df.loc[:"U+0003", "Key"]
[26]:
Code
U+0000       NUL
U+0001    Ctrl-A
U+0002    Ctrl-B
U+0003    Ctrl-C
Name: Key, dtype: object
[27]:
df.iloc[:3, :3]
[27]:
Decimal Octal Key
Code
U+0000 0 001 NUL
U+0001 1 002 Ctrl-A
U+0002 2 003 Ctrl-B

There are many ways to select and rearrange the data contained in a pandas object. Below is a brief summary of most of these options for DataFrames:

Type

Note

df[LABEL]

selects a single column or a sequence of columns from the DataFrame

df.loc[LABEL]

selects a single row or a subset of rows from the DataFrame by label

df.loc[:, LABEL]

selects a single column or a subset of columns from the DataFrame by Label

df.loc[LABEL1, LABEL2]

selects both rows and columns by label

df.iloc[INTEGER]

selects a single row or a subset of rows from the DataFrame by integer position

df.iloc[INTEGER1, INTEGER2]

selects a single column or a subset of columns by integer position

df.at[LABEL1, LABEL2]

selects a single value by row and column label

df.iat[INTEGER1, INTEGER2]

selects a scalar value by row and column position (integers)

reindex NEW_INDEX

selects rows or columns by label

get_value, set_value

deprecated since version 0.21.0: use .at[] or .iat[] instead.