Indexing¶

Index objects¶

The index objects of pandas are responsible for the axis labels and other metadata, such as the axis name. Any array or other sequence of labels you use when constructing a series or DataFrame is internally converted into an index:

[1]:

import pandas as pd

obj = pd.Series(range(7), index=pd.date_range("2022-02-02", periods=7))

[2]:

obj.index

[2]:

DatetimeIndex(['2022-02-02', '2022-02-03', '2022-02-04', '2022-02-05',
               '2022-02-06', '2022-02-07', '2022-02-08'],
              dtype='datetime64[ns]', freq='D')

[3]:

obj.index[3:]

[3]:

DatetimeIndex(['2022-02-05', '2022-02-06', '2022-02-07', '2022-02-08'], dtype='datetime64[ns]', freq='D')

Index objects are immutable and therefore cannot be changed by the user:

[4]:

obj.index[1] = "2022-02-03"

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 obj.index[1] = "2022-02-03"

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:5371, in Index.__setitem__(self, key, value)
   5369 @final
   5370 def __setitem__(self, key, value) -> None:
-> 5371     raise TypeError("Index does not support mutable operations")

TypeError: Index does not support mutable operations

Immutability makes the sharing of index objects in data structures more secure:

[5]:

import numpy as np

labels = pd.Index(np.arange(3))

labels

[5]:

Index([0, 1, 2], dtype='int64')

[6]:

rng = np.random.default_rng()
obj2 = pd.Series(rng.normal(size=3), index=labels)

[7]:

obj2

[7]:

0    1.473255
1   -0.203160
2   -1.430757
dtype: float64

[8]:

obj2.index is labels

[8]:

True

To be similar to an array, an index also behaves like a fixed-size set:

[9]:

data1 = {
    "Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
    "Decimal": [0, 1, 2, 3, 4, 5],
    "Octal": ["001", "002", "003", "004", "004", "005"],
}
df1 = pd.DataFrame(data1)

[10]:

df1

[10]:

	Code	Decimal	Octal
0	U+0000	0	001
1	U+0001	1	002
2	U+0002	2	003
3	U+0003	3	004
4	U+0004	4	004
5	U+0005	5	005

[11]:

df1.columns

[11]:

Index(['Code', 'Decimal', 'Octal'], dtype='object')

[12]:

"Code" in df1.columns

[12]:

True

[13]:

"Key" in df1.columns

[13]:

False

Axis indices with double labels¶

Unlike Python sets, a pandas index can contain duplicate labels:

[14]:

data2 = {
    "Code": ["U+0006", "U+0007"],
    "Decimal": [6, 7],
    "Octal": ["006", "007"],
}
df2 = pd.DataFrame(data2)
df12 = pd.concat([df1, df2])

df12

[14]:

	Code	Decimal	Octal
0	U+0000	0	001
1	U+0001	1	002
2	U+0002	2	003
3	U+0003	3	004
4	U+0004	4	004
5	U+0005	5	005
0	U+0006	6	006
1	U+0007	7	007

When selecting duplicate labels, all occurrences of the label in question are selected:

[15]:

df12.loc[1]

[15]:

	Code	Decimal	Octal
1	U+0001	1	002
1	U+0007	7	007

[16]:

df12.loc[2]

[16]:

Code       U+0002
Decimal         2
Octal         003
Name: 2, dtype: object

Data selection is one of the main points that behaves differently with duplicates. Indexing a label with multiple entries results in a series, while single entries result in a scalar value. This can complicate your code because the output type of indexing can vary depending on whether a label is repeated or not. In addition, many pandas functions, such as reindex, require labels to be unique. You can use the is_unique property of the index to determine whether its labels are unique or not:

[17]:

df12.index.is_unique

[17]:

False

To avoid duplicate labels, you can use ignore_index=True, for example:

[18]:

df12 = pd.concat([df1, df2], ignore_index=True)

df12

[18]:

	Code	Decimal	Octal
0	U+0000	0	001
1	U+0001	1	002
2	U+0002	2	003
3	U+0003	3	004
4	U+0004	4	004
5	U+0005	5	005
6	U+0006	6	006
7	U+0007	7	007

Some index methods and properties¶

Each index has a number of set logic methods and properties that answer other general questions about the data it contains. The following are some useful methods and properties:

Method	Description
concat	concatenates additional index objects, creating a new index
Index.difference	calculates the difference of two sets as an index
Index.intersection	calculates the intersection
Index.union	calculates the union set
Index.isin	computes a boolean array indicating whether each value is contained in the passed collection
Index.delete	computes a new index by deleting the element in index `i`
Index.drop	computes a new index by deleting the passed values
Index.insert	insert computes new index by inserting the element in index `i`
Index.is_monotonic_increasing	is_monotonic returns `True` if each element is greater than or equal to the previous element
Index.is_unique	is_unique returns `True` if the index does not contain duplicate values
Index.unique	calculates the array of unique values in the index

Re-indexing with `Index.reindex`¶

An important method for Pandas objects is Index.reindex, which can be used to create a new object with rearranged values that match the new index. Consider, for example:

[19]:

obj = pd.Series(range(7), index=pd.date_range("2022-02-02", periods=7))

[20]:

obj

[20]:

2022-02-02    0
2022-02-03    1
2022-02-04    2
2022-02-05    3
2022-02-06    4
2022-02-07    5
2022-02-08    6
Freq: D, dtype: int64

[21]:

new_index = pd.date_range("2022-02-03", periods=7)

[22]:

obj.reindex(new_index)

[22]:

2022-02-03    1.0
2022-02-04    2.0
2022-02-05    3.0
2022-02-06    4.0
2022-02-07    5.0
2022-02-08    6.0
2022-02-09    NaN
Freq: D, dtype: float64

Index.reindex creates a new index and re-indexes the DataFrame. By default, values in the new index for which there are no corresponding records in the DataFrame become NaN.

For ordered data such as time series, it may be desirable to interpolate or fill values during reindexing. The method option allows this with a method like ffill that fills the values forward:

[23]:

obj.reindex(new_index, method="ffill")

[23]:

2022-02-03    1
2022-02-04    2
2022-02-05    3
2022-02-06    4
2022-02-07    5
2022-02-08    6
2022-02-09    6
Freq: D, dtype: int64

For a DataFrame, reindex can change either the (row) index, the columns or both. If only a sequence is passed, the rows in the result are re-indexed:

[24]:

df1.reindex(range(7))

[24]:

	Code	Decimal	Octal
0	U+0000	0.0	001
1	U+0001	1.0	002
2	U+0002	2.0	003
3	U+0003	3.0	004
4	U+0004	4.0	004
5	U+0005	5.0	005
6	NaN	NaN	NaN

The columns can be re-indexed with the keyword columns:

[25]:

encoding = ["Octal", "Code", "Description"]

df1.reindex(columns=encoding)

[25]:

	Octal	Code	Description
0	001	U+0000	NaN
1	002	U+0001	NaN
2	003	U+0002	NaN
3	004	U+0003	NaN
4	004	U+0004	NaN
5	005	U+0005	NaN

Arguments of the function `Index.reindex`¶

Argument	Description
`labels`	New sequence to be used as index. Can be an index instance or another sequence-like Python data structure. An index is used exactly as it is, without being copied.
`axis`	The new axis to index, either `index` (rows) or `columns`. The default is `index`. You can alternatively use `reindex(index=new_labels)` or `reindex(columns=new_labels)`.
`method`	Interpolation method; `ffill` fills forwards, while `bfill` fills backwards.
`fill_value`	Substitute value to be used when missing data is inserted by re-indexing. Uses `fill_value='missing'` (the default behaviour) if the missing labels in the result are to have zero values.
`limit`	When filling forward or backward, the maximum number of elements to fill.
`tolerance`	When filling forward or backward, the maximum size of the gap to be filled for inexact matches.
`level`	Match single index at `MultiIndex` level; otherwise select subset.
`copy`	If `True`, the underlying data is always copied, even if the new index matches the old index; if `False`, the data is not copied if the indices are equivalent.

Rename axis indices¶

The axis labels can be converted by a function or mapping to create new, differently labelled objects. You can also change the axes in place without creating a new data structure. Here is a simple example:

[26]:

df3 = pd.DataFrame(
    np.arange(12).reshape((3, 4)),
    index=["Deutsch", "English", "Français"],
    columns=[1, 2, 3, 4],
)

df3

[26]:

	1	2	3	4
Deutsch	0	1	2	3
English	4	5	6	7
Français	8	9	10	11

Rename axis indices with `Index.map`¶

The axis labels can be converted by a function or Index.map to create new, differently labeled objects. You can also change the axes in place without creating a new data structure. Here is a simple example:

[27]:

def transform(x):
    return x[:2].upper()


df3.index.map(transform)

[27]:

Index(['DE', 'EN', 'FR'], dtype='object')

You can assign the index and change the DataFrame on the spot:

[28]:

df3.index = df3.index.map(transform)

df3

[28]:

	1	2	3	4
DE	0	1	2	3
EN	4	5	6	7
FR	8	9	10	11

Rename axis indices with `Index.rename`¶

If you want to create a converted version of your dataset without changing the original, you can use Index.rename:

[29]:

df3.rename(index=str.lower)

[29]:

	1	2	3	4
de	0	1	2	3
en	4	5	6	7
fr	8	9	10	11

In particular, Index.rename can be used in conjunction with a dict-like object that provides new values for a subset of the axis labels:

[30]:

df3.rename(
    index={"DE": "BE", "EN": "DE", "FR": "EN"},
    columns={1: 0, 2: 1, 3: 2, 4: 3},
    inplace=True,
)

df3

[30]:

	0	1	2	3
BE	0	1	2	3
DE	4	5	6	7
EN	8	9	10	11

Index.rename saves you from manually copying the DataFrame and assigning its index and column attributes. If you want to change a data set on the spot, also pass inplace=True:

[31]:

df3.rename(
    index={"DE": "BE", "EN": "DE", "FR": "EN"},
    columns={1: 0, 2: 1, 3: 2, 4: 3},
    inplace=True,
)

df3

[31]:

	0	0	1	2
BE	0	1	2	3
BE	4	5	6	7
DE	8	9	10	11

Hierarchical Indexing¶

Hierarchical indexing is an important feature of pandas that allows you to have multiple index levels on one axis. This gives you the opportunity to work with higher dimensional data in a lower dimensional form.

Let’s start with a simple example: Let’s create a series of lists as an index:

[32]:

hits = pd.Series(
    [83080, 20336, 11376, 1228, 468],
    index=[
        [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            "Python Basics",
            "Python Basics",
        ],
        ["de", "en", "de", "de", "en"],
    ],
)

hits

[32]:

Jupyter Tutorial  de    83080
                  en    20336
PyViz Tutorial    de    11376
Python Basics     de     1228
                  en      468
dtype: int64

What you see is a graphical view of a series with a pandas.MultiIndex. The gaps in the index display mean that the label above it is to be used.

[33]:

hits.index

[33]:

MultiIndex([('Jupyter Tutorial', 'de'),
            ('Jupyter Tutorial', 'en'),
            (  'PyViz Tutorial', 'de'),
            (   'Python Basics', 'de'),
            (   'Python Basics', 'en')],
           )

With a hierarchically indexed object, so-called partial indexing is possible, with which you can select subsets of the data:

[34]:

hits["Jupyter Tutorial"]

[34]:

de    83080
en    20336
dtype: int64

[35]:

hits["Jupyter Tutorial":"Python Basics"]

[35]:

Jupyter Tutorial  de    83080
                  en    20336
PyViz Tutorial    de    11376
Python Basics     de     1228
                  en      468
dtype: int64

[36]:

hits.loc[["Jupyter Tutorial", "Python Basics"]]

[36]:

Jupyter Tutorial  de    83080
                  en    20336
Python Basics     de     1228
                  en      468
dtype: int64

The selection is even possible from an inner level. In the following I select all values with the value 1 from the second index level:

[37]:

hits.loc[:, "de"]

[37]:

Jupyter Tutorial    83080
PyViz Tutorial      11376
Python Basics        1228
dtype: int64

View vs. copy¶

In Pandas, whether you get a view or not depends on the structure and data types of the original DataFrame – and whether changes made to a view are propagated back to the original DataFrame.

**See also:** * [Returning a view versus a copy](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy) * [Views and Copies in pandas](https://www.practicaldatascience.org/html/views_and_copies_in_pandas.html)

`stack` and `unstack`¶

Hierarchical indexing plays an important role in data reshaping and group-based operations such as forming a pivot table. For example, you can reorder this data into a DataFrame using the pandas.Series.unstack method:

[38]:

hits.unstack()

[38]:

	de	en
Jupyter Tutorial	83080.0	20336.0
PyViz Tutorial	11376.0	NaN
Python Basics	1228.0	468.0

The reverse operation of unstack is stack:

[39]:

hits.unstack().stack()

[39]:

Jupyter Tutorial  de    83080.0
                  en    20336.0
PyViz Tutorial    de    11376.0
Python Basics     de     1228.0
                  en      468.0
dtype: float64

In a DataFrame, each axis can have a hierarchical index:

[40]:

version_hits = [
    [19651, 0, 30134, 0, 33295, 0],
    [4722, 1825, 3497, 2576, 4009, 3707],
    [2573, 0, 4873, 0, 3930, 0],
    [525, 0, 427, 0, 276, 0],
    [157, 0, 85, 0, 226, 0],
]

df = pd.DataFrame(
    version_hits,
    index=[
        [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            "Python Basics",
            "Python Basics",
        ],
        ["de", "en", "de", "de", "en"],
    ],
    columns=[
        ["12/2021", "12/2021", "01/2022", "01/2022", "02/2022", "02/2022"],
        ["latest", "stable", "latest", "stable", "latest", "stable"],
    ],
)

df

[40]:

		12/2021		01/2022		02/2022
		latest	stable	latest	stable	latest	stable
Jupyter Tutorial	de	19651	0	30134	0	33295	0
Jupyter Tutorial	en	4722	1825	3497	2576	4009	3707
PyViz Tutorial	de	2573	0	4873	0	3930	0
Python Basics	de	525	0	427	0	276	0
Python Basics	en	157	0	85	0	226	0

The hierarchy levels can have names (as strings or any Python objects). If this is the case, they are displayed in the console output:

[41]:

df.index.names = ["Title", "Language"]
df.columns.names = ["Month", "Version"]

df

[41]:

	Month	12/2021		01/2022		02/2022
	Version	latest	stable	latest	stable	latest	stable
Title	Language
Jupyter Tutorial	de	19651	0	30134	0	33295	0
Jupyter Tutorial	en	4722	1825	3497	2576	4009	3707
PyViz Tutorial	de	2573	0	4873	0	3930	0
Python Basics	de	525	0	427	0	276	0
Python Basics	en	157	0	85	0	226	0

Warning:

Make sure that the index names Month and Version are not part of the row names (of the df.index values).

With the partial column indexing you can select column groups in a similar way:

[42]:

df["12/2021"]

[42]:

	Version	latest	stable
Title	Language
Jupyter Tutorial	de	19651	0
Jupyter Tutorial	en	4722	1825
PyViz Tutorial	de	2573	0
Python Basics	de	525	0
Python Basics	en	157	0

With MultiIndex.from_arrays, a MultiIndex can be created itself and then reused; the columns in the preceding DataFrame with level names could be created in this way:

[43]:

pd.MultiIndex.from_arrays(
    [
        [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            "Python Basics",
            "Python Basics",
        ],
        ["de", "en", "de", "de", "en"],
    ],
    names=["Title", "Language"],
)

[43]:

MultiIndex([('Jupyter Tutorial', 'de'),
            ('Jupyter Tutorial', 'en'),
            (  'PyViz Tutorial', 'de'),
            (   'Python Basics', 'de'),
            (   'Python Basics', 'en')],
           names=['Title', 'Language'])

Rearranging and Sorting Levels¶

There may be times when you want to rearrange the order of the levels on an axis or sort the data by the values in a particular level. The function DataFrame.swaplevel takes two level numbers or names and returns a new object in which the levels are swapped (but the data remains unchanged):

[44]:

df.swaplevel("Language", "Title")

[44]:

	Month	12/2021		01/2022		02/2022
	Version	latest	stable	latest	stable	latest	stable
Language	Title
de	Jupyter Tutorial	19651	0	30134	0	33295	0
en	Jupyter Tutorial	4722	1825	3497	2576	4009	3707
de	PyViz Tutorial	2573	0	4873	0	3930	0
de	Python Basics	525	0	427	0	276	0
en	Python Basics	157	0	85	0	226	0

DataFrame.sort_index, on the other hand, sorts the data only by the values in a single level. When swapping levels, it is not uncommon to also use sort_index so that the result is lexicographically sorted by the specified level:

[45]:

df.sort_index(level=0)

[45]:

	Month	12/2021		01/2022		02/2022
	Version	latest	stable	latest	stable	latest	stable
Title	Language
Jupyter Tutorial	de	19651	0	30134	0	33295	0
Jupyter Tutorial	en	4722	1825	3497	2576	4009	3707
PyViz Tutorial	de	2573	0	4873	0	3930	0
Python Basics	de	525	0	427	0	276	0
Python Basics	en	157	0	85	0	226	0

However, the PyViz Tutorial will now be sorted before the Python Basics, as all upper case letters appear before lower case letters in this sorting. To avoid this, you can use the following lambda function:

[46]:

df.sort_index(level=0, key=lambda x: x.str.lower())

[46]:

	Month	12/2021		01/2022		02/2022
	Version	latest	stable	latest	stable	latest	stable
Title	Language
Jupyter Tutorial	de	19651	0	30134	0	33295	0
Jupyter Tutorial	en	4722	1825	3497	2576	4009	3707
Python Basics	de	525	0	427	0	276	0
Python Basics	en	157	0	85	0	226	0
PyViz Tutorial	de	2573	0	4873	0	3930	0

[47]:

df.swaplevel(0, 1).sort_index(level=0)

[47]:

	Month	12/2021		01/2022		02/2022
	Version	latest	stable	latest	stable	latest	stable
Language	Title
de	Jupyter Tutorial	19651	0	30134	0	33295	0
	PyViz Tutorial	2573	0	4873	0	3930	0
	Python Basics	525	0	427	0	276	0
en	Jupyter Tutorial	4722	1825	3497	2576	4009	3707
en	Python Basics	157	0	85	0	226	0

Note:

Data selection performance is much better for hierarchically indexed objects if the index is sorted lexicographically, starting with the outermost level, i.e. the result of calling sort_index(level=0) or sort_index().

Summary statistics by level¶

Many descriptive and summary statistics for DataFrame and Series have a level option that allows you to specify the level by which you can aggregate on a particular axis. Consider the DataFrame above; we can aggregate either the rows or the columns by level as follows:

[48]:

df.groupby(level="Language").sum()

[48]:

Month	12/2021		01/2022		02/2022
Version	latest	stable	latest	stable	latest	stable
Language
de	22749	0	35434	0	37501	0
en	4879	1825	3582	2576	4235	3707

[49]:

dft = df.T.groupby(level="Month").sum()

dft.T

[49]:

	Month	01/2022	02/2022	12/2021
Title	Language
Jupyter Tutorial	de	30134	33295	19651
Jupyter Tutorial	en	6073	7716	6547
PyViz Tutorial	de	4873	3930	2573
Python Basics	de	427	276	525
Python Basics	en	85	226	157

Internally, pandas’ DataFrame.groupby machinery is used for this purpose, which is explained in more detail in Group Operations.

Indexing with the columns of a DataFrame¶

It is not uncommon to use one or more columns of a DataFrame as a row index; alternatively, you can move the row index into the columns of the DataFrame. Here is an example DataFrame:

[50]:

data = [
    ["Jupyter Tutorial", "de", 19651, 0, 30134, 0, 33295, 0],
    ["Jupyter Tutorial", "en", 4722, 1825, 3497, 2576, 4009, 3707],
    ["PyViz Tutorial", "de", 2573, 0, 4873, 0, 3930, 0],
    ["Python Basics", "de", 525, 0, 427, 0, 276, 0],
    ["Python Basics", "en", 157, 0, 85, 0, 226, 0],
]

df = pd.DataFrame(data)

df

[50]:

	0	1	2	3	4	5	6	7
0	Jupyter Tutorial	de	19651	0	30134	0	33295	0
1	Jupyter Tutorial	en	4722	1825	3497	2576	4009	3707
2	PyViz Tutorial	de	2573	0	4873	0	3930	0
3	Python Basics	de	525	0	427	0	276	0
4	Python Basics	en	157	0	85	0	226	0

The function pandas.DataFrame.set_index creates a new DataFrame that uses one or more of its columns as an index:

[51]:

df2 = df.set_index([0, 1])

df2

[51]:

		2	3	4	5	6	7
0	1
Jupyter Tutorial	de	19651	0	30134	0	33295	0
Jupyter Tutorial	en	4722	1825	3497	2576	4009	3707
PyViz Tutorial	de	2573	0	4873	0	3930	0
Python Basics	de	525	0	427	0	276	0
Python Basics	en	157	0	85	0	226	0

By default, the columns are removed from the DataFrame, but you can also leave them in by passing drop=False to set_index:

[52]:

df.set_index([0, 1], drop=False)

[52]:

		0	1	2	3	4	5	6	7
0	1
Jupyter Tutorial	de	Jupyter Tutorial	de	19651	0	30134	0	33295	0
Jupyter Tutorial	en	Jupyter Tutorial	en	4722	1825	3497	2576	4009	3707
PyViz Tutorial	de	PyViz Tutorial	de	2573	0	4873	0	3930	0
Python Basics	de	Python Basics	de	525	0	427	0	276	0
Python Basics	en	Python Basics	en	157	0	85	0	226	0

DataFrame.reset_index, on the other hand, does the opposite of set_index; the hierarchical index levels are moved into the columns:

[53]:

df2.reset_index()

[53]:

	0	1	2	3	4	5	6	7
0	Jupyter Tutorial	de	19651	0	30134	0	33295	0
1	Jupyter Tutorial	en	4722	1825	3497	2576	4009	3707
2	PyViz Tutorial	de	2573	0	4873	0	3930	0
3	Python Basics	de	525	0	427	0	276	0
4	Python Basics	en	157	0	85	0	226	0