Indexing

Index objects

The index objects of pandas are responsible for the axis labels and other metadata, such as the axis name. Any array or other sequence of labels you use when constructing a series or DataFrame is internally converted into an index:

[1]:
import pandas as pd


obj = pd.Series(range(7), index=pd.date_range("2022-02-02", periods=7))
[2]:
obj.index
[2]:
DatetimeIndex(['2022-02-02', '2022-02-03', '2022-02-04', '2022-02-05',
               '2022-02-06', '2022-02-07', '2022-02-08'],
              dtype='datetime64[ns]', freq='D')
[3]:
obj.index[3:]
[3]:
DatetimeIndex(['2022-02-05', '2022-02-06', '2022-02-07', '2022-02-08'], dtype='datetime64[ns]', freq='D')

Index objects are immutable and therefore cannot be changed by the user:

[4]:
obj.index[1] = "2022-02-03"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 obj.index[1] = "2022-02-03"

File ~/.local/share/virtualenvs/python-311-6zxVKbDJ/lib/python3.11/site-packages/pandas/core/indexes/base.py:5157, in Index.__setitem__(self, key, value)
   5155 @final
   5156 def __setitem__(self, key, value):
-> 5157     raise TypeError("Index does not support mutable operations")

TypeError: Index does not support mutable operations

Immutability makes the sharing of index objects in data structures more secure:

[5]:
import numpy as np


labels = pd.Index(np.arange(3))

labels
[5]:
Index([0, 1, 2], dtype='int64')
[6]:
rng = np.random.default_rng()
obj2 = pd.Series(rng.normal(size=3),index=labels)
[7]:
obj2
[7]:
0    0.515353
1    1.153708
2   -1.776476
dtype: float64
[8]:
obj2.index is labels
[8]:
True

To be similar to an array, an index also behaves like a fixed-size set:

[9]:
data1 = {
    "Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
    "Decimal": [0, 1, 2, 3, 4, 5],
    "Octal": ["001", "002", "003", "004", "004", "005"],
}
df1 = pd.DataFrame(data1)
[10]:
df1
[10]:
Code Decimal Octal
0 U+0000 0 001
1 U+0001 1 002
2 U+0002 2 003
3 U+0003 3 004
4 U+0004 4 004
5 U+0005 5 005
[11]:
df1.columns
[11]:
Index(['Code', 'Decimal', 'Octal'], dtype='object')
[12]:
"Code" in df1.columns
[12]:
True
[13]:
"Key" in df1.columns
[13]:
False

Axis indices with double labels

Unlike Python sets, a pandas index can contain duplicate labels:

[14]:
data2 = {
    "Code": ["U+0006", "U+0007"],
    "Decimal": [6, 7],
    "Octal": ["006", "007"],
}
df2 = pd.DataFrame(data2)
df12 = pd.concat([df1, df2])

df12
[14]:
Code Decimal Octal
0 U+0000 0 001
1 U+0001 1 002
2 U+0002 2 003
3 U+0003 3 004
4 U+0004 4 004
5 U+0005 5 005
0 U+0006 6 006
1 U+0007 7 007

When selecting duplicate labels, all occurrences of the label in question are selected:

[15]:
df12.loc[1]
[15]:
Code Decimal Octal
1 U+0001 1 002
1 U+0007 7 007
[16]:
df12.loc[2]
[16]:
Code       U+0002
Decimal         2
Octal         003
Name: 2, dtype: object

Data selection is one of the main points that behaves differently with duplicates. Indexing a label with multiple entries results in a series, while single entries result in a scalar value. This can complicate your code because the output type of indexing can vary depending on whether a label is repeated or not. In addition, many pandas functions, such as reindex, require labels to be unique. You can use the is_unique property of the index to determine whether its labels are unique or not:

[17]:
df12.index.is_unique
[17]:
False

To avoid duplicate labels, you can use ignore_index=True, for example:

[18]:
df12 = pd.concat([df1, df2], ignore_index=True)

df12
[18]:
Code Decimal Octal
0 U+0000 0 001
1 U+0001 1 002
2 U+0002 2 003
3 U+0003 3 004
4 U+0004 4 004
5 U+0005 5 005
6 U+0006 6 006
7 U+0007 7 007

Some index methods and properties

Each index has a number of set logic methods and properties that answer other general questions about the data it contains. The following are some useful methods and properties:

Method

Description

concat

concatenates additional index objects, creating a new index

Index.difference

calculates the difference of two sets as an index

Index.intersection

calculates the intersection

Index.union

calculates the union set

Index.isin

computes a boolean array indicating whether each value is contained in the passed collection

Index.delete

computes a new index by deleting the element in index i

Index.drop

computes a new index by deleting the passed values

Index.insert

insert computes new index by inserting the element in index i

Index.is_monotonic_increasing

is_monotonic returns True if each element is greater than or equal to the previous element

Index.is_unique

is_unique returns True if the index does not contain duplicate values

Index.unique

calculates the array of unique values in the index

Re-indexing with Index.reindex

An important method for Pandas objects is Index.reindex, which can be used to create a new object with rearranged values that match the new index. Consider, for example:

[19]:
obj = pd.Series(range(7), index=pd.date_range("2022-02-02", periods=7))
[20]:
obj
[20]:
2022-02-02    0
2022-02-03    1
2022-02-04    2
2022-02-05    3
2022-02-06    4
2022-02-07    5
2022-02-08    6
Freq: D, dtype: int64
[21]:
new_index = pd.date_range("2022-02-03", periods=7)
[22]:
obj.reindex(new_index)
[22]:
2022-02-03    1.0
2022-02-04    2.0
2022-02-05    3.0
2022-02-06    4.0
2022-02-07    5.0
2022-02-08    6.0
2022-02-09    NaN
Freq: D, dtype: float64

Index.reindex creates a new index and re-indexes the DataFrame. By default, values in the new index for which there are no corresponding records in the DataFrame become NaN.

For ordered data such as time series, it may be desirable to interpolate or fill values during reindexing. The method option allows this with a method like ffill that fills the values forward:

[23]:
obj.reindex(new_index, method="ffill")
[23]:
2022-02-03    1
2022-02-04    2
2022-02-05    3
2022-02-06    4
2022-02-07    5
2022-02-08    6
2022-02-09    6
Freq: D, dtype: int64

For a DataFrame, reindex can change either the (row) index, the columns or both. If only a sequence is passed, the rows in the result are re-indexed:

[24]:
df1.reindex(range(7))
[24]:
Code Decimal Octal
0 U+0000 0.0 001
1 U+0001 1.0 002
2 U+0002 2.0 003
3 U+0003 3.0 004
4 U+0004 4.0 004
5 U+0005 5.0 005
6 NaN NaN NaN

The columns can be re-indexed with the keyword columns:

[25]:
encoding = ["Octal", "Code", "Description"]

df1.reindex(columns=encoding)
[25]:
Octal Code Description
0 001 U+0000 NaN
1 002 U+0001 NaN
2 003 U+0002 NaN
3 004 U+0003 NaN
4 004 U+0004 NaN
5 005 U+0005 NaN

Arguments of the function Index.reindex

Argument

Description

labels

New sequence to be used as index. Can be an index instance or another sequence-like Python data structure. An index is used exactly as it is, without being copied.

axis

The new axis to index, either index (rows) or columns. The default is index. You can alternatively use reindex(index=new_labels) or reindex(columns=new_labels).

method

Interpolation method; ffill fills forwards, while bfill fills backwards.

fill_value

Substitute value to be used when missing data is inserted by re-indexing. Uses fill_value='missing' (the default behaviour) if the missing labels in the result are to have zero values.

limit

When filling forward or backward, the maximum number of elements to fill.

tolerance

When filling forward or backward, the maximum size of the gap to be filled for inexact matches.

level

Match single index at MultiIndex level; otherwise select subset.

copy

If True, the underlying data is always copied, even if the new index matches the old index; if False, the data is not copied if the indices are equivalent.

Rename axis indices

The axis labels can be converted by a function or mapping to create new, differently labelled objects. You can also change the axes in place without creating a new data structure. Here is a simple example:

[26]:
df3 = pd.DataFrame(
    np.arange(12).reshape((3, 4)),
    index=["Deutsch", "English", "Français"],
    columns=[1, 2, 3, 4],
)

df3
[26]:
1 2 3 4
Deutsch 0 1 2 3
English 4 5 6 7
Français 8 9 10 11

Rename axis indices with Index.map

The axis labels can be converted by a function or Index.map to create new, differently labeled objects. You can also change the axes in place without creating a new data structure. Here is a simple example:

[27]:
transform = lambda x: x[:2].upper()

df3.index.map(transform)
[27]:
Index(['DE', 'EN', 'FR'], dtype='object')

You can assign the index and change the DataFrame on the spot:

[28]:
df3.index = df3.index.map(transform)

df3
[28]:
1 2 3 4
DE 0 1 2 3
EN 4 5 6 7
FR 8 9 10 11

Rename axis indices with Index.rename

If you want to create a converted version of your dataset without changing the original, you can use Index.rename:

[29]:
df3.rename(index=str.lower)
[29]:
1 2 3 4
de 0 1 2 3
en 4 5 6 7
fr 8 9 10 11

In particular, Index.rename can be used in conjunction with a dict-like object that provides new values for a subset of the axis labels:

[30]:
df3.rename(
    index={"DE": "BE", "EN": "DE", "FR": "EN"},
    columns={1: 0, 2: 1, 3: 2, 4: 3},
    inplace=True,
)

df3
[30]:
0 1 2 3
BE 0 1 2 3
DE 4 5 6 7
EN 8 9 10 11

Index.rename saves you from manually copying the DataFrame and assigning its index and column attributes. If you want to change a data set on the spot, also pass inplace=True:

[31]:
df3.rename(
    index={"DE": "BE", "EN": "DE", "FR": "EN"},
    columns={1: 0, 2: 1, 3: 2, 4: 3},
    inplace=True,
)

df3
[31]:
0 0 1 2
BE 0 1 2 3
BE 4 5 6 7
DE 8 9 10 11

Hierarchical Indexing

Hierarchical indexing is an important feature of pandas that allows you to have multiple index levels on one axis. This gives you the opportunity to work with higher dimensional data in a lower dimensional form.

Let’s start with a simple example: Let’s create a series of lists as an index:

[32]:
hits = pd.Series(
    [83080, 20336, 11376, 1228, 468],
    index=[
        [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            "Python Basics",
            "Python Basics",
        ],
        ["de", "en", "de", "de", "en"],
    ],
)

hits
[32]:
Jupyter Tutorial  de    83080
                  en    20336
PyViz Tutorial    de    11376
Python Basics     de     1228
                  en      468
dtype: int64

What you see is a graphical view of a series with a pandas.MultiIndex. The gaps in the index display mean that the label above it is to be used.

[33]:
hits.index
[33]:
MultiIndex([('Jupyter Tutorial', 'de'),
            ('Jupyter Tutorial', 'en'),
            (  'PyViz Tutorial', 'de'),
            (   'Python Basics', 'de'),
            (   'Python Basics', 'en')],
           )

With a hierarchically indexed object, so-called partial indexing is possible, with which you can select subsets of the data:

[34]:
hits["Jupyter Tutorial"]
[34]:
de    83080
en    20336
dtype: int64
[35]:
hits["Jupyter Tutorial":"Python Basics"]
[35]:
Jupyter Tutorial  de    83080
                  en    20336
PyViz Tutorial    de    11376
Python Basics     de     1228
                  en      468
dtype: int64
[36]:
hits.loc[["Jupyter Tutorial", "Python Basics"]]
[36]:
Jupyter Tutorial  de    83080
                  en    20336
Python Basics     de     1228
                  en      468
dtype: int64

The selection is even possible from an inner level. In the following I select all values with the value 1 from the second index level:

[37]:
hits.loc[:, "de"]
[37]:
Jupyter Tutorial    83080
PyViz Tutorial      11376
Python Basics        1228
dtype: int64

View vs. copy

In Pandas, whether you get a view or not depends on the structure and data types of the original DataFrame – and whether changes made to a view are propagated back to the original DataFrame.

**See also:** * [Returning a view versus a copy](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy) * [Views and Copies in pandas](https://www.practicaldatascience.org/html/views_and_copies_in_pandas.html)

stack and unstack

Hierarchical indexing plays an important role in data reshaping and group-based operations such as forming a pivot table. For example, you can reorder this data into a DataFrame using the pandas.Series.unstack method:

[38]:
hits.unstack()
[38]:
de en
Jupyter Tutorial 83080.0 20336.0
PyViz Tutorial 11376.0 NaN
Python Basics 1228.0 468.0

The reverse operation of unstack is stack:

[39]:
hits.unstack().stack()
[39]:
Jupyter Tutorial  de    83080.0
                  en    20336.0
PyViz Tutorial    de    11376.0
Python Basics     de     1228.0
                  en      468.0
dtype: float64

In a DataFrame, each axis can have a hierarchical index:

[40]:
version_hits = [
    [19651, 0, 30134, 0, 33295, 0],
    [4722, 1825, 3497, 2576, 4009, 3707],
    [2573, 0, 4873, 0, 3930, 0],
    [525, 0, 427, 0, 276, 0],
    [157, 0, 85, 0, 226, 0],
]

df = pd.DataFrame(
    version_hits,
    index=[
        [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            "Python Basics",
            "Python Basics",
        ],
        ["de", "en", "de", "de", "en"],
    ],
    columns=[
        ["12/2021", "12/2021", "01/2022", "01/2022", "02/2022", "02/2022"],
        ["latest", "stable", "latest", "stable", "latest", "stable"],
    ],
)

df
[40]:
12/2021 01/2022 02/2022
latest stable latest stable latest stable
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0

The hierarchy levels can have names (as strings or any Python objects). If this is the case, they are displayed in the console output:

[41]:
df.index.names = ["Title", "Language"]
df.columns.names = ["Month", "Version"]

df
[41]:
Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Title Language
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0

Warning:

Make sure that the index names Month and Version are not part of the row names (of the df.index values).

With the partial column indexing you can select column groups in a similar way:

[42]:
df["12/2021"]
[42]:
Version latest stable
Title Language
Jupyter Tutorial de 19651 0
en 4722 1825
PyViz Tutorial de 2573 0
Python Basics de 525 0
en 157 0

With MultiIndex.from_arrays, a MultiIndex can be created itself and then reused; the columns in the preceding DataFrame with level names could be created in this way:

[43]:
pd.MultiIndex.from_arrays(
    [
        [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            "Python Basics",
            "Python Basics",
        ],
        ["de", "en", "de", "de", "en"],
    ],
    names=["Title", "Language"],
)
[43]:
MultiIndex([('Jupyter Tutorial', 'de'),
            ('Jupyter Tutorial', 'en'),
            (  'PyViz Tutorial', 'de'),
            (   'Python Basics', 'de'),
            (   'Python Basics', 'en')],
           names=['Title', 'Language'])

Rearranging and Sorting Levels

There may be times when you want to rearrange the order of the levels on an axis or sort the data by the values in a particular level. The function DataFrame.swaplevel takes two level numbers or names and returns a new object in which the levels are swapped (but the data remains unchanged):

[44]:
df.swaplevel("Language", "Title")
[44]:
Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Language Title
de Jupyter Tutorial 19651 0 30134 0 33295 0
en Jupyter Tutorial 4722 1825 3497 2576 4009 3707
de PyViz Tutorial 2573 0 4873 0 3930 0
Python Basics 525 0 427 0 276 0
en Python Basics 157 0 85 0 226 0

DataFrame.sort_index, on the other hand, sorts the data only by the values in a single level. When swapping levels, it is not uncommon to also use sort_index so that the result is lexicographically sorted by the specified level:

[45]:
df.sort_index(level=0)
[45]:
Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Title Language
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0

However, the PyViz Tutorial will now be sorted before the Python Basics, as all upper case letters appear before lower case letters in this sorting. To avoid this, you can use the following lambda function:

[46]:
df.sort_index(level=0, key=lambda x: x.str.lower())
[46]:
Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Title Language
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0
PyViz Tutorial de 2573 0 4873 0 3930 0
[47]:
df.swaplevel(0, 1).sort_index(level=0)
[47]:
Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Language Title
de Jupyter Tutorial 19651 0 30134 0 33295 0
PyViz Tutorial 2573 0 4873 0 3930 0
Python Basics 525 0 427 0 276 0
en Jupyter Tutorial 4722 1825 3497 2576 4009 3707
Python Basics 157 0 85 0 226 0

Note:

Data selection performance is much better for hierarchically indexed objects if the index is sorted lexicographically, starting with the outermost level, i.e. the result of calling sort_index(level=0) or sort_index().

Summary statistics by level

Many descriptive and summary statistics for DataFrame and Series have a level option that allows you to specify the level by which you can aggregate on a particular axis. Consider the DataFrame above; we can aggregate either the rows or the columns by level as follows:

[48]:
df.groupby(level="Language").sum()
[48]:
Month 12/2021 01/2022 02/2022
Version latest stable latest stable latest stable
Language
de 22749 0 35434 0 37501 0
en 4879 1825 3582 2576 4235 3707
[49]:
df.groupby(level="Month", axis=1).sum()
[49]:
Month 01/2022 02/2022 12/2021
Title Language
Jupyter Tutorial de 30134 33295 19651
en 6073 7716 6547
PyViz Tutorial de 4873 3930 2573
Python Basics de 427 276 525
en 85 226 157

Internally, pandas’ DataFrame.groupby machinery is used for this purpose, which is explained in more detail in Group Operations.

Indexing with the columns of a DataFrame

It is not uncommon to use one or more columns of a DataFrame as a row index; alternatively, you can move the row index into the columns of the DataFrame. Here is an example DataFrame:

[50]:
data = [
    ["Jupyter Tutorial", "de", 19651, 0, 30134, 0, 33295, 0],
    ["Jupyter Tutorial", "en", 4722, 1825, 3497, 2576, 4009, 3707],
    ["PyViz Tutorial", "de", 2573, 0, 4873, 0, 3930, 0],
    ["Python Basics", "de", 525, 0, 427, 0, 276, 0],
    ["Python Basics", "en", 157, 0, 85, 0, 226, 0],
]

df = pd.DataFrame(data)

df
[50]:
0 1 2 3 4 5 6 7
0 Jupyter Tutorial de 19651 0 30134 0 33295 0
1 Jupyter Tutorial en 4722 1825 3497 2576 4009 3707
2 PyViz Tutorial de 2573 0 4873 0 3930 0
3 Python Basics de 525 0 427 0 276 0
4 Python Basics en 157 0 85 0 226 0

The function pandas.DataFrame.set_index creates a new DataFrame that uses one or more of its columns as an index:

[51]:
df2 = df.set_index([0, 1])

df2
[51]:
2 3 4 5 6 7
0 1
Jupyter Tutorial de 19651 0 30134 0 33295 0
en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de 525 0 427 0 276 0
en 157 0 85 0 226 0

By default, the columns are removed from the DataFrame, but you can also leave them in by passing drop=False to set_index:

[52]:
df.set_index([0, 1], drop=False)
[52]:
0 1 2 3 4 5 6 7
0 1
Jupyter Tutorial de Jupyter Tutorial de 19651 0 30134 0 33295 0
en Jupyter Tutorial en 4722 1825 3497 2576 4009 3707
PyViz Tutorial de PyViz Tutorial de 2573 0 4873 0 3930 0
Python Basics de Python Basics de 525 0 427 0 276 0
en Python Basics en 157 0 85 0 226 0

DataFrame.reset_index, on the other hand, does the opposite of set_index; the hierarchical index levels are moved into the columns:

[53]:
df2.reset_index()
[53]:
0 1 2 3 4 5 6 7
0 Jupyter Tutorial de 19651 0 30134 0 33295 0
1 Jupyter Tutorial en 4722 1825 3497 2576 4009 3707
2 PyViz Tutorial de 2573 0 4873 0 3930 0
3 Python Basics de 525 0 427 0 276 0
4 Python Basics en 157 0 85 0 226 0