Introduction to the data structures of pandas

To get started with pandas, you should first familiarise yourself with the two most important data structures Series and DataFrame.

Series

A series is a one-dimensional array-like object containing a sequence of values (of similar types to the NumPy types) and an associated array of data labels called an index. The simplest series is formed from just an array of data:

[1]:
import numpy as np
import pandas as pd
[2]:
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7))
s
[2]:
0    0.415497
1    3.102087
2   -0.332863
3   -0.135429
4    0.471112
5    0.173483
6   -0.487151
dtype: float64

The string representation of an interactively displayed series shows the index on the left and the values on the right. Since we have not specified an index for the data, a default index is created consisting of the integers 0 to N - 1 (where N is the length of the data). You can get the array representation and the index object of the series via their pandas.Series.array and pandas.Series.index attributes respectively:

[3]:
s.array
[3]:
<PandasArray>
[  0.4154969051865909,    3.102087203833539,  -0.3328632996406089,
 -0.13542859429409687,   0.4711123318607415,   0.1734826179409076,
 -0.48715121240065956]
Length: 7, dtype: float64
[4]:
s.index
[4]:
RangeIndex(start=0, stop=7, step=1)

Often you will want to create an index that identifies each data point with a label:

[5]:
idx = pd.date_range("2022-01-31", periods=7)

s2 = pd.Series(rng.normal(size=7), index=idx)
[6]:
s2
[6]:
2022-01-31    0.434474
2022-02-01   -1.696645
2022-02-02   -1.180240
2022-02-03   -0.205702
2022-02-04   -0.426140
2022-02-05   -0.123695
2022-02-06    1.071786
Freq: D, dtype: float64

Compared to NumPy arrays, you can use labels in the index if you want to select individual values or a group of values:

[7]:
s2["2022-02-02"]
[7]:
-1.1802398819304771
[8]:
s2[["2022-02-02", "2022-02-03", "2022-02-04"]]
[8]:
2022-02-02   -1.180240
2022-02-03   -0.205702
2022-02-04   -0.426140
dtype: float64

Here ['2022-02-02', '2022-02-03', '2022-02-04'] is interpreted as a list of indices, even if it contains strings instead of integers.

When using NumPy functions or NumPy-like operations, such as filtering with a Boolean array, scalar multiplication or applying mathematical functions, the link between index and value is preserved:

[9]:
s2[s2 > 0]
[9]:
2022-01-31    0.434474
2022-02-06    1.071786
dtype: float64
[10]:
s2**2
[10]:
2022-01-31    0.188768
2022-02-01    2.878604
2022-02-02    1.392966
2022-02-03    0.042313
2022-02-04    0.181595
2022-02-05    0.015301
2022-02-06    1.148725
Freq: D, dtype: float64
[11]:
np.exp(s2)
[11]:
2022-01-31    1.544151
2022-02-01    0.183297
2022-02-02    0.307205
2022-02-03    0.814076
2022-02-04    0.653025
2022-02-05    0.883649
2022-02-06    2.920591
Freq: D, dtype: float64

You can also think of a series as a fixed-length ordered dict, since it is an assignment of index values to data values. It can be used in many contexts where you could use a dict:

[12]:
"2022-02-02" in s2
[12]:
True
[13]:
"2022-02-09" in s2
[13]:
False

Missing data

I will use NA and null synonymously to indicate missing data. The functions isna and notna in pandas should be used to identify missing data:

[14]:
pd.isna(s2)
[14]:
2022-01-31    False
2022-02-01    False
2022-02-02    False
2022-02-03    False
2022-02-04    False
2022-02-05    False
2022-02-06    False
Freq: D, dtype: bool
[15]:
pd.notna(s2)
[15]:
2022-01-31    True
2022-02-01    True
2022-02-02    True
2022-02-03    True
2022-02-04    True
2022-02-05    True
2022-02-06    True
Freq: D, dtype: bool

Series also has these as instance methods:

[16]:
s2.isna()
[16]:
2022-01-31    False
2022-02-01    False
2022-02-02    False
2022-02-03    False
2022-02-04    False
2022-02-05    False
2022-02-06    False
Freq: D, dtype: bool

Dealing with missing data is discussed in more detail in the section Managing missing data with pandas.

A useful feature of Series for many applications is the automatic alignment by index labels in arithmetic operations:

[17]:
idx = pd.date_range("2022-02-01", periods=7)

s3 = pd.Series(rng.normal(size=7), index=idx)
[18]:
s2, s3
[18]:
(2022-01-31    0.434474
 2022-02-01   -1.696645
 2022-02-02   -1.180240
 2022-02-03   -0.205702
 2022-02-04   -0.426140
 2022-02-05   -0.123695
 2022-02-06    1.071786
 Freq: D, dtype: float64,
 2022-02-01   -0.105019
 2022-02-02    0.156524
 2022-02-03    0.191187
 2022-02-04    0.002915
 2022-02-05    0.274354
 2022-02-06   -0.991969
 2022-02-07   -0.087003
 Freq: D, dtype: float64)
[19]:
s2 + s3
[19]:
2022-01-31         NaN
2022-02-01   -1.801664
2022-02-02   -1.023716
2022-02-03   -0.014515
2022-02-04   -0.423225
2022-02-05    0.150659
2022-02-06    0.079817
2022-02-07         NaN
Freq: D, dtype: float64

If you have experience with SQL, this is similar to a JOIN operation.

Both the Series object itself and its index have a name attribute that can be integrated into other areas of the pandas functionality:

[20]:
s3.name = "floats"
s3.index.name = "date"

s3
[20]:
date
2022-02-01   -0.105019
2022-02-02    0.156524
2022-02-03    0.191187
2022-02-04    0.002915
2022-02-05    0.274354
2022-02-06   -0.991969
2022-02-07   -0.087003
Freq: D, Name: floats, dtype: float64

DataFrame

A DataFrame represents a rectangular data table and contains an ordered, named collection of columns, each of which can have a different value type. The DataFrame has both a row index and a column index.

Note:

Although a DataFrame is two-dimensional, you can also use it to represent higher-dimensional data in a table format with hierarchical indexing using join, combine and Reshaping.

[21]:
data = {
    "Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
    "Decimal": [0, 1, 2, 3, 4, 5],
    "Octal": ["001", "002", "003", "004", "004", "005"],
    "Key": ["NUL", "Ctrl-A", "Ctrl-B", "Ctrl-C", "Ctrl-D", "Ctrl-E"],
}

df = pd.DataFrame(data)

df
[21]:
Code Decimal Octal Key
0 U+0000 0 001 NUL
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D
5 U+0005 5 005 Ctrl-E

For large DataFrames, the head method selects only the first five rows:

[22]:
df.head()
[22]:
Code Decimal Octal Key
0 U+0000 0 001 NUL
1 U+0001 1 002 Ctrl-A
2 U+0002 2 003 Ctrl-B
3 U+0003 3 004 Ctrl-C
4 U+0004 4 004 Ctrl-D

You can also specify columns and their order:

[23]:
pd.DataFrame(data, columns=["Code", "Key"])
[23]:
Code Key
0 U+0000 NUL
1 U+0001 Ctrl-A
2 U+0002 Ctrl-B
3 U+0003 Ctrl-C
4 U+0004 Ctrl-D
5 U+0005 Ctrl-E

If you want to pass a column that is not contained in the dict, it will appear without values in the result:

[24]:
df2 = pd.DataFrame(
    data, columns=["Code", "Decimal", "Octal", "Description", "Key"]
)

df2
[24]:
Code Decimal Octal Description Key
0 U+0000 0 001 NaN NUL
1 U+0001 1 002 NaN Ctrl-A
2 U+0002 2 003 NaN Ctrl-B
3 U+0003 3 004 NaN Ctrl-C
4 U+0004 4 004 NaN Ctrl-D
5 U+0005 5 005 NaN Ctrl-E

You can retrieve a column in a DataFrame with a dict-like notation:

[25]:
df["Code"]
[25]:
0    U+0000
1    U+0001
2    U+0002
3    U+0003
4    U+0004
5    U+0005
Name: Code, dtype: object

This way you can also make a column the index:

[26]:
df2 = pd.DataFrame(
    data, columns=["Decimal", "Octal", "Description", "Key"], index=df["Code"]
)

df2
[26]:
Decimal Octal Description Key
Code
U+0000 0 001 NaN NUL
U+0001 1 002 NaN Ctrl-A
U+0002 2 003 NaN Ctrl-B
U+0003 3 004 NaN Ctrl-C
U+0004 4 004 NaN Ctrl-D
U+0005 5 005 NaN Ctrl-E

Rows can be retrieved by position or name with the pandas.DataFrame.loc attribute:

[27]:
df2.loc["U+0001"]
[27]:
Decimal             1
Octal             002
Description       NaN
Key            Ctrl-A
Name: U+0001, dtype: object

Column values can be changed by assignment. For example, a scalar value or an array of values could be assigned to the empty Description column:

[28]:
df2["Description"] = [
    "Null character",
    "Start of Heading",
    "Start of Text",
    "End-of-text character",
    "End-of-transmission character",
    "Enquiry character",
]

df2
[28]:
Decimal Octal Description Key
Code
U+0000 0 001 Null character NUL
U+0001 1 002 Start of Heading Ctrl-A
U+0002 2 003 Start of Text Ctrl-B
U+0003 3 004 End-of-text character Ctrl-C
U+0004 4 004 End-of-transmission character Ctrl-D
U+0005 5 005 Enquiry character Ctrl-E

Assigning a non-existing column creates a new column.

Columns can be removed with pandas.DataFrame.drop and displayed with pandas.DataFrame.columns:

[29]:
df3 = df2.drop(columns=["Decimal", "Octal"])
[30]:
df2.columns
[30]:
Index(['Decimal', 'Octal', 'Description', 'Key'], dtype='object')
[31]:
df3.columns
[31]:
Index(['Description', 'Key'], dtype='object')

Another common form of data is nested dict of dicts:

[32]:
u = {
    "U+0006": {
        "Decimal": "6",
        "Octal": "006",
        "Description": "Acknowledge character",
        "Key": "Ctrl-F",
    },
    "U+0007": {
        "Decimal": "7",
        "Octal": "007",
        "Description": "Bell character",
        "Key": "Ctrl-G",
    },
}

df4 = pd.DataFrame(u)

df4
[32]:
U+0006 U+0007
Decimal 6 7
Octal 006 007
Description Acknowledge character Bell character
Key Ctrl-F Ctrl-G

You can transpose the DataFrame, i.e. swap the rows and columns, with a similar syntax to a NumPy array:

[33]:
df4.T
[33]:
Decimal Octal Description Key
U+0006 6 006 Acknowledge character Ctrl-F
U+0007 7 007 Bell character Ctrl-G

Warning:

Note that when transposing, the data types of the columns are discarded if the columns do not all have the same data type, so when transposing and then transposing back, the previous type information may be lost. In this case, the columns become arrays of pure Python objects.

The keys in the inner dicts are combined to form the index in the result. This is not the case when an explicit index is specified:

[34]:
df5 = pd.DataFrame(u, index=["Decimal", "Octal", "Key"])
df5
[34]:
U+0006 U+0007
Decimal 6 7
Octal 006 007
Key Ctrl-F Ctrl-G