Introduction to the data structures of pandas¶
To get started with pandas, you should first familiarise yourself with the two most important data structures Series and DataFrame.
Series¶
A series is a one-dimensional array-like object containing a sequence of values (of similar types to the NumPy types) and an associated array of data labels called an index. The simplest series is formed from just an array of data:
[1]:
import numpy as np
import pandas as pd
[2]:
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7))
s
[2]:
0 0.756390
1 -0.720152
2 -1.241521
3 0.008288
4 -1.020880
5 -0.669150
6 -0.959491
dtype: float64
The string representation of an interactively displayed series shows the index on the left and the values on the right. Since we have not specified an index for the data, a default index is created consisting of the integers 0 to N - 1 (where N is the length of the data). You can get the array representation and the index object of the series via their pandas.Series.array and
pandas.Series.index attributes respectively:
[3]:
s.array
[3]:
<NumpyExtensionArray>
[ np.float64(0.7563904474621019), np.float64(-0.720151807332795),
np.float64(-1.2415207045203973), np.float64(0.008287570553159707),
np.float64(-1.0208804470233657), np.float64(-0.6691500152824282),
np.float64(-0.9594909838421385)]
Length: 7, dtype: float64
[4]:
s.index
[4]:
RangeIndex(start=0, stop=7, step=1)
Often you will want to create an index that identifies each data point with a label:
[5]:
idx = pd.date_range("2022-01-31", periods=7)
s2 = pd.Series(rng.normal(size=7), index=idx)
[6]:
s2
[6]:
2022-01-31 0.181385
2022-02-01 0.044108
2022-02-02 -1.537827
2022-02-03 1.172051
2022-02-04 -1.792041
2022-02-05 0.335298
2022-02-06 1.388968
Freq: D, dtype: float64
See also:
Compared to NumPy arrays, you can use labels in the index if you want to select individual values or a group of values:
[7]:
s2["2022-02-02"]
[7]:
np.float64(-1.5378272926290917)
[8]:
s2[["2022-02-02", "2022-02-03", "2022-02-04"]]
[8]:
2022-02-02 -1.537827
2022-02-03 1.172051
2022-02-04 -1.792041
dtype: float64
Here ['2022-02-02', '2022-02-03', '2022-02-04'] is interpreted as a list of indices, even if it contains strings instead of integers.
When using NumPy functions or NumPy-like operations, such as filtering with a Boolean array, scalar multiplication or applying mathematical functions, the link between index and value is preserved:
[9]:
s2[s2 > 0]
[9]:
2022-01-31 0.181385
2022-02-01 0.044108
2022-02-03 1.172051
2022-02-05 0.335298
2022-02-06 1.388968
dtype: float64
[10]:
s2**2
[10]:
2022-01-31 0.032900
2022-02-01 0.001945
2022-02-02 2.364913
2022-02-03 1.373704
2022-02-04 3.211413
2022-02-05 0.112424
2022-02-06 1.929233
Freq: D, dtype: float64
[11]:
np.exp(s2)
[11]:
2022-01-31 1.198876
2022-02-01 1.045095
2022-02-02 0.214847
2022-02-03 3.228609
2022-02-04 0.166620
2022-02-05 1.398356
2022-02-06 4.010711
Freq: D, dtype: float64
You can also think of a series as a fixed-length ordered dict, since it is an assignment of index values to data values. It can be used in many contexts where you could use a dict:
[12]:
"2022-02-02" in s2
[12]:
True
[13]:
"2022-02-09" in s2
[13]:
False
Missing data¶
I will use NA and null synonymously to indicate missing data. The functions isna and notna in pandas should be used to identify missing data:
[14]:
pd.isna(s2)
[14]:
2022-01-31 False
2022-02-01 False
2022-02-02 False
2022-02-03 False
2022-02-04 False
2022-02-05 False
2022-02-06 False
Freq: D, dtype: bool
[15]:
pd.notna(s2)
[15]:
2022-01-31 True
2022-02-01 True
2022-02-02 True
2022-02-03 True
2022-02-04 True
2022-02-05 True
2022-02-06 True
Freq: D, dtype: bool
Series also has these as instance methods:
[16]:
s2.isna()
[16]:
2022-01-31 False
2022-02-01 False
2022-02-02 False
2022-02-03 False
2022-02-04 False
2022-02-05 False
2022-02-06 False
Freq: D, dtype: bool
Dealing with missing data is discussed in more detail in the section Managing missing data with pandas.
A useful feature of Series for many applications is the automatic alignment by index labels in arithmetic operations:
[17]:
idx = pd.date_range("2022-02-01", periods=7)
s3 = pd.Series(rng.normal(size=7), index=idx)
[18]:
s2, s3
[18]:
(2022-01-31 0.181385
2022-02-01 0.044108
2022-02-02 -1.537827
2022-02-03 1.172051
2022-02-04 -1.792041
2022-02-05 0.335298
2022-02-06 1.388968
Freq: D, dtype: float64,
2022-02-01 -0.559119
2022-02-02 0.397340
2022-02-03 0.259035
2022-02-04 -0.138873
2022-02-05 1.520708
2022-02-06 -0.456184
2022-02-07 -0.438187
Freq: D, dtype: float64)
[19]:
s2 + s3
[19]:
2022-01-31 NaN
2022-02-01 -0.515012
2022-02-02 -1.140487
2022-02-03 1.431087
2022-02-04 -1.930914
2022-02-05 1.856006
2022-02-06 0.932785
2022-02-07 NaN
Freq: D, dtype: float64
If you have experience with SQL, this is similar to a JOIN operation.
Both the Series object itself and its index have a name attribute that can be integrated into other areas of the pandas functionality:
[20]:
s3.name = "floats"
s3.index.name = "date"
s3
[20]:
date
2022-02-01 -0.559119
2022-02-02 0.397340
2022-02-03 0.259035
2022-02-04 -0.138873
2022-02-05 1.520708
2022-02-06 -0.456184
2022-02-07 -0.438187
Freq: D, Name: floats, dtype: float64
DataFrame¶
A DataFrame represents a rectangular data table and contains an ordered, named collection of columns, each of which can have a different value type. The DataFrame has both a row index and a column index.
Note:
Although a DataFrame is two-dimensional, you can also use it to represent higher-dimensional data in a table format with hierarchical indexing using join, combine and Reshaping.
[21]:
data = {
"Code": ["U+0000", "U+0001", "U+0002", "U+0003", "U+0004", "U+0005"],
"Decimal": [0, 1, 2, 3, 4, 5],
"Octal": ["001", "002", "003", "004", "004", "005"],
"Key": ["NUL", "Ctrl-A", "Ctrl-B", "Ctrl-C", "Ctrl-D", "Ctrl-E"],
}
df = pd.DataFrame(data)
df
[21]:
| Code | Decimal | Octal | Key | |
|---|---|---|---|---|
| 0 | U+0000 | 0 | 001 | NUL |
| 1 | U+0001 | 1 | 002 | Ctrl-A |
| 2 | U+0002 | 2 | 003 | Ctrl-B |
| 3 | U+0003 | 3 | 004 | Ctrl-C |
| 4 | U+0004 | 4 | 004 | Ctrl-D |
| 5 | U+0005 | 5 | 005 | Ctrl-E |
For large DataFrames, the head method selects only the first five rows:
[22]:
df.head()
[22]:
| Code | Decimal | Octal | Key | |
|---|---|---|---|---|
| 0 | U+0000 | 0 | 001 | NUL |
| 1 | U+0001 | 1 | 002 | Ctrl-A |
| 2 | U+0002 | 2 | 003 | Ctrl-B |
| 3 | U+0003 | 3 | 004 | Ctrl-C |
| 4 | U+0004 | 4 | 004 | Ctrl-D |
You can also specify columns and their order:
[23]:
pd.DataFrame(data, columns=["Code", "Key"])
[23]:
| Code | Key | |
|---|---|---|
| 0 | U+0000 | NUL |
| 1 | U+0001 | Ctrl-A |
| 2 | U+0002 | Ctrl-B |
| 3 | U+0003 | Ctrl-C |
| 4 | U+0004 | Ctrl-D |
| 5 | U+0005 | Ctrl-E |
If you want to pass a column that is not contained in the dict, it will appear without values in the result:
[24]:
df2 = pd.DataFrame(
data, columns=["Code", "Decimal", "Octal", "Description", "Key"]
)
df2
[24]:
| Code | Decimal | Octal | Description | Key | |
|---|---|---|---|---|---|
| 0 | U+0000 | 0 | 001 | NaN | NUL |
| 1 | U+0001 | 1 | 002 | NaN | Ctrl-A |
| 2 | U+0002 | 2 | 003 | NaN | Ctrl-B |
| 3 | U+0003 | 3 | 004 | NaN | Ctrl-C |
| 4 | U+0004 | 4 | 004 | NaN | Ctrl-D |
| 5 | U+0005 | 5 | 005 | NaN | Ctrl-E |
You can retrieve a column in a DataFrame with a dict-like notation:
[25]:
df["Code"]
[25]:
0 U+0000
1 U+0001
2 U+0002
3 U+0003
4 U+0004
5 U+0005
Name: Code, dtype: object
This way you can also make a column the index:
[26]:
df2 = pd.DataFrame(
data, columns=["Decimal", "Octal", "Description", "Key"], index=df["Code"]
)
df2
[26]:
| Decimal | Octal | Description | Key | |
|---|---|---|---|---|
| Code | ||||
| U+0000 | 0 | 001 | NaN | NUL |
| U+0001 | 1 | 002 | NaN | Ctrl-A |
| U+0002 | 2 | 003 | NaN | Ctrl-B |
| U+0003 | 3 | 004 | NaN | Ctrl-C |
| U+0004 | 4 | 004 | NaN | Ctrl-D |
| U+0005 | 5 | 005 | NaN | Ctrl-E |
Rows can be retrieved by position or name with the pandas.DataFrame.loc attribute:
[27]:
df2.loc["U+0001"]
[27]:
Decimal 1
Octal 002
Description NaN
Key Ctrl-A
Name: U+0001, dtype: object
Column values can be changed by assignment. For example, a scalar value or an array of values could be assigned to the empty Description column:
[28]:
df2["Description"] = [
"Null character",
"Start of Heading",
"Start of Text",
"End-of-text character",
"End-of-transmission character",
"Enquiry character",
]
df2
[28]:
| Decimal | Octal | Description | Key | |
|---|---|---|---|---|
| Code | ||||
| U+0000 | 0 | 001 | Null character | NUL |
| U+0001 | 1 | 002 | Start of Heading | Ctrl-A |
| U+0002 | 2 | 003 | Start of Text | Ctrl-B |
| U+0003 | 3 | 004 | End-of-text character | Ctrl-C |
| U+0004 | 4 | 004 | End-of-transmission character | Ctrl-D |
| U+0005 | 5 | 005 | Enquiry character | Ctrl-E |
Assigning a non-existing column creates a new column.
Columns can be removed with pandas.DataFrame.drop and displayed with pandas.DataFrame.columns:
[29]:
df3 = df2.drop(columns=["Decimal", "Octal"])
[30]:
df2.columns
[30]:
Index(['Decimal', 'Octal', 'Description', 'Key'], dtype='object')
[31]:
df3.columns
[31]:
Index(['Description', 'Key'], dtype='object')
Another common form of data is nested dict of dicts:
[32]:
u = {
"U+0006": {
"Decimal": "6",
"Octal": "006",
"Description": "Acknowledge character",
"Key": "Ctrl-F",
},
"U+0007": {
"Decimal": "7",
"Octal": "007",
"Description": "Bell character",
"Key": "Ctrl-G",
},
}
df4 = pd.DataFrame(u)
df4
[32]:
| U+0006 | U+0007 | |
|---|---|---|
| Decimal | 6 | 7 |
| Octal | 006 | 007 |
| Description | Acknowledge character | Bell character |
| Key | Ctrl-F | Ctrl-G |
You can transpose the DataFrame, i.e. swap the rows and columns, with a similar syntax to a NumPy array:
[33]:
df4.T
[33]:
| Decimal | Octal | Description | Key | |
|---|---|---|---|---|
| U+0006 | 6 | 006 | Acknowledge character | Ctrl-F |
| U+0007 | 7 | 007 | Bell character | Ctrl-G |
Warning:
Note that when transposing, the data types of the columns are discarded if the columns do not all have the same data type, so when transposing and then transposing back, the previous type information may be lost. In this case, the columns become arrays of pure Python objects.
The keys in the inner dicts are combined to form the index in the result. This is not the case when an explicit index is specified:
[34]:
df5 = pd.DataFrame(u, index=["Decimal", "Octal", "Key"])
df5
[34]:
| U+0006 | U+0007 | |
|---|---|---|
| Decimal | 6 | 7 |
| Octal | 006 | 007 |
| Key | Ctrl-F | Ctrl-G |