Convert dtype#

Sometimes the pandas data types do not fit really well. This can be due to serialisation formats that do not contain type information, for example. However, sometimes you should also change the type to achieve better performance – either more manipulation possibilities or less memory requirements. In the following examples, we will make different conversions of a Series:

[1]:
import numpy as np
import pandas as pd
[2]:
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7))
[3]:
s
[3]:
0    0.330852
1    0.703606
2   -0.773463
3   -0.814739
4    1.677859
5    0.992312
6    0.175644
dtype: float64

Automatic conversion#

pandas.Series.convert_dtypes tries to convert a Series to a type that supports NA. In the case of our Series, the type is changed from float64 to Float64:

[4]:
s.convert_dtypes()
[4]:
0    0.330852
1    0.703606
2   -0.773463
3   -0.814739
4    1.677859
5    0.992312
6    0.175644
dtype: Float64

Unfortunately, however, with convert_dtypes I have little control over what data type is converted to. Therefore, I prefer pandas.Series.astype:

[5]:
s.astype("Float32")
[5]:
0    0.330852
1    0.703606
2   -0.773463
3   -0.814739
4    1.677859
5    0.992312
6    0.175644
dtype: Float32

Using the correct type can save memory. The usual data type is 8 bytes wide, for example int64 or float64. If you can use a narrower type, this will significantly reduce memory consumption, allowing you to process more data. You can use NumPy to check the limits of integer and float types:

[6]:
np.iinfo("int64")
[6]:
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)
[7]:
np.finfo("float32")
[7]:
finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)
[8]:
np.finfo("float64")
[8]:
finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

Memory usage#

To calculate the memory consumption of the Series, you can use pandas.Series.nbytes to determine the memory used by the data. pandas.Series.memory_usage also records the index memory and the data type. With deep=True you can also determine the memory consumption at system level.

[9]:
s.nbytes
[9]:
56
[10]:
s.astype("Float32").nbytes
[10]:
35
[11]:
s.memory_usage()
[11]:
188
[12]:
s.astype("Float32").memory_usage()
[12]:
167
[13]:
s.memory_usage(deep=True)
[13]:
188

String and category types#

The pandas.Series.astype method can also convert numeric series into strings if you pass str. Note the dtype in the following example:

[14]:
s.astype(str)
[14]:
0    0.33085233447486595
1     0.7036061214691522
2    -0.7734631836438829
3    -0.8147390382513203
4     1.6778586038914356
5     0.9923123929031976
6    0.17564372049973478
dtype: object
[15]:
s.astype(str).memory_usage()
[15]:
188
[16]:
s.astype(str).memory_usage(deep=True)
[16]:
661

To convert to a categorical type, you can pass 'category' as the type:

[17]:
s.astype(str).astype("category")
[17]:
0    0.33085233447486595
1     0.7036061214691522
2    -0.7734631836438829
3    -0.8147390382513203
4     1.6778586038914356
5     0.9923123929031976
6    0.17564372049973478
dtype: category
Categories (7, object): ['-0.7734631836438829', '-0.8147390382513203', '0.17564372049973478', '0.33085233447486595', '0.7036061214691522', '0.9923123929031976', '1.6778586038914356']

A categorical Series is useful for string data and can lead to large memory savings. This is because when converting to categorical data, pandas no longer uses Python strings for each value, but repeating values are not duplicated. You still have all the features of the str attribute, but you save a lot of memory when you have a lot of duplicate values and you increase performance because you don’t have to do as many string operations.

[18]:
s.astype("category").memory_usage(deep=True)
[18]:
495

Ordered categories#

To create ordered categories, you need to define your own pandas.CategoricalDtype:

[19]:
from pandas.api.types import CategoricalDtype


sorted = pd.Series(sorted(set(s)))
cat_dtype = CategoricalDtype(categories=sorted, ordered=True)

s.astype(cat_dtype)
[19]:
0    0.330852
1    0.703606
2   -0.773463
3   -0.814739
4    1.677859
5    0.992312
6    0.175644
dtype: category
Categories (7, float64): [-0.814739 < -0.773463 < 0.175644 < 0.330852 < 0.703606 < 0.992312 < 1.677859]
[20]:
s.astype(cat_dtype).memory_usage(deep=True)
[20]:
495

The following table lists the types you can pass to astype.

Data type

Description

str, 'str'

convert to Python string

'string'

convert to Pandas string with pandas.NA

int, 'int', 'int64'

convert to NumPy int64

'int32', 'uint32'

convert to NumPy int32

'Int64'

convert to pandas Int64 with pandas.NA

float, 'float', 'float64'

convert to floats

'category'

convert to CategoricalDtype with pandas.NA

Conversion to other data types#

The pandas.Series.to_numpy method or the pandas.Series.values property gives us a NumPy array of values, and pandas.Series.to_list returns a Python list of values. Why would you want to do this? pandas objects are usually much more user-friendly and the code is easier to read. Also, python lists will be much slower to process. With pandas.Series.to_frame you can create a DataFrame with a single column, if necessary:

[21]:
s.to_frame()
[21]:
0
0 0.330852
1 0.703606
2 -0.773463
3 -0.814739
4 1.677859
5 0.992312
6 0.175644

The function pandas.to_datetime can also be useful to convert values in pandas to date and time.