Convert dtype

Sometimes the pandas data types do not fit really well. This can be due to serialisation formats that do not contain type information, for example. However, sometimes you should also change the type to achieve better performance – either more manipulation possibilities or less memory requirements. In the following examples, we will make different conversions of a Series:

[1]:
import numpy as np
import pandas as pd
[2]:
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7))
[3]:
s
[3]:
0   -0.438454
1    1.494909
2   -0.103494
3   -1.029965
4    0.995336
5    0.743191
6   -0.214517
dtype: float64

Automatic conversion

pandas.Series.convert_dtypes tries to convert a Series to a type that supports NA. In the case of our Series, the type is changed from float64 to Float64:

[4]:
s.convert_dtypes()
[4]:
0   -0.438454
1    1.494909
2   -0.103494
3   -1.029965
4    0.995336
5    0.743191
6   -0.214517
dtype: Float64

Unfortunately, however, with convert_dtypes I have little control over what data type is converted to. Therefore, I prefer pandas.Series.astype:

[5]:
s.astype("Float32")
[5]:
0   -0.438454
1    1.494909
2   -0.103494
3   -1.029965
4    0.995336
5    0.743191
6   -0.214517
dtype: Float32

However, if non-convertible values are included, an error will be returned:

[6]:
n = pd.Series([np.random.randint(127), np.nan, np.random.randint(127)])
n
[6]:
0    121.0
1      NaN
2    125.0
dtype: float64
[7]:
n.astype('int8')
---------------------------------------------------------------------------
IntCastingNaNError                        Traceback (most recent call last)
Cell In[7], line 1
----> 1 n.astype('int8')

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/generic.py:6643, in NDFrame.astype(self, dtype, copy, errors)
   6637     results = [
   6638         ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items()
   6639     ]
   6641 else:
   6642     # else, only a single dtype is given
-> 6643     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6644     res = self._constructor_from_mgr(new_data, axes=new_data.axes)
   6645     return res.__finalize__(self, method="astype")

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/internals/managers.py:430, in BaseBlockManager.astype(self, dtype, copy, errors)
    427 elif using_copy_on_write():
    428     copy = False
--> 430 return self.apply(
    431     "astype",
    432     dtype=dtype,
    433     copy=copy,
    434     errors=errors,
    435     using_cow=using_copy_on_write(),
    436 )

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    361         applied = b.apply(f, **kwargs)
    362     else:
--> 363         applied = getattr(b, f)(**kwargs)
    364     result_blocks = extend_blocks(applied, result_blocks)
    366 out = type(self).from_blocks(result_blocks, self.axes)

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze)
    755         raise ValueError("Can not squeeze with more than one column.")
    756     values = values[0, :]  # type: ignore[call-overload]
--> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    760 new_values = maybe_coerce_values(new_values)
    762 refs = None

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors)
    234     dtype = dtype.numpy_dtype
    236 try:
--> 237     new_values = astype_array(values, dtype, copy=copy)
    238 except (ValueError, TypeError):
    239     # e.g. _astype_nansafe can fail on object-dtype of strings
    240     #  trying to convert to float
    241     if errors == "ignore":

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/dtypes/astype.py:182, in astype_array(values, dtype, copy)
    179     values = values.astype(dtype, copy=copy)
    181 else:
--> 182     values = _astype_nansafe(values, dtype, copy=copy)
    184 # in pandas we don't store numpy str dtypes, so convert to object
    185 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/dtypes/astype.py:101, in _astype_nansafe(arr, dtype, copy, skipna)
     96     return lib.ensure_string_array(
     97         arr, skipna=skipna, convert_na_value=False
     98     ).reshape(shape)
    100 elif np.issubdtype(arr.dtype, np.floating) and dtype.kind in "iu":
--> 101     return _astype_float_to_int_nansafe(arr, dtype, copy)
    103 elif arr.dtype == object:
    104     # if we have a datetime/timedelta array of objects
    105     # then coerce to datetime64[ns] and use DatetimeArray.astype
    107     if lib.is_np_dtype(dtype, "M"):

File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/dtypes/astype.py:145, in _astype_float_to_int_nansafe(values, dtype, copy)
    141 """
    142 astype with a check preventing converting NaN to an meaningless integer value.
    143 """
    144 if not np.isfinite(values).all():
--> 145     raise IntCastingNaNError(
    146         "Cannot convert non-finite values (NA or inf) to integer"
    147     )
    148 if dtype.kind == "u":
    149     # GH#45151
    150     if not (values >= 0).all():

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

Errors such as IntCastingNaNError can be avoided by retaining the original data type using errors = "ignore" if necessary:

[8]:
n.astype('int8', errors="ignore")
[8]:
0    121.0
1      NaN
2    125.0
dtype: float64

Using the correct type can save memory. The usual data type is 8 bytes wide, for example int64 or float64. If you can use a narrower type, this will significantly reduce memory consumption, allowing you to process more data. You can use NumPy to check the limits of integer and float types:

[9]:
np.iinfo("int8")
[9]:
iinfo(min=-128, max=127, dtype=int8)
[10]:
np.iinfo("int64")
[10]:
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)
[11]:
np.finfo("float32")
[11]:
finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)
[12]:
np.finfo("float64")
[12]:
finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

Memory usage

To calculate the memory consumption of the Series, you can use pandas.Series.nbytes to determine the memory used by the data. pandas.Series.memory_usage also records the index memory and the data type. With deep=True you can also determine the memory consumption at system level.

[13]:
s.nbytes
[13]:
56
[14]:
s.astype("Float32").nbytes
[14]:
35
[15]:
s.memory_usage()
[15]:
188
[16]:
s.astype("Float32").memory_usage()
[16]:
167
[17]:
s.memory_usage(deep=True)
[17]:
188

String and category types

The pandas.Series.astype method can also convert numeric series into strings if you pass str. Note the dtype in the following example:

[18]:
s.astype(str)
[18]:
0     -0.4384542761183468
1      1.4949094487463122
2    -0.10349429095988272
3      -1.029965136707806
4       0.995336334832013
5      0.7431905015299632
6    -0.21451741758934953
dtype: object
[19]:
s.astype(str).memory_usage()
[19]:
188
[20]:
s.astype(str).memory_usage(deep=True)
[20]:
605

To convert to a categorical type, you can pass 'category' as the type:

[21]:
s.astype(str).astype("category")
[21]:
0     -0.4384542761183468
1      1.4949094487463122
2    -0.10349429095988272
3      -1.029965136707806
4       0.995336334832013
5      0.7431905015299632
6    -0.21451741758934953
dtype: category
Categories (7, object): ['-0.10349429095988272', '-0.21451741758934953', '-0.4384542761183468', '-1.029965136707806', '0.7431905015299632', '0.995336334832013', '1.4949094487463122']

A categorical Series is useful for string data and can lead to large memory savings. This is because when converting to categorical data, pandas no longer uses Python strings for each value, but repeating values are not duplicated. You still have all the features of the str attribute, but you save a lot of memory when you have a lot of duplicate values and you increase performance because you don’t have to do as many string operations.

[22]:
s.astype("category").memory_usage(deep=True)
[22]:
495

Ordered categories

To create ordered categories, you need to define your own pandas.CategoricalDtype:

[23]:
from pandas.api.types import CategoricalDtype


sorted = pd.Series(sorted(set(s)))
cat_dtype = CategoricalDtype(categories=sorted, ordered=True)

s.astype(cat_dtype)
[23]:
0   -0.438454
1    1.494909
2   -0.103494
3   -1.029965
4    0.995336
5    0.743191
6   -0.214517
dtype: category
Categories (7, float64): [-1.029965 < -0.438454 < -0.214517 < -0.103494 < 0.743191 < 0.995336 < 1.494909]
[24]:
s.astype(cat_dtype).memory_usage(deep=True)
[24]:
495

The following table lists the types you can pass to astype.

Data type

Description

str, 'str'

convert to Python string

'string'

convert to Pandas string with pandas.NA

int, 'int', 'int64'

convert to NumPy int64

'int32', 'uint32'

convert to NumPy int32

'Int64'

convert to pandas Int64 with pandas.NA

float, 'float', 'float64'

convert to floats

'category'

convert to CategoricalDtype with pandas.NA

Conversion to other data types

The pandas.Series.to_numpy method or the pandas.Series.values property gives us a NumPy array of values, and pandas.Series.to_list returns a Python list of values. Why would you want to do this? pandas objects are usually much more user-friendly and the code is easier to read. Also, python lists will be much slower to process. With pandas.Series.to_frame you can create a DataFrame with a single column, if necessary:

[25]:
s.to_frame()
[25]:
0
0 -0.438454
1 1.494909
2 -0.103494
3 -1.029965
4 0.995336
5 0.743191
6 -0.214517

The function pandas.to_datetime can also be useful to convert values in pandas to date and time.