Convert dtype¶
Sometimes the pandas data types do not fit really well. This can be due to serialisation formats that do not contain type information, for example. However, sometimes you should also change the type to achieve better performance – either more manipulation possibilities or less memory requirements. In the following examples, we will make different conversions of a Series:
[1]:
import numpy as np
import pandas as pd
[2]:
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7))
[3]:
s
[3]:
0 -0.438454
1 1.494909
2 -0.103494
3 -1.029965
4 0.995336
5 0.743191
6 -0.214517
dtype: float64
Automatic conversion¶
pandas.Series.convert_dtypes tries to convert a Series to a type that supports NA. In the case of our Series, the type is changed from float64 to Float64:
[4]:
s.convert_dtypes()
[4]:
0 -0.438454
1 1.494909
2 -0.103494
3 -1.029965
4 0.995336
5 0.743191
6 -0.214517
dtype: Float64
Unfortunately, however, with convert_dtypes I have little control over what data type is converted to. Therefore, I prefer pandas.Series.astype:
[5]:
s.astype("Float32")
[5]:
0 -0.438454
1 1.494909
2 -0.103494
3 -1.029965
4 0.995336
5 0.743191
6 -0.214517
dtype: Float32
However, if non-convertible values are included, an error will be returned:
[6]:
n = pd.Series([np.random.randint(127), np.nan, np.random.randint(127)])
n
[6]:
0 121.0
1 NaN
2 125.0
dtype: float64
[7]:
n.astype('int8')
---------------------------------------------------------------------------
IntCastingNaNError Traceback (most recent call last)
Cell In[7], line 1
----> 1 n.astype('int8')
File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/generic.py:6643, in NDFrame.astype(self, dtype, copy, errors)
6637 results = [
6638 ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items()
6639 ]
6641 else:
6642 # else, only a single dtype is given
-> 6643 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
6644 res = self._constructor_from_mgr(new_data, axes=new_data.axes)
6645 return res.__finalize__(self, method="astype")
File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/internals/managers.py:430, in BaseBlockManager.astype(self, dtype, copy, errors)
427 elif using_copy_on_write():
428 copy = False
--> 430 return self.apply(
431 "astype",
432 dtype=dtype,
433 copy=copy,
434 errors=errors,
435 using_cow=using_copy_on_write(),
436 )
File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
361 applied = b.apply(f, **kwargs)
362 else:
--> 363 applied = getattr(b, f)(**kwargs)
364 result_blocks = extend_blocks(applied, result_blocks)
366 out = type(self).from_blocks(result_blocks, self.axes)
File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze)
755 raise ValueError("Can not squeeze with more than one column.")
756 values = values[0, :] # type: ignore[call-overload]
--> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
760 new_values = maybe_coerce_values(new_values)
762 refs = None
File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors)
234 dtype = dtype.numpy_dtype
236 try:
--> 237 new_values = astype_array(values, dtype, copy=copy)
238 except (ValueError, TypeError):
239 # e.g. _astype_nansafe can fail on object-dtype of strings
240 # trying to convert to float
241 if errors == "ignore":
File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/dtypes/astype.py:182, in astype_array(values, dtype, copy)
179 values = values.astype(dtype, copy=copy)
181 else:
--> 182 values = _astype_nansafe(values, dtype, copy=copy)
184 # in pandas we don't store numpy str dtypes, so convert to object
185 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):
File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/dtypes/astype.py:101, in _astype_nansafe(arr, dtype, copy, skipna)
96 return lib.ensure_string_array(
97 arr, skipna=skipna, convert_na_value=False
98 ).reshape(shape)
100 elif np.issubdtype(arr.dtype, np.floating) and dtype.kind in "iu":
--> 101 return _astype_float_to_int_nansafe(arr, dtype, copy)
103 elif arr.dtype == object:
104 # if we have a datetime/timedelta array of objects
105 # then coerce to datetime64[ns] and use DatetimeArray.astype
107 if lib.is_np_dtype(dtype, "M"):
File ~/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/pandas/core/dtypes/astype.py:145, in _astype_float_to_int_nansafe(values, dtype, copy)
141 """
142 astype with a check preventing converting NaN to an meaningless integer value.
143 """
144 if not np.isfinite(values).all():
--> 145 raise IntCastingNaNError(
146 "Cannot convert non-finite values (NA or inf) to integer"
147 )
148 if dtype.kind == "u":
149 # GH#45151
150 if not (values >= 0).all():
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
Errors such as IntCastingNaNError can be avoided by retaining the original data type using errors = "ignore" if necessary:
[8]:
n.astype('int8', errors="ignore")
[8]:
0 121.0
1 NaN
2 125.0
dtype: float64
Using the correct type can save memory. The usual data type is 8 bytes wide, for example int64 or float64. If you can use a narrower type, this will significantly reduce memory consumption, allowing you to process more data. You can use NumPy to check the limits of integer and float types:
[9]:
np.iinfo("int8")
[9]:
iinfo(min=-128, max=127, dtype=int8)
[10]:
np.iinfo("int64")
[10]:
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)
[11]:
np.finfo("float32")
[11]:
finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)
[12]:
np.finfo("float64")
[12]:
finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)
Memory usage¶
To calculate the memory consumption of the Series, you can use pandas.Series.nbytes to determine the memory used by the data. pandas.Series.memory_usage also records the index memory and the data type. With deep=True you can also determine the memory consumption at system level.
[13]:
s.nbytes
[13]:
56
[14]:
s.astype("Float32").nbytes
[14]:
35
[15]:
s.memory_usage()
[15]:
188
[16]:
s.astype("Float32").memory_usage()
[16]:
167
[17]:
s.memory_usage(deep=True)
[17]:
188
String and category types¶
The pandas.Series.astype method can also convert numeric series into strings if you pass str. Note the dtype in the following example:
[18]:
s.astype(str)
[18]:
0 -0.4384542761183468
1 1.4949094487463122
2 -0.10349429095988272
3 -1.029965136707806
4 0.995336334832013
5 0.7431905015299632
6 -0.21451741758934953
dtype: object
[19]:
s.astype(str).memory_usage()
[19]:
188
[20]:
s.astype(str).memory_usage(deep=True)
[20]:
605
To convert to a categorical type, you can pass 'category' as the type:
[21]:
s.astype(str).astype("category")
[21]:
0 -0.4384542761183468
1 1.4949094487463122
2 -0.10349429095988272
3 -1.029965136707806
4 0.995336334832013
5 0.7431905015299632
6 -0.21451741758934953
dtype: category
Categories (7, object): ['-0.10349429095988272', '-0.21451741758934953', '-0.4384542761183468', '-1.029965136707806', '0.7431905015299632', '0.995336334832013', '1.4949094487463122']
A categorical Series is useful for string data and can lead to large memory savings. This is because when converting to categorical data, pandas no longer uses Python strings for each value, but repeating values are not duplicated. You still have all the features of the str attribute, but you save a lot of memory when you have a lot of duplicate values and you increase performance because you don’t have to do as many string operations.
[22]:
s.astype("category").memory_usage(deep=True)
[22]:
495
Ordered categories¶
To create ordered categories, you need to define your own pandas.CategoricalDtype:
[23]:
from pandas.api.types import CategoricalDtype
sorted = pd.Series(sorted(set(s)))
cat_dtype = CategoricalDtype(categories=sorted, ordered=True)
s.astype(cat_dtype)
[23]:
0 -0.438454
1 1.494909
2 -0.103494
3 -1.029965
4 0.995336
5 0.743191
6 -0.214517
dtype: category
Categories (7, float64): [-1.029965 < -0.438454 < -0.214517 < -0.103494 < 0.743191 < 0.995336 < 1.494909]
[24]:
s.astype(cat_dtype).memory_usage(deep=True)
[24]:
495
The following table lists the types you can pass to astype.
Data type |
Description |
|---|---|
|
convert to Python string |
|
convert to Pandas string with |
|
convert to NumPy |
|
convert to NumPy |
|
convert to pandas |
|
convert to floats |
|
convert to |
Conversion to other data types¶
The pandas.Series.to_numpy method or the pandas.Series.values property gives us a NumPy array of values, and pandas.Series.to_list returns a Python list of values. Why would you want to do this? pandas objects are usually much more user-friendly and the code is easier to read. Also, python lists will be much slower to process. With pandas.Series.to_frame you can create a DataFrame with a single column, if necessary:
[25]:
s.to_frame()
[25]:
| 0 | |
|---|---|
| 0 | -0.438454 |
| 1 | 1.494909 |
| 2 | -0.103494 |
| 3 | -1.029965 |
| 4 | 0.995336 |
| 5 | 0.743191 |
| 6 | -0.214517 |
The function pandas.to_datetime can also be useful to convert values in pandas to date and time.