{ "cells": [ { "cell_type": "markdown", "id": "73bd3476", "metadata": {}, "source": [ "# Convert `dtype`\n", "\n", "Sometimes the pandas data types do not fit really well. This can be due to serialisation formats that do not contain type information, for example. However, sometimes you should also change the type to achieve better performance – either more manipulation possibilities or less memory requirements. In the following examples, we will make different conversions of a `Series`:" ] }, { "cell_type": "code", "execution_count": 1, "id": "69abe2d3", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "c2a35bac", "metadata": {}, "outputs": [], "source": [ "rng = np.random.default_rng()\n", "s = pd.Series(rng.normal(size=7))" ] }, { "cell_type": "code", "execution_count": 3, "id": "98ffcd09", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 0.684464\n", "1 -0.521004\n", "2 0.520805\n", "3 0.086337\n", "4 -1.427918\n", "5 0.096508\n", "6 -0.131822\n", "dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s" ] }, { "cell_type": "markdown", "id": "ab32810f", "metadata": {}, "source": [ "## Automatic conversion\n", "\n", "[pandas.Series.convert_dtypes](https://pandas.pydata.org/docs/reference/api/pandas.Series.convert_dtypes.html) tries to convert a `Series` to a type that supports `NA`. In the case of our `Series`, the type is changed from `float64` to `Float64`:" ] }, { "cell_type": "code", "execution_count": 4, "id": "c9738255", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.684464\n", "1 -0.521004\n", "2 0.520805\n", "3 0.086337\n", "4 -1.427918\n", "5 0.096508\n", "6 -0.131822\n", "dtype: Float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.convert_dtypes()" ] }, { "cell_type": "markdown", "id": "70ef14c8", "metadata": {}, "source": [ "Unfortunately, however, with `convert_dtypes` I have little control over what data type is converted to. Therefore, I prefer [pandas.Series.astype](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html):" ] }, { "cell_type": "code", "execution_count": 5, "id": "4c6de2eb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.684464\n", "1 -0.521004\n", "2 0.520805\n", "3 0.086337\n", "4 -1.427918\n", "5 0.096508\n", "6 -0.131822\n", "dtype: Float32" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.astype(\"Float32\")" ] }, { "cell_type": "markdown", "id": "b725610a", "metadata": {}, "source": [ "Using the correct type can save memory. The usual data type is 8 bytes wide, for example `int64` or `float64`. If you can use a narrower type, this will significantly reduce memory consumption, allowing you to process more data. You can use NumPy to check the limits of integer and float types:" ] }, { "cell_type": "code", "execution_count": 6, "id": "f9e051ef", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.iinfo(\"int64\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "add9af88", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.finfo(\"float32\")" ] }, { "cell_type": "code", "execution_count": 8, "id": "1a5e1e44", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.finfo(\"float64\")" ] }, { "cell_type": "markdown", "id": "ea015364", "metadata": {}, "source": [ "## Memory usage\n", "\n", "To calculate the memory consumption of the `Series`, you can use [pandas.Series.nbytes](https://pandas.pydata.org/docs/reference/api/pandas.Series.nbytes.html) to determine the memory used by the data. [pandas.Series.memory_usage](https://pandas.pydata.org/docs/reference/api/pandas.Series.memory_usage.html) also records the index memory and the data type. With `deep=True` you can also determine the memory consumption at system level." ] }, { "cell_type": "code", "execution_count": 9, "id": "e17aa411", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "56" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.nbytes" ] }, { "cell_type": "code", "execution_count": 10, "id": "2a73df20", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "35" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.astype(\"Float32\").nbytes" ] }, { "cell_type": "code", "execution_count": 11, "id": "3b3c53d5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "188" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.memory_usage()" ] }, { "cell_type": "code", "execution_count": 12, "id": "f1add8c4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "167" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.astype(\"Float32\").memory_usage()" ] }, { "cell_type": "code", "execution_count": 13, "id": "91108fce", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "188" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.memory_usage(deep=True)" ] }, { "cell_type": "markdown", "id": "7859b4e2", "metadata": {}, "source": [ "## String and category types\n", "\n", "The [pandas.Series.astype](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html) method can also convert numeric series into strings if you pass `str`. Note the `dtype` in the following example:" ] }, { "cell_type": "code", "execution_count": 14, "id": "98fff9a8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.6844642165885412\n", "1 -0.5210038291133791\n", "2 0.5208054433837618\n", "3 0.08633690916390123\n", "4 -1.4279177804659373\n", "5 0.0965079370150347\n", "6 -0.1318224826843188\n", "dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.astype(str)" ] }, { "cell_type": "code", "execution_count": 15, "id": "0c4a2c37", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "188" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.astype(str).memory_usage()" ] }, { "cell_type": "code", "execution_count": 16, "id": "f72c5dec", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "605" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.astype(str).memory_usage(deep=True)" ] }, { "cell_type": "markdown", "id": "aee2d501", "metadata": {}, "source": [ "To convert to a categorical type, you can pass `'category'` as the type:" ] }, { "cell_type": "code", "execution_count": 17, "id": "c1dcfcfb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.6844642165885412\n", "1 -0.5210038291133791\n", "2 0.5208054433837618\n", "3 0.08633690916390123\n", "4 -1.4279177804659373\n", "5 0.0965079370150347\n", "6 -0.1318224826843188\n", "dtype: category\n", "Categories (7, object): ['-0.1318224826843188', '-0.5210038291133791', '-1.4279177804659373', '0.08633690916390123', '0.0965079370150347', '0.5208054433837618', '0.6844642165885412']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.astype(str).astype(\"category\")" ] }, { "cell_type": "markdown", "id": "332aecb0", "metadata": {}, "source": [ "A categorical `Series` is useful for string data and can lead to large memory savings. This is because when converting to categorical data, pandas no longer uses Python strings for each value, but repeating values are not duplicated. You still have all the features of the `str` attribute, but you save a lot of memory when you have a lot of duplicate values and you increase performance because you don’t have to do as many string operations." ] }, { "cell_type": "code", "execution_count": 18, "id": "65a2aecb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "495" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.astype(\"category\").memory_usage(deep=True)" ] }, { "cell_type": "markdown", "id": "14ebb412", "metadata": {}, "source": [ "## Ordered categories\n", "\n", "To create ordered categories, you need to define your own [pandas.CategoricalDtype](https://pandas.pydata.org/docs/reference/api/pandas.CategoricalDtype.html):" ] }, { "cell_type": "code", "execution_count": 19, "id": "fe7c824b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.684464\n", "1 -0.521004\n", "2 0.520805\n", "3 0.086337\n", "4 -1.427918\n", "5 0.096508\n", "6 -0.131822\n", "dtype: category\n", "Categories (7, float64): [-1.427918 < -0.521004 < -0.131822 < 0.086337 < 0.096508 < 0.520805 < 0.684464]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pandas.api.types import CategoricalDtype\n", "\n", "\n", "sorted = pd.Series(sorted(set(s)))\n", "cat_dtype = CategoricalDtype(categories=sorted, ordered=True)\n", "\n", "s.astype(cat_dtype)" ] }, { "cell_type": "code", "execution_count": 20, "id": "ac5641a5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "495" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.astype(cat_dtype).memory_usage(deep=True)" ] }, { "cell_type": "markdown", "id": "cac146b8", "metadata": {}, "source": [ "The following table lists the types you can pass to `astype`.\n", "\n", "Data type | Description\n", ":-------- | :----------\n", "`str`, `'str'` | convert to Python string\n", "`'string'` | convert to Pandas string with `pandas.NA`\n", "`int`, `'int'`, `'int64'` | convert to NumPy `int64`\n", "`'int32'`, `'uint32'` | convert to NumPy `int32`\n", "`'Int64'` | convert to pandas `Int64` with `pandas.NA`\n", "`float`, `'float'`, `'float64'` | convert to floats\n", "`'category'` | convert to `CategoricalDtype` with `pandas.NA`" ] }, { "cell_type": "markdown", "id": "5bab7d2e", "metadata": {}, "source": [ "## Conversion to other data types\n", "\n", "The [pandas.Series.to_numpy](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_numpy.html) method or the [pandas.Series.values](https://pandas.pydata.org/docs/reference/api/pandas.Series.values.html) property gives us a NumPy array of values, and [pandas.Series.to_list](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html) returns a Python list of values. Why would you want to do this? pandas objects are usually much more user-friendly and the code is easier to read. Also, python lists will be much slower to process. With [pandas.Series.to_frame](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html) you can create a DataFrame with a single column, if necessary:" ] }, { "cell_type": "code", "execution_count": 21, "id": "de6e2fa8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
00.684464
1-0.521004
20.520805
30.086337
4-1.427918
50.096508
6-0.131822
\n", "
" ], "text/plain": [ " 0\n", "0 0.684464\n", "1 -0.521004\n", "2 0.520805\n", "3 0.086337\n", "4 -1.427918\n", "5 0.096508\n", "6 -0.131822" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.to_frame()" ] }, { "cell_type": "markdown", "id": "eaff05be", "metadata": {}, "source": [ "The function [pandas.to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) can also be useful to convert values in pandas to date and time." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }