{ "cells": [ { "cell_type": "markdown", "id": "7543a862", "metadata": {}, "source": [ "# Select and filter data\n", "\n", "Indexing series `(obj[...])` works analogously to indexing NumPy arrays, except that you can use index values of the series instead of just integers. Here are some examples:" ] }, { "cell_type": "code", "execution_count": 1, "id": "5359176d", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "339875bb", "metadata": {}, "outputs": [], "source": [ "idx = pd.date_range(\"2022-02-02\", periods=7)\n", "rng = np.random.default_rng()\n", "s = pd.Series(rng.normal(size=7), index=idx)" ] }, { "cell_type": "code", "execution_count": 3, "id": "ac21c059", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-02-02 0.852492\n", "2022-02-03 2.073257\n", "2022-02-04 -0.391827\n", "2022-02-05 -1.288372\n", "2022-02-06 0.238623\n", "2022-02-07 1.752573\n", "2022-02-08 -0.155582\n", "Freq: D, dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s" ] }, { "cell_type": "code", "execution_count": 4, "id": "642b2864", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(2.073256834089322)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[\"2022-02-03\"]" ] }, { "cell_type": "code", "execution_count": 5, "id": "e1b90b3e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(2.073256834089322)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.iloc[1]" ] }, { "cell_type": "code", "execution_count": 6, "id": "ccc10714", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-02-04 -0.391827\n", "2022-02-05 -1.288372\n", "Freq: D, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ " s[2:4]" ] }, { "cell_type": "code", "execution_count": 7, "id": "0d737dcc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-02-04 -0.391827\n", "2022-02-03 2.073257\n", "2022-02-02 0.852492\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[[\"2022-02-04\", \"2022-02-03\", \"2022-02-02\"]]" ] }, { "cell_type": "code", "execution_count": 8, "id": "003e3a80", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-02-03 2.073257\n", "2022-02-05 -1.288372\n", "Freq: 2D, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.iloc[[1, 3]]" ] }, { "cell_type": "code", "execution_count": 9, "id": "83376cb4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-02-02 0.852492\n", "2022-02-03 2.073257\n", "2022-02-06 0.238623\n", "2022-02-07 1.752573\n", "dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[s > 0]" ] }, { "cell_type": "markdown", "id": "295647c8", "metadata": {}, "source": [ "While you can select data by label in this way, the preferred method for selecting index values is the `loc` operator:" ] }, { "cell_type": "code", "execution_count": 10, "id": "be9a6c3a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-02-04 -0.391827\n", "2022-02-03 2.073257\n", "2022-02-02 0.852492\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.loc[[\"2022-02-04\", \"2022-02-03\", \"2022-02-02\"]]" ] }, { "cell_type": "markdown", "id": "50cc31af", "metadata": {}, "source": [ "The reason for the preference for `loc` is the different treatment of integers when indexing with `[]`. In regular `[]`-based indexing, integers are treated as labels if the index contains integers, so the behaviour varies depending on the data type of the index. In our example, the expression `s.loc[[3, 2, 1]]` will fail because the index does not contain integers:" ] }, { "cell_type": "code", "execution_count": 11, "id": "94057262", "metadata": {}, "outputs": [ { "ename": "KeyError", "evalue": "\"None of [Index([3, 2, 1], dtype='int64')] are in the [index]\"", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[11], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43ms\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mloc\u001b[49m\u001b[43m[\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m3\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m]\u001b[49m\n", "File \u001b[0;32m~/sandbox/py313/.venv/lib/python3.13/site-packages/pandas/core/indexing.py:1191\u001b[0m, in \u001b[0;36m_LocationIndexer.__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 1189\u001b[0m maybe_callable \u001b[38;5;241m=\u001b[39m com\u001b[38;5;241m.\u001b[39mapply_if_callable(key, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj)\n\u001b[1;32m 1190\u001b[0m maybe_callable \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_check_deprecated_callable_usage(key, maybe_callable)\n\u001b[0;32m-> 1191\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_getitem_axis\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmaybe_callable\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/sandbox/py313/.venv/lib/python3.13/site-packages/pandas/core/indexing.py:1420\u001b[0m, in \u001b[0;36m_LocIndexer._getitem_axis\u001b[0;34m(self, key, axis)\u001b[0m\n\u001b[1;32m 1417\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(key, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mndim\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;129;01mand\u001b[39;00m key\u001b[38;5;241m.\u001b[39mndim \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[1;32m 1418\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCannot index with multidimensional key\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m-> 1420\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_getitem_iterable\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1422\u001b[0m \u001b[38;5;66;03m# nested tuple slicing\u001b[39;00m\n\u001b[1;32m 1423\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m is_nested_tuple(key, labels):\n", "File \u001b[0;32m~/sandbox/py313/.venv/lib/python3.13/site-packages/pandas/core/indexing.py:1360\u001b[0m, in \u001b[0;36m_LocIndexer._getitem_iterable\u001b[0;34m(self, key, axis)\u001b[0m\n\u001b[1;32m 1357\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_validate_key(key, axis)\n\u001b[1;32m 1359\u001b[0m \u001b[38;5;66;03m# A collection of keys\u001b[39;00m\n\u001b[0;32m-> 1360\u001b[0m keyarr, indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_get_listlike_indexer\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1361\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_reindex_with_indexers(\n\u001b[1;32m 1362\u001b[0m {axis: [keyarr, indexer]}, copy\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m, allow_dups\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[1;32m 1363\u001b[0m )\n", "File \u001b[0;32m~/sandbox/py313/.venv/lib/python3.13/site-packages/pandas/core/indexing.py:1558\u001b[0m, in \u001b[0;36m_LocIndexer._get_listlike_indexer\u001b[0;34m(self, key, axis)\u001b[0m\n\u001b[1;32m 1555\u001b[0m ax \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_get_axis(axis)\n\u001b[1;32m 1556\u001b[0m axis_name \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_get_axis_name(axis)\n\u001b[0;32m-> 1558\u001b[0m keyarr, indexer \u001b[38;5;241m=\u001b[39m \u001b[43max\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_get_indexer_strict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis_name\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1560\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m keyarr, indexer\n", "File \u001b[0;32m~/sandbox/py313/.venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:6200\u001b[0m, in \u001b[0;36mIndex._get_indexer_strict\u001b[0;34m(self, key, axis_name)\u001b[0m\n\u001b[1;32m 6197\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 6198\u001b[0m keyarr, indexer, new_indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_reindex_non_unique(keyarr)\n\u001b[0;32m-> 6200\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_raise_if_missing\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkeyarr\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindexer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis_name\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 6202\u001b[0m keyarr \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtake(indexer)\n\u001b[1;32m 6203\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(key, Index):\n\u001b[1;32m 6204\u001b[0m \u001b[38;5;66;03m# GH 42790 - Preserve name from an Index\u001b[39;00m\n", "File \u001b[0;32m~/sandbox/py313/.venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:6249\u001b[0m, in \u001b[0;36mIndex._raise_if_missing\u001b[0;34m(self, key, indexer, axis_name)\u001b[0m\n\u001b[1;32m 6247\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m nmissing:\n\u001b[1;32m 6248\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m nmissing \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mlen\u001b[39m(indexer):\n\u001b[0;32m-> 6249\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNone of [\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mkey\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m] are in the [\u001b[39m\u001b[38;5;132;01m{\u001b[39;00maxis_name\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m]\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 6251\u001b[0m not_found \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(ensure_index(key)[missing_mask\u001b[38;5;241m.\u001b[39mnonzero()[\u001b[38;5;241m0\u001b[39m]]\u001b[38;5;241m.\u001b[39munique())\n\u001b[1;32m 6252\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mnot_found\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m not in index\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", "\u001b[0;31mKeyError\u001b[0m: \"None of [Index([3, 2, 1], dtype='int64')] are in the [index]\"" ] } ], "source": [ "s.loc[[3, 2, 1]]" ] }, { "cell_type": "markdown", "id": "390c0838", "metadata": {}, "source": [ "While the `loc` operator exclusively indexes labels, the `iloc` operator exclusively indexes with integers:" ] }, { "cell_type": "code", "execution_count": 12, "id": "263b4ddf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-02-05 -1.288372\n", "2022-02-04 -0.391827\n", "2022-02-03 2.073257\n", "Freq: -1D, dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.iloc[[3, 2, 1]]" ] }, { "cell_type": "markdown", "id": "dbe97dd5", "metadata": {}, "source": [ "You can also slice with labels, but this works differently from normal Python slicing because the endpoint is included:" ] }, { "cell_type": "code", "execution_count": 13, "id": "3d21b345", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-02-03 2.073257\n", "2022-02-04 -0.391827\n", "Freq: D, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.loc[\"2022-02-03\":\"2022-02-04\"]" ] }, { "cell_type": "markdown", "id": "99deca60", "metadata": {}, "source": [ "Setting with these methods changes the corresponding section of the row:" ] }, { "cell_type": "code", "execution_count": 14, "id": "18c03d20", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-02-02 0.852492\n", "2022-02-03 0.000000\n", "2022-02-04 0.000000\n", "2022-02-05 -1.288372\n", "2022-02-06 0.238623\n", "2022-02-07 1.752573\n", "2022-02-08 -0.155582\n", "Freq: D, dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.loc[\"2022-02-03\":\"2022-02-04\"] = 0\n", "\n", "s" ] }, { "cell_type": "markdown", "id": "09a14ccb", "metadata": {}, "source": [ "Indexing in a DataFrame is used to retrieve one or more columns with either a single value or a sequence:" ] }, { "cell_type": "code", "execution_count": 15, "id": "3f40a46f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DecimalOctalKey
Code
U+00000001NUL
U+00011002Ctrl-A
U+00022003Ctrl-B
U+00033004Ctrl-C
U+00044004Ctrl-D
U+00055005Ctrl-E
\n", "
" ], "text/plain": [ " Decimal Octal Key\n", "Code \n", "U+0000 0 001 NUL\n", "U+0001 1 002 Ctrl-A\n", "U+0002 2 003 Ctrl-B\n", "U+0003 3 004 Ctrl-C\n", "U+0004 4 004 Ctrl-D\n", "U+0005 5 005 Ctrl-E" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = {\n", " \"Code\": [\"U+0000\", \"U+0001\", \"U+0002\", \"U+0003\", \"U+0004\", \"U+0005\"],\n", " \"Decimal\": [0, 1, 2, 3, 4, 5],\n", " \"Octal\": [\"001\", \"002\", \"003\", \"004\", \"004\", \"005\"],\n", " \"Key\": [\"NUL\", \"Ctrl-A\", \"Ctrl-B\", \"Ctrl-C\", \"Ctrl-D\", \"Ctrl-E\"],\n", "}\n", "\n", "df = pd.DataFrame(data)\n", "df = pd.DataFrame(data, columns=[\"Decimal\", \"Octal\", \"Key\"], index=df[\"Code\"])\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 16, "id": "ff8881ff", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Code\n", "U+0000 NUL\n", "U+0001 Ctrl-A\n", "U+0002 Ctrl-B\n", "U+0003 Ctrl-C\n", "U+0004 Ctrl-D\n", "U+0005 Ctrl-E\n", "Name: Key, dtype: object" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Key\"]" ] }, { "cell_type": "code", "execution_count": 17, "id": "6701d732", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DecimalKey
Code
U+00000NUL
U+00011Ctrl-A
U+00022Ctrl-B
U+00033Ctrl-C
U+00044Ctrl-D
U+00055Ctrl-E
\n", "
" ], "text/plain": [ " Decimal Key\n", "Code \n", "U+0000 0 NUL\n", "U+0001 1 Ctrl-A\n", "U+0002 2 Ctrl-B\n", "U+0003 3 Ctrl-C\n", "U+0004 4 Ctrl-D\n", "U+0005 5 Ctrl-E" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[[\"Decimal\", \"Key\"]]" ] }, { "cell_type": "code", "execution_count": 18, "id": "e6af9bbe", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DecimalOctalKey
Code
U+00000001NUL
U+00011002Ctrl-A
\n", "
" ], "text/plain": [ " Decimal Octal Key\n", "Code \n", "U+0000 0 001 NUL\n", "U+0001 1 002 Ctrl-A" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[:2]" ] }, { "cell_type": "code", "execution_count": 19, "id": "9d0deda5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DecimalOctalKey
Code
U+00033004Ctrl-C
U+00044004Ctrl-D
U+00055005Ctrl-E
\n", "
" ], "text/plain": [ " Decimal Octal Key\n", "Code \n", "U+0003 3 004 Ctrl-C\n", "U+0004 4 004 Ctrl-D\n", "U+0005 5 005 Ctrl-E" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"Decimal\"] > 2]" ] }, { "cell_type": "markdown", "id": "cbd2174b", "metadata": {}, "source": [ "The line selection syntax `df[:2]` is provided for convenience. Passing a single item or a list to the `[]` operator selects columns." ] }, { "cell_type": "markdown", "id": "0c4c64e0", "metadata": {}, "source": [ "Another use case is indexing with a Boolean DataFrame, which is generated by a scalar comparison, for example:" ] }, { "cell_type": "code", "execution_count": 20, "id": "b839beb6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Code\n", "U+0000 False\n", "U+0001 False\n", "U+0002 False\n", "U+0003 True\n", "U+0004 True\n", "U+0005 True\n", "Name: Decimal, dtype: bool" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Decimal\"] > 2" ] }, { "cell_type": "code", "execution_count": 21, "id": "0b7eac34-07cd-4551-bcef-58e2a18e5112", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/hk/s8m0bblj0g10hw885gld52mc0000gn/T/ipykernel_90739/3186310461.py:1: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'NA' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.\n", " df[df[\"Decimal\"] > 2] = \"NA\"\n" ] } ], "source": [ "df[df[\"Decimal\"] > 2] = \"NA\"" ] }, { "cell_type": "markdown", "id": "f93d6130", "metadata": {}, "source": [ "Like Series, DataFrame has special operators `loc` and `iloc` for label-based and integer indexing respectively. Since DataFrame is two-dimensional, you can select a subset of the rows and columns with NumPy-like notation using either axis labels (`loc`) or integers (`iloc`)." ] }, { "cell_type": "code", "execution_count": 22, "id": "435668de", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Decimal 2\n", "Key Ctrl-B\n", "Name: U+0002, dtype: object" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[\"U+0002\", [\"Decimal\", \"Key\"]]" ] }, { "cell_type": "code", "execution_count": 23, "id": "8483cc54", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OctalKey
Code
U+0002003Ctrl-B
\n", "
" ], "text/plain": [ " Octal Key\n", "Code \n", "U+0002 003 Ctrl-B" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[[2], [1, 2]]" ] }, { "cell_type": "code", "execution_count": 24, "id": "b4cf28cd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OctalKey
Code
U+0000001NUL
U+0001002Ctrl-A
\n", "
" ], "text/plain": [ " Octal Key\n", "Code \n", "U+0000 001 NUL\n", "U+0001 002 Ctrl-A" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[[0, 1], [1, 2]]" ] }, { "cell_type": "markdown", "id": "5f407c49", "metadata": {}, "source": [ "Both indexing functions work with slices in addition to individual labels or lists of labels:" ] }, { "cell_type": "code", "execution_count": 25, "id": "90ca1775", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Code\n", "U+0000 NUL\n", "U+0001 Ctrl-A\n", "U+0002 Ctrl-B\n", "U+0003 NA\n", "Name: Key, dtype: object" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[:\"U+0003\", \"Key\"]" ] }, { "cell_type": "code", "execution_count": 26, "id": "d9e78d4a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DecimalOctalKey
Code
U+00000001NUL
U+00011002Ctrl-A
U+00022003Ctrl-B
\n", "
" ], "text/plain": [ " Decimal Octal Key\n", "Code \n", "U+0000 0 001 NUL\n", "U+0001 1 002 Ctrl-A\n", "U+0002 2 003 Ctrl-B" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[:3, :3]" ] }, { "cell_type": "markdown", "id": "934c211c", "metadata": {}, "source": [ "So there are many ways to select and rearrange the data contained in a pandas object. In the following, I put together a brief summary of most of these possibilities for DataFrames:\n", "\n", "Type | Note\n", ":--- | :---\n", "`df[LABEL]` | selects a single column or a sequence of columns from the DataFrame\n", "`df.loc[LABEL]` | selects a single row or a subset of rows from the DataFrame by label\n", "`df.loc[:, LABEL]` | selects a single column or a subset of columns from the DataFrame by Label\n", "`df.loc[LABEL1, LABEL2]` | selects both rows and columns by label\n", "`df.iloc[INTEGER]` | selects a single row or a subset of rows from the DataFrame by integer position\n", "`df.iloc[INTEGER1, INTEGER2]` | selects a single column or a subset of columns by integer position\n", "`df.at[LABEL1, LABEL2]` | selects a single value by row and column label\n", "`df.iat[INTEGER1, INTEGER2]` | selects a scalar value by row and column position (integers)\n", "`reindex NEW_INDEX` | selects rows or columns by label\n", "`get_value`, `set_value` | deprecated since version 0.21.0: use `.at[]` or `.iat[]` instead." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }