{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5553e7a5",
   "metadata": {},
   "source": [
    "# CSV example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "06a5d3d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b682a0d",
   "metadata": {},
   "source": [
    "After importing pandas, we first read a csv file with `read_csv`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "de19c7bd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Python basics</th>\n",
       "      <th>en</th>\n",
       "      <th>Veit Schiele</th>\n",
       "      <th>BSD-3-Clause</th>\n",
       "      <th>2021-10-28</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2019-06-27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>de</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-10-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>PyViz Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-04-13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Python basics  en  Veit Schiele  BSD-3-Clause  2021-10-28\n",
       "0  Jupyter Tutorial  en  Veit Schiele  BSD-3-Clause  2019-06-27\n",
       "1  Jupyter Tutorial  de  Veit Schiele  BSD-3-Clause  2020-10-26\n",
       "2    PyViz Tutorial  en  Veit Schiele  BSD-3-Clause  2020-04-13"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c87274f",
   "metadata": {},
   "source": [
    "As you can see, this file has no header. To give the DataFrame a header, you have several options. You can allow pandas to assign default column names, or you can define the names yourself:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a46a20c2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Python basics</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2021-10-28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2019-06-27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>de</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-10-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>PyViz Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-04-13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  0   1             2             3           4\n",
       "0     Python basics  en  Veit Schiele  BSD-3-Clause  2021-10-28\n",
       "1  Jupyter Tutorial  en  Veit Schiele  BSD-3-Clause  2019-06-27\n",
       "2  Jupyter Tutorial  de  Veit Schiele  BSD-3-Clause  2020-10-26\n",
       "3    PyViz Tutorial  en  Veit Schiele  BSD-3-Clause  2020-04-13"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n",
    "    header=None,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "46b04f42",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Language</th>\n",
       "      <th>Authors</th>\n",
       "      <th>License</th>\n",
       "      <th>Publication date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Python basics</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2021-10-28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2019-06-27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>de</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-10-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>PyViz Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-04-13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Title Language       Authors       License Publication date\n",
       "0     Python basics       en  Veit Schiele  BSD-3-Clause       2021-10-28\n",
       "1  Jupyter Tutorial       en  Veit Schiele  BSD-3-Clause       2019-06-27\n",
       "2  Jupyter Tutorial       de  Veit Schiele  BSD-3-Clause       2020-10-26\n",
       "3    PyViz Tutorial       en  Veit Schiele  BSD-3-Clause       2020-04-13"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n",
    "    names=[\"Title\", \"Language\", \"Authors\", \"License\", \"Publication date\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d50206d4",
   "metadata": {},
   "source": [
    "Suppose you want the `Authors` column to be the index of the returned DataFrame. You can either specify that you want the column at index 3 or with the name `Authors` by using the argument `index_col`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "15179ece",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Language</th>\n",
       "      <th>License</th>\n",
       "      <th>Publication date</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Authors</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Veit Schiele</th>\n",
       "      <td>Python basics</td>\n",
       "      <td>en</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2021-10-28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Veit Schiele</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2019-06-27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Veit Schiele</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>de</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-10-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Veit Schiele</th>\n",
       "      <td>PyViz Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-04-13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                         Title Language       License Publication date\n",
       "Authors                                                               \n",
       "Veit Schiele     Python basics       en  BSD-3-Clause       2021-10-28\n",
       "Veit Schiele  Jupyter Tutorial       en  BSD-3-Clause       2019-06-27\n",
       "Veit Schiele  Jupyter Tutorial       de  BSD-3-Clause       2020-10-26\n",
       "Veit Schiele    PyViz Tutorial       en  BSD-3-Clause       2020-04-13"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n",
    "    index_col=[\"Authors\"],\n",
    "    names=[\"Title\", \"Language\", \"Authors\", \"License\", \"Publication date\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "696f87c9",
   "metadata": {},
   "source": [
    "In case you want to build a hierarchical index from several columns, pass a list of column numbers or names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "fd3e9130",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Language</th>\n",
       "      <th>License</th>\n",
       "      <th>Publication date</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Authors</th>\n",
       "      <th>Title</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">Veit Schiele</th>\n",
       "      <th>Python basics</th>\n",
       "      <td>en</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2021-10-28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Jupyter Tutorial</th>\n",
       "      <td>en</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2019-06-27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Jupyter Tutorial</th>\n",
       "      <td>de</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-10-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PyViz Tutorial</th>\n",
       "      <td>en</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-04-13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              Language       License Publication date\n",
       "Authors      Title                                                   \n",
       "Veit Schiele Python basics          en  BSD-3-Clause       2021-10-28\n",
       "             Jupyter Tutorial       en  BSD-3-Clause       2019-06-27\n",
       "             Jupyter Tutorial       de  BSD-3-Clause       2020-10-26\n",
       "             PyViz Tutorial         en  BSD-3-Clause       2020-04-13"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n",
    "    index_col=[2, 0],\n",
    "    names=[\"Title\", \"Language\", \"Authors\", \"License\", \"Publication date\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d5b438d",
   "metadata": {},
   "source": [
    "In some cases, a table does not have a fixed separator, but uses several spaces or some other pattern to separate fields. Suppose a file looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "47d8f436",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['   Title             Language  Authors       License       Publication date\\n',\n",
       " '1  Python basics     en        Veit Schiele  BSD-3-Clause  2021-10-28\\n',\n",
       " '2  Jupyter Tutorial  en        Veit Schiele  BSD-3-Clause  2019-06-27\\n',\n",
       " '3  Jupyter Tutorial  de        Veit Schiele  BSD-3-Clause  2020-10-26\\n',\n",
       " '4  PyViz Tutorial    en        Veit Schiele  BSD-3-Clause  2020-04-13\\n']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(open(\"books.txt\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b46eb37",
   "metadata": {},
   "source": [
    "In such cases, you can pass a regular expression as a separator for `read_csv`. This can be expressed by the regular expression `\\s\\s+`, so then we have:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "2fa3eb87",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Language</th>\n",
       "      <th>Authors</th>\n",
       "      <th>License</th>\n",
       "      <th>Publication date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Python basics</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2021-10-28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2019-06-27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>de</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-10-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>PyViz Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-04-13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Title Language       Authors       License Publication date\n",
       "1     Python basics       en  Veit Schiele  BSD-3-Clause       2021-10-28\n",
       "2  Jupyter Tutorial       en  Veit Schiele  BSD-3-Clause       2019-06-27\n",
       "3  Jupyter Tutorial       de  Veit Schiele  BSD-3-Clause       2020-10-26\n",
       "4    PyViz Tutorial       en  Veit Schiele  BSD-3-Clause       2020-04-13"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\"books.txt\", sep=r\"\\s\\s+\", engine=\"python\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41021e4d",
   "metadata": {},
   "source": [
    "Since there was one column name less than the number of data rows, `read_csv` infers that in this case the first column should be the index of the DataFrame."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cef6718",
   "metadata": {},
   "source": [
    "The parser functions have many additional arguments that help you handle the wide variety of exception file formats that occur. For example, you can skip individual lines of a file with `skiprows`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1f849c65",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Language</th>\n",
       "      <th>Authors</th>\n",
       "      <th>License</th>\n",
       "      <th>Publication date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Python basics</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2021-10-28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2019-06-27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>PyViz Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-04-13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Title Language       Authors       License Publication date\n",
       "0     Python basics       en  Veit Schiele  BSD-3-Clause       2021-10-28\n",
       "1  Jupyter Tutorial       en  Veit Schiele  BSD-3-Clause       2019-06-27\n",
       "2    PyViz Tutorial       en  Veit Schiele  BSD-3-Clause       2020-04-13"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n",
    "    skiprows=[2],\n",
    "    names=[\"Title\", \"Language\", \"Authors\", \"License\", \"Publication date\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8163765c",
   "metadata": {},
   "source": [
    "Dealing with missing values is an important and often complicated part of parsing data. Missing data is usually either not present (empty string) or indicated by a placeholder. By default, pandas uses a number of common placeholders, such as `NA` and `NULL`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b7060c22",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Language</th>\n",
       "      <th>Authors</th>\n",
       "      <th>License</th>\n",
       "      <th>Publication date</th>\n",
       "      <th>doi</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Python basics</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2021-10-28</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2019-06-27</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>de</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-10-26</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>PyViz Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>BSD-3-Clause</td>\n",
       "      <td>2020-04-13</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Title Language       Authors       License Publication date  doi\n",
       "0     Python basics       en  Veit Schiele  BSD-3-Clause       2021-10-28  NaN\n",
       "1  Jupyter Tutorial       en  Veit Schiele  BSD-3-Clause       2019-06-27  NaN\n",
       "2  Jupyter Tutorial       de  Veit Schiele  BSD-3-Clause       2020-10-26  NaN\n",
       "3    PyViz Tutorial       en  Veit Schiele  BSD-3-Clause       2020-04-13  NaN"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n",
    "    names=[\n",
    "        \"Title\",\n",
    "        \"Language\",\n",
    "        \"Authors\",\n",
    "        \"License\",\n",
    "        \"Publication date\",\n",
    "        \"doi\",\n",
    "    ],\n",
    ")\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "0b3dc1bc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Language</th>\n",
       "      <th>Authors</th>\n",
       "      <th>License</th>\n",
       "      <th>Publication date</th>\n",
       "      <th>doi</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Title  Language  Authors  License  Publication date   doi\n",
       "0  False     False    False    False             False  True\n",
       "1  False     False    False    False             False  True\n",
       "2  False     False    False    False             False  True\n",
       "3  False     False    False    False             False  True"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isna()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec5602d1",
   "metadata": {},
   "source": [
    "The `na_values` option can take either a list or a series of strings to account for missing values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "eb355d44",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Title</th>\n",
       "      <th>Language</th>\n",
       "      <th>Authors</th>\n",
       "      <th>License</th>\n",
       "      <th>Publication date</th>\n",
       "      <th>doi</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Python basics</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-10-28</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019-06-27</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Jupyter Tutorial</td>\n",
       "      <td>de</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2020-10-26</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>PyViz Tutorial</td>\n",
       "      <td>en</td>\n",
       "      <td>Veit Schiele</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2020-04-13</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Title Language       Authors  License Publication date  doi\n",
       "0     Python basics       en  Veit Schiele      NaN       2021-10-28  NaN\n",
       "1  Jupyter Tutorial       en  Veit Schiele      NaN       2019-06-27  NaN\n",
       "2  Jupyter Tutorial       de  Veit Schiele      NaN       2020-10-26  NaN\n",
       "3    PyViz Tutorial       en  Veit Schiele      NaN       2020-04-13  NaN"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\n",
    "    \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n",
    "    na_values=[\"BSD-3-Clause\"],\n",
    "    names=[\n",
    "        \"Title\",\n",
    "        \"Language\",\n",
    "        \"Authors\",\n",
    "        \"License\",\n",
    "        \"Publication date\",\n",
    "        \"doi\",\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47c9eecb",
   "metadata": {},
   "source": [
    "The most frequent arguments of the function  `read_csv`:\n",
    "\n",
    "Argument | Description\n",
    ":------- | :----------\n",
    "`path` | String specifying the location in the file system, a URL or a file-like object\n",
    "`sep` or `delimiter` | String or regular expression to separate the fields in each row\n",
    "`header` | Row number to be used as column name; default is `0`, i.e. the first row, but should be `None` if there is no header row\n",
    "`index_col` | Row numbers or names to be used as row index in the result; can be a single name/number or a list of them for a hierarchical index\n",
    "`names` | List of column names\n",
    "`skiprows` | Number of rows to be ignored at the beginning of the file or list of row numbers starting at `0` to be skipped\n",
    "`na_values` | sequence of values to be replaced by NA\n",
    "`comment` | character to separate comments from the end of the line\n",
    "`parse_dates` | Attempt to parse data with datetime; defaults to `False`. If `True`, attempts to parse all columns. Otherwise, a list of column numbers or names to parse can be specified. If the list element is a tuple or a list, multiple columns are combined and converted to a date, for example if the date and time are split between two columns\n",
    "`keep_date_col` | if columns are combined to parse the date, the combined columns are kept; default: `False`\n",
    "`converters` | Dict containing the column number of names mapped to functions, for example `{'Titel': f}` would apply the function f to all values in the column `Title`\n",
    "`dayfirst` | treat as an international format when parsing potentially ambiguous dates, for example `28/6/2021` → `28. Juni 2021`; `False` by default\n",
    "`date_parser` | function to use for parsing dates\n",
    "`nrows` | Number of lines to read from the beginning of the file.\n",
    "`iterator` | Return a `TextFileReader` object to read the file piece by piece; this object can also be used with the `with` statement\n",
    "`chunksize` | For the iteration, the size of the data blocks.\n",
    "`skip_footer` | number of lines to be ignored at the end of the file\n",
    "`verbose` | outputs various information about the parser output, for example the number of missing values in non-numeric columns\n",
    "`encoding` | Text encoding for Unicode, for example `utf-8` for UTF-8 encoded text\n",
    "`squeeze` | if the parsed data contains only one column, a Series is returned\n",
    "`thousands` | Separator for thousands, for example `,` or `.`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "951388e8",
   "metadata": {},
   "source": [
    "## Reading in text files piece by piece\n",
    "\n",
    "If you want to process very large files, you can also read in only a small part of a file or iterate through smaller parts of a file.\n",
    "\n",
    "Before we look at a large file, we reduce the number of lines displayed with `options.display.max_rows`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "e61b2021",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.options.display.max_rows = 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "29ff1efa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Mon.</th>\n",
       "      <th>Tues.</th>\n",
       "      <th>Wed.</th>\n",
       "      <th>Thurs.</th>\n",
       "      <th>Fri.</th>\n",
       "      <th>Sat.</th>\n",
       "      <th>Sun.</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1996-01-01</td>\n",
       "      <td>0.129453</td>\n",
       "      <td>-0.023836</td>\n",
       "      <td>1.121460</td>\n",
       "      <td>1.698286</td>\n",
       "      <td>-0.598506</td>\n",
       "      <td>1.042221</td>\n",
       "      <td>-0.726412</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1996-01-02</td>\n",
       "      <td>-0.094021</td>\n",
       "      <td>-0.727942</td>\n",
       "      <td>0.698641</td>\n",
       "      <td>-1.198040</td>\n",
       "      <td>1.927505</td>\n",
       "      <td>1.147445</td>\n",
       "      <td>-1.134103</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1996-01-03</td>\n",
       "      <td>-0.560857</td>\n",
       "      <td>0.145222</td>\n",
       "      <td>-0.990202</td>\n",
       "      <td>1.200214</td>\n",
       "      <td>0.717339</td>\n",
       "      <td>1.117095</td>\n",
       "      <td>-1.793565</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1996-01-04</td>\n",
       "      <td>-0.169755</td>\n",
       "      <td>-0.677391</td>\n",
       "      <td>-1.533519</td>\n",
       "      <td>-0.343477</td>\n",
       "      <td>-0.109705</td>\n",
       "      <td>1.038236</td>\n",
       "      <td>-0.799088</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1996-01-05</td>\n",
       "      <td>1.344705</td>\n",
       "      <td>-1.817261</td>\n",
       "      <td>0.460991</td>\n",
       "      <td>-0.839633</td>\n",
       "      <td>0.265814</td>\n",
       "      <td>0.477659</td>\n",
       "      <td>0.636383</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9127</th>\n",
       "      <td>2020-12-27</td>\n",
       "      <td>-0.881800</td>\n",
       "      <td>-0.074270</td>\n",
       "      <td>-0.351769</td>\n",
       "      <td>1.381641</td>\n",
       "      <td>-0.049548</td>\n",
       "      <td>1.664180</td>\n",
       "      <td>-1.032204</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9128</th>\n",
       "      <td>2020-12-28</td>\n",
       "      <td>-0.143386</td>\n",
       "      <td>0.198217</td>\n",
       "      <td>-1.243861</td>\n",
       "      <td>1.196576</td>\n",
       "      <td>1.338166</td>\n",
       "      <td>-0.212333</td>\n",
       "      <td>-0.023131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9129</th>\n",
       "      <td>2020-12-29</td>\n",
       "      <td>0.398787</td>\n",
       "      <td>-0.848786</td>\n",
       "      <td>1.791707</td>\n",
       "      <td>-1.167592</td>\n",
       "      <td>-0.033881</td>\n",
       "      <td>-0.285559</td>\n",
       "      <td>-0.323477</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9130</th>\n",
       "      <td>2020-12-30</td>\n",
       "      <td>0.587846</td>\n",
       "      <td>0.411580</td>\n",
       "      <td>1.150380</td>\n",
       "      <td>0.444638</td>\n",
       "      <td>-1.093577</td>\n",
       "      <td>0.605456</td>\n",
       "      <td>1.463345</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9131</th>\n",
       "      <td>2020-12-31</td>\n",
       "      <td>0.736350</td>\n",
       "      <td>0.436292</td>\n",
       "      <td>-0.260171</td>\n",
       "      <td>-0.066066</td>\n",
       "      <td>-0.328324</td>\n",
       "      <td>-0.586792</td>\n",
       "      <td>-1.204582</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>9132 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            Date      Mon.     Tues.      Wed.    Thurs.      Fri.      Sat.  \\\n",
       "0     1996-01-01  0.129453 -0.023836  1.121460  1.698286 -0.598506  1.042221   \n",
       "1     1996-01-02 -0.094021 -0.727942  0.698641 -1.198040  1.927505  1.147445   \n",
       "2     1996-01-03 -0.560857  0.145222 -0.990202  1.200214  0.717339  1.117095   \n",
       "3     1996-01-04 -0.169755 -0.677391 -1.533519 -0.343477 -0.109705  1.038236   \n",
       "4     1996-01-05  1.344705 -1.817261  0.460991 -0.839633  0.265814  0.477659   \n",
       "...          ...       ...       ...       ...       ...       ...       ...   \n",
       "9127  2020-12-27 -0.881800 -0.074270 -0.351769  1.381641 -0.049548  1.664180   \n",
       "9128  2020-12-28 -0.143386  0.198217 -1.243861  1.196576  1.338166 -0.212333   \n",
       "9129  2020-12-29  0.398787 -0.848786  1.791707 -1.167592 -0.033881 -0.285559   \n",
       "9130  2020-12-30  0.587846  0.411580  1.150380  0.444638 -1.093577  0.605456   \n",
       "9131  2020-12-31  0.736350  0.436292 -0.260171 -0.066066 -0.328324 -0.586792   \n",
       "\n",
       "          Sun.  \n",
       "0    -0.726412  \n",
       "1    -1.134103  \n",
       "2    -1.793565  \n",
       "3    -0.799088  \n",
       "4     0.636383  \n",
       "...        ...  \n",
       "9127 -1.032204  \n",
       "9128 -0.023131  \n",
       "9129 -0.323477  \n",
       "9130  1.463345  \n",
       "9131 -1.204582  \n",
       "\n",
       "[9132 rows x 8 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\"example.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64e596d1",
   "metadata": {},
   "source": [
    "If you only want to read a small number of lines (without reading the whole file), you can specify this with `nrows`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "01856e17",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Mon.</th>\n",
       "      <th>Tues.</th>\n",
       "      <th>Wed.</th>\n",
       "      <th>Thurs.</th>\n",
       "      <th>Fri.</th>\n",
       "      <th>Sat.</th>\n",
       "      <th>Sun.</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1996-01-01</td>\n",
       "      <td>0.129453</td>\n",
       "      <td>-0.023836</td>\n",
       "      <td>1.121460</td>\n",
       "      <td>1.698286</td>\n",
       "      <td>-0.598506</td>\n",
       "      <td>1.042221</td>\n",
       "      <td>-0.726412</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1996-01-02</td>\n",
       "      <td>-0.094021</td>\n",
       "      <td>-0.727942</td>\n",
       "      <td>0.698641</td>\n",
       "      <td>-1.198040</td>\n",
       "      <td>1.927505</td>\n",
       "      <td>1.147445</td>\n",
       "      <td>-1.134103</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1996-01-03</td>\n",
       "      <td>-0.560857</td>\n",
       "      <td>0.145222</td>\n",
       "      <td>-0.990202</td>\n",
       "      <td>1.200214</td>\n",
       "      <td>0.717339</td>\n",
       "      <td>1.117095</td>\n",
       "      <td>-1.793565</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1996-01-04</td>\n",
       "      <td>-0.169755</td>\n",
       "      <td>-0.677391</td>\n",
       "      <td>-1.533519</td>\n",
       "      <td>-0.343477</td>\n",
       "      <td>-0.109705</td>\n",
       "      <td>1.038236</td>\n",
       "      <td>-0.799088</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1996-01-05</td>\n",
       "      <td>1.344705</td>\n",
       "      <td>-1.817261</td>\n",
       "      <td>0.460991</td>\n",
       "      <td>-0.839633</td>\n",
       "      <td>0.265814</td>\n",
       "      <td>0.477659</td>\n",
       "      <td>0.636383</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1996-01-06</td>\n",
       "      <td>-0.354445</td>\n",
       "      <td>-0.065182</td>\n",
       "      <td>-1.244963</td>\n",
       "      <td>-0.559732</td>\n",
       "      <td>0.042362</td>\n",
       "      <td>-0.303712</td>\n",
       "      <td>0.067632</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1996-01-07</td>\n",
       "      <td>1.460922</td>\n",
       "      <td>0.164412</td>\n",
       "      <td>0.883960</td>\n",
       "      <td>-0.833642</td>\n",
       "      <td>0.001582</td>\n",
       "      <td>1.138469</td>\n",
       "      <td>0.561618</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Date      Mon.     Tues.      Wed.    Thurs.      Fri.      Sat.  \\\n",
       "0  1996-01-01  0.129453 -0.023836  1.121460  1.698286 -0.598506  1.042221   \n",
       "1  1996-01-02 -0.094021 -0.727942  0.698641 -1.198040  1.927505  1.147445   \n",
       "2  1996-01-03 -0.560857  0.145222 -0.990202  1.200214  0.717339  1.117095   \n",
       "3  1996-01-04 -0.169755 -0.677391 -1.533519 -0.343477 -0.109705  1.038236   \n",
       "4  1996-01-05  1.344705 -1.817261  0.460991 -0.839633  0.265814  0.477659   \n",
       "5  1996-01-06 -0.354445 -0.065182 -1.244963 -0.559732  0.042362 -0.303712   \n",
       "6  1996-01-07  1.460922  0.164412  0.883960 -0.833642  0.001582  1.138469   \n",
       "\n",
       "       Sun.  \n",
       "0 -0.726412  \n",
       "1 -1.134103  \n",
       "2 -1.793565  \n",
       "3 -0.799088  \n",
       "4  0.636383  \n",
       "5  0.067632  \n",
       "6  0.561618  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\"example.csv\", nrows=7)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "124f2a6a",
   "metadata": {},
   "source": [
    "To read a file piece by piece, you can specify the number of lines with `chunksize`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ce309f8c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pandas.io.parsers.readers.TextFileReader at 0x137d11220>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(\"example.csv\", chunksize=1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c682a02",
   "metadata": {},
   "source": [
    "The `TextFileReader` object returned by `read_csv` allows iteration over parts of the file according to the `chunksize`. For example, we can iterate over the `example.csv` file and aggregate the number of values in the `Date` column as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "c11aa475",
   "metadata": {},
   "outputs": [],
   "source": [
    "chunks = pd.read_csv(\"example.csv\", chunksize=1000)\n",
    "\n",
    "serie = pd.Series([], dtype=\"float64\")\n",
    "for chunk in chunks:\n",
    "    values = serie.add(chunk[\"Date\"].value_counts(), fill_value=0)\n",
    "\n",
    "sorted_values = values.sort_values(ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "59657986",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Date\n",
       "2020-08-22    1.0\n",
       "2020-11-13    1.0\n",
       "2020-11-27    1.0\n",
       "2020-11-26    1.0\n",
       "2020-11-25    1.0\n",
       "2020-11-24    1.0\n",
       "2020-11-23    1.0\n",
       "2020-11-22    1.0\n",
       "2020-11-21    1.0\n",
       "2020-11-20    1.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted_values[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "915c3c62",
   "metadata": {},
   "source": [
    "`TextFileReader` also has a `get_chunk` method that allows you to read pieces of any size."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d48072ce",
   "metadata": {},
   "source": [
    "## Write DataFrame and Series as a CSV file\n",
    "\n",
    "Data can also be exported in a comma-separated format. With the method `pandas.DataFrame.to_csv` we can write the data into a comma-separated file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "d88a8271",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv(\"out.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b41a72b7",
   "metadata": {},
   "source": [
    "Of course, other delimiters can also be used, for example to write to `sys.stdout`, so that the text result is output on the console and not in a file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "ec24d29c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "f1271ae2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|Title|Language|Authors|License|Publication date|doi\n",
      "0|Python basics|en|Veit Schiele|BSD-3-Clause|2021-10-28|\n",
      "1|Jupyter Tutorial|en|Veit Schiele|BSD-3-Clause|2019-06-27|\n",
      "2|Jupyter Tutorial|de|Veit Schiele|BSD-3-Clause|2020-10-26|\n",
      "3|PyViz Tutorial|en|Veit Schiele|BSD-3-Clause|2020-04-13|\n"
     ]
    }
   ],
   "source": [
    "df.to_csv(sys.stdout, sep=\"|\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2edd8f82",
   "metadata": {},
   "source": [
    "Missing values appear in the output as empty strings. You may want to mark them with a different placeholder:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "455145fe",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ",Title,Language,Authors,License,Publication date,doi\n",
      "0,Python basics,en,Veit Schiele,BSD-3-Clause,2021-10-28,NaN\n",
      "1,Jupyter Tutorial,en,Veit Schiele,BSD-3-Clause,2019-06-27,NaN\n",
      "2,Jupyter Tutorial,de,Veit Schiele,BSD-3-Clause,2020-10-26,NaN\n",
      "3,PyViz Tutorial,en,Veit Schiele,BSD-3-Clause,2020-04-13,NaN\n"
     ]
    }
   ],
   "source": [
    "df.to_csv(sys.stdout, na_rep=\"NaN\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57531679",
   "metadata": {},
   "source": [
    "If no other options are given, both the row and column labels are written. Both can be deactivated:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "b599ee64",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Python basics,en,Veit Schiele,BSD-3-Clause,2021-10-28,\n",
      "Jupyter Tutorial,en,Veit Schiele,BSD-3-Clause,2019-06-27,\n",
      "Jupyter Tutorial,de,Veit Schiele,BSD-3-Clause,2020-10-26,\n",
      "PyViz Tutorial,en,Veit Schiele,BSD-3-Clause,2020-04-13,\n"
     ]
    }
   ],
   "source": [
    "df.to_csv(sys.stdout, index=False, header=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "436ca565",
   "metadata": {},
   "source": [
    "You can also write only a subset of the columns, in an order of your choosing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "3625a0fa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Title,Language,Authors,Publication date\n",
      "Python basics,en,Veit Schiele,2021-10-28\n",
      "Jupyter Tutorial,en,Veit Schiele,2019-06-27\n",
      "Jupyter Tutorial,de,Veit Schiele,2020-10-26\n",
      "PyViz Tutorial,en,Veit Schiele,2020-04-13\n"
     ]
    }
   ],
   "source": [
    "df.to_csv(\n",
    "    sys.stdout,\n",
    "    index=False,\n",
    "    columns=[\"Title\", \"Language\", \"Authors\", \"Publication date\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3791334b",
   "metadata": {},
   "source": [
    "## Working with the csv module of Python\n",
    "\n",
    "Most forms of table data can be loaded using functions such as `pandas.read_csv`. However, in some cases manual processing may be required. It is not uncommon to receive a file with one or more incorrect rows that cause `read_csv` to fail. For any file with a single-digit delimiter, you can use Python's built-in [csv](https://docs.python.org/3/library/csv.html) module. To use it, pass an open file or file-like object to `csv.reader`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "bd1ba571",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['', 'Title', 'Language', 'Authors', 'License', 'Publication date', 'doi']\n",
      "['0', 'Python basics', 'en', 'Veit Schiele', 'BSD-3-Clause', '2021-10-28', '']\n",
      "['1', 'Jupyter Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2019-06-27', '']\n",
      "['2', 'Jupyter Tutorial', 'de', 'Veit Schiele', 'BSD-3-Clause', '2020-10-26', '']\n",
      "['3', 'PyViz Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2020-04-13', '']\n"
     ]
    }
   ],
   "source": [
    "import csv\n",
    "\n",
    "\n",
    "f = open(\"out.csv\")\n",
    "reader = csv.reader(f)\n",
    "\n",
    "for line in reader:\n",
    "    print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bc7ee20",
   "metadata": {},
   "source": [
    "### Dialekte\n",
    "\n",
    "csv-Dateien gibt es in vielen verschiedenen Varianten. Das Python csv-Modul kommt bereits mit drei verschiedenen Dialekten:\n",
    "\n",
    "Parameter | excel | excel-tab | unix\n",
    ":--- | :--- | :--- | :---\n",
    "`delimiter` | `','` | `'\\\\t'` | `','`\n",
    "`quotechar` | `'\\\"'` | `'\\\"'` | ` '\\\"'`\n",
    "`doublequote` | `True` | `True` | `True`\n",
    "`skipinitialspace` | `False` | `False` | `False`\n",
    "`lineterminator` | `'\\\\r\\\\n'` | `'\\\\r\\\\n'` | `'\\\\n'`\n",
    "`quoting` | `csv.QUOTE_MINIMAL` | `csv.QUOTE_MINIMAL` | `csv.QUOTE_ALL`\n",
    "`escapechar` | `None` | `None` | `None`\n",
    "\n",
    "You can also use it to define your own format with a different separator, a different string convention or a different end-of-line character. Registering your own dialect is recommended for this. Possible options and functions of `csv.register_dialect` are:\n",
    "\n",
    "Argument | Description\n",
    ":------- | :----------\n",
    "`delimiter` | One-character string to separate fields; default value is `,`.\n",
    "`lineterminator` | Line terminator for writing; default value is `\\r\\n`. Reader ignores this and recognises cross-platform line delimiters.\n",
    "`quotechar` | Quotation marks for fields with special characters (like a separator); default is `\"`.\n",
    "`quoting` | Quoting convention. Options include `csv.QUOTE_ALL` – quote all fields, `csv.QUOTE_MINIMAL` – quote only fields with special characters like the delimiter, `csv.QUOTE_NONNUMERIC`, and `csv.QUOTE_NONE` – no quotes. The default value is `QUOTE_MINIMAL`.\n",
    "`skipinitialspace` | Ignore spaces after each delimiter; default is `False`.\n",
    "`doublequote` | if `True`, quotes are doubled within a field.\n",
    "`escapechar` | String to bypass the delimiter when `quoting` is set to `csv.QUOTE_NONE`; default is disabled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "c6d73a1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "csv.register_dialect(\n",
    "    \"my_csv_dialect\",\n",
    "    lineterminator=\"\\n\",\n",
    "    delimiter=\",\",\n",
    "    quotechar=\"'\",\n",
    "    quoting=csv.QUOTE_MINIMAL,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2633ad3b",
   "metadata": {},
   "source": [
    "Now the CSV file can be opened with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "85ac6d66",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['', 'Title', 'Language', 'Authors', 'License', 'Publication date', 'doi']\n",
      "['0', 'Python basics', 'en', 'Veit Schiele', 'BSD-3-Clause', '2021-10-28', '']\n",
      "['1', 'Jupyter Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2019-06-27', '']\n",
      "['2', 'Jupyter Tutorial', 'de', 'Veit Schiele', 'BSD-3-Clause', '2020-10-26', '']\n",
      "['3', 'PyViz Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2020-04-13', '']\n"
     ]
    }
   ],
   "source": [
    "with open(\"out.csv\") as f:\n",
    "    reader = csv.reader(f, \"my_csv_dialect\")\n",
    "    for line in reader:\n",
    "        print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "610e6ddf",
   "metadata": {},
   "source": [
    "Then we can create a Dict with data columns by using [Dict Comprehensions](https://peps.python.org/pep-0274/) and iterating over the values from `values` with [zip](https://docs.python.org/3/library/functions.html#zip). Note that this requires a lot of storage space for large files, as the rows are converted into columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "341af079",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'': ('0', '1', '2', '3'),\n",
       " 'Title': ('Python basics',\n",
       "  'Jupyter Tutorial',\n",
       "  'Jupyter Tutorial',\n",
       "  'PyViz Tutorial'),\n",
       " 'Language': ('en', 'en', 'de', 'en'),\n",
       " 'Authors': ('Veit Schiele', 'Veit Schiele', 'Veit Schiele', 'Veit Schiele'),\n",
       " 'License': ('BSD-3-Clause', 'BSD-3-Clause', 'BSD-3-Clause', 'BSD-3-Clause'),\n",
       " 'Publication date': ('2021-10-28', '2019-06-27', '2020-10-26', '2020-04-13'),\n",
       " 'doi': ('', '', '', '')}"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open(\"out.csv\") as f:\n",
    "    reader = csv.reader(f, \"my_csv_dialect\")\n",
    "    lines = list(reader)\n",
    "    header, values = lines[0], lines[1:]\n",
    "    data_dict = {h: v for h, v in zip(header, zip(*values))}\n",
    "\n",
    "data_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52db491d",
   "metadata": {},
   "source": [
    "To write files with separators manually, you can use `csv.writer`. It accepts an open, writable file object and the same dialect and format options as `csv.reader`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "69f3c21a",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"new.csv\", \"w\") as f:\n",
    "    writer = csv.writer(f, \"my_csv_dialect\")\n",
    "    writer.writerow((\"\", \"Titel\", \"Sprache\", \"Autor*innen\"))\n",
    "    writer.writerow((\"1\", \"Python basics\", \"en\", \"Veit Schiele\"))\n",
    "    writer.writerow((\"2\", \"Jupyter Tutorial\", \"en\", \"Veit Schiele\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "ff5b4f67",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[',Titel,Sprache,Autor*innen\\n',\n",
       " '1,Python basics,en,Veit Schiele\\n',\n",
       " '2,Jupyter Tutorial,en,Veit Schiele\\n']"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(open(\"new.csv\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.13 Kernel",
   "language": "python",
   "name": "python313"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}