CSV example¶

[1]:

import pandas as pd

After importing pandas, we first read a csv file with read_csv:

[2]:

pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv"
)

[2]:

	Python basics	en	Veit Schiele	BSD-3-Clause	2021-10-28
0	Jupyter Tutorial	en	Veit Schiele	BSD-3-Clause	2019-06-27
1	Jupyter Tutorial	de	Veit Schiele	BSD-3-Clause	2020-10-26
2	PyViz Tutorial	en	Veit Schiele	BSD-3-Clause	2020-04-13

As you can see, this file has no header. To give the DataFrame a header, you have several options. You can allow pandas to assign default column names, or you can define the names yourself:

[3]:

pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    header=None,
)

[3]:

	0	1	2	3	4
0	Python basics	en	Veit Schiele	BSD-3-Clause	2021-10-28
1	Jupyter Tutorial	en	Veit Schiele	BSD-3-Clause	2019-06-27
2	Jupyter Tutorial	de	Veit Schiele	BSD-3-Clause	2020-10-26
3	PyViz Tutorial	en	Veit Schiele	BSD-3-Clause	2020-04-13

[4]:

pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    names=["Title", "Language", "Authors", "License", "Publication date"],
)

[4]:

	Title	Language	Authors	License	Publication date
0	Python basics	en	Veit Schiele	BSD-3-Clause	2021-10-28
1	Jupyter Tutorial	en	Veit Schiele	BSD-3-Clause	2019-06-27
2	Jupyter Tutorial	de	Veit Schiele	BSD-3-Clause	2020-10-26
3	PyViz Tutorial	en	Veit Schiele	BSD-3-Clause	2020-04-13

Suppose you want the Authors column to be the index of the returned DataFrame. You can either specify that you want the column at index 3 or with the name Authors by using the argument index_col:

[5]:

pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    index_col=["Authors"],
    names=["Title", "Language", "Authors", "License", "Publication date"],
)

[5]:

	Title	Language	License	Publication date
Authors
Veit Schiele	Python basics	en	BSD-3-Clause	2021-10-28
Veit Schiele	Jupyter Tutorial	en	BSD-3-Clause	2019-06-27
Veit Schiele	Jupyter Tutorial	de	BSD-3-Clause	2020-10-26
Veit Schiele	PyViz Tutorial	en	BSD-3-Clause	2020-04-13

In case you want to build a hierarchical index from several columns, pass a list of column numbers or names:

[6]:

pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    index_col=[2, 0],
    names=["Title", "Language", "Authors", "License", "Publication date"],
)

[6]:

		Language	License	Publication date
Authors	Title
Veit Schiele	Python basics	en	BSD-3-Clause	2021-10-28
	Jupyter Tutorial	en	BSD-3-Clause	2019-06-27
	Jupyter Tutorial	de	BSD-3-Clause	2020-10-26
	PyViz Tutorial	en	BSD-3-Clause	2020-04-13

In some cases, a table does not have a fixed separator, but uses several spaces or some other pattern to separate fields. Suppose a file looks like this:

[7]:

list(open("books.txt"))

[7]:

['   Title             Language  Authors       License       Publication date\n',
 '1  Python basics     en        Veit Schiele  BSD-3-Clause  2021-10-28\n',
 '2  Jupyter Tutorial  en        Veit Schiele  BSD-3-Clause  2019-06-27\n',
 '3  Jupyter Tutorial  de        Veit Schiele  BSD-3-Clause  2020-10-26\n',
 '4  PyViz Tutorial    en        Veit Schiele  BSD-3-Clause  2020-04-13\n']

In such cases, you can pass a regular expression as a separator for read_csv. This can be expressed by the regular expression \s\s+, so then we have:

[8]:

pd.read_csv("books.txt", sep="\s\s+", engine="python")

[8]:

	Title	Language	Authors	License	Publication date
1	Python basics	en	Veit Schiele	BSD-3-Clause	2021-10-28
2	Jupyter Tutorial	en	Veit Schiele	BSD-3-Clause	2019-06-27
3	Jupyter Tutorial	de	Veit Schiele	BSD-3-Clause	2020-10-26
4	PyViz Tutorial	en	Veit Schiele	BSD-3-Clause	2020-04-13

Since there was one column name less than the number of data rows, read_csv infers that in this case the first column should be the index of the DataFrame.

The parser functions have many additional arguments that help you handle the wide variety of exception file formats that occur. For example, you can skip individual lines of a file with skiprows:

[9]:

pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    skiprows=[2],
    names=["Title", "Language", "Authors", "License", "Publication date"],
)

[9]:

	Title	Language	Authors	License	Publication date
0	Python basics	en	Veit Schiele	BSD-3-Clause	2021-10-28
1	Jupyter Tutorial	en	Veit Schiele	BSD-3-Clause	2019-06-27
2	PyViz Tutorial	en	Veit Schiele	BSD-3-Clause	2020-04-13

Dealing with missing values is an important and often complicated part of parsing data. Missing data is usually either not present (empty string) or indicated by a placeholder. By default, pandas uses a number of common placeholders, such as NA and NULL:

[10]:

df = pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    names=[
        "Title",
        "Language",
        "Authors",
        "License",
        "Publication date",
        "doi",
    ],
)

df

[10]:

	Title	Language	Authors	License	Publication date	doi
0	Python basics	en	Veit Schiele	BSD-3-Clause	2021-10-28	NaN
1	Jupyter Tutorial	en	Veit Schiele	BSD-3-Clause	2019-06-27	NaN
2	Jupyter Tutorial	de	Veit Schiele	BSD-3-Clause	2020-10-26	NaN
3	PyViz Tutorial	en	Veit Schiele	BSD-3-Clause	2020-04-13	NaN

[11]:

df.isna()

[11]:

	Title	Language	Authors	License	Publication date	doi
0	False	False	False	False	False	True
1	False	False	False	False	False	True
2	False	False	False	False	False	True
3	False	False	False	False	False	True

The na_values option can take either a list or a series of strings to account for missing values:

[12]:

pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    na_values=["BSD-3-Clause"],
    names=[
        "Title",
        "Language",
        "Authors",
        "License",
        "Publication date",
        "doi",
    ],
)

[12]:

	Title	Language	Authors	License	Publication date	doi
0	Python basics	en	Veit Schiele	NaN	2021-10-28	NaN
1	Jupyter Tutorial	en	Veit Schiele	NaN	2019-06-27	NaN
2	Jupyter Tutorial	de	Veit Schiele	NaN	2020-10-26	NaN
3	PyViz Tutorial	en	Veit Schiele	NaN	2020-04-13	NaN

The most frequent arguments of the function read_csv:

Argument	Description
`path`	String specifying the location in the file system, a URL or a file-like object
`sep` or `delimiter`	String or regular expression to separate the fields in each row
`header`	Row number to be used as column name; default is `0`, i.e. the first row, but should be `None` if there is no header row
`index_col`	Row numbers or names to be used as row index in the result; can be a single name/number or a list of them for a hierarchical index
`names`	List of column names
`skiprows`	Number of rows to be ignored at the beginning of the file or list of row numbers starting at `0` to be skipped
`na_values`	sequence of values to be replaced by NA
`comment`	character to separate comments from the end of the line
`parse_dates`	Attempt to parse data with datetime; defaults to `False`. If `True`, attempts to parse all columns. Otherwise, a list of column numbers or names to parse can be specified. If the list element is a tuple or a list, multiple columns are combined and converted to a date, for example if the date and time are split between two columns
`keep_date_col`	if columns are combined to parse the date, the combined columns are kept; default: `False`
`converters`	Dict containing the column number of names mapped to functions, for example `{'Titel': f}` would apply the function f to all values in the column `Title`
`dayfirst`	treat as an international format when parsing potentially ambiguous dates, for example `28/6/2021` → `28. Juni 2021`; `False` by default
`date_parser`	function to use for parsing dates
`nrows`	Number of lines to read from the beginning of the file.
`iterator`	Return a `TextFileReader` object to read the file piece by piece; this object can also be used with the `with` statement
`chunksize`	For the iteration, the size of the data blocks.
`skip_footer`	number of lines to be ignored at the end of the file
`verbose`	outputs various information about the parser output, for example the number of missing values in non-numeric columns
`encoding`	Text encoding for Unicode, for example `utf-8` for UTF-8 encoded text
`squeeze`	if the parsed data contains only one column, a Series is returned
`thousands`	Separator for thousands, for example `,` or `.`

Reading in text files piece by piece¶

If you want to process very large files, you can also read in only a small part of a file or iterate through smaller parts of a file.

Before we look at a large file, we reduce the number of lines displayed with options.display.max_rows:

[13]:

pd.options.display.max_rows = 10

[14]:

pd.read_csv("example.csv")

[14]:

	Date	Mon.	Tues.	Wed.	Thurs.	Fri.	Sat.	Sun.
0	1996-01-01	0.129453	-0.023836	1.121460	1.698286	-0.598506	1.042221	-0.726412
1	1996-01-02	-0.094021	-0.727942	0.698641	-1.198040	1.927505	1.147445	-1.134103
2	1996-01-03	-0.560857	0.145222	-0.990202	1.200214	0.717339	1.117095	-1.793565
3	1996-01-04	-0.169755	-0.677391	-1.533519	-0.343477	-0.109705	1.038236	-0.799088
4	1996-01-05	1.344705	-1.817261	0.460991	-0.839633	0.265814	0.477659	0.636383
...	...	...	...	...	...	...	...	...
9127	2020-12-27	-0.881800	-0.074270	-0.351769	1.381641	-0.049548	1.664180	-1.032204
9128	2020-12-28	-0.143386	0.198217	-1.243861	1.196576	1.338166	-0.212333	-0.023131
9129	2020-12-29	0.398787	-0.848786	1.791707	-1.167592	-0.033881	-0.285559	-0.323477
9130	2020-12-30	0.587846	0.411580	1.150380	0.444638	-1.093577	0.605456	1.463345
9131	2020-12-31	0.736350	0.436292	-0.260171	-0.066066	-0.328324	-0.586792	-1.204582

9132 rows × 8 columns

If you only want to read a small number of lines (without reading the whole file), you can specify this with nrows:

[15]:

pd.read_csv("example.csv", nrows=7)

[15]:

	Date	Mon.	Tues.	Wed.	Thurs.	Fri.	Sat.	Sun.
0	1996-01-01	0.129453	-0.023836	1.121460	1.698286	-0.598506	1.042221	-0.726412
1	1996-01-02	-0.094021	-0.727942	0.698641	-1.198040	1.927505	1.147445	-1.134103
2	1996-01-03	-0.560857	0.145222	-0.990202	1.200214	0.717339	1.117095	-1.793565
3	1996-01-04	-0.169755	-0.677391	-1.533519	-0.343477	-0.109705	1.038236	-0.799088
4	1996-01-05	1.344705	-1.817261	0.460991	-0.839633	0.265814	0.477659	0.636383
5	1996-01-06	-0.354445	-0.065182	-1.244963	-0.559732	0.042362	-0.303712	0.067632
6	1996-01-07	1.460922	0.164412	0.883960	-0.833642	0.001582	1.138469	0.561618

To read a file piece by piece, you can specify the number of lines with chunksize:

[16]:

pd.read_csv("example.csv", chunksize=1000)

[16]:

<pandas.io.parsers.readers.TextFileReader at 0x11fa09110>

The TextFileReader object returned by read_csv allows iteration over parts of the file according to the chunksize. For example, we can iterate over the example.csv file and aggregate the number of values in the Date column as follows:

[17]:

chunks = pd.read_csv("example.csv", chunksize=1000)

serie = pd.Series([], dtype="float64")
for chunk in chunks:
    values = serie.add(chunk["Date"].value_counts(), fill_value=0)

sorted_values = values.sort_values(ascending=False)

[18]:

sorted_values[:10]

[18]:

Date
2020-08-22    1.0
2020-09-07    1.0
2020-08-24    1.0
2020-08-25    1.0
2020-08-26    1.0
2020-08-27    1.0
2020-08-28    1.0
2020-08-29    1.0
2020-08-30    1.0
2020-08-31    1.0
dtype: float64

TextFileReader also has a get_chunk method that allows you to read pieces of any size.

Write DataFrame and Series as a CSV file¶

Data can also be exported in a comma-separated format. With the method pandas.DataFrame.to_csv we can write the data into a comma-separated file:

[19]:

df.to_csv("out.csv")

Of course, other delimiters can also be used, for example to write to sys.stdout, so that the text result is output on the console and not in a file:

[20]:

import sys

[21]:

df.to_csv(sys.stdout, sep="|")

|Title|Language|Authors|License|Publication date|doi
0|Python basics|en|Veit Schiele|BSD-3-Clause|2021-10-28|
1|Jupyter Tutorial|en|Veit Schiele|BSD-3-Clause|2019-06-27|
2|Jupyter Tutorial|de|Veit Schiele|BSD-3-Clause|2020-10-26|
3|PyViz Tutorial|en|Veit Schiele|BSD-3-Clause|2020-04-13|

Missing values appear in the output as empty strings. You may want to mark them with a different placeholder:

[22]:

df.to_csv(sys.stdout, na_rep="NaN")

,Title,Language,Authors,License,Publication date,doi
0,Python basics,en,Veit Schiele,BSD-3-Clause,2021-10-28,NaN
1,Jupyter Tutorial,en,Veit Schiele,BSD-3-Clause,2019-06-27,NaN
2,Jupyter Tutorial,de,Veit Schiele,BSD-3-Clause,2020-10-26,NaN
3,PyViz Tutorial,en,Veit Schiele,BSD-3-Clause,2020-04-13,NaN

If no other options are given, both the row and column labels are written. Both can be deactivated:

[23]:

df.to_csv(sys.stdout, index=False, header=False)

Python basics,en,Veit Schiele,BSD-3-Clause,2021-10-28,
Jupyter Tutorial,en,Veit Schiele,BSD-3-Clause,2019-06-27,
Jupyter Tutorial,de,Veit Schiele,BSD-3-Clause,2020-10-26,
PyViz Tutorial,en,Veit Schiele,BSD-3-Clause,2020-04-13,

You can also write only a subset of the columns, in an order of your choosing:

[24]:

df.to_csv(
    sys.stdout,
    index=False,
    columns=["Title", "Language", "Authors", "Publication date"],
)

Title,Language,Authors,Publication date
Python basics,en,Veit Schiele,2021-10-28
Jupyter Tutorial,en,Veit Schiele,2019-06-27
Jupyter Tutorial,de,Veit Schiele,2020-10-26
PyViz Tutorial,en,Veit Schiele,2020-04-13

Working with the csv module of Python¶

Most forms of table data can be loaded using functions such as pandas.read_csv. However, in some cases manual processing may be required. It is not uncommon to receive a file with one or more incorrect rows that cause read_csv to fail. For any file with a single-digit delimiter, you can use Python’s built-in csv module. To use it, pass an open file or file-like object to csv.reader:

[25]:

import csv


f = open("out.csv")
reader = csv.reader(f)

for line in reader:
    print(line)

['', 'Title', 'Language', 'Authors', 'License', 'Publication date', 'doi']
['0', 'Python basics', 'en', 'Veit Schiele', 'BSD-3-Clause', '2021-10-28', '']
['1', 'Jupyter Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2019-06-27', '']
['2', 'Jupyter Tutorial', 'de', 'Veit Schiele', 'BSD-3-Clause', '2020-10-26', '']
['3', 'PyViz Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2020-04-13', '']

Dialekte¶

csv-Dateien gibt es in vielen verschiedenen Varianten. Das Python csv-Modul kommt bereits mit drei verschiedenen Dialekten:

Parameter	excel	excel-tab	unix
`delimiter`	`','`	`'\\t'`	`','`
`quotechar`	`'\"'`	`'\"'`	`'\"'`
`doublequote`	`True`	`True`	`True`
`skipinitialspace`	`False`	`False`	`False`
`lineterminator`	`'\\r\\n'`	`'\\r\\n'`	`'\\n'`
`quoting`	`csv.QUOTE_MINIMAL`	`csv.QUOTE_MINIMAL`	`csv.QUOTE_ALL`
`escapechar`	`None`	`None`	`None`

You can also use it to define your own format with a different separator, a different string convention or a different end-of-line character. Registering your own dialect is recommended for this. Possible options and functions of csv.register_dialect are:

Argument	Description
`delimiter`	One-character string to separate fields; default value is `,`.
`lineterminator`	Line terminator for writing; default value is `\r\n`. Reader ignores this and recognises cross-platform line delimiters.
`quotechar`	Quotation marks for fields with special characters (like a separator); default is `"`.
`quoting`	Quoting convention. Options include `csv.QUOTE_ALL` – quote all fields, `csv.QUOTE_MINIMAL` – quote only fields with special characters like the delimiter, `csv.QUOTE_NONNUMERIC`, and `csv.QUOTE_NONE` – no quotes. The default value is `QUOTE_MINIMAL`.
`skipinitialspace`	Ignore spaces after each delimiter; default is `False`.
`doublequote`	if `True`, quotes are doubled within a field.
`escapechar`	String to bypass the delimiter when `quoting` is set to `csv.QUOTE_NONE`; default is disabled.

[26]:

csv.register_dialect(
    "my_csv_dialect",
    lineterminator="\n",
    delimiter=",",
    quotechar="'",
    quoting=csv.QUOTE_MINIMAL,
)

Now the CSV file can be opened with:

[27]:

with open("out.csv") as f:
    reader = csv.reader(f, "my_csv_dialect")
    for line in reader:
        print(line)

['', 'Title', 'Language', 'Authors', 'License', 'Publication date', 'doi']
['0', 'Python basics', 'en', 'Veit Schiele', 'BSD-3-Clause', '2021-10-28', '']
['1', 'Jupyter Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2019-06-27', '']
['2', 'Jupyter Tutorial', 'de', 'Veit Schiele', 'BSD-3-Clause', '2020-10-26', '']
['3', 'PyViz Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2020-04-13', '']

Then we can create a Dict with data columns by using Dict Comprehensions and iterating over the values from values with zip. Note that this requires a lot of storage space for large files, as the rows are converted into columns:

[28]:

with open("out.csv") as f:
    reader = csv.reader(f, "my_csv_dialect")
    lines = list(reader)
    header, values = lines[0], lines[1:]
    data_dict = {h: v for h, v in zip(header, zip(*values))}

data_dict

[28]:

{'': ('0', '1', '2', '3'),
 'Title': ('Python basics',
  'Jupyter Tutorial',
  'Jupyter Tutorial',
  'PyViz Tutorial'),
 'Language': ('en', 'en', 'de', 'en'),
 'Authors': ('Veit Schiele', 'Veit Schiele', 'Veit Schiele', 'Veit Schiele'),
 'License': ('BSD-3-Clause', 'BSD-3-Clause', 'BSD-3-Clause', 'BSD-3-Clause'),
 'Publication date': ('2021-10-28', '2019-06-27', '2020-10-26', '2020-04-13'),
 'doi': ('', '', '', '')}

To write files with separators manually, you can use csv.writer. It accepts an open, writable file object and the same dialect and format options as csv.reader:

[29]:

with open("new.csv", "w") as f:
    writer = csv.writer(f, "my_csv_dialect")
    writer.writerow(("", "Titel", "Sprache", "Autor*innen"))
    writer.writerow(("1", "Python basics", "en", "Veit Schiele"))
    writer.writerow(("2", "Jupyter Tutorial", "en", "Veit Schiele"))

[30]:

list(open("new.csv"))

[30]:

[',Titel,Sprache,Autor*innen\n',
 '1,Python basics,en,Veit Schiele\n',
 '2,Jupyter Tutorial,en,Veit Schiele\n']