CSV example

[1]:
import pandas as pd

After importing pandas, we first read a csv file with read_csv:

[2]:
pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv"
)
[2]:
Python basics en Veit Schiele BSD-3-Clause 2021-10-28
0 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27
1 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26
2 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13

As you can see, this file has no header. To give the DataFrame a header, you have several options. You can allow pandas to assign default column names, or you can define the names yourself:

[3]:
pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    header=None,
)
[3]:
0 1 2 3 4
0 Python basics en Veit Schiele BSD-3-Clause 2021-10-28
1 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27
2 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26
3 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13
[4]:
pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    names=["Title", "Language", "Authors", "License", "Publication date"],
)
[4]:
Title Language Authors License Publication date
0 Python basics en Veit Schiele BSD-3-Clause 2021-10-28
1 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27
2 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26
3 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13

Suppose you want the Authors column to be the index of the returned DataFrame. You can either specify that you want the column at index 3 or with the name Authors by using the argument index_col:

[5]:
pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    index_col=["Authors"],
    names=["Title", "Language", "Authors", "License", "Publication date"],
)
[5]:
Title Language License Publication date
Authors
Veit Schiele Python basics en BSD-3-Clause 2021-10-28
Veit Schiele Jupyter Tutorial en BSD-3-Clause 2019-06-27
Veit Schiele Jupyter Tutorial de BSD-3-Clause 2020-10-26
Veit Schiele PyViz Tutorial en BSD-3-Clause 2020-04-13

In case you want to build a hierarchical index from several columns, pass a list of column numbers or names:

[6]:
pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    index_col=[2, 0],
    names=["Title", "Language", "Authors", "License", "Publication date"],
)
[6]:
Language License Publication date
Authors Title
Veit Schiele Python basics en BSD-3-Clause 2021-10-28
Jupyter Tutorial en BSD-3-Clause 2019-06-27
Jupyter Tutorial de BSD-3-Clause 2020-10-26
PyViz Tutorial en BSD-3-Clause 2020-04-13

In some cases, a table does not have a fixed separator, but uses several spaces or some other pattern to separate fields. Suppose a file looks like this:

[7]:
list(open("books.txt"))
[7]:
['   Title             Language  Authors       License       Publication date\n',
 '1  Python basics     en        Veit Schiele  BSD-3-Clause  2021-10-28\n',
 '2  Jupyter Tutorial  en        Veit Schiele  BSD-3-Clause  2019-06-27\n',
 '3  Jupyter Tutorial  de        Veit Schiele  BSD-3-Clause  2020-10-26\n',
 '4  PyViz Tutorial    en        Veit Schiele  BSD-3-Clause  2020-04-13\n']

In such cases, you can pass a regular expression as a separator for read_csv. This can be expressed by the regular expression \s\s+, so then we have:

[8]:
pd.read_csv("books.txt", sep="\s\s+", engine="python")
[8]:
Title Language Authors License Publication date
1 Python basics en Veit Schiele BSD-3-Clause 2021-10-28
2 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27
3 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26
4 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13

Since there was one column name less than the number of data rows, read_csv infers that in this case the first column should be the index of the DataFrame.

The parser functions have many additional arguments that help you handle the wide variety of exception file formats that occur. For example, you can skip individual lines of a file with skiprows:

[9]:
pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    skiprows=[2],
    names=["Title", "Language", "Authors", "License", "Publication date"],
)
[9]:
Title Language Authors License Publication date
0 Python basics en Veit Schiele BSD-3-Clause 2021-10-28
1 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27
2 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13

Dealing with missing values is an important and often complicated part of parsing data. Missing data is usually either not present (empty string) or indicated by a placeholder. By default, pandas uses a number of common placeholders, such as NA and NULL:

[10]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    names=[
        "Title",
        "Language",
        "Authors",
        "License",
        "Publication date",
        "doi",
    ],
)

df
[10]:
Title Language Authors License Publication date doi
0 Python basics en Veit Schiele BSD-3-Clause 2021-10-28 NaN
1 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27 NaN
2 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26 NaN
3 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13 NaN
[11]:
df.isna()
[11]:
Title Language Authors License Publication date doi
0 False False False False False True
1 False False False False False True
2 False False False False False True
3 False False False False False True

The na_values option can take either a list or a series of strings to account for missing values:

[12]:
pd.read_csv(
    "https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
    na_values=["BSD-3-Clause"],
    names=[
        "Title",
        "Language",
        "Authors",
        "License",
        "Publication date",
        "doi",
    ],
)
[12]:
Title Language Authors License Publication date doi
0 Python basics en Veit Schiele NaN 2021-10-28 NaN
1 Jupyter Tutorial en Veit Schiele NaN 2019-06-27 NaN
2 Jupyter Tutorial de Veit Schiele NaN 2020-10-26 NaN
3 PyViz Tutorial en Veit Schiele NaN 2020-04-13 NaN

The most frequent arguments of the function read_csv:

Argument

Description

path

String specifying the location in the file system, a URL or a file-like object

sep or delimiter

String or regular expression to separate the fields in each row

header

Row number to be used as column name; default is 0, i.e. the first row, but should be None if there is no header row

index_col

Row numbers or names to be used as row index in the result; can be a single name/number or a list of them for a hierarchical index

names

List of column names

skiprows

Number of rows to be ignored at the beginning of the file or list of row numbers starting at 0 to be skipped

na_values

sequence of values to be replaced by NA

comment

character to separate comments from the end of the line

parse_dates

Attempt to parse data with datetime; defaults to False. If True, attempts to parse all columns. Otherwise, a list of column numbers or names to parse can be specified. If the list element is a tuple or a list, multiple columns are combined and converted to a date, for example if the date and time are split between two columns

keep_date_col

if columns are combined to parse the date, the combined columns are kept; default: False

converters

Dict containing the column number of names mapped to functions, for example {'Titel': f} would apply the function f to all values in the column Title

dayfirst

treat as an international format when parsing potentially ambiguous dates, for example 28/6/202128. Juni 2021; False by default

date_parser

function to use for parsing dates

nrows

Number of lines to read from the beginning of the file.

iterator

Return a TextFileReader object to read the file piece by piece; this object can also be used with the with statement

chunksize

For the iteration, the size of the data blocks.

skip_footer

number of lines to be ignored at the end of the file

verbose

outputs various information about the parser output, for example the number of missing values in non-numeric columns

encoding

Text encoding for Unicode, for example utf-8 for UTF-8 encoded text

squeeze

if the parsed data contains only one column, a Series is returned

thousands

Separator for thousands, for example , or .

Reading in text files piece by piece

If you want to process very large files, you can also read in only a small part of a file or iterate through smaller parts of a file.

Before we look at a large file, we reduce the number of lines displayed with options.display.max_rows:

[13]:
pd.options.display.max_rows = 10
[14]:
pd.read_csv("example.csv")
[14]:
Date Mon. Tues. Wed. Thurs. Fri. Sat. Sun.
0 1996-01-01 0.129453 -0.023836 1.121460 1.698286 -0.598506 1.042221 -0.726412
1 1996-01-02 -0.094021 -0.727942 0.698641 -1.198040 1.927505 1.147445 -1.134103
2 1996-01-03 -0.560857 0.145222 -0.990202 1.200214 0.717339 1.117095 -1.793565
3 1996-01-04 -0.169755 -0.677391 -1.533519 -0.343477 -0.109705 1.038236 -0.799088
4 1996-01-05 1.344705 -1.817261 0.460991 -0.839633 0.265814 0.477659 0.636383
... ... ... ... ... ... ... ... ...
9127 2020-12-27 -0.881800 -0.074270 -0.351769 1.381641 -0.049548 1.664180 -1.032204
9128 2020-12-28 -0.143386 0.198217 -1.243861 1.196576 1.338166 -0.212333 -0.023131
9129 2020-12-29 0.398787 -0.848786 1.791707 -1.167592 -0.033881 -0.285559 -0.323477
9130 2020-12-30 0.587846 0.411580 1.150380 0.444638 -1.093577 0.605456 1.463345
9131 2020-12-31 0.736350 0.436292 -0.260171 -0.066066 -0.328324 -0.586792 -1.204582

9132 rows × 8 columns

If you only want to read a small number of lines (without reading the whole file), you can specify this with nrows:

[15]:
pd.read_csv("example.csv", nrows=7)
[15]:
Date Mon. Tues. Wed. Thurs. Fri. Sat. Sun.
0 1996-01-01 0.129453 -0.023836 1.121460 1.698286 -0.598506 1.042221 -0.726412
1 1996-01-02 -0.094021 -0.727942 0.698641 -1.198040 1.927505 1.147445 -1.134103
2 1996-01-03 -0.560857 0.145222 -0.990202 1.200214 0.717339 1.117095 -1.793565
3 1996-01-04 -0.169755 -0.677391 -1.533519 -0.343477 -0.109705 1.038236 -0.799088
4 1996-01-05 1.344705 -1.817261 0.460991 -0.839633 0.265814 0.477659 0.636383
5 1996-01-06 -0.354445 -0.065182 -1.244963 -0.559732 0.042362 -0.303712 0.067632
6 1996-01-07 1.460922 0.164412 0.883960 -0.833642 0.001582 1.138469 0.561618

To read a file piece by piece, you can specify the number of lines with chunksize:

[16]:
pd.read_csv("example.csv", chunksize=1000)
[16]:
<pandas.io.parsers.readers.TextFileReader at 0x11fa09110>

The TextFileReader object returned by read_csv allows iteration over parts of the file according to the chunksize. For example, we can iterate over the example.csv file and aggregate the number of values in the Date column as follows:

[17]:
chunks = pd.read_csv("example.csv", chunksize=1000)

serie = pd.Series([], dtype="float64")
for chunk in chunks:
    values = serie.add(chunk["Date"].value_counts(), fill_value=0)

sorted_values = values.sort_values(ascending=False)
[18]:
sorted_values[:10]
[18]:
Date
2020-08-22    1.0
2020-09-07    1.0
2020-08-24    1.0
2020-08-25    1.0
2020-08-26    1.0
2020-08-27    1.0
2020-08-28    1.0
2020-08-29    1.0
2020-08-30    1.0
2020-08-31    1.0
dtype: float64

TextFileReader also has a get_chunk method that allows you to read pieces of any size.

Write DataFrame and Series as a CSV file

Data can also be exported in a comma-separated format. With the method pandas.DataFrame.to_csv we can write the data into a comma-separated file:

[19]:
df.to_csv("out.csv")

Of course, other delimiters can also be used, for example to write to sys.stdout, so that the text result is output on the console and not in a file:

[20]:
import sys
[21]:
df.to_csv(sys.stdout, sep="|")
|Title|Language|Authors|License|Publication date|doi
0|Python basics|en|Veit Schiele|BSD-3-Clause|2021-10-28|
1|Jupyter Tutorial|en|Veit Schiele|BSD-3-Clause|2019-06-27|
2|Jupyter Tutorial|de|Veit Schiele|BSD-3-Clause|2020-10-26|
3|PyViz Tutorial|en|Veit Schiele|BSD-3-Clause|2020-04-13|

Missing values appear in the output as empty strings. You may want to mark them with a different placeholder:

[22]:
df.to_csv(sys.stdout, na_rep="NaN")
,Title,Language,Authors,License,Publication date,doi
0,Python basics,en,Veit Schiele,BSD-3-Clause,2021-10-28,NaN
1,Jupyter Tutorial,en,Veit Schiele,BSD-3-Clause,2019-06-27,NaN
2,Jupyter Tutorial,de,Veit Schiele,BSD-3-Clause,2020-10-26,NaN
3,PyViz Tutorial,en,Veit Schiele,BSD-3-Clause,2020-04-13,NaN

If no other options are given, both the row and column labels are written. Both can be deactivated:

[23]:
df.to_csv(sys.stdout, index=False, header=False)
Python basics,en,Veit Schiele,BSD-3-Clause,2021-10-28,
Jupyter Tutorial,en,Veit Schiele,BSD-3-Clause,2019-06-27,
Jupyter Tutorial,de,Veit Schiele,BSD-3-Clause,2020-10-26,
PyViz Tutorial,en,Veit Schiele,BSD-3-Clause,2020-04-13,

You can also write only a subset of the columns, in an order of your choosing:

[24]:
df.to_csv(
    sys.stdout,
    index=False,
    columns=["Title", "Language", "Authors", "Publication date"],
)
Title,Language,Authors,Publication date
Python basics,en,Veit Schiele,2021-10-28
Jupyter Tutorial,en,Veit Schiele,2019-06-27
Jupyter Tutorial,de,Veit Schiele,2020-10-26
PyViz Tutorial,en,Veit Schiele,2020-04-13

Working with the csv module of Python

Most forms of table data can be loaded using functions such as pandas.read_csv. However, in some cases manual processing may be required. It is not uncommon to receive a file with one or more incorrect rows that cause read_csv to fail. For any file with a single-digit delimiter, you can use Python’s built-in csv module. To use it, pass an open file or file-like object to csv.reader:

[25]:
import csv


f = open("out.csv")
reader = csv.reader(f)

for line in reader:
    print(line)
['', 'Title', 'Language', 'Authors', 'License', 'Publication date', 'doi']
['0', 'Python basics', 'en', 'Veit Schiele', 'BSD-3-Clause', '2021-10-28', '']
['1', 'Jupyter Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2019-06-27', '']
['2', 'Jupyter Tutorial', 'de', 'Veit Schiele', 'BSD-3-Clause', '2020-10-26', '']
['3', 'PyViz Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2020-04-13', '']

Dialekte

csv-Dateien gibt es in vielen verschiedenen Varianten. Das Python csv-Modul kommt bereits mit drei verschiedenen Dialekten:

Parameter

excel

excel-tab

unix

delimiter

','

'\\t'

','

quotechar

'\"'

'\"'

'\"'

doublequote

True

True

True

skipinitialspace

False

False

False

lineterminator

'\\r\\n'

'\\r\\n'

'\\n'

quoting

csv.QUOTE_MINIMAL

csv.QUOTE_MINIMAL

csv.QUOTE_ALL

escapechar

None

None

None

You can also use it to define your own format with a different separator, a different string convention or a different end-of-line character. Registering your own dialect is recommended for this. Possible options and functions of csv.register_dialect are:

Argument

Description

delimiter

One-character string to separate fields; default value is ,.

lineterminator

Line terminator for writing; default value is \r\n. Reader ignores this and recognises cross-platform line delimiters.

quotechar

Quotation marks for fields with special characters (like a separator); default is ".

quoting

Quoting convention. Options include csv.QUOTE_ALL – quote all fields, csv.QUOTE_MINIMAL – quote only fields with special characters like the delimiter, csv.QUOTE_NONNUMERIC, and csv.QUOTE_NONE – no quotes. The default value is QUOTE_MINIMAL.

skipinitialspace

Ignore spaces after each delimiter; default is False.

doublequote

if True, quotes are doubled within a field.

escapechar

String to bypass the delimiter when quoting is set to csv.QUOTE_NONE; default is disabled.

[26]:
csv.register_dialect(
    "my_csv_dialect",
    lineterminator="\n",
    delimiter=",",
    quotechar="'",
    quoting=csv.QUOTE_MINIMAL,
)

Now the CSV file can be opened with:

[27]:
with open("out.csv") as f:
    reader = csv.reader(f, "my_csv_dialect")
    for line in reader:
        print(line)
['', 'Title', 'Language', 'Authors', 'License', 'Publication date', 'doi']
['0', 'Python basics', 'en', 'Veit Schiele', 'BSD-3-Clause', '2021-10-28', '']
['1', 'Jupyter Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2019-06-27', '']
['2', 'Jupyter Tutorial', 'de', 'Veit Schiele', 'BSD-3-Clause', '2020-10-26', '']
['3', 'PyViz Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2020-04-13', '']

Then we can create a Dict with data columns by using Dict Comprehensions and iterating over the values from values with zip. Note that this requires a lot of storage space for large files, as the rows are converted into columns:

[28]:
with open("out.csv") as f:
    reader = csv.reader(f, "my_csv_dialect")
    lines = list(reader)
    header, values = lines[0], lines[1:]
    data_dict = {h: v for h, v in zip(header, zip(*values))}

data_dict
[28]:
{'': ('0', '1', '2', '3'),
 'Title': ('Python basics',
  'Jupyter Tutorial',
  'Jupyter Tutorial',
  'PyViz Tutorial'),
 'Language': ('en', 'en', 'de', 'en'),
 'Authors': ('Veit Schiele', 'Veit Schiele', 'Veit Schiele', 'Veit Schiele'),
 'License': ('BSD-3-Clause', 'BSD-3-Clause', 'BSD-3-Clause', 'BSD-3-Clause'),
 'Publication date': ('2021-10-28', '2019-06-27', '2020-10-26', '2020-04-13'),
 'doi': ('', '', '', '')}

To write files with separators manually, you can use csv.writer. It accepts an open, writable file object and the same dialect and format options as csv.reader:

[29]:
with open("new.csv", "w") as f:
    writer = csv.writer(f, "my_csv_dialect")
    writer.writerow(("", "Titel", "Sprache", "Autor*innen"))
    writer.writerow(("1", "Python basics", "en", "Veit Schiele"))
    writer.writerow(("2", "Jupyter Tutorial", "en", "Veit Schiele"))
[30]:
list(open("new.csv"))
[30]:
[',Titel,Sprache,Autor*innen\n',
 '1,Python basics,en,Veit Schiele\n',
 '2,Jupyter Tutorial,en,Veit Schiele\n']