CSV example#
[1]:
import pandas as pd
After importing pandas, we first read a csv file with read_csv
:
[2]:
pd.read_csv(
"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv"
)
[2]:
Python basics | en | Veit Schiele | BSD-3-Clause | 2021-10-28 | |
---|---|---|---|---|---|
0 | Jupyter Tutorial | en | Veit Schiele | BSD-3-Clause | 2019-06-27 |
1 | Jupyter Tutorial | de | Veit Schiele | BSD-3-Clause | 2020-10-26 |
2 | PyViz Tutorial | en | Veit Schiele | BSD-3-Clause | 2020-04-13 |
As you can see, this file has no header. To give the DataFrame a header, you have several options. You can allow pandas to assign default column names, or you can define the names yourself:
[3]:
pd.read_csv(
"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
header=None,
)
[3]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | Python basics | en | Veit Schiele | BSD-3-Clause | 2021-10-28 |
1 | Jupyter Tutorial | en | Veit Schiele | BSD-3-Clause | 2019-06-27 |
2 | Jupyter Tutorial | de | Veit Schiele | BSD-3-Clause | 2020-10-26 |
3 | PyViz Tutorial | en | Veit Schiele | BSD-3-Clause | 2020-04-13 |
[4]:
pd.read_csv(
"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
names=["Title", "Language", "Authors", "License", "Publication date"],
)
[4]:
Title | Language | Authors | License | Publication date | |
---|---|---|---|---|---|
0 | Python basics | en | Veit Schiele | BSD-3-Clause | 2021-10-28 |
1 | Jupyter Tutorial | en | Veit Schiele | BSD-3-Clause | 2019-06-27 |
2 | Jupyter Tutorial | de | Veit Schiele | BSD-3-Clause | 2020-10-26 |
3 | PyViz Tutorial | en | Veit Schiele | BSD-3-Clause | 2020-04-13 |
Suppose you want the Authors
column to be the index of the returned DataFrame. You can either specify that you want the column at index 3 or with the name Authors
by using the argument index_col
:
[5]:
pd.read_csv(
"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
index_col=["Authors"],
names=["Title", "Language", "Authors", "License", "Publication date"],
)
[5]:
Title | Language | License | Publication date | |
---|---|---|---|---|
Authors | ||||
Veit Schiele | Python basics | en | BSD-3-Clause | 2021-10-28 |
Veit Schiele | Jupyter Tutorial | en | BSD-3-Clause | 2019-06-27 |
Veit Schiele | Jupyter Tutorial | de | BSD-3-Clause | 2020-10-26 |
Veit Schiele | PyViz Tutorial | en | BSD-3-Clause | 2020-04-13 |
In case you want to build a hierarchical index from several columns, pass a list of column numbers or names:
[6]:
pd.read_csv(
"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
index_col=[2, 0],
names=["Title", "Language", "Authors", "License", "Publication date"],
)
[6]:
Language | License | Publication date | ||
---|---|---|---|---|
Authors | Title | |||
Veit Schiele | Python basics | en | BSD-3-Clause | 2021-10-28 |
Jupyter Tutorial | en | BSD-3-Clause | 2019-06-27 | |
Jupyter Tutorial | de | BSD-3-Clause | 2020-10-26 | |
PyViz Tutorial | en | BSD-3-Clause | 2020-04-13 |
In some cases, a table does not have a fixed separator, but uses several spaces or some other pattern to separate fields. Suppose a file looks like this:
[7]:
list(open("books.txt"))
[7]:
[' Title Language Authors License Publication date\n',
'1 Python basics en Veit Schiele BSD-3-Clause 2021-10-28\n',
'2 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27\n',
'3 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26\n',
'4 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13\n']
In such cases, you can pass a regular expression as a separator for read_csv
. This can be expressed by the regular expression \s\s+
, so then we have:
[8]:
pd.read_csv("books.txt", sep="\s\s+", engine="python")
[8]:
Title | Language | Authors | License | Publication date | |
---|---|---|---|---|---|
1 | Python basics | en | Veit Schiele | BSD-3-Clause | 2021-10-28 |
2 | Jupyter Tutorial | en | Veit Schiele | BSD-3-Clause | 2019-06-27 |
3 | Jupyter Tutorial | de | Veit Schiele | BSD-3-Clause | 2020-10-26 |
4 | PyViz Tutorial | en | Veit Schiele | BSD-3-Clause | 2020-04-13 |
Since there was one column name less than the number of data rows, read_csv
infers that in this case the first column should be the index of the DataFrame.
The parser functions have many additional arguments that help you handle the wide variety of exception file formats that occur. For example, you can skip individual lines of a file with skiprows
:
[9]:
pd.read_csv(
"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
skiprows=[2],
names=["Title", "Language", "Authors", "License", "Publication date"],
)
[9]:
Title | Language | Authors | License | Publication date | |
---|---|---|---|---|---|
0 | Python basics | en | Veit Schiele | BSD-3-Clause | 2021-10-28 |
1 | Jupyter Tutorial | en | Veit Schiele | BSD-3-Clause | 2019-06-27 |
2 | PyViz Tutorial | en | Veit Schiele | BSD-3-Clause | 2020-04-13 |
Dealing with missing values is an important and often complicated part of parsing data. Missing data is usually either not present (empty string) or indicated by a placeholder. By default, pandas uses a number of common placeholders, such as NA
and NULL
:
[10]:
df = pd.read_csv(
"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
names=[
"Title",
"Language",
"Authors",
"License",
"Publication date",
"doi",
],
)
df
[10]:
Title | Language | Authors | License | Publication date | doi | |
---|---|---|---|---|---|---|
0 | Python basics | en | Veit Schiele | BSD-3-Clause | 2021-10-28 | NaN |
1 | Jupyter Tutorial | en | Veit Schiele | BSD-3-Clause | 2019-06-27 | NaN |
2 | Jupyter Tutorial | de | Veit Schiele | BSD-3-Clause | 2020-10-26 | NaN |
3 | PyViz Tutorial | en | Veit Schiele | BSD-3-Clause | 2020-04-13 | NaN |
[11]:
df.isna()
[11]:
Title | Language | Authors | License | Publication date | doi | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | True |
1 | False | False | False | False | False | True |
2 | False | False | False | False | False | True |
3 | False | False | False | False | False | True |
The na_values
option can take either a list or a series of strings to account for missing values:
[12]:
pd.read_csv(
"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv",
na_values=["BSD-3-Clause"],
names=[
"Title",
"Language",
"Authors",
"License",
"Publication date",
"doi",
],
)
[12]:
Title | Language | Authors | License | Publication date | doi | |
---|---|---|---|---|---|---|
0 | Python basics | en | Veit Schiele | NaN | 2021-10-28 | NaN |
1 | Jupyter Tutorial | en | Veit Schiele | NaN | 2019-06-27 | NaN |
2 | Jupyter Tutorial | de | Veit Schiele | NaN | 2020-10-26 | NaN |
3 | PyViz Tutorial | en | Veit Schiele | NaN | 2020-04-13 | NaN |
The most frequent arguments of the function read_csv
:
Argument |
Description |
---|---|
|
String specifying the location in the file system, a URL or a file-like object |
|
String or regular expression to separate the fields in each row |
|
Row number to be used as column name; default is |
|
Row numbers or names to be used as row index in the result; can be a single name/number or a list of them for a hierarchical index |
|
List of column names |
|
Number of rows to be ignored at the beginning of the file or list of row numbers starting at |
|
sequence of values to be replaced by NA |
|
character to separate comments from the end of the line |
|
Attempt to parse data with datetime; defaults to |
|
if columns are combined to parse the date, the combined columns are kept; default: |
|
Dict containing the column number of names mapped to functions, for example |
|
treat as an international format when parsing potentially ambiguous dates, for example |
|
function to use for parsing dates |
|
Number of lines to read from the beginning of the file. |
|
Return a |
|
For the iteration, the size of the data blocks. |
|
number of lines to be ignored at the end of the file |
|
outputs various information about the parser output, for example the number of missing values in non-numeric columns |
|
Text encoding for Unicode, for example |
|
if the parsed data contains only one column, a Series is returned |
|
Separator for thousands, for example |
Reading in text files piece by piece#
If you want to process very large files, you can also read in only a small part of a file or iterate through smaller parts of a file.
Before we look at a large file, we reduce the number of lines displayed with options.display.max_rows
:
[13]:
pd.options.display.max_rows = 10
[14]:
pd.read_csv("example.csv")
[14]:
Date | Mon. | Tues. | Wed. | Thurs. | Fri. | Sat. | Sun. | |
---|---|---|---|---|---|---|---|---|
0 | 1996-01-01 | 0.129453 | -0.023836 | 1.121460 | 1.698286 | -0.598506 | 1.042221 | -0.726412 |
1 | 1996-01-02 | -0.094021 | -0.727942 | 0.698641 | -1.198040 | 1.927505 | 1.147445 | -1.134103 |
2 | 1996-01-03 | -0.560857 | 0.145222 | -0.990202 | 1.200214 | 0.717339 | 1.117095 | -1.793565 |
3 | 1996-01-04 | -0.169755 | -0.677391 | -1.533519 | -0.343477 | -0.109705 | 1.038236 | -0.799088 |
4 | 1996-01-05 | 1.344705 | -1.817261 | 0.460991 | -0.839633 | 0.265814 | 0.477659 | 0.636383 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
9127 | 2020-12-27 | -0.881800 | -0.074270 | -0.351769 | 1.381641 | -0.049548 | 1.664180 | -1.032204 |
9128 | 2020-12-28 | -0.143386 | 0.198217 | -1.243861 | 1.196576 | 1.338166 | -0.212333 | -0.023131 |
9129 | 2020-12-29 | 0.398787 | -0.848786 | 1.791707 | -1.167592 | -0.033881 | -0.285559 | -0.323477 |
9130 | 2020-12-30 | 0.587846 | 0.411580 | 1.150380 | 0.444638 | -1.093577 | 0.605456 | 1.463345 |
9131 | 2020-12-31 | 0.736350 | 0.436292 | -0.260171 | -0.066066 | -0.328324 | -0.586792 | -1.204582 |
9132 rows × 8 columns
If you only want to read a small number of lines (without reading the whole file), you can specify this with nrows
:
[15]:
pd.read_csv("example.csv", nrows=7)
[15]:
Date | Mon. | Tues. | Wed. | Thurs. | Fri. | Sat. | Sun. | |
---|---|---|---|---|---|---|---|---|
0 | 1996-01-01 | 0.129453 | -0.023836 | 1.121460 | 1.698286 | -0.598506 | 1.042221 | -0.726412 |
1 | 1996-01-02 | -0.094021 | -0.727942 | 0.698641 | -1.198040 | 1.927505 | 1.147445 | -1.134103 |
2 | 1996-01-03 | -0.560857 | 0.145222 | -0.990202 | 1.200214 | 0.717339 | 1.117095 | -1.793565 |
3 | 1996-01-04 | -0.169755 | -0.677391 | -1.533519 | -0.343477 | -0.109705 | 1.038236 | -0.799088 |
4 | 1996-01-05 | 1.344705 | -1.817261 | 0.460991 | -0.839633 | 0.265814 | 0.477659 | 0.636383 |
5 | 1996-01-06 | -0.354445 | -0.065182 | -1.244963 | -0.559732 | 0.042362 | -0.303712 | 0.067632 |
6 | 1996-01-07 | 1.460922 | 0.164412 | 0.883960 | -0.833642 | 0.001582 | 1.138469 | 0.561618 |
To read a file piece by piece, you can specify the number of lines with chunksize
:
[16]:
pd.read_csv("example.csv", chunksize=1000)
[16]:
<pandas.io.parsers.readers.TextFileReader at 0x11fa09110>
The TextFileReader
object returned by read_csv
allows iteration over parts of the file according to the chunksize
. For example, we can iterate over the example.csv
file and aggregate the number of values in the Date
column as follows:
[17]:
chunks = pd.read_csv("example.csv", chunksize=1000)
serie = pd.Series([], dtype="float64")
for chunk in chunks:
values = serie.add(chunk["Date"].value_counts(), fill_value=0)
sorted_values = values.sort_values(ascending=False)
[18]:
sorted_values[:10]
[18]:
Date
2020-08-22 1.0
2020-09-07 1.0
2020-08-24 1.0
2020-08-25 1.0
2020-08-26 1.0
2020-08-27 1.0
2020-08-28 1.0
2020-08-29 1.0
2020-08-30 1.0
2020-08-31 1.0
dtype: float64
TextFileReader
also has a get_chunk
method that allows you to read pieces of any size.
Write DataFrame and Series as a CSV file#
Data can also be exported in a comma-separated format. With the method pandas.DataFrame.to_csv
we can write the data into a comma-separated file:
[19]:
df.to_csv("out.csv")
Of course, other delimiters can also be used, for example to write to sys.stdout
, so that the text result is output on the console and not in a file:
[20]:
import sys
[21]:
df.to_csv(sys.stdout, sep="|")
|Title|Language|Authors|License|Publication date|doi
0|Python basics|en|Veit Schiele|BSD-3-Clause|2021-10-28|
1|Jupyter Tutorial|en|Veit Schiele|BSD-3-Clause|2019-06-27|
2|Jupyter Tutorial|de|Veit Schiele|BSD-3-Clause|2020-10-26|
3|PyViz Tutorial|en|Veit Schiele|BSD-3-Clause|2020-04-13|
Missing values appear in the output as empty strings. You may want to mark them with a different placeholder:
[22]:
df.to_csv(sys.stdout, na_rep="NaN")
,Title,Language,Authors,License,Publication date,doi
0,Python basics,en,Veit Schiele,BSD-3-Clause,2021-10-28,NaN
1,Jupyter Tutorial,en,Veit Schiele,BSD-3-Clause,2019-06-27,NaN
2,Jupyter Tutorial,de,Veit Schiele,BSD-3-Clause,2020-10-26,NaN
3,PyViz Tutorial,en,Veit Schiele,BSD-3-Clause,2020-04-13,NaN
If no other options are given, both the row and column labels are written. Both can be deactivated:
[23]:
df.to_csv(sys.stdout, index=False, header=False)
Python basics,en,Veit Schiele,BSD-3-Clause,2021-10-28,
Jupyter Tutorial,en,Veit Schiele,BSD-3-Clause,2019-06-27,
Jupyter Tutorial,de,Veit Schiele,BSD-3-Clause,2020-10-26,
PyViz Tutorial,en,Veit Schiele,BSD-3-Clause,2020-04-13,
You can also write only a subset of the columns, in an order of your choosing:
[24]:
df.to_csv(
sys.stdout,
index=False,
columns=["Title", "Language", "Authors", "Publication date"],
)
Title,Language,Authors,Publication date
Python basics,en,Veit Schiele,2021-10-28
Jupyter Tutorial,en,Veit Schiele,2019-06-27
Jupyter Tutorial,de,Veit Schiele,2020-10-26
PyViz Tutorial,en,Veit Schiele,2020-04-13
Working with the csv module of Python#
Most forms of table data can be loaded using functions such as pandas.read_csv
. However, in some cases manual processing may be required. It is not uncommon to receive a file with one or more incorrect rows that cause read_csv
to fail. For any file with a single-digit delimiter, you can use Python’s built-in csv module. To use it, pass an open file or file-like object to csv.reader
:
[25]:
import csv
f = open("out.csv")
reader = csv.reader(f)
for line in reader:
print(line)
['', 'Title', 'Language', 'Authors', 'License', 'Publication date', 'doi']
['0', 'Python basics', 'en', 'Veit Schiele', 'BSD-3-Clause', '2021-10-28', '']
['1', 'Jupyter Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2019-06-27', '']
['2', 'Jupyter Tutorial', 'de', 'Veit Schiele', 'BSD-3-Clause', '2020-10-26', '']
['3', 'PyViz Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2020-04-13', '']
Dialekte#
csv-Dateien gibt es in vielen verschiedenen Varianten. Das Python csv-Modul kommt bereits mit drei verschiedenen Dialekten:
Parameter |
excel |
excel-tab |
unix |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
You can also use it to define your own format with a different separator, a different string convention or a different end-of-line character. Registering your own dialect is recommended for this. Possible options and functions of csv.register_dialect
are:
Argument |
Description |
---|---|
|
One-character string to separate fields; default value is |
|
Line terminator for writing; default value is |
|
Quotation marks for fields with special characters (like a separator); default is |
|
Quoting convention. Options include |
|
Ignore spaces after each delimiter; default is |
|
if |
|
String to bypass the delimiter when |
[26]:
csv.register_dialect(
"my_csv_dialect",
lineterminator="\n",
delimiter=",",
quotechar="'",
quoting=csv.QUOTE_MINIMAL,
)
Now the CSV file can be opened with:
[27]:
with open("out.csv") as f:
reader = csv.reader(f, "my_csv_dialect")
for line in reader:
print(line)
['', 'Title', 'Language', 'Authors', 'License', 'Publication date', 'doi']
['0', 'Python basics', 'en', 'Veit Schiele', 'BSD-3-Clause', '2021-10-28', '']
['1', 'Jupyter Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2019-06-27', '']
['2', 'Jupyter Tutorial', 'de', 'Veit Schiele', 'BSD-3-Clause', '2020-10-26', '']
['3', 'PyViz Tutorial', 'en', 'Veit Schiele', 'BSD-3-Clause', '2020-04-13', '']
Then we can create a Dict with data columns by using Dict Comprehensions and iterating over the values from values
with zip. Note that this requires a lot of storage space for large files, as the rows are converted into columns:
[28]:
with open("out.csv") as f:
reader = csv.reader(f, "my_csv_dialect")
lines = list(reader)
header, values = lines[0], lines[1:]
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict
[28]:
{'': ('0', '1', '2', '3'),
'Title': ('Python basics',
'Jupyter Tutorial',
'Jupyter Tutorial',
'PyViz Tutorial'),
'Language': ('en', 'en', 'de', 'en'),
'Authors': ('Veit Schiele', 'Veit Schiele', 'Veit Schiele', 'Veit Schiele'),
'License': ('BSD-3-Clause', 'BSD-3-Clause', 'BSD-3-Clause', 'BSD-3-Clause'),
'Publication date': ('2021-10-28', '2019-06-27', '2020-10-26', '2020-04-13'),
'doi': ('', '', '', '')}
To write files with separators manually, you can use csv.writer
. It accepts an open, writable file object and the same dialect and format options as csv.reader
:
[29]:
with open("new.csv", "w") as f:
writer = csv.writer(f, "my_csv_dialect")
writer.writerow(("", "Titel", "Sprache", "Autor*innen"))
writer.writerow(("1", "Python basics", "en", "Veit Schiele"))
writer.writerow(("2", "Jupyter Tutorial", "en", "Veit Schiele"))
[30]:
list(open("new.csv"))
[30]:
[',Titel,Sprache,Autor*innen\n',
'1,Python basics,en,Veit Schiele\n',
'2,Jupyter Tutorial,en,Veit Schiele\n']