pandas IO tools#
pandas has a number of functions for reading table data as DataFrame objects, including
Function |
Description |
---|---|
loads CSV data from a file, URL or file-like object; usually a comma is used as separator |
|
loads fwf, which is data in column format with a fixed width |
|
reads data from the clipboard and passes it to
|
|
reads table data from an Excel XLS or XLSX file |
|
reads HDF5 files |
|
reads all tables from the specified HTML document |
|
reads data from a JSON file |
|
reads the Feather binary file format |
|
reads Apache ORC binary data |
|
reads Apache Parquet binary file format |
|
reads any object stored in Python Pickle format |
|
reads a SAS data set |
|
reads a data file created by SPSS |
|
reads the results of an SQL query (with SQLAlchemy) as a pandas DataFrame |
|
reads an entire SQL table (with
SQLAlchemy) as a pandas DataFrame
(corresponds to a query that selects everything Rin this
table with |
|
reads a data set from the Stata file format |
See also
- pandas I/O API
The pandas I/O API is a collection of
reader
functions that return a pandas object. In most cases, correspondingwriter
methods are also available.
First, I will give an overview of some of these functions that are designed to convert text and excel data into a pandas DataFrame: CSV, JSON and Excel. The optional arguments for these functions can be divided into the following categories:
- Indexing
Can one or more columns index the returned DataFrame, and whether the column names should be retrieved from the file, the arguments you specify, or not at all.
- Type inference and data conversion
This includes the custom value conversions and the custom list of missing value flags.
- Date and time parsing
This includes the combining capability, including combining date and time information spread across multiple columns into a single column in the result.
- Iteration
Support for iteration over parts of very large files.
- Problems with unclean data
Skipping of rows or footers, comments or other trivia such as numeric data with thousands separated by commas.
Since data can be very messy in the real world, some of the data loading
functions (especially read_csv
) have accumulated a long list of optional
arguments over time. The online documentation for pandas contains many examples
of each function.
Some of these functions, like pandas.read_csv
, perform type inference because
the data types of the columns are not part of the data format. This means that
you don’t necessarily have to specify which columns are numeric, integer, boolean
or string. With other data formats such as HDF5, ORC and Parquet, however, the
data type information is already embedded in the format.