Intake for data engineers#
Intake supports data engineers with the provision of data and with the specification of the data sources, the distribution of the data, the parameterisation of the user options etc. This makes it easier for data scientists to access the data afterwards, as the possible options are already specified in the catalog.
[1]:
import hvplot.pandas
import intake
intake.output_notebook()
Intake data sets are loaded with so-called drivers, some come with the intake package, but others have to be reloaded as plug-ins. You can display the available drivers as follows:
[2]:
list(intake.registry)
[2]:
['parquet',
'alias',
'catalog',
'csv',
'intake_remote',
'json',
'jsonl',
'ndzarr',
'numpy',
'textfiles',
'tiled',
'tiled_cat',
'yaml_file_cat',
'yaml_files_cat',
'zarr_cat']
Each of these drivers is assigned an intake.open_*
function. It is also possible to refer to drivers by the fully qualified name (e.g. package.submodule.DriverClass
). In the following example, however, we will focus on the csv driver that is included in the standard Intake installation.
In general, the first step in writing a catalog entry is to use the appropriate open_*
function to create a DataSource object:
[3]:
source = intake.open_csv(
"https://timeseries.weebly.com/uploads/" "2/1/0/8/21086414/sea_ice.csv"
)
The above specification has now created a DataSource object, but has not yet checked whether the data can actually be accessed. To test whether the loading was really successful, the source itself can be opened (source.discover
) or read (source.read
):
[4]:
source.discover()
[4]:
{'dtype': {'Time': 'object', 'Arctic': 'float64', 'Antarctica': 'float64'},
'shape': (None, 3),
'npartitions': 1,
'metadata': {}}
[5]:
df = source.read()
df.head()
[5]:
Time | Arctic | Antarctica | |
---|---|---|---|
0 | 1990M01 | 12.72 | 3.27 |
1 | 1990M02 | 13.33 | 2.15 |
2 | 1990M03 | 13.44 | 2.71 |
3 | 1990M04 | 12.16 | 5.10 |
4 | 1990M05 | 10.84 | 7.37 |
After we have determined that the data can be loaded as desired, we want to open up the data visually:
[6]:
df.hvplot(
kind="line", x="Time", y=["Arctic", "Antarctica"], width=700, height=500
)
[6]:
Now we can load a source correctly and also receive a graphic output for opening up the data. We can now display this recipe in the YAML syntax with:
[7]:
print(source.yaml())
sources:
csv:
args:
urlpath: https://timeseries.weebly.com/uploads/2/1/0/8/21086414/sea_ice.csv
description: ''
driver: intake.source.csv.CSVSource
metadata: {}
Finally, we can create a YAML file containing this recipe with an additional description and the tested diagram:
[8]:
%%writefile sea.yaml
sources:
sea_ice:
args:
urlpath: "https://timeseries.weebly.com/uploads/2/1/0/8/21086414/sea_ice.csv"
description: "Polar sea ice cover"
driver: csv
metadata:
plots:
basic:
kind: line
x: Time
y: [Arctic, Antarctica]
width: 700
height: 500
Overwriting sea.yaml
To check that the YAML file works too, we can reload it and try to work with it:
[9]:
cat = intake.open_catalog("sea.yaml")
[10]:
cat.sea_ice.plot.basic()
[10]:
The catalog appears to be functional and can now be released. The easiest way to share an Intake catalog is to put it in a place where it can be read by your target audience. In this tutorial stored in a Git repo, this can be the url of the file in the repo. All you have to share with your users is the URL of the catalog. You can try this yourself with:
[11]:
cat = intake.open_catalog(
"https://raw.githubusercontent.com/veit/Python4DataScience/main/docs/data-processing/intake/sea.yaml"
)
[12]:
cat.sea_ice.read().head()
[12]:
Time | Arctic | Antarctica | |
---|---|---|---|
0 | 1990M01 | 12.72 | 3.27 |
1 | 1990M02 | 13.33 | 2.15 |
2 | 1990M03 | 13.44 | 2.71 |
3 | 1990M04 | 12.16 | 5.10 |
4 | 1990M05 | 10.84 | 7.37 |
Note
This catalog is also a DataSource instance, i.e. you can refer to it from other catalogs and thus build a hierarchy of data sources. For example, you have a master or main catalog that references several other catalogs, each with entries of a certain type and the whole thing can e.g. be searched with Intake-GUI. In this way, the overall data acquisition structure has a structure that makes it easier to navigate to the correct data set. You can even have separate hierarchies that reference the same data.
[13]:
print(cat.yaml())
sources:
sea:
args:
path: https://raw.githubusercontent.com/veit/Python4DataScience/main/docs/data-processing/intake/sea.yaml
description: ''
driver: intake.catalog.local.YAMLFileCatalog
metadata: {}