Intake for data engineers#

Intake supports data engineers with the provision of data and with the specification of the data sources, the distribution of the data, the parameterisation of the user options etc. This makes it easier for data scientists to access the data afterwards, as the possible options are already specified in the catalog.

[1]:
import hvplot.pandas
import intake


intake.output_notebook()

Intake data sets are loaded with so-called drivers, some come with the intake package, but others have to be reloaded as plug-ins. You can display the available drivers as follows:

[2]:
list(intake.registry)
[2]:
['parquet',
 'alias',
 'catalog',
 'csv',
 'intake_remote',
 'json',
 'jsonl',
 'ndzarr',
 'numpy',
 'textfiles',
 'tiled',
 'tiled_cat',
 'yaml_file_cat',
 'yaml_files_cat',
 'zarr_cat']

Each of these drivers is assigned an intake.open_* function. It is also possible to refer to drivers by the fully qualified name (e.g. package.submodule.DriverClass). In the following example, however, we will focus on the csv driver that is included in the standard Intake installation.

In general, the first step in writing a catalog entry is to use the appropriate open_* function to create a DataSource object:

[3]:
source = intake.open_csv(
    "https://timeseries.weebly.com/uploads/" "2/1/0/8/21086414/sea_ice.csv"
)

The above specification has now created a DataSource object, but has not yet checked whether the data can actually be accessed. To test whether the loading was really successful, the source itself can be opened (source.discover) or read (source.read):

[4]:
source.discover()
[4]:
{'dtype': {'Time': 'object', 'Arctic': 'float64', 'Antarctica': 'float64'},
 'shape': (None, 3),
 'npartitions': 1,
 'metadata': {}}
[5]:
df = source.read()

df.head()
[5]:
Time Arctic Antarctica
0 1990M01 12.72 3.27
1 1990M02 13.33 2.15
2 1990M03 13.44 2.71
3 1990M04 12.16 5.10
4 1990M05 10.84 7.37

After we have determined that the data can be loaded as desired, we want to open up the data visually:

[6]:
df.hvplot(
    kind="line", x="Time", y=["Arctic", "Antarctica"], width=700, height=500
)
[6]:

Now we can load a source correctly and also receive a graphic output for opening up the data. We can now display this recipe in the YAML syntax with:

[7]:
print(source.yaml())
sources:
  csv:
    args:
      urlpath: https://timeseries.weebly.com/uploads/2/1/0/8/21086414/sea_ice.csv
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}

Finally, we can create a YAML file containing this recipe with an additional description and the tested diagram:

[8]:
%%writefile sea.yaml
sources:
    sea_ice:
      args:
        urlpath: "https://timeseries.weebly.com/uploads/2/1/0/8/21086414/sea_ice.csv"
      description: "Polar sea ice cover"
      driver: csv
      metadata:
        plots:
          basic:
            kind: line
            x: Time
            y: [Arctic, Antarctica]
            width: 700
            height: 500
Overwriting sea.yaml

To check that the YAML file works too, we can reload it and try to work with it:

[9]:
cat = intake.open_catalog("sea.yaml")
[10]:
cat.sea_ice.plot.basic()
[10]:

The catalog appears to be functional and can now be released. The easiest way to share an Intake catalog is to put it in a place where it can be read by your target audience. In this tutorial stored in a Git repo, this can be the url of the file in the repo. All you have to share with your users is the URL of the catalog. You can try this yourself with:

[11]:
cat = intake.open_catalog(
    "https://raw.githubusercontent.com/veit/Python4DataScience/main/docs/data-processing/intake/sea.yaml"
)
[12]:
cat.sea_ice.read().head()
[12]:
Time Arctic Antarctica
0 1990M01 12.72 3.27
1 1990M02 13.33 2.15
2 1990M03 13.44 2.71
3 1990M04 12.16 5.10
4 1990M05 10.84 7.37

Note

This catalog is also a DataSource instance, i.e. you can refer to it from other catalogs and thus build a hierarchy of data sources. For example, you have a master or main catalog that references several other catalogs, each with entries of a certain type and the whole thing can e.g. be searched with Intake-GUI. In this way, the overall data acquisition structure has a structure that makes it easier to navigate to the correct data set. You can even have separate hierarchies that reference the same data.

[13]:
print(cat.yaml())
sources:
  sea:
    args:
      path: https://raw.githubusercontent.com/veit/Python4DataScience/main/docs/data-processing/intake/sea.yaml
    description: ''
    driver: intake.catalog.local.YAMLFileCatalog
    metadata: {}