Pipelines¶

Connecting code and data¶

Commands such as dvc add, dvc push, and dvc pull can be executed independently of changes in the Git repository and therefore only provide the basis for managing large amounts of data and models. To achieve truly reproducible results, code and data must be connected.

Connecting Git and DVC — Design: André Henze, Berlin¶

With dvc stage, you can create individual processing stages, each of which is described by a source code file managed with Git, as well as other dependencies and output data. All stages together then form the DVC pipeline.

In our example dvc-example, the first stage is to split the data into training and test data:

$ uv run dvc stage add \
    -n prepare \
    -p prepare.seed,prepare.split \
    -d src/dvc_example/prepare.py -d data/data.xml \
    -o data/prepared \
    uv run python src/dvc_example/prepare.py data/data.xml

-n

specifies the name of the processing stage.

-p

specifies the parameters from the params.yaml file to be used for this stage.

See also

Parameterisation

-d

specifies dependencies for the reproducible command.

When dvc repro is called to reproduce the results next time, DVC checks these dependencies and decides whether they are up to date or need to be re-executed to obtain more recent results.

-o

specifies the output file or output directory.

The generated dvc.yaml file then looks like this:

stages:
  prepare:
    cmd: uv run python src/dvc_example/prepare.py data/data.xml
    deps:
    - data/data.xml
    - src/dvc_example/prepare.py
    params:
    - prepare.seed
    - prepare.split
    outs:
    - data/prepared

If you now call uv run dvc repro, the files test.tsv and train.tsv will be created in data/prepared, and dvc.lock will be written. The directory structure will then look like this:

├── .dvc
├── .dvcignore
├── .git
├── .gitignore
├── .pre-commit-config.yaml
├── .python-version
├── .venv
├── README.md
├── data
│   ├── .gitignore
│   ├── data.xml
│   ├── data.xml.dvc
│   └── prepared
│       ├── test.tsv
│       └── train.tsv
├── dvc.lock
├── dvc.yaml
├── params.yaml
├── pyproject.toml
├── src
│   └── dvc_example
│       ├── __init__.py
│       └── prepare.py
└── uv.lock