Pipelines¶
Connecting code and data¶
Commands such as dvc add, dvc push, and dvc pull can be executed
independently of changes in the Git repository and therefore only provide the
basis for managing large amounts of data and models. To achieve truly
reproducible results, code and data must be connected.
Design: André Henze, Berlin¶
With dvc stage, you can create individual processing stages, each of which
is described by a source code file managed with Git, as well as other
dependencies and output data. All stages together then form the DVC pipeline.
In our example dvc-example, the first stage is to split the data into training and test data:
$ uv run dvc stage add \
-n prepare \
-p prepare.seed,prepare.split \
-d src/dvc_example/prepare.py -d data/data.xml \
-o data/prepared \
uv run python src/dvc_example/prepare.py data/data.xml
-nspecifies the name of the processing stage.
-pspecifies the parameters from the
params.yamlfile to be used for this stage.See also
-dspecifies dependencies for the reproducible command.
When
dvc reprois called to reproduce the results next time, DVC checks these dependencies and decides whether they are up to date or need to be re-executed to obtain more recent results.-ospecifies the output file or output directory.
The generated dvc.yaml file then looks like this:
stages:
prepare:
cmd: uv run python src/dvc_example/prepare.py data/data.xml
deps:
- data/data.xml
- src/dvc_example/prepare.py
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
If you now call uv run dvc repro, the files test.tsv and
train.tsv will be created in data/prepared, and dvc.lock
will be written. The directory structure will then look like this:
├── .dvc
├── .dvcignore
├── .git
├── .gitignore
├── .pre-commit-config.yaml
├── .python-version
├── .venv
├── README.md
├── data
│ ├── .gitignore
│ ├── data.xml
│ ├── data.xml.dvc
│ └── prepared
│ ├── test.tsv
│ └── train.tsv
├── dvc.lock
├── dvc.yaml
├── params.yaml
├── pyproject.toml
├── src
│ └── dvc_example
│ ├── __init__.py
│ └── prepare.py
└── uv.lock
See also
The new and modified programme and configuration files should now be managed in Git:
$ git add data/.gitignore dvc.lock dvc.yaml src/dvc_example/prepare.py
$ git commit -m ':sparkles: Add prepare step'
Hint
DVC knows from the pipeline definition that the files in
data/prepared were generated in the prepare stage. You can
restore them at any time with uv run dvc repro or uv run dvc commit.