Create a project#

DVC can be easily initialised with:

$ mkdir -p dvc-example/data
$ cd dvc-example
$ git init
$ dvc init
$ git add .dvc
$ git commit -m "Initialise DVC"
dvc init

creates a directory .dvc/ with config, .gitignore and cache directory.

git commit

puts .dvc/config and .dvc/.gitignore under version control.

Configure#

Before DVC is used, even a remote storage is established. This should be accessible to everyone who should access the data or the model. It’s similar to using a Git server. Often, however, this is also an NFS mount, which can be integrated as follows, for example:

$ sudo mkdir -p /var/dvc-storage
$ dvc remote add -d local /var/dvc-storage
Setting 'local' as a default remote.
$ git commit .dvc/config -m "Configure local remote"
[master efaeb84] Configure local remote
 1 file changed, 4 insertions(+)
-d, --default

Default value for the space removed

local

Name of the remote location

/var/dvc-storage

URL of the remote location

In addition, other protocols are supported, which are preceded by the path, including ssh:, hdfs: and https:.

Another remote data storage can simply be added, for example with:

$ dvc remote add webserver https://dvc.example.org/myproject

The associated configuration file .dvc/config looks like this:

['remote "local"']
url = /var/dvc-storage
[core]
remote = local
['remote "webserver"']
url = https://dvc.example.org/myproject

Add data and directories#

With DVC you can save and version files, ML models, directories and intermediate results with Git without having to check the file content into Git:

$ dvc get https://github.com/iterative/dataset-registry get-started/data.xml \
    -o data/data.xml
$ dvc add data/data.xml

This will add the file data/data.xml in data/.gitignore and write the meta information in data/data.xml.dvc. Further information on the file format of the *.dvc can be found under DVC-File Format.

In order to be able to manage different versions of your project data with Git, you only have to add the CVS file:

$ git add data/.gitignore data/fortune500.csv.dvc
$ git commit -m "Add raw data to project"

Store and retrieve data#

The data can be copied from the working directory of your Git repository to the remote storage space with

$ dvc push

If you want to call up more current data, you can do so with

$ dvc pull

Import and update#

You can also import data and models from another project with the command dvc import, for example:

$ dvc import https://github.com/iterative/dataset-registry  get-started/data.xml
Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data.xml'

This loads the file from the dataset-registry into the current working directory, adds .gitignore and creates data.xml.dvc.

With dvc update we can update these data sources before we reproduce a pipeline that depends on these data sources, for example

$ dvc update data.xml.dvc
Stage 'data.xml.dvc' didn't change.
Saving information to 'data.xml.dvc'.