Create a project#

DVC can be easily initialised with:

$ mkdir -p dvc-example/data
$ cd dvc-example
$ git init
$ dvc init
$ git add .dvc
$ git commit -m "Initialise DVC"
dvc init

creates a directory .dvc/ with config, .gitignore and cache directory.

git commit

puts .dvc/config and .dvc/.gitignore under version control.


Before DVC is used, even a remote storage is established. This should be accessible to everyone who should access the data or the model. It’s similar to using a Git server. Often, however, this is also an NFS mount, which can be integrated as follows, for example:

$ sudo mkdir -p /var/dvc-storage
$ dvc remote add -d local /var/dvc-storage
Setting 'local' as a default remote.
$ git commit .dvc/config -m "Configure local remote"
[master efaeb84] Configure local remote
 1 file changed, 4 insertions(+)
-d, --default

Default value for the space removed


Name of the remote location


URL of the remote location

In addition, other protocols are supported, which are preceded by the path, including ssh:, hdfs: and https:.

Another remote data storage can simply be added, for example with:

$ dvc remote add webserver

The associated configuration file .dvc/config looks like this:

['remote "local"']
url = /var/dvc-storage
remote = local
['remote "webserver"']
url =

Add data and directories#

With DVC you can save and version files, ML models, directories and intermediate results with Git without having to check the file content into Git:

$ dvc get get-started/data.xml \
    -o data/data.xml
$ dvc add data/data.xml

This will add the file data/data.xml in data/.gitignore and write the meta information in data/data.xml.dvc. Further information on the file format of the *.dvc can be found under DVC-File Format.

In order to be able to manage different versions of your project data with Git, you only have to add the CVS file:

$ git add data/.gitignore data/fortune500.csv.dvc
$ git commit -m "Add raw data to project"

Store and retrieve data#

The data can be copied from the working directory of your Git repository to the remote storage space with

$ dvc push

If you want to call up more current data, you can do so with

$ dvc pull

Import and update#

You can also import data and models from another project with the command dvc import, for example:

$ dvc import  get-started/data.xml
Importing 'get-started/data.xml (' -> 'data.xml'

This loads the file from the dataset-registry into the current working directory, adds .gitignore and creates data.xml.dvc.

With dvc update we can update these data sources before we reproduce a pipeline that depends on these data sources, for example

$ dvc update data.xml.dvc
Stage 'data.xml.dvc' didn't change.
Saving information to 'data.xml.dvc'.