Create a project#
DVC can be easily initialised with:
$ mkdir -p dvc-example/data
$ cd dvc-example
$ git init
$ dvc init
$ git add .dvc
$ git commit -m "Initialise DVC"
dvc init
creates a directory
.dvc/
withconfig
,.gitignore
andcache
directory.git commit
puts
.dvc/config
and.dvc/.gitignore
under version control.
Configure#
Before DVC is used, even a remote storage is established. This should be accessible to everyone who should access the data or the model. It’s similar to using a Git server. Often, however, this is also an NFS mount, which can be integrated as follows, for example:
$ sudo mkdir -p /var/dvc-storage
$ dvc remote add -d local /var/dvc-storage
Setting 'local' as a default remote.
$ git commit .dvc/config -m "Configure local remote"
[master efaeb84] Configure local remote
1 file changed, 4 insertions(+)
-d
,--default
Default value for the space removed
local
Name of the remote location
/var/dvc-storage
URL of the remote location
In addition, other protocols are supported, which are preceded by the path, including
ssh:
,hdfs:
andhttps:
.
Another remote data storage can simply be added, for example with:
$ dvc remote add webserver https://dvc.example.org/myproject
The associated configuration file .dvc/config
looks like this:
['remote "local"']
url = /var/dvc-storage
[core]
remote = local
['remote "webserver"']
url = https://dvc.example.org/myproject
Add data and directories#
With DVC you can save and version files, ML models, directories and intermediate results with Git without having to check the file content into Git:
$ dvc get https://github.com/iterative/dataset-registry get-started/data.xml \
-o data/data.xml
$ dvc add data/data.xml
This will add the file data/data.xml
in data/.gitignore
and write the
meta information in data/data.xml.dvc
. Further information on the file
format of the *.dvc
can be found under DVC-File Format.
In order to be able to manage different versions of your project data with Git, you only have to add the CVS file:
$ git add data/.gitignore data/fortune500.csv.dvc
$ git commit -m "Add raw data to project"
Store and retrieve data#
The data can be copied from the working directory of your Git repository to the remote storage space with
$ dvc push
If you want to call up more current data, you can do so with
$ dvc pull
Import and update#
You can also import data and models from another project with the command dvc
import
, for example:
$ dvc import https://github.com/iterative/dataset-registry get-started/data.xml
Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data.xml'
This loads the file from the dataset-registry into the current working
directory, adds .gitignore
and creates data.xml.dvc
.
With dvc update
we can update these data sources before we reproduce a
pipeline that depends on these data sources, for example
$ dvc update data.xml.dvc
Stage 'data.xml.dvc' didn't change.
Saving information to 'data.xml.dvc'.