Create a project#
DVC can be easily initialised with:
$ mkdir -p dvc-example/data $ cd dvc-example $ git init $ dvc init $ git add .dvc $ git commit -m "Initialise DVC"
creates a directory
.dvc/.gitignoreunder version control.
Before DVC is used, even a remote storage is established. This should be accessible to everyone who should access the data or the model. It’s similar to using a Git server. Often, however, this is also an NFS mount, which can be integrated as follows, for example:
$ sudo mkdir -p /var/dvc-storage $ dvc remote add -d local /var/dvc-storage Setting 'local' as a default remote. $ git commit .dvc/config -m "Configure local remote" [master efaeb84] Configure local remote 1 file changed, 4 insertions(+)
Default value for the space removed
Name of the remote location
URL of the remote location
In addition, other protocols are supported, which are preceded by the path, including
Another remote data storage can simply be added, for example with:
$ dvc remote add webserver https://dvc.example.org/myproject
The associated configuration file
.dvc/config looks like this:
['remote "local"'] url = /var/dvc-storage [core] remote = local ['remote "webserver"'] url = https://dvc.example.org/myproject
Add data and directories#
With DVC you can save and version files, ML models, directories and intermediate results with Git without having to check the file content into Git:
$ dvc get https://github.com/iterative/dataset-registry get-started/data.xml \ -o data/data.xml $ dvc add data/data.xml
This will add the file
data/.gitignore and write the
meta information in
data/data.xml.dvc. Further information on the file
format of the
*.dvc can be found under DVC-File Format.
In order to be able to manage different versions of your project data with Git, you only have to add the CVS file:
$ git add data/.gitignore data/fortune500.csv.dvc $ git commit -m "Add raw data to project"
Store and retrieve data#
The data can be copied from the working directory of your Git repository to the remote storage space with
$ dvc push
If you want to call up more current data, you can do so with
$ dvc pull
Import and update#
You can also import data and models from another project with the command
import, for example:
$ dvc import https://github.com/iterative/dataset-registry get-started/data.xml Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data.xml'
This loads the file from the dataset-registry into the current working
.gitignore and creates
dvc update we can update these data sources before we reproduce a
pipeline that depends on these data sources, for example
$ dvc update data.xml.dvc Stage 'data.xml.dvc' didn't change. Saving information to 'data.xml.dvc'.