Parameterisation

In the next phase of our example, we parameterise the processing and create the file params.yaml with the following content:

max_features: 6000
ngram_range:
  lo: 1
  hi: 2

To read the parameters, the option -p <filename>:<params_list> must be added to the ommand dvc run, in our example:

$ dvc run -n featurise -d src/featurisation.py -d data/splitted \
    -p params.yaml:max_features,ngram_range.lo,ngram_range.hi -o data/features \
    python src/featurisation.py data/splitted data/features

This adds to the dvc.yaml file:

featurise:
  cmd: python src/featurization.py data/splitted data/features
  deps:
  - data/splitted
  - src/featurization.py
  params:
  - max_features
  - ngram_range.lo
  - ngram_range.hi
  outs:
  - data/features

So that this phase can be repeated, the MD5 hash values and parameter values are stored in the file dvc.lock:

featurise:
  cmd: python src/featurisation.py data/splitted data/features
  deps:
  - path: data/splitted
    md5: 1ce9051bf386e57c03fe779d476d93e7.dir
  - path: src/featurisation.py
    md5: a56570e715e39134adb4fdc779296373
  params:
    params.yaml:
      max_features: 1000
      ngram_range.hi: 2
      ngram_range.lo: 1

Finally dvc.lock, dvc.yaml and data/.gitignore in the Git repository need to be updated:

$ git add dvc.lock dvc.yaml data/.gitignore

See also