3.3. Train a neural network

$ python3 blueoil/cmd/main.py train -c config/test.py

  main.py train [OPTIONS]

  -c, --config TEXT         Path of config file.  [required]
  -e, --experiment_id TEXT  ID of this training.
  --profile_step            Train step for memory and time profile.
  --help                    Show this message and exit.

python3 blueoil/cmd/main.py train command runs actual training.

Before running python3 blueoil/cmd/main.py train, make sure you’ve already put training/test data in the proper location, as defined in the configuration file.

If you want to stop training, you should press Ctrl + C or kill the blueoil train processes. You can restart training from saved checkpoints by setting experiment_id to be the same as an existing id.

3.3.1. Training on GPUs

Bueoil supports tranining on CUDA enabled GPUs. To train on a GPU, notify a GPU ID to use by the environment variable CUDA_VISIBLE_DEVICES. For example, if you want to use GPU ID 0, you should set the environment variable as CUDA_VISIBLE_DEVICES="0".

Blueoil also support multiple GPU training using Horovod. For example, if you want to use GPU ID 0 and GPU ID 1, then you should set the environment variable as CUDA_VISIBLE_DEVICES="0,1". Internally, Blueoil count the number of “,” from CUDA_VISIBLE_DEVICES, to count the number of GPUs. (Hence, CUDA_VISIBLE_DEVICES='0,,1,,2,,3' would confuse Blueoil.) If the counted number of GPUs are greater than 1, then Blueoil automatically prepends horovodrun command to enable multiple GPU training.