This folder contains actively maintained examples of use of 🤗 Optimum Habana for various ML tasks.
Other examples from the 🤗 Transformers library can be adapted the same way to enable deployment on Gaudi processors. This simply consists in:
- replacing the
Trainer
from 🤗 Transformers with theGaudiTrainer
from 🤗 Optimum Habana, - replacing the
TrainingArguments
from 🤗 Transformers with theGaudiTrainingArguments
from 🤗 Optimum Habana.
All the PyTorch training scripts in this repository work out of the box with distributed training.
To launch a script on n HPUs belonging to a single Gaudi server, use the following command:
python gaudi_spawn.py \
--world_size number_of_hpu_you_have --use_mpi \
path_to_script.py --args1 --args2 ... --argsN
where --argX
is an argument of the script to run in a distributed way.
All the PyTorch training scripts in this repository work out of the box with DeepSpeed. To launch one of them on n HPUs, use the following command:
python gaudi_spawn.py \
--world_size number_of_hpu_you_have --use_deepspeed \
path_to_script.py --args1 --args2 ... --argsN \
--deepspeed path_to_my_deepspeed_config
where --argX
is an argument of the script to run with DeepSpeed.
All the PyTorch training scripts in this repository work out of the box on several Gaudi instances. To launch one of them on n nodes, use the following command:
python gaudi_spawn.py \
--hostfile path_to_my_hostfile --use_deepspeed \
path_to_my_script.py --args1 --args2 ... --argsN \
--deepspeed path_to_my_deepspeed_config
where --argX
is an argument of the script to run with DeepSpeed and --hostfile
is a file specifying the addresses and the number of devices to use for each node such as:
ip_1 slots=8
ip_2 slots=8
...
ip_n slots=8
You can find more information about multi-node training in the documentation and in the multi-node-training
folder where a Dockerfile is provided to easily set up your environment.
If a model also has Tensorflow or Flax checkpoints, you can load them instead of a PyTorch checkpoint by specifying from_tf=True
or from_flax=True
in the model instantiation.
You can try it for SQuAD here or MRPC here.
You can check if a model has such checkpoints on the Hub. You can also specify a URL or a path to a Tensorflow/Flax checkpoint in model_args.model_name_or_path
.
Resuming from a checkpoint will only work with a PyTorch checkpoint.
Most examples are equipped with a mechanism to truncate the number of dataset samples to the desired length. This is useful for debugging purposes, for example to quickly check that all stages of the programs can complete, before running the same setup on the full dataset which may take hours to complete.
For example here is how to truncate all three splits to just 50 samples each:
examples/pytorch/question-answering/run_squad.py \
--max_train_samples 50 \
--max_eval_samples 50 \
--max_predict_samples 50 \
[...]
You can resume training from a previous checkpoint like this:
- Pass
--output_dir previous_output_dir
without--overwrite_output_dir
to resume training from the latest checkpoint inoutput_dir
(what you would use if the training was interrupted, for instance). - Pass
--resume_from_checkpoint path_to_a_specific_checkpoint
to resume training from that checkpoint folder.
Should you want to turn an example into a notebook where you'd no longer have access to the command
line, 🤗 GaudiTrainer supports resuming from a checkpoint via trainer.train(resume_from_checkpoint)
.
- If
resume_from_checkpoint
isTrue
it will look for the last checkpoint in the value ofoutput_dir
passed viaTrainingArguments
. - If
resume_from_checkpoint
is a path to a specific checkpoint it will use that saved checkpoint folder to resume the training.
All the example scripts support the automatic upload of your final model to the Model Hub by adding a --push_to_hub
argument. It will then create a repository with your username slash the name of the folder you are using as output_dir
. For instance, "sgugger/test-mrpc"
if your username is sgugger
and you are working in the folder ~/tmp/test-mrpc
.
To specify a given repository name, use the --hub_model_id
argument. You will need to specify the whole repository name (including your username), for instance --hub_model_id sgugger/finetuned-bert-mrpc
. To upload to an organization you are a member of, just use the name of that organization instead of your username: --hub_model_id huggingface/finetuned-bert-mrpc
.
A few notes on this integration:
- you will need to be logged in to the Hugging Face website locally for it to work, the easiest way to achieve this is to run
huggingface-cli login
and then type your username and password when prompted. You can also pass along your authentication token with the--hub_token
argument. - the
output_dir
you pick will either need to be a new folder or a local clone of the distant repository you are using.