|
| 1 | +# Speculative Decoding |
| 2 | +## Introduction |
| 3 | +[Speculative Decoding (Ref: arXiv:2211.17192v2)](https://arxiv.org/abs/2211.17192) is now available for playing via: |
| 4 | +```bash |
| 5 | +python ./examples/speculative_inference.py \ |
| 6 | + --model # your_model_name_or_path |
| 7 | + --draft_model # your_draft_model_name_or_path |
| 8 | + --temperature # your_temperature |
| 9 | + --gamma # your_gamma |
| 10 | + --max_new_tokens # your_max_new_tokens |
| 11 | + --gpu # your_gpu_id |
| 12 | +``` |
| 13 | +For example, |
| 14 | +```bash |
| 15 | +python ./examples/speculative_inference.py \ |
| 16 | + --model gpt2-xl |
| 17 | + --draft_model gpt2 |
| 18 | + --temperature 0.3 |
| 19 | + --gamma 5 |
| 20 | + --max_new_tokens 512 |
| 21 | + --gpu 0 |
| 22 | +``` |
| 23 | +Another example, |
| 24 | +```bash |
| 25 | +python ./examples/speculative_inference.py \ |
| 26 | + --model /home/eric/Documents/models/gpt2-xl |
| 27 | + --draft_model /home/eric/Documents/models/gpt2 |
| 28 | + --temperature 0 |
| 29 | + --gamma 3 |
| 30 | + --max_new_tokens 1024 |
| 31 | + --gpu 7 |
| 32 | +``` |
| 33 | +## Parameter Instruction |
| 34 | +`model`, `draft_model` |
| 35 | +- Huggingface model name or locally cached model path. |
| 36 | +- Currently only supports huggingface decoder only models. |
| 37 | +- `model` refers to the target model (i.e., the large model you want to accelerate) in the paper. |
| 38 | +- `draft_model` refers to the draft model in the paper. |
| 39 | + |
| 40 | +`temperature` |
| 41 | +- Temperature for sampling. When temperature <= 1e-6, will use argmax sampling. |
| 42 | + |
| 43 | +`gamma` |
| 44 | +- Number of tokens that the draft model will generate at each step. See the paper for more details. |
| 45 | + |
| 46 | +`max_new_tokens` |
| 47 | +- Maximum number of tokens that the speculative inference will generate. |
| 48 | +- TODO: currently the speculative decoding will always generate `max_new_tokens` tokens. We will add a `stop_token` in the future. |
| 49 | + |
| 50 | +`gpu` |
| 51 | +- gpu id, currently speculative inference only support single gpu. |
| 52 | + |
| 53 | +## Experiments |
| 54 | +We tested the speculative inference using the first 100 inputs from alpaca test dataset as prompts. When `model=gpt2-xl`, `draft_model=gpt2`, `temperature=0.`, `max_new_tokens=512`, we observed the following acceleration: |
| 55 | + |
| 56 | +|gamma|speedup (inference time)|speed up (num of forwards) |
| 57 | +|--|--|--| |
| 58 | +|1|1.75x|1.96x| |
| 59 | +|2|2.29x|2.89x| |
| 60 | +|3|2.71x|3.77x| |
| 61 | +|4|3.06x|4.63x| |
| 62 | +|5|3.35x|5.44x| |
| 63 | +|6|3.65x|6.23x| |
| 64 | +|7|3.82x|6.94x| |
| 65 | +|8|3.96x|7.64x| |
| 66 | +|9|4.05x|8.33x| |
| 67 | +|10|4.14x|9.00x| |
| 68 | + |
| 69 | +Note that the speedup may be overestimated. When `temperature=0`, `gpt2-xl` and `gpt2` tend to generate duplicated tokens as the number of tokens generated increases, thus making the target model more likely to accept the draft model's output. |
0 commit comments