Skip to content

Commit dcda53b

Browse files
authored
Merge pull request #641 from wheresmyhair/main
add readme for speculative decoding
2 parents 4d124d6 + 9f0e7ed commit dcda53b

File tree

1 file changed

+69
-0
lines changed

1 file changed

+69
-0
lines changed
+69
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Speculative Decoding
2+
## Introduction
3+
[Speculative Decoding (Ref: arXiv:2211.17192v2)](https://arxiv.org/abs/2211.17192) is now available for playing via:
4+
```bash
5+
python ./examples/speculative_inference.py \
6+
--model # your_model_name_or_path
7+
--draft_model # your_draft_model_name_or_path
8+
--temperature # your_temperature
9+
--gamma # your_gamma
10+
--max_new_tokens # your_max_new_tokens
11+
--gpu # your_gpu_id
12+
```
13+
For example,
14+
```bash
15+
python ./examples/speculative_inference.py \
16+
--model gpt2-xl
17+
--draft_model gpt2
18+
--temperature 0.3
19+
--gamma 5
20+
--max_new_tokens 512
21+
--gpu 0
22+
```
23+
Another example,
24+
```bash
25+
python ./examples/speculative_inference.py \
26+
--model /home/eric/Documents/models/gpt2-xl
27+
--draft_model /home/eric/Documents/models/gpt2
28+
--temperature 0
29+
--gamma 3
30+
--max_new_tokens 1024
31+
--gpu 7
32+
```
33+
## Parameter Instruction
34+
`model`, `draft_model`
35+
- Huggingface model name or locally cached model path.
36+
- Currently only supports huggingface decoder only models.
37+
- `model` refers to the target model (i.e., the large model you want to accelerate) in the paper.
38+
- `draft_model` refers to the draft model in the paper.
39+
40+
`temperature`
41+
- Temperature for sampling. When temperature <= 1e-6, will use argmax sampling.
42+
43+
`gamma`
44+
- Number of tokens that the draft model will generate at each step. See the paper for more details.
45+
46+
`max_new_tokens`
47+
- Maximum number of tokens that the speculative inference will generate.
48+
- TODO: currently the speculative decoding will always generate `max_new_tokens` tokens. We will add a `stop_token` in the future.
49+
50+
`gpu`
51+
- gpu id, currently speculative inference only support single gpu.
52+
53+
## Experiments
54+
We tested the speculative inference using the first 100 inputs from alpaca test dataset as prompts. When `model=gpt2-xl`, `draft_model=gpt2`, `temperature=0.`, `max_new_tokens=512`, we observed the following acceleration:
55+
56+
|gamma|speedup (inference time)|speed up (num of forwards)
57+
|--|--|--|
58+
|1|1.75x|1.96x|
59+
|2|2.29x|2.89x|
60+
|3|2.71x|3.77x|
61+
|4|3.06x|4.63x|
62+
|5|3.35x|5.44x|
63+
|6|3.65x|6.23x|
64+
|7|3.82x|6.94x|
65+
|8|3.96x|7.64x|
66+
|9|4.05x|8.33x|
67+
|10|4.14x|9.00x|
68+
69+
Note that the speedup may be overestimated. When `temperature=0`, `gpt2-xl` and `gpt2` tend to generate duplicated tokens as the number of tokens generated increases, thus making the target model more likely to accept the draft model's output.

0 commit comments

Comments
 (0)