Add speculative decoding (#1120)

* Add draft model param to llama class, implement basic prompt lookup decoding draft model

* Use samplingcontext for sampling

* Use 1d array

* Use draft model for sampling

* Fix dumb mistake

* Allow for later extensions to the LlamaDraftModel api

* Cleanup

* Adaptive candidate prediction

* Update implementation to match hf transformers

* Tuning

* Fix bug where last token was not used for ngram prediction

* Remove heuristic for num_pred_tokens (no benefit)

* fix: n_candidates bug.

* Add draft_model_num_pred_tokens server setting

* Cleanup

* Update README

Andrei committed 2y ago

fb762a60411f53278454b4e9888c5bd9712d3779

Parent: 71e3e4c

Committed by GitHub <noreply@github.com> on 1/31/2024, 7:08:14 PM