Add speculative decoding (#1120)
* Add draft model param to llama class, implement basic prompt lookup decoding draft model * Use samplingcontext for sampling * Use 1d array * Use draft model for sampling * Fix dumb mistake * Allow for later extensions to the LlamaDraftModel api * Cleanup * Adaptive candidate prediction * Update implementation to match hf transformers * Tuning * Fix bug where last token was not used for ngram prediction * Remove heuristic for num_pred_tokens (no benefit) * fix: n_candidates bug. * Add draft_model_num_pred_tokens server setting * Cleanup * Update README
A
Andrei committed
fb762a60411f53278454b4e9888c5bd9712d3779
Parent: 71e3e4c
Committed by GitHub <noreply@github.com>
on 1/31/2024, 7:08:14 PM