Add option to disable duplicates in topk #464
Open
+32
−18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Current implementation of optimized topp/topk calculations for scalar case is handling the duplicates that are outside of kth border. Unfortunately, to analyze duplicates it is necessary to make a synchronization with CPU, what makes multi-step scheduling useless together with topp/topk.
This PR adds option to skip duplicates handling with
VLLM_HANDLE_TOPK_DUPLICATES
(defaultTrue
). When this variable is set, handling duplicates will be skipped and we will avoid synchronization with CPU. It also removes the synchronization which was done earlier in Sampler, by saving scalar value oftop_k
andtop_p
. It should give performance gain for all benchmarks with these sampling parameters, especially together with multi-step scheduling.While disabling the duplicates handling may cause small accuracy differences, the best solution will be to handle duplicates without synchronization with CPU. However, this is not a trivial problem, so I will try to provide such solution later.