Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge Case: Occasional response ending prematurely while not reaching max tokens #2169

Open
jorgecolonconsulting opened this issue Oct 27, 2024 · 0 comments

Comments

@jorgecolonconsulting
Copy link

Issue

Had a weird bug where the architect command was ending way before I reached max tokens set in .aider.model.metadata.json and explicitly supported by the provider. I would get the following error:

Model together_ai/Qwen/Qwen2.5-72B-Instruct-Turbo has hit a token limit!
Token counts below are approximate.

Input tokens: ~4,688 of 32,768
Output tokens: ~2,022 of 32,768
Total tokens: ~6,710 of 32,768

I modified aider to output debug info from litellm and realized that response was returning a finish_reason: 'length'. Although not documented in the Together AI docs, looks like the default max_tokens value is 2048 if it isn't provided in the response. If I change the max_tokens to 32768 I get a different error. Something like inputs tokens + max_new_tokens must be <= 32769. If I set max_tokens: 20000 as extra_params in .aider.model.settings.yml everything works just fine and I get past 2048 tokens without a problem. I believe I've had this problem in the past with another model, I just don't remember which model I used.

I found that base_coder has the following:

  except FinishReasonLength:
      # We hit the output limit!
      if not self.main_model.info.get("supports_assistant_prefill"):
          exhausted = True
          break
# ...
if exhausted:
    self.show_exhausted_error()
    self.num_exhausted_context_windows += 1
    return

# ...
def show_exhausted_error(self):
# ...
    res.append(f"Model {self.main_model.name} has hit a token limit!")
    res.append("Token counts below are approximate.")

Setting a fixed max tokens works for this specific case, but it seems like a more dynamic approach would be ideal where max_tokens reduces based on the amount of tokenized input. Not exactly sure if that's the best approach though since it's more often that models have a large context and fixed max completion tokens 4096 or 8192. Perhaps a use_max_dynamic_tokens setting in .aider.model.settings.yml that requires context_size?

I'm willing to do the work for this, just need direction and what to look at since I'm not familiar with the codebase for aider and litellm.

Version and model info

Aider v0.60.1
Model: together_ai/Qwen/Qwen2.5-72B-Instruct-Turbo
.aider.model.settings.yml:

- name: "together_ai/Qwen/Qwen2.5-72B-Instruct-Turbo"
  edit_format: "diff"
  use_repo_map: true
  examples_as_sys_msg: true
  reminder: "sys"
  streaming: false

  # extra_params:
    #   max_tokens: 20000 # workaround to get around the issue

.aider.model.metadata.json:

"together_ai/Qwen/Qwen2.5-72B-Instruct-Turbo": {
    "max_tokens": 32768,
    "max_input_tokens": 32768,
    "max_output_tokens": 32768,
    "input_cost_per_token": 0.0000012,
    "output_cost_per_token": 0.0000012,
    "litellm_provider": "together_ai",
    "mode": "chat",
    "supports_function_calling": false,
    "supports_tool_choice": false
}

Rough idea of tokens:

$ 0.0022    1,792 system messages
$ 0.0008      698 repository map                      use --map-tokens to resize
$ 0.0040    3,337 beagledocs/scrape.py                /drop to remove
$ 0.0004      373 .ai/codex/codex.md (read-only)      /drop to remove
$ 0.0001       52 .ai/rules/bug-review.md (read-only) /drop to remove
==================
$ 0.0075    6,252 tokens total
           26,516 tokens remaining in context window
           32,768 tokens max context window size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant