-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial grok #169
initial grok #169
Conversation
a45d444
to
4095db0
Compare
Posted a comment here, doesn't show up on the main page. |
lets chat at meeting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did say I didn't mind if we did this, but looking at it now, all the args are just values from the config anyway. I feel like separating them I think just makes our code harder to follow.
Not a big deal either way, but I'd personally prefer we just drop this commit. @KyleHerndon what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right way to set things up. The bad argument is that llama.cpp has a similar setup, and we're referencing them for accuracy baselines. The better argument goes something like:
The KV cache is not ontologically related to the config, it just (currently) exclusively uses parameters from it. In the near future, we will want things like sharding, at which point the KV Cache might need additional args (number of devices, for example, which is not a ModelConfig parameter, but something like an ExecutionConfig parameter).
This change doesn't feel related to grok as much a code refactor, which I think is fine to include in a bigger patch for efficiency but might make things harder to follow when looking at commit history. Otherwise, I find the code just as easy to follow, if not slightly easier because I think it is better organized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either way, but do agree with Kyle that this keeps it better organized. We can consult Rob or Stella to see if there is a better way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary, if you two agree thats more than enough for me
looks good to me, @KyleHerndon @archana-ramalingam any final changes you think are needed? |
@@ -233,7 +233,6 @@ def main(): | |||
|
|||
device = torch.device(args.device) if args.device else None | |||
activation_dtype = getattr(torch, args.activation_dtype) | |||
attention_dtype = getattr(torch, args.attention_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested Llama and Grok models again and it's LGTM.
@dan-garvey, the error. output = weight * to(output, weight.dtype) but this is fine output = elementwise(torch.mul, weight, to(output, weight.dtype))
def __mul__(self, rhs):
from ..ops import elementwise
return elementwise(torch.mul, self, rhs) |
When calling |
Instead of calling the output = weight.my_mul(to(output, weight.dtype)) class InferenceTensor(ABC):
....
def my_mul(self, rhs):
from ..ops import elementwise
return elementwise(torch.mul, self, rhs) This is essentially the same code, but PyTorch does something special about binary operators. '+' also suffers from the same problem. Also when running with env vars
The tracer reports
It is doing something special about |
I opened an issue with PyTorch. I think that may be a bug there. |
Initial grok work, also does some refactoring