Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive lora #66

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Adaptive lora #66

wants to merge 12 commits into from

Conversation

hage1005
Copy link
Collaborator

@hage1005 hage1005 commented Dec 27, 2023

Add adaptive_threshold option to dynamically determine the rank
Experiment result with MNIST
To use this feature, set compression_ratio_by_covariance or compression_ratio_by_memory in config.yaml

lora:
  init: pca
  compression_ratio_by_covariance: 0.8

This will determine the rank needed for PCA to explain 80% of the covariance
or

lora:
  init: pca
  compression_ratio_by_memory: 0.8

This will determine the rank that compresses gradient memory to 80%.

image image image

@sangkeun00
Copy link
Collaborator

Thanks for this feature and your analysis! This is a great initial effort. However, there are quite a few things to be figured out before I can merge this PR. For example,

  • Budget-based vs coverage-based: In your implementation, you fix the covariance coverage and determine the rank for each layer accordingly. However, this could potentially lead to low compression ratio. Instead, we can think about the budget-based approach, where we fix the number of parameters to be tracked and set the rank accordingly based on covariance distribution.
  • Compatibility with analog.initialize_from_log(): If the rank is determined adaptively for each layer, we have to save this rank structure somewhere so that we can recover the exact LoRA structure when initializing from log.

@hage1005
Copy link
Collaborator Author

hage1005 commented Dec 30, 2023

  • Budget-based vs coverage-based: In your implementation, you fix the covariance coverage and determine the rank for each layer accordingly.

Yeah this indeed might lead to low-compression ratio. For MNIST case the compression ratio seems plausible. But regarding "fix the number of parameters", how do we determine this number from the percentage covariance threshold? Should we put all singular values across layers together and sort them?

  • Compatibility with analog.initialize_from_log(): If the rank is determined adaptively for each layer, we have to save this rank structure somewhere so that we can recover the exact LoRA structure when initializing from log.

Thanks for catching this!

@hage1005 hage1005 self-assigned this Dec 30, 2023
@sangkeun00
Copy link
Collaborator

I don't have a concrete answer to the first question at the moment, and believe this is largely a research question (which is exciting). I know many literatures in communication efficient distributed training, which also does gradient compression, also applies different compression ratio across layers. You can probably review them, try/develop new ideas, and find the best working one. Once we have this, we can merge this PR!

@sangkeun00
Copy link
Collaborator

Also, we can think about using different ranks for forward and backward. From the implementation perspective, you may allow users to pass tuple of (rank_fwd, rank_bwd) for this. If a user passes an integer value we can use this value to set both rank_fwd and rank_bwd. This is somewhat similar with setting kernel_size or stride in nn.Conv in PyTorch.

@hage1005
Copy link
Collaborator Author

hage1005 commented Jan 8, 2024

Instead of explicitly saving rank information, I did it by using the shape of the lora weight matrix. Let me know if this seems okay!

Copy link
Collaborator

@sangkeun00 sangkeun00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would you be able to come up with some basic unit tests for lora? This would be massively helpful for the future development!

@@ -97,11 +97,16 @@ def _sanity_check(self):
)
self._log["grad"] = True

def eval(self):
def eval(self, log="grad"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having "grad" as a default value, what do you think about having None as a default value, and when it's None we set it to "grad" with a warning message like:

def eval(self, log=None):
    if log is None:
        get_logger().warning("we automatically set 'log' to 'grad'. if this is not a desired behavior, please explicitly set your 'log' value.")
        log = "grad"

    if isinstance(log, str):
        ...

Comment on lines +90 to +91
if __name__ == "__main__":
unittest.main()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion: I am not sure if this two lines are necessary, let's try to follow the format of other tests!

Comment on lines +82 to +83
print(if_scores)
# torch.save(if_scores, f"{os.path.dirname(os.path.abspath(__file__))}/data/if_analog_lora.pt")
Copy link
Collaborator

@eatpk eatpk Feb 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the comments! Also, if we are to print the scores( or any other sorts) in the test, I would love to see it in the formatted way!

Comment on lines +7 to +15
def construct_mlp(num_inputs=784, num_classes=10):
return torch.nn.Sequential(
nn.Flatten(),
nn.Linear(num_inputs, 4, bias=False),
nn.ReLU(),
nn.Linear(4, 2, bias=False),
nn.ReLU(),
nn.Linear(2, num_classes, bias=False),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be only used for MNIST data, I am not sure if this can be a general util function.. Let me know of your thoughts!

if_scores = if_scores.numpy().tolist()
torch.save(if_scores, "if_analog_pca.pt")
if_scores = if_scores.numpy().tolist()[0]
torch.save(if_scores, f"if_analog_pca.pt")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: f-string unnecessary.

@@ -91,6 +89,6 @@

# Save
if_scores = if_scores.numpy().tolist()[0]
torch.save(if_scores, "if_analog_scheduler.pt")
torch.save(if_scores, f"if_analog_scheduler.pt")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants