Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match GPTQ state dict #2188

Closed
wants to merge 3 commits into from
Closed

Conversation

rahul-tuli
Copy link
Member

@rahul-tuli rahul-tuli commented Mar 19, 2024

Conversion script:

from sparseml.transformers.utils.vllm_export_helpers import export_vllm_checkpoint
from sparseml.transformers import SparseAutoModelForCausalLM, SparseAutoTokenizer

path = "/home/rahul/projects/sparseml/local/local_output/sparsegpt-autogptq-emulation-checkpoint/stage_compression"
sparse_gpt_model = SparseAutoModelForCausalLM.from_pretrained(path)
tokenizer = SparseAutoTokenizer.from_pretrained(path)

export_vllm_checkpoint(
    model=sparse_gpt_model,
    tokenizer=tokenizer,
)
024-03-21 01:58:33 sparseml.pytorch.model_load.helpers INFO     Reloaded model state after SparseML recipe structure modifications from /home/rahul/projects/sparseml/local/local_output/sparsegpt-autogptq-emulation-checkpoint/stage_compression
2024-03-21 01:58:33 __main__     INFO     Adding exllama quantization info to config
2024-03-21 01:58:33 __main__     INFO     Translating state dict to exllama format.
2024-03-21 01:58:33 sparseml.transformers.utils.transformations INFO     Applying transformation: TRANSFORM_NAMES
2024-03-21 02:00:46 sparseml.transformers.utils.transformations INFO     Transformation: TRANSFORM_NAMES complete
2024-03-21 02:00:46 sparseml.transformers.utils.transformations INFO     Applying transformation: ADD_TENSORS
2024-03-21 02:00:46 sparseml.transformers.utils.transformations INFO     Transformation: ADD_TENSORS complete
2024-03-21 02:00:46 sparseml.transformers.utils.transformations INFO     Applying transformation: TRANSFORM_TENSORS
2024-03-21 02:00:46 sparseml.transformers.utils.transformations INFO     Transformation: TRANSFORM_TENSORS complete
2024-03-21 02:00:46 sparseml.transformers.utils.transformations INFO     Applying transformation: REMOVE_UNWANTED_TENSORS
2024-03-21 02:00:46 sparseml.transformers.utils.transformations INFO     Transformation: REMOVE_UNWANTED_TENSORS complete
2024-03-21 02:00:50 __main__     INFO     Model and config saved to /nm/drive0/rahul/projects/sparseml/exllama_model
2024-03-21 02:00:50 __main__     INFO     tokenizer saved to /nm/drive0/rahul/projects/sparseml/exllama_model
$ tree ./exllama_model 
./exllama_model
├── config.json
├── generation_config.json
├── model.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json

0 directories, 6 files

config.json

{
  "_name_or_path": "/home/rahul/projects/sparseml/local/local_output/sparsegpt-autogptq-emulation-checkpoint/stage_compression",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5632,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 22,
  "num_key_value_heads": 4,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "desc_act": false,
    "group_size": -1,
    "is_marlin_format": false,
    "quant_method": "gptq",
    "sym": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "1.7.0.20240321",
  "use_cache": true,
  "vocab_size": 32000
}

Usage Script: (needs vLLM)

import argparse
from vllm import LLM, SamplingParams


parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str)

args = parser.parse_args()


prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=1, max_tokens=100)

# Create an LLM.
llm = LLM(args.model)

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"\nGenerated text: {prompt}{generated_text}\n")


@rahul-tuli rahul-tuli changed the title Add Translation Structure Match GPTQ state dict Mar 19, 2024
@rahul-tuli rahul-tuli force-pushed the match-quant-state-dict-to-gptq branch from d442b66 to 29f83bb Compare March 21, 2024 02:18
@rahul-tuli rahul-tuli marked this pull request as ready for review March 21, 2024 02:22
@rahul-tuli rahul-tuli force-pushed the match-quant-state-dict-to-gptq branch from 8227bfe to 1b1567d Compare March 26, 2024 14:21
return wrapper


def is_quantization_target(key: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should move file to gptq_helpers.py or make sure these functions are named with gptq specifically since these assumptions are specific to how this algorithm is applied, not all quantization

def _log_call(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
_LOGGER.info("Applying transformation: %s", func.__name__.upper())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move to debug, users won't necessarily know the internal transformation names

intweight = []
infeatures = weight.shape[1]
for idx in range(infeatures):
intweight.append(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after we fix the accuracy issue - let's see what we can do to speed this up - or at least time it. with grouping vectorizing might be tricky but could at least pre-allocate the final tensor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could maybe try moving the model to GPU before running the transformations (ie model.to("cuda:0"))

- Reshape the zero points tensor to [1, x] of type int32 and fill with zeros
(it is assumed that quantization was symmetric)

:param state_dict: The state_dict to be transformed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify that keys should already have been updated

bfineran
bfineran previously approved these changes Mar 28, 2024
remove src. from imports
Update names
Some Cleanup
Add docstring to QuantizationConfig
@rahul-tuli
Copy link
Member Author

Closing as this is not needed now!

@rahul-tuli rahul-tuli closed this Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants