Skip to content

Commit

Permalink
Bump3.4.2 (OpenNMT#2493)
Browse files Browse the repository at this point in the history
* v3.4.2
  • Loading branch information
vince62s authored Oct 20, 2023
1 parent cb35810 commit 9942ecd
Show file tree
Hide file tree
Showing 8 changed files with 44 additions and 35 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,15 @@

## [Unreleased]

## [3.4.2](https://github.com/OpenNMT/OpenNMT-py/tree/3.4.2) (2023-10-20)

* torch 2.1 (scaled_dot_product improvements)
* Mistral 7B sliding window
* Speed-up inference
* flash attention 2 (with sliding window) >= v2.3.1
* use FusedRMSNorm from apex if available
* fixed attn_debug

## [3.4.1](https://github.com/OpenNMT/OpenNMT-py/tree/3.4.1) (2023-09-26)

* bug fixes
Expand Down
49 changes: 28 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,12 @@ Otherwise you can just have a look at the [Quickstart](https://opennmt.net/OpenN
----
## New:

* Special note on Pytorch v2: up to v2.0.1 dynamic shapes are not handled properly, hence torch.compile() will not work with OpenNMT-py. We have tested nightly (in May) and it works with a small gain. Next version will be 2.1
* LLM support with converters for: Llama, OpenLlama, Redpajama, MPT-7B, Falcon.
* You will need Pytorch v2 preferably v2.1 which fixes some `scaled_dot_product_attention` issues
* LLM support with converters for: Llama (+ Mistral), OpenLlama, Redpajama, MPT-7B, Falcon.
* Support for 8bit and 4bit quantization along with LoRA adapters, with or without checkpointing.
* You can finetune 7B and 13B models on a single RTX 24GB with 4-bit quantization.
* Inference can be forced in 4/8bit using the same layer quantization as in finetuning.
* Tensor parallelism when the model does not fit on one GPU's memory (both training and inference)
* Once your model is finetuned you can run inference either with OpenNMT-py or faster with CTranslate2.
* MMLU evaluation script, see results [here](https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/readme.md)

Expand All @@ -55,7 +56,7 @@ If you used previous versions of OpenNMT-py, you can check the [Changelog](https
OpenNMT-py requires:

- Python >= 3.8
- PyTorch >= 2.0 <2.1
- PyTorch >= 2.0 <2.2

Install `OpenNMT-py` from `pip`:
```bash
Expand All @@ -77,11 +78,24 @@ Note: if you encounter a `MemoryError` during installation, try to use `pip` wit
pip install -r requirements.opt.txt
```

Special note on flash attention support:
## Manual installation of some dependencies

Apex is highly recommended to have fast performance (especially the legacy fusedadam optimizer and FusedRMSNorm)

```shell
git clone https://github.com/NVIDIA/apex
cd apex
pip3 install -v --no-build-isolation --config-settings --build-option="--cpp_ext --cuda_ext --deprecated_fused_adam --xentropy --fast_multihead_attn" ./
cd ..
```

Flash attention:

As of Oct. 2023 flash attention 1 has been upstreamed to pytorch v2 but it is recommended to use flash attention 2 with v2.3.1 for sliding window attention support.

When using regular `position_encoding=True` or Rotary with `max_relative_positions=-1` OpenNMT-py will try to use an optimized dot-product path.

if you want to use [flash attention 2](https://github.com/Dao-AILab/flash-attention#installation-and-features) then you need to manually install it first:
if you want to use [flash attention](https://github.com/Dao-AILab/flash-attention#installation-and-features) then you need to manually install it first:

```bash
pip install flash-attn --no-build-isolation
Expand All @@ -91,7 +105,7 @@ if flash attention 2 is not installed, then we will use `F.scaled_dot_product_at

When using `max_relative_positions > 0` or Alibi `max_relative_positions=-2` OpenNMT-py will use its legacy code for matrix multiplications.

flash attention is a bit faster and saves some GPU memory.
flash attention and `F.scaled_dot_product_attention` are a bit faster and saves some GPU memory.

## Documentation & FAQs

Expand All @@ -106,28 +120,21 @@ Project was incubated by Systran and Harvard NLP in 2016 in Lua and ported to Py

Current maintainers (since 2018):

[François Hernandez](https://github.com/francoishernandez) and Ubiqus Team.
[François Hernandez](https://github.com/francoishernandez)
[Vincent Nguyen](https://github.com/vince62s) (Seedfall)

## Citation

If you are using OpenNMT-py for academic work, please cite the initial [system demonstration paper](https://www.aclweb.org/anthology/P17-4012) published in ACL 2017:

```
@inproceedings{klein-etal-2017-opennmt,
title = "{O}pen{NMT}: Open-Source Toolkit for Neural Machine Translation",
author = "Klein, Guillaume and
Kim, Yoon and
Deng, Yuntian and
Senellart, Jean and
Rush, Alexander",
booktitle = "Proceedings of {ACL} 2017, System Demonstrations",
month = jul,
year = "2017",
address = "Vancouver, Canada",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/P17-4012",
pages = "67--72",
@misc{klein2018opennmt,
title={OpenNMT: Neural Machine Translation Toolkit},
author={Guillaume Klein and Yoon Kim and Yuntian Deng and Vincent Nguyen and Jean Senellart and Alexander M. Rush},
year={2018},
eprint={1805.11462},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

2 changes: 1 addition & 1 deletion onmt/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@
onmt.modules,
]

__version__ = "3.4.1"
__version__ = "3.4.2"
6 changes: 0 additions & 6 deletions onmt/bin/translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,6 @@
from onmt.utils.parse import ArgumentParser
from onmt.utils.misc import use_gpu, set_random_seed

# import cProfile


def translate(opt):
ArgumentParser.validate_translate_opts(opt)
Expand Down Expand Up @@ -50,13 +48,9 @@ def _get_parser():


def main():
# profile = cProfile.Profile()
# profile.enable()
parser = _get_parser()
opt = parser.parse_args()
translate(opt)
# profile.disable()
# profile.print_stats(sort="cumulative")


if __name__ == "__main__":
Expand Down
2 changes: 1 addition & 1 deletion onmt/inputters/dynamic_iterator.py
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,7 @@ def __iter__(self):

class OnDeviceDatasetIter:
def __init__(self, data_iter, device):
self.data_iter = iter(data_iter)
self.data_iter = data_iter
self.device = device

def __iter__(self):
Expand Down
2 changes: 1 addition & 1 deletion onmt/modules/multi_headed_attn.py
Original file line number Diff line number Diff line change
Expand Up @@ -487,7 +487,7 @@ def forward(
).transpose(1, 2)
else:
with torch.backends.cuda.sdp_kernel(
enable_flash=False, enable_math=False, enable_mem_efficient=True
enable_flash=False, enable_math=True, enable_mem_efficient=True
):
attn_output = F.scaled_dot_product_attention(
query,
Expand Down
3 changes: 1 addition & 2 deletions requirements.opt.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
pyrouge
git+https://github.com/NVIDIA/apex.git@700d6825e205732c1d6be511306ca4e595297070
sentencepiece>=0.1.94,<0.1.98
subword-nmt>=0.3.7
rapidfuzz
scipy
bitsandbytes>=0.39.0
bitsandbytes>=0.39.1
safetensors
spacy
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
description="A python implementation of OpenNMT",
long_description=long_description,
long_description_content_type="text/markdown",
version="3.4.1",
version="3.4.2",
packages=find_packages(),
project_urls={
"Documentation": "http://opennmt.net/OpenNMT-py/",
Expand All @@ -21,9 +21,9 @@
},
python_requires=">=3.8",
install_requires=[
"torch>=2.0,<2.1",
"torch>=2.0.1,<2.2",
"configargparse",
"ctranslate2>=3.2,<4",
"ctranslate2>=3.17,<4",
"tensorboard>=2.3",
"flask",
"waitress",
Expand Down

0 comments on commit 9942ecd

Please sign in to comment.