Bump3.4.2 (OpenNMT#2493)

* v3.4.2
vince62s · Oct 20, 2023 · 9942ecd · 9942ecd
1 parent cb35810
commit 9942ecd
Show file tree

Hide file tree

Showing 8 changed files with 44 additions and 35 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,15 @@
 
 ## [Unreleased]
 
+## [3.4.2](https://github.com/OpenNMT/OpenNMT-py/tree/3.4.2) (2023-10-20)
+
+* torch 2.1 (scaled_dot_product improvements)
+* Mistral 7B sliding window
+* Speed-up inference
+* flash attention 2 (with sliding window) >= v2.3.1
+* use FusedRMSNorm from apex if available
+* fixed attn_debug
+
 ## [3.4.1](https://github.com/OpenNMT/OpenNMT-py/tree/3.4.1) (2023-09-26)
 
 * bug fixes

diff --git a/README.md b/README.md
@@ -25,11 +25,12 @@ Otherwise you can just have a look at the [Quickstart](https://opennmt.net/OpenN
 ----
 ## New:
 
-* Special note on Pytorch v2: up to v2.0.1 dynamic shapes are not handled properly, hence torch.compile() will not work with OpenNMT-py. We have tested nightly (in May) and it works with a small gain. Next version will be 2.1
-* LLM support with converters for: Llama, OpenLlama, Redpajama, MPT-7B, Falcon.
+* You will need Pytorch v2 preferably v2.1 which fixes some `scaled_dot_product_attention` issues
+* LLM support with converters for: Llama (+ Mistral), OpenLlama, Redpajama, MPT-7B, Falcon.
 * Support for 8bit and 4bit quantization along with LoRA adapters, with or without checkpointing.
 * You can finetune 7B and 13B models on a single RTX 24GB with 4-bit quantization.
 * Inference can be forced in 4/8bit using the same layer quantization as in finetuning.
+* Tensor parallelism when the model does not fit on one GPU's memory (both training and inference)
 * Once your model is finetuned you can run inference either with OpenNMT-py or faster with CTranslate2.
 * MMLU evaluation script, see results [here](https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/readme.md)
 
@@ -55,7 +56,7 @@ If you used previous versions of OpenNMT-py, you can check the [Changelog](https
 OpenNMT-py requires:
 
 - Python >= 3.8
-- PyTorch >= 2.0 <2.1
+- PyTorch >= 2.0 <2.2
 
 Install `OpenNMT-py` from `pip`:
 ```bash
@@ -77,11 +78,24 @@ Note: if you encounter a `MemoryError` during installation, try to use `pip` wit
 pip install -r requirements.opt.txt
 ```
 
-Special note on flash attention support:
+## Manual installation of some dependencies
+
+Apex is highly recommended to have fast performance (especially the legacy fusedadam optimizer and FusedRMSNorm)
+
+```shell
+git clone https://github.com/NVIDIA/apex
+cd apex
+pip3 install -v --no-build-isolation --config-settings --build-option="--cpp_ext --cuda_ext --deprecated_fused_adam --xentropy --fast_multihead_attn" ./
+cd ..
+```
+
+Flash attention:
+
+As of Oct. 2023 flash attention 1 has been upstreamed to pytorch v2 but it is recommended to use flash attention 2 with v2.3.1 for sliding window attention support.
 
 When using regular `position_encoding=True` or Rotary with `max_relative_positions=-1` OpenNMT-py will try to use an optimized dot-product path.
 
-if you want to use [flash attention 2](https://github.com/Dao-AILab/flash-attention#installation-and-features) then you need to manually install it first:
+if you want to use [flash attention](https://github.com/Dao-AILab/flash-attention#installation-and-features) then you need to manually install it first:
 
 ```bash
 pip install flash-attn --no-build-isolation
@@ -91,7 +105,7 @@ if flash attention 2 is not installed, then we will use `F.scaled_dot_product_at
 
 When using `max_relative_positions > 0` or Alibi `max_relative_positions=-2` OpenNMT-py will use its legacy code for matrix multiplications.
 
-flash attention is a bit faster and saves some GPU memory.
+flash attention and `F.scaled_dot_product_attention` are a bit faster and saves some GPU memory.
 
 ## Documentation & FAQs
 
@@ -106,28 +120,21 @@ Project was incubated by Systran and Harvard NLP in 2016 in Lua and ported to Py
 
 Current maintainers (since 2018):
 
-[François Hernandez](https://github.com/francoishernandez) and Ubiqus Team.
+[François Hernandez](https://github.com/francoishernandez)
 [Vincent Nguyen](https://github.com/vince62s) (Seedfall)
 
 ## Citation
 
 If you are using OpenNMT-py for academic work, please cite the initial [system demonstration paper](https://www.aclweb.org/anthology/P17-4012) published in ACL 2017:
 
 ```
-@inproceedings{klein-etal-2017-opennmt,
-    title = "{O}pen{NMT}: Open-Source Toolkit for Neural Machine Translation",
-    author = "Klein, Guillaume  and
-      Kim, Yoon  and
-      Deng, Yuntian  and
-      Senellart, Jean  and
-      Rush, Alexander",
-    booktitle = "Proceedings of {ACL} 2017, System Demonstrations",
-    month = jul,
-    year = "2017",
-    address = "Vancouver, Canada",
-    publisher = "Association for Computational Linguistics",
-    url = "https://www.aclweb.org/anthology/P17-4012",
-    pages = "67--72",
+@misc{klein2018opennmt,
+      title={OpenNMT: Neural Machine Translation Toolkit}, 
+      author={Guillaume Klein and Yoon Kim and Yuntian Deng and Vincent Nguyen and Jean Senellart and Alexander M. Rush},
+      year={2018},
+      eprint={1805.11462},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
 }
 ```
 
diff --git a/onmt/__init__.py b/onmt/__init__.py
@@ -21,4 +21,4 @@
     onmt.modules,
 ]
 
-__version__ = "3.4.1"
+__version__ = "3.4.2"
diff --git a/onmt/bin/translate.py b/onmt/bin/translate.py
@@ -9,8 +9,6 @@
 from onmt.utils.parse import ArgumentParser
 from onmt.utils.misc import use_gpu, set_random_seed
 
-# import cProfile
-
 
 def translate(opt):
     ArgumentParser.validate_translate_opts(opt)
@@ -50,13 +48,9 @@ def _get_parser():
 
 
 def main():
-    # profile = cProfile.Profile()
-    # profile.enable()
     parser = _get_parser()
     opt = parser.parse_args()
     translate(opt)
-    # profile.disable()
-    # profile.print_stats(sort="cumulative")
 
 
 if __name__ == "__main__":

diff --git a/onmt/inputters/dynamic_iterator.py b/onmt/inputters/dynamic_iterator.py
@@ -351,7 +351,7 @@ def __iter__(self):
 
 class OnDeviceDatasetIter:
     def __init__(self, data_iter, device):
-        self.data_iter = iter(data_iter)
+        self.data_iter = data_iter
         self.device = device
 
     def __iter__(self):

diff --git a/onmt/modules/multi_headed_attn.py b/onmt/modules/multi_headed_attn.py
@@ -487,7 +487,7 @@ def forward(
                 ).transpose(1, 2)
             else:
                 with torch.backends.cuda.sdp_kernel(
-                    enable_flash=False, enable_math=False, enable_mem_efficient=True
+                    enable_flash=False, enable_math=True, enable_mem_efficient=True
                 ):
                     attn_output = F.scaled_dot_product_attention(
                         query,

diff --git a/requirements.opt.txt b/requirements.opt.txt
@@ -1,9 +1,8 @@
 pyrouge
-git+https://github.com/NVIDIA/apex.git@700d6825e205732c1d6be511306ca4e595297070
 sentencepiece>=0.1.94,<0.1.98
 subword-nmt>=0.3.7
 rapidfuzz
 scipy
-bitsandbytes>=0.39.0
+bitsandbytes>=0.39.1
 safetensors
 spacy
diff --git a/setup.py b/setup.py
@@ -11,7 +11,7 @@
     description="A python implementation of OpenNMT",
     long_description=long_description,
     long_description_content_type="text/markdown",
-    version="3.4.1",
+    version="3.4.2",
     packages=find_packages(),
     project_urls={
         "Documentation": "http://opennmt.net/OpenNMT-py/",
@@ -21,9 +21,9 @@
     },
     python_requires=">=3.8",
     install_requires=[
-        "torch>=2.0,<2.1",
+        "torch>=2.0.1,<2.2",
         "configargparse",
-        "ctranslate2>=3.2,<4",
+        "ctranslate2>=3.17,<4",
         "tensorboard>=2.3",
         "flask",
         "waitress",