- Bellow are the list that I find are useful, helpful to build Large Language Models (LLMs) in Japanese.
- The list is not complete and there are other resources that I find are relevant but not yet listed.
- Each resource in the list might have my comments on it.
- General
- Pre-training datasets
- Downsteam tasks
- Tokenization
- Models
- Model Architecture
- Training
- Evaluation
- A Survey of Large Language Models [arXiv]
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling [arXiv]
- noted: Pile-CC
- The Pile has 0.07%(approx. 900 M chars)of Japanese texts based on the types of the characters.
RedPajama-Data [github]
- However, the Pile still seems better than RedPajama-Data - cf. https://twitter.com/BlancheMinerva/status/1652899628356960256?s=20
- Better Question-Answering Models on a Budget [ArXiv]
- Having briefly checked it, it looks interesting, but only compared their models to OPTs?
- Word segmentation by MeCab+UniDic + subword tokenization by SentencePiece
- 【インターンレポート】6.7B日本語モデルに対するLoRAチューニング
- Self-Instruct: Aligning Language Model with Self Generated Instructions arXiv
- INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models arXiv Related tweet
- Learning to summarize from human feedback [arXiv]
- Training language models to follow instructions with human feedback [arXiv]
- RLHF works because it's rating full sentences
- https://twitter.com/savvyRL/status/1651255588813443073?s=20
- https://twitter.com/mr_bay_area/status/1651594421551644678?s=20
- Sequence Level Training with Recurrent Neural Networks [arXiv]
- EleutherAI/lm-evaluation-harness: v0.3.0
- japanese-toxic-dataset
- Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings
- ChatGPTに共通テスト(旧センター試験)を解かせてみた - Related tweet
- Performance report of
rinna/japanese-gpt-1b
fine-tuned on Japanese version (translation) of Dolly dataset [tweet]
- Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond [ArXiv]