diff --git a/README.md b/README.md index aeec4ef70..b4d372356 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,8 @@ The goal is to offer high-level APIs for developers to quickly get started in wo - [Scaling transforms from laptop to cluster](#laptop_cluster) - [Repository Use and Navigation](doc/repo.md) - [How to Contribute](CONTRIBUTING.md) -- [Papers and Talks](#talks_papers) +- [Talks and Papers](#talks_papers) +- [Citations](#citations) ## 📖 About @@ -131,7 +132,7 @@ The matrix below shows the the combination of modules and supported runtimes. Al | **Data Ingestion** | | | | | | [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | -| [HTML to Parquet](transforms/universal/html2parquet/python/README.md) | :white_check_mark: | | | | +| [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | | | **Universal (Code & Language)** | | | | | | [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | | :white_check_mark: | | :white_check_mark: | @@ -220,3 +221,23 @@ You can run transforms via docker image or using virtual environments. This [doc 5. Talk on "Hands on session for fine tuning LLMs" [Video](https://www.youtube.com/watch?v=VEHIA3E64DM) 6. Talk on "Build your own data preparation module using data-prep-kit" [Video](https://www.youtube.com/watch?v=0WUMG6HIgMg) +## Citations + +If you use Data Prep Kit in your research, please cite our paper: + +```bash +@misc{wood2024dataprepkitgettingdataready, + title={Data-Prep-Kit: getting your data ready for LLM application development}, + author={David Wood and Boris Lublinsky and Alexy Roytman and Shivdeep Singh + and Abdulhamid Adebayo and Revital Eres and Mohammad Nassar and Hima Patel + and Yousaf Shah and Constantin Adam and Petros Zerfos and Nirmit Desai + and Daiki Tsuzuku and Takuya Goto and Michele Dolfi and Saptha Surendran + and Paramesvaran Selvam and Sungeun An and Yuan Chi Chang and Dhiraj Joshi + and Hajar Emami-Gohari and Xuan-Hong Dang and Yan Koyfman and Shahrokh Daijavad}, + year={2024}, + eprint={2409.18164}, + archivePrefix={arXiv}, + primaryClass={cs.AI}, + url={https://arxiv.org/abs/2409.18164}, +} +``` \ No newline at end of file