What's Changed
- Restructure the repository to distinguish/separate runtime libraries by @daw3rd in #140
- Move transform code into ray subdirectory - towards splitting transform runtimes. by @daw3rd in #143
- restore lost transforms/universal/noop/ray content by @daw3rd in #144
- New Readme file created for memory and endurance tests by @shahrokhDaijavad in #145
- LAB to Kit by @shahrokhDaijavad in #147
- Update ray/README.md by @eltociear in #148
- kfp multi jobs by @blublinsky in #142
- small fix in the init file by @blublinsky in #150
- rename make targets to be ray-specific by @daw3rd in #146
- Naming, docs and fix for recent binary file processing changes by @daw3rd in #153
- bug fixes by @blublinsky in #155
- Binary by @blublinsky in #141
- update kfp image version by @roytman in #159
- Update README.md for Broken links by @shahrokhDaijavad in #160
- adding multi_launcher tests by @blublinsky in #164
- Enable kfp in GH action for testing workflows by @revit13 in #149
- Fix paths in examples scripts. by @revit13 in #180
- Fail workflow if input size is empty. by @revit13 in #181
- library versions update by @blublinsky in #186
- Handle empty input parameter. by @Mohammad-nassar10 in #158
- Moving kfp workflows transform_workflows to transform directory. by @revit13 in #151
- update KFP docs by @roytman in #189
- Dev2 by @roytman in #191
- Modified ingress config (#130) by @D-Sai-Venkatesh in #156
- fixed flush in transform_file_processor.py by @blublinsky in #190
- added PLI related language extensions by @jitendrasinghibm in #177
- more fixes to the transform file processor by @blublinsky in #195
- Spark runtime by @cmadam in #183
- Fix white check marks in top readme. by @daw3rd in #199
- Minor fixes to kind/README.md. by @revit13 in #208
- Add utils functions to kfp support lib. by @Mohammad-nassar10 in #209
- Add Super pipeline for code transforms. by @revit13 in #172
- Tutorial README files fixes by @shahrokhDaijavad in #214
- Added copyright to the Spark files by @cmadam in #207
- Fix dependabot alert on tqdm in fdedup. by @daw3rd in #218
- Update filter_local.py by @shahrokhDaijavad in #217
- Split data-processing-lib/ray into python and ray. by @daw3rd in #213
- Enhanced the default 'make clean' rule to delete python leftovers and… by @daw3rd in #219
- small fixes by @roytman in #220
- Fixes after testing. by @revit13 in #223
- Change kfp_v1_workflow_support. by @revit13 in #227
- Split noop ray transform into ray and python runtimes. by @daw3rd in #221
- Fix tqdm security issue in ededup by @daw3rd in #224
- Tansform project conventions doc and makefile fix… by @daw3rd in #229
- Fixes after testing. by @revit13 in #232
- Runtime reorg by @daw3rd in #230
- Auto generate kfp pipelines. by @Mohammad-nassar10 in #193
- ingest to parquet rewrite by @blublinsky in #231
- KFPv2 support step 1 by @roytman in #226
- Rename of ingest_2_parquet file. by @daw3rd in #241
- Make all top level make targets pass w/o error by @daw3rd in #247
- Readme, pyproject metadata and makefile fixes in noop and filter. by @daw3rd in #240
- add retries counter to data processing by @blublinsky in #245
- Initial split of tokenization transform into ray and python by @daw3rd in #243
- add language identification transform module by @dtsuzuku-ibm in #256
- small changes to get ready for pdf by @blublinsky in #261
- Combine the common KFP support code in a shared library by @roytman in #253
- Fix tasks tags in kfp workflows. by @revit13 in #236
- Adjust ingest_2_parquet workflow. by @revit13 in #248
- Repo Root README and CONTRIBUTING clarifications by @shahrokhDaijavad in #264
- add build-language job to build-images workflow by @dtsuzuku-ibm in #268
- remove the artifactory settings by @roytman in #280
- update docs for KFPv2 by @roytman in #279
- Enhancing some README files by @shahrokhDaijavad in #278
- extended logging to print % and number processed files by @blublinsky in #272
- Updated transform readmes to reference correct runtime when describing cli params. by @daw3rd in #284
- Update advanced-transform-tutorial.md by @shahrokhDaijavad in #287
- add test-language job by @dtsuzuku-ibm in #286
- Change execution log file name. by @Mohammad-nassar10 in #251
- Update tests for KFP v2. by @revit13 in #255
- remove entire pipeline timeouts by @roytman in #270
- Randomly choose workflow to run in GH action. by @revit13 in #281
- Change the docker user as root by @takuyagt in #291
- Initial version of profiler by @blublinsky in #269
- Minimum explanation for VS Code by @shahrokhDaijavad in #290
- move logger to ensure Ray logging is correct by @blublinsky in #301
- Use dpk user for malware python image by @takuyagt in #304
- Move hack dirs to scripts dir by @revit13 in #295
- Fix issue #274 for venv corruption via make -n venv by @daw3rd in #302
- Installation of minio added to the transform README files by @shahrokhDaijavad in #303
- Minor fixes to profiler workflow by @revit13 in #308
- Ray version update by @blublinsky in #305
- update notebook by @shivdeep-singh-ibm in #310
- Split code quality, malware and proglang select transforms into python and ray. by @daw3rd in #288
- renaming of ingest_2_parquet by @blublinsky in #316
- move transform exceptions doc out of ray runtime to overview by @daw3rd in #319
- Inputcode2parquet rename by @daw3rd in #320
- fault tolerance by @blublinsky in #321
- Makefile rules updates by @revit13 in #323
- updated pyarrow version by @blublinsky in #325
- Fix make run-cli-sample for code2parquet by @daw3rd in #328
- Updated generate (simple pipeline) pipeline by @D-Sai-Venkatesh in #311
- Some new thoughts on cutting a release, especially scripts/release.sh by @daw3rd in #309
- Corrected Readme to update file path, added more detail signoff steps by @santoshborse in #330
- improve doc on transform design/expectations by @daw3rd in #331
- fix a typo by @roytman in #333
- Improvements to code2parquet transform by @daw3rd in #329
- implementing missing pyproject on transforms by @blublinsky in #327
- add new params to lang_id to store the results of language identification by @dtsuzuku-ibm in #322
- change content column name used in wf script by @dtsuzuku-ibm in #340
- small bug fixes by @blublinsky in #342
- fix typos by @roytman in #341
- update top readme table of transforms by @daw3rd in #344
- update the kfp release process by @roytman in #338
- remove globals in ray transforms that should insteads be references to the python transform globals by @daw3rd in #336
- Update K8s cluster deployment by @revit13 in #334
- Fix Instruction to create NOOP transformer by @santoshborse in #346
- Add workflow-build target by @revit13 in #348
- Update readme to point to new code2parquet transform by @Bytes-Explorer in #349
- add new release docs and stop publishing in script for 0.2.0 by @daw3rd in #337
- Add ingest2parquet step to superpipeline. by @Mohammad-nassar10 in #273
- code2parquet fixes on domain/snapshot and document_id by @daw3rd in #347
- add kfp_ray README files by @roytman in #351
- Changes in code2parquet, ingest2parquet, and advance tutorial readmes. by @daw3rd in #352
- disable debug flag, by default, in release-branch.sh by @daw3rd in #353
- Update release-branch script to not verify commits to avoid failures by @daw3rd in #354
- fix 'make set-versions' for doc_id, e/fdedup transforms by @daw3rd in #360
- Fixing Tutorial to run Python and Ray versions correctly by @santoshborse in #359
- added resize by @blublinsky in #350
- define required transform() method as abstract to AbstractTableTransform by @daw3rd in #358
- Remove requirements.txt from filter and doc_id spark transforms by @daw3rd in #357
- Turn off publishing of transform wheels and fix release script to not commit to main branch by @daw3rd in #363
- Fix kfp-data-processing tag and resize workflow. by @revit13 in #366
- Fix set-versions target in kfp/kfp_ray_components. by @revit13 in #368
- Add releases/** to branch to run on by @daw3rd in #369
- apply fix needed for apparent docker/git action problem when runnning… by @daw3rd in #375
- Mods to github workflows/actions by @daw3rd in #376
- fix relative input path by @roytman in #387
- Update Code Quality KFP_RAY README.md by @Param-S in #396
- Add github action to push docker images with latest tag. by @revit13 in #370
- A couple of wrong links in README file and a typo by @shahrokhDaijavad in #398
- Update examples notebook. by @shivdeep-singh-ibm in #401
- fix stats dictionary settings in fdedup by @daw3rd in #397
- fix release script to not modify main branch and add better logging by @daw3rd in #393
- Doc updates, especially new quickstart by @daw3rd in #402
- Document processing of local data using python transform image by @daw3rd in #394
- Enables configuration of makefile to use src or pypi for data-prep-kit library dependencies by @daw3rd in #405
- Readme changes by @daw3rd in #407
- Documentation updates by @Bytes-Explorer in #408
- add checks for expected arguments in .make.defaults and .make.transforms. by @daw3rd in #410
- Readme typo and hyperlinks by @shahrokhDaijavad in #411
- Added documentation on how to use data-access-factory by @santoshborse in #415
- fix broken linke to DAF in library overview by @daw3rd in #417
- Fixed link and typos by @shahrokhDaijavad in #423
- Fix get config parameter function. by @Mohammad-nassar10 in #427
- Remove creation of clusterrolebinding in kubeflow installation. by @revit13 in #412
- Add repo level ordering tf by @shivdeep-singh-ibm in #377
- Fix Dockerfiles to doc the copied main() src file. remove RUNTIME from Makefiles by @daw3rd in #440
- update repo-level transform to notebook by @shivdeep-singh-ibm in #435
- apply ZSTD compression by @yuanchi2807 in #441
- Fix abstact_test binary file comparison to allow a percentage diff on parquet binaries by @daw3rd in #443
- docs: update README.md by @eltociear in #442
- Convert info to debug to reduce the verbosity by @santoshborse in #444
- add doc_quality transform by @dtsuzuku-ibm in #282
- stop installing extra dependencies in building doc quality python image by @dtsuzuku-ibm in #445
- Add pdf2parquet transform by @dolfim-ibm in #416
- do not publish-dist in make publish by @dtsuzuku-ibm in #453
- adding unrecoverable exception by @blublinsky in #452
- Add repo_name to code2parquet transform and injest2parquet tool by @sapthasurendran in #428
- fix path in notebook by @shivdeep-singh-ibm in #446
- added ededup python implementation by @blublinsky in #436
- Fix to ingest2parquet by @sapthasurendran in #457
- Fix local run of noop by @dolfim-ibm in #454
- Fixes some bad assert statements and adds fuzzy floating point comparison 2 by @daw3rd in #456
- Try and make all transform Dockerfiles use a common set of first statements by @daw3rd in #458
- fix broken install links in repo.md by @daw3rd in #460
- remove comments including internal url by @dtsuzuku-ibm in #462
- Update Docling and improve Enum for content type by @dolfim-ibm in #463
- New transformer for license and copyright header removal by @ykalathiya in #332
- Super pipeline generator by @Mohammad-nassar10 in #233
- pdf2parquet updates by @dolfim-ibm in #469
- Fix image name in header_cleanser_wf.py by @revit13 in #475
- Remove Lakehouse from comments. by @revit13 in #478
- added dedup percentage to python implementation by @blublinsky in #476
- update documentation of output columns annotated by doc quality transform by @dtsuzuku-ibm in #479
- minor changes by @shivdeep-singh-ibm in #480
- Language notebook for RAG ingestion w/ chunker and encoder transforms. by @dolfim-ibm in #461
- Makefile and release documentations by @daw3rd in #477
- Minor fixes to language transforms. by @revit13 in #486
- Add KFP prefix to DataAccess object created from kfp component. by @revit13 in #487
- update dependency to remove conflict with code_quality by @touma-I in #490
- Minor change to header_cleanser workflow by @revit13 in #491
- initial implementation of the simple python APIs for invoking transforms by @blublinsky in #413
- Allow users to add args to the .make.defaults image building target and pip installs by @daw3rd in #489
- Enhancements to the Table of the root README file by @shahrokhDaijavad in #495
- Adding support for python multiprocessing pool by @blublinsky in #492
- Additional changes to two README files by @shahrokhDaijavad in #498
- fix lib doc .py links and update resize readme by @daw3rd in #499
- Update docling version and tests by @dolfim-ibm in #504
- fix lang_id, use parameters related to output columns by @dtsuzuku-ibm in #500
- fix issue 481 - double logging by @daw3rd in #505
- clear error output from notebook cell by @shivdeep-singh-ibm in #506
- Html2parquet changes made by @sungeunan-ibm in #496
- updated multiprocessing doc, moved test by @blublinsky in #507
- Further enhancements to the root README file by @shahrokhDaijavad in #510
- Pii transform by @SowmyaLR in #471
- Super pipeline KFPv2. by @Mohammad-nassar10 in #488
- Update README.md by @shahrokhDaijavad in #519
- Add pipeline for repo_level_order transform. by @Mohammad-nassar10 in #512
- Add language transform workflows to CI/CD tests by @roytman in #521
- fix fdedup workflow error by @roytman in #523
- doc id python by @blublinsky in #509
- Add PII to the root README by @shahrokhDaijavad in #527
- incremental ededup by @blublinsky in #502
- Add new notebook by @shivdeep-singh-ibm in #473
- prevent kfp test for transforms that do not support it by @roytman in #530
- Update README.md fix typos by @santoshborse in #537
- Don't use pip cache in Dockerfile and fix exception reporting msg by @daw3rd in #539
- fix colab links in notebooks by @shivdeep-singh-ibm in #542
- Cleanup PII transform and other CICD related failure by @touma-I in #538
- update sample notebook by @sapthasurendran in #547
- pdf2parquet updates by @dolfim-ibm in #528
- Fix make set-versions target by @daw3rd in #548
- Added instructions for using pip install by @touma-I in #545
- AI Alliance RAG Demo by @touma-I in #526
- Improve repo level ordering tf by @shivdeep-singh-ibm in #434
- Use cpu-only version of torch on linux builds by @dolfim-ibm in #558
- Make it easier to get started by @Bytes-Explorer in #561
- Improve noop test templates for reuse. by @daw3rd in #555
- Fix docs and mkdocs documentation by @shivdeep-singh-ibm in #562
- Organise examples by use cases by @Bytes-Explorer in #563
- add KFP_BLACK_LIST by @roytman in #560
- Getting started instructions and code tweak by @sujee in #566
- Fix parameters type for pii transform pipeline. by @Mohammad-nassar10 in #522
- disable publish-image rule for pii_redactor to allow merge to pass by @daw3rd in #570
- Getting started 2 : Added a colab notebook, updated for local data. by @sujee in #572
- Custom column validator for pdf2parquet by @dolfim-ibm in #577
- kfp enhancement with new parameters by @blublinsky in #580
- fixed paths in README by @sujee in #581
- update docling dependencies to newer versions by @dolfim-ibm in #584
- Alternate spark runtime implementation by @blublinsky in #406
- Html2parquet Makefile added by @sungeunan-ibm in #524
- disable test workflow when no code files change by @daw3rd in #589
- doc_chunk updates and new parameters by @dolfim-ibm in #591
- Minor fix to workflow-manual-run.yml by @revit13 in #594
- fix: pin all docling deps for more stability by @dolfim-ibm in #596
- refactoring of data access code by @blublinsky in #592
- doc_id and source_doc_id params in doc_chunk by @dolfim-ibm in #598
- Update root README in order to try DPK faster by @shahrokhDaijavad in #593
- tips for running on google colab by @sujee in #587
- Change some release documentation by @daw3rd in #599
- Start triggering testing at finer granularity in the repo by @daw3rd in #595
- Disable ci/cd spark image build when transform does not implement spark by @daw3rd in #604
- Add generator to noop pipeline. by @Mohammad-nassar10 in #610
- Add delay before deleting ray cluster in kfp component. by @revit13 in #612
- Add a link to the Google Colab Tips file from the root README by @shahrokhDaijavad in #601
- log error in case of exception by @shivdeep-singh-ibm in #617
- Build transforms wheel by @touma-I in #493
- Pending version change from 0.2.1 to 0.2.2.dev0 by @touma-I in #621
- The link to this example was broken by @shahrokhDaijavad in #623
- adding exception handling for transforms creation by @blublinsky in #616
- align docling versions among transforms by @dolfim-ibm in #629
- Release documentation and notes by @daw3rd in #624
- set spark base image tag to that from .make.versions by @daw3rd in #626
- html2parquet Moved to language folder by @sungeunan-ibm in #615
- Define spark version in .make.versions by @daw3rd in #632
- Adding Resize Spark by @blublinsky in #630
- Add nodes toleration to Ray pods by @revit13 in #627
- add license check transform by @shivdeep-singh-ibm in #257
- Update README.md to add the DPK arXiv paper by @shahrokhDaijavad in #647
- expanding profiler runtimes support by @blublinsky in #631
- Fix kfp pipelines testing in github workflow. by @revit13 in #611
- Test and address conflicts when using the transforms package in a language application with specific requirements.txt file by @touma-I in #640
- Add Licence transforms to Code Superpipeline. by @Mohammad-nassar10 in #400
- The table of transforms was duplicated by @daw3rd in #649
- adding hap transform by @ian-cho in #638
- Fix makefile publish target for license select. by @shivdeep-singh-ibm in #656
- Updated RAG example for DPK version 0.2.1 by @sujee in #655
- changed doc_text into contents in related files by @ian-cho in #663
- Update HAP README.md by @ian-cho in #661
- Change the calculation of the desired ray actors by @revit13 in #654
- Various fixes to workflows, especially kfp by @daw3rd in #664
- Add Tolerations and node selector to KFP pods by @revit13 in #643
- adding execution_stats to python metadata by @blublinsky in #650
- add dpk_connector to dpk by @hmtbr in #637
- support python 3.12 by @roytman in #619
- feat: #641 - Extend document chunker transform to support fixed-size token window chunker with overlap by @juancappi in #642
- Fix kfp workflows which were created with a bad template by @daw3rd in #672
- Added ray version of the html2parquet transform by @sungeunan-ibm in #666
- Update README.md now that we have Ray version of html2parquet. Also added the Citations section. by @shahrokhDaijavad in #680
- Move transform version numbers out of .make.versions to transform-specific file by @daw3rd in #657
- Update README.md by @dnielsen in #688
- Minor change to kfp workflow. by @revit13 in #684
- code license failed to apply module cli params by @Param-S in #692
- add MultiLock class by @daw3rd in #693
- Fuzzy dedup modifications by @cmadam in #687
- Added ray-based version of hap transform by @ian-cho in #685
- Update README.md to add Ray version of HAP to the table by @shahrokhDaijavad in #700
- Single Packages (0.2.2.dev1) for data-prep-toolkit and data-prep-toolkit-transforms with python3.12 by @touma-I in #682
- Fix workflow failures due to usage of pip by @revit13 in #704
- Fixing cicd failures based on ubuntu-latest changing by @daw3rd in #706
- Add KFP workflow for HAP. by @revit13 in #701
- KFP for HAP to be added to the table in README by @shahrokhDaijavad in #708
- Update test-kfp.yml workflow. by @revit13 in #702
- Change the release script and documentation to bump the minor version instead of micro version. by @daw3rd in #698
- Fix missing dev1 in README.md pip install command by @touma-I in #709
- added folder_transform by @blublinsky in #691
- Kfp workflow for html2parquet by @revit13 in #694
- workflow to publish data-connector-lib by @hmtbr in #712
- Transform for Code Profiling by @pankajskku in #646
- Added Trafilatura parameters for heading, table, image extraction by @sungeunan-ibm in #707
- Update Run_your_first_transform_colab.ipynb by @shahrokhDaijavad in #711
- Adding a resources page by @sujee in #696
- Update README.md by @pankajskku in #714
- Update Docling to 1.20.0 by @dolfim-ibm in #723
- Fix 'IndexError: list index out of range' in header_cleanser by @takuyagt in #720
- Multiple fixes for semantic order transform by @shivdeep-singh-ibm in #726
- Intro example 1 by @sujee in #718
- fix link to pdf2parquet readme.md by @touma-I in #727
- implement subdomain focus feature in data-prep-connector by @hmtbr in #725
- add license/copyright as appropriate to .py files and add check-licensing.sh script by @daw3rd in #715
- Update README.md of the intro example for the typo by @shahrokhDaijavad in #730
- docs: update README.md by @eltociear in #728
- Update release number following release cutoff for http connector by @touma-I in #733
New Contributors
- @eltociear made their first contribution in #148
- @Mohammad-nassar10 made their first contribution in #158
- @D-Sai-Venkatesh made their first contribution in #156
- @jitendrasinghibm made their first contribution in #177
- @cmadam made their first contribution in #183
- @dtsuzuku-ibm made their first contribution in #256
- @takuyagt made their first contribution in #291
- @santoshborse made their first contribution in #330
- @yuanchi2807 made their first contribution in #441
- @dolfim-ibm made their first contribution in #416
- @ykalathiya made their first contribution in #332
- @sungeunan-ibm made their first contribution in #496
- @SowmyaLR made their first contribution in #471
- @sujee made their first contribution in #566
- @ian-cho made their first contribution in #638
- @hmtbr made their first contribution in #637
- @juancappi made their first contribution in #642
- @dnielsen made their first contribution in #688
- @pankajskku made their first contribution in #646
Full Changelog: v0.1.0-dpk...v0.2.2-connector