Project Restructuring and Modularity Improvements #3

CreativeSelf0 · 2024-08-08T22:37:10Z

Project Restructuring and Modularity Improvements

Overview

This pull request implements a significant restructuring of the CATT (Character-based Arabic Tashkeel Transformer) project to improve modularity, maintainability, and ease of use. The changes focus on reorganizing the codebase, introducing a new API module, and standardizing the project structure.

Key Changes

1. Project Structure Reorganization

Moved core CATT functionality into a dedicated catt/ package
Separated data handling, models, and utilities into distinct submodules
Created a new api/ directory for API-related functionality
Added configs/, dataset/, docs/, models/, scripts/, and tests/ directories for better organization

2. API Implementation

Introduced a new api/ module with FastAPI integration
Created main.py, models.py, and catt_service.py for API functionality
Implemented a /tashkeel endpoint for diacritization requests

3. Modularity Improvements

Refactored encoder and decoder models into separate files
Created model_types.py for better model type management
Moved utility functions into a dedicated utils/ submodule

4. Configuration Management

Added configs/ directory with separate configuration files for Encoder-Decoder and Encoder-Only models
Introduced a Sample_config.yaml for easier customization

5. Dependency Management

Added pyproject.toml and poetry.lock for better dependency management using Poetry

6. Documentation and Testing

Created a docs/ directory for future documentation
Added a tests/ directory for unit tests (to be implemented)

7. Simplified Prediction and Training Scripts

Consolidated predict_ed.py and predict_eo.py into a single predict_catt.py
Refactored train_catt.py for improved clarity and consistency

Before and After Structure Comparison

Before

└── catt
    ├── benchmarking
    │   ├── all_models_CATT_data
    │   ├── all_models_WikiNews_data
    │   ├── eo_ed_mlm_ns
    │   │   ├── catt_data
    │   │   └── wikinews_data
    │   ├── run_compute_der_all_CATT_benchmark.sh
    │   ├── run_compute_der_all_WikiNews_benchmark.sh
    │   ├── run_compute_der_eo_ed_mlm_ns_long_training.sh
    │   └── run_compute_der_eo_ed_mlm_ns_short_training.sh
    ├── bw2ar.py
    ├── compute_der.py
    ├── ed_pl.py
    ├── ed.py
    ├── eo_pl.py
    ├── eo.py
    ├── LICENSE
    ├── predict_ed.py
    ├── predict_eo.py
    ├── README.md
    ├── tashkeel_dataset.py
    ├── tashkeel_tokenizer.py
    ├── train_catt.py
    ├── transformer.py
    ├── utils.py
    └── xer.py

After

├── api
│   ├── catt_service.py
│   ├── __init__.py
│   ├── main.py
│   ├── models.py
├── benchmarking
│   ├── all_models_CATT_data
│   ├── all_models_WikiNews_data
│   ├── eo_ed_mlm_ns
│   │   ├── catt_data
│   │   └── wikinews_data
│   ├── run_compute_der_all_CATT_benchmark.sh
│   ├── run_compute_der_all_WikiNews_benchmark.sh
│   ├── run_compute_der_eo_ed_mlm_ns_long_training.sh
│   └── run_compute_der_eo_ed_mlm_ns_short_training.sh
├── catt
│   ├── data
│   │   ├── __init__.py
│   │   ├── tashkeel_dataset.py
│   │   └── tashkeel_tokenizer.py
│   ├── __init__.py
│   ├── models
│   │   ├── encoder_decoder.py
│   │   ├── encoder_only.py
│   │   ├── __init__.py
│   │   ├── model_types.py
│   │   └── transformer.py
│   └── utils
│       ├── arabic_utils.py
│       ├── bw2ar.py
│       ├── __init__.py
│       └── xer.py
├── compute_der.py
├── configs
│   ├── EncoderDecoder_config.yaml
│   ├── EncoderOnly_config.yaml
│   └── Sample_config.yaml
├── dataset
│   ├── test
│   ├── train
│   └── val
├── docs
├── LICENSE
├── models
│   ├── best_ed_mlm_ns_epoch_178.pt
│   └── best_eo_mlm_ns_epoch_193.pt
├── poetry.lock
├── predict_catt.py
├── pyproject.toml
├── README.md
├── scripts
├── tests
└── train_catt.py

Benefits

Improved code organization and maintainability
Better separation of concerns
Easier to navigate and understand the project structure
Simplified prediction and training processes
Added API functionality for easier integration
Improved dependency management with Poetry

These changes lay the groundwork for easier collaboration, maintenance, and future improvements to the CATT project.

… add api support

enhance code readability/maintaibility by adding a modular approach |…

0159a50

… add api support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Restructuring and Modularity Improvements #3

Project Restructuring and Modularity Improvements #3

CreativeSelf0 commented Aug 8, 2024

Project Restructuring and Modularity Improvements #3

Are you sure you want to change the base?

Project Restructuring and Modularity Improvements #3

Conversation

CreativeSelf0 commented Aug 8, 2024

Project Restructuring and Modularity Improvements

Overview

Key Changes

1. Project Structure Reorganization

2. API Implementation

3. Modularity Improvements

4. Configuration Management

5. Dependency Management

6. Documentation and Testing

7. Simplified Prediction and Training Scripts

Before and After Structure Comparison

Before

After

Benefits