Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Restructuring and Modularity Improvements #3

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

CreativeSelf0
Copy link

Project Restructuring and Modularity Improvements

Overview

This pull request implements a significant restructuring of the CATT (Character-based Arabic Tashkeel Transformer) project to improve modularity, maintainability, and ease of use. The changes focus on reorganizing the codebase, introducing a new API module, and standardizing the project structure.

Key Changes

1. Project Structure Reorganization

  • Moved core CATT functionality into a dedicated catt/ package
  • Separated data handling, models, and utilities into distinct submodules
  • Created a new api/ directory for API-related functionality
  • Added configs/, dataset/, docs/, models/, scripts/, and tests/ directories for better organization

2. API Implementation

  • Introduced a new api/ module with FastAPI integration
  • Created main.py, models.py, and catt_service.py for API functionality
  • Implemented a /tashkeel endpoint for diacritization requests

3. Modularity Improvements

  • Refactored encoder and decoder models into separate files
  • Created model_types.py for better model type management
  • Moved utility functions into a dedicated utils/ submodule

4. Configuration Management

  • Added configs/ directory with separate configuration files for Encoder-Decoder and Encoder-Only models
  • Introduced a Sample_config.yaml for easier customization

5. Dependency Management

  • Added pyproject.toml and poetry.lock for better dependency management using Poetry

6. Documentation and Testing

  • Created a docs/ directory for future documentation
  • Added a tests/ directory for unit tests (to be implemented)

7. Simplified Prediction and Training Scripts

  • Consolidated predict_ed.py and predict_eo.py into a single predict_catt.py
  • Refactored train_catt.py for improved clarity and consistency

Before and After Structure Comparison

Before

└── catt
    ├── benchmarking
    │   ├── all_models_CATT_data
    │   ├── all_models_WikiNews_data
    │   ├── eo_ed_mlm_ns
    │   │   ├── catt_data
    │   │   └── wikinews_data
    │   ├── run_compute_der_all_CATT_benchmark.sh
    │   ├── run_compute_der_all_WikiNews_benchmark.sh
    │   ├── run_compute_der_eo_ed_mlm_ns_long_training.sh
    │   └── run_compute_der_eo_ed_mlm_ns_short_training.sh
    ├── bw2ar.py
    ├── compute_der.py
    ├── ed_pl.py
    ├── ed.py
    ├── eo_pl.py
    ├── eo.py
    ├── LICENSE
    ├── predict_ed.py
    ├── predict_eo.py
    ├── README.md
    ├── tashkeel_dataset.py
    ├── tashkeel_tokenizer.py
    ├── train_catt.py
    ├── transformer.py
    ├── utils.py
    └── xer.py

After

├── api
│   ├── catt_service.py
│   ├── __init__.py
│   ├── main.py
│   ├── models.py
├── benchmarking
│   ├── all_models_CATT_data
│   ├── all_models_WikiNews_data
│   ├── eo_ed_mlm_ns
│   │   ├── catt_data
│   │   └── wikinews_data
│   ├── run_compute_der_all_CATT_benchmark.sh
│   ├── run_compute_der_all_WikiNews_benchmark.sh
│   ├── run_compute_der_eo_ed_mlm_ns_long_training.sh
│   └── run_compute_der_eo_ed_mlm_ns_short_training.sh
├── catt
│   ├── data
│   │   ├── __init__.py
│   │   ├── tashkeel_dataset.py
│   │   └── tashkeel_tokenizer.py
│   ├── __init__.py
│   ├── models
│   │   ├── encoder_decoder.py
│   │   ├── encoder_only.py
│   │   ├── __init__.py
│   │   ├── model_types.py
│   │   └── transformer.py
│   └── utils
│       ├── arabic_utils.py
│       ├── bw2ar.py
│       ├── __init__.py
│       └── xer.py
├── compute_der.py
├── configs
│   ├── EncoderDecoder_config.yaml
│   ├── EncoderOnly_config.yaml
│   └── Sample_config.yaml
├── dataset
│   ├── test
│   ├── train
│   └── val
├── docs
├── LICENSE
├── models
│   ├── best_ed_mlm_ns_epoch_178.pt
│   └── best_eo_mlm_ns_epoch_193.pt
├── poetry.lock
├── predict_catt.py
├── pyproject.toml
├── README.md
├── scripts
├── tests
└── train_catt.py

Benefits

  • Improved code organization and maintainability
  • Better separation of concerns
  • Easier to navigate and understand the project structure
  • Simplified prediction and training processes
  • Added API functionality for easier integration
  • Improved dependency management with Poetry

These changes lay the groundwork for easier collaboration, maintenance, and future improvements to the CATT project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant