UCI download/process software repository

Open source Python repository for downloading, processing, folding and describing supervised machine learning datasets from UCI and others raw repositories

This Github repository is a set of scripts for downloading supervised machine learning datasets from UCI Machine Learning Repository, and process them into a common format. Originally, it was a fork of Julia repository JackDunnNZ/uci-data, from which configuration files were extracted. The UCI ML repository is a useful source for machine learning datasets for testing and benchmarking, but the format of datasets is not consistent. This means effort is required in order to make use of new datasets since they need to be read differently.

The main goal of this repository is to process the datasets into a format to be read from PyRidge, where each row of final data is as follows:

attribute_1 attribute_2 ... attribute_n class

This makes it easy to switch out datasets in ML problems, which is great when automating things.

Converting to common format

The datasets are not checked in to git in order to minimise the size of the repository and to avoid rehosting the data. As such, the script downloads any missing datasets directly from UCI as it runs.

Running the code

There are two ways of running the code. Easy/obscure way is to run first the install_requirements.sh script, using bash

bash install_requirements.sh

Which install the Python 3 requirements from requirements.txt. Packages necessaries for this library:

numpy
pandas
sklearn
rarfile
PyLaTeX

After that, the main script

bash script.sh

However, it is recommended to use a virtual environment for Python 3, which can be done easily following an explanation here. In this virtual enviroment, previous requirements must be installed. Then, you just have to run the scripts in the main directory

python download_data.py
python process_data.py
python fold_data.py
python describe_data.py

The data will be downloaded, processed, k-folded and described, in that order. Customizable parameters, such as folders to process and number of folds, are found in parameter_config.ini:

[DOWNLOAD]
config_folders = datafiles/regression,datafiles/classification
raw_folder = raw_data
remove_older = True

[PROCESS]
config_folders = datafiles/regression,datafiles/classification
processed_folder = processed_data
remove_older = True

[FOLD]
processed_folders = processed_data/regression,processed_data/classification
data_folder = data
remove_older = True
n_fold = 10

[DESCRIBE]
data_folders = data/regression,data/classification
description_folder = description
remove_older = True

Citation policy

Perales-González, Carlos, (2020). UCI download-process, v1.3, GitHub repository, https://github.com/cperales/uci-download-process

@misc{UCI-download-process,
  author = {Carlos, Perales-González},
  title = {UCI download/process},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cperales/uci-download-process}},
  tag = {1.3}
}

Name		Name	Last commit message	Last commit date
Latest commit History 291 Commits
datafiles		datafiles
description		description
failed_datafiles/classification		failed_datafiles/classification
testing/classification		testing/classification
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
default_config.ini		default_config.ini
describe_data.py		describe_data.py
download_data.py		download_data.py
fold_data.py		fold_data.py
install_requirements.sh		install_requirements.sh
process_data.py		process_data.py
requirements.txt		requirements.txt
script.sh		script.sh
things-to-do.md		things-to-do.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UCI download/process software repository

Open source Python repository for downloading, processing, folding and describing supervised machine learning datasets from UCI and others raw repositories

Converting to common format

Running the code

Citation policy

About

Releases

Packages

Languages

License

Doctorado-ML/uci-download-process

Folders and files

Latest commit

History

Repository files navigation

UCI download/process software repository

Open source Python repository for downloading, processing, folding and describing supervised machine learning datasets from UCI and others raw repositories

Converting to common format

Running the code

Citation policy

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages