Skip to content

Latest commit

 

History

History
11 lines (7 loc) · 1.38 KB

File metadata and controls

11 lines (7 loc) · 1.38 KB

Dataprep

Dataprep is short for "Data Preparation" it includes the collection and pre-processing stages of a conventional data mining pipeline. The upstream of Dataprep are multiple, probably unstructured and/ or heterogeneous, data sources. The downstream of Dataprep is data processing/ mining modules that take structured data as input and produce visuals, insights or intermediate datasets for further analysis.

Dataprep generally involves data cleaning, normalisation and joining. From a DB perspective, it is like series of ETL process. As DMaaS (Data Mining as a Service) and AIaaS (Artificial Intelligence as a Service) become widely available and accurate enough for practical applications, the difficulty now lies in how to prepare the dataset that is suitable for those algorithms. Besides writing your own scripts, following may be some useful pointers:

The Quartz bad data guide is a well developed guide for people who starts to handle data sourcing and cleaning.