Dataprep is short for "Data Preparation" it includes the collection and pre-processing stages of a conventional data mining pipeline. The upstream of Dataprep are multiple, probably unstructured and/ or heterogeneous, data sources. The downstream of Dataprep is data processing/ mining modules that take structured data as input and produce visuals, insights or intermediate datasets for further analysis.
Dataprep generally involves data cleaning, normalisation and joining. From a DB perspective, it is like series of ETL process. As DMaaS (Data Mining as a Service) and AIaaS (Artificial Intelligence as a Service) become widely available and accurate enough for practical applications, the difficulty now lies in how to prepare the dataset that is suitable for those algorithms. Besides writing your own scripts, following may be some useful pointers:
- Google Cloud Dataprep, released in 2017: https://cloud.google.com/dataprep/
- Tableau, in its recent "BI trends 2018", implied as 11th trend that it would enrich/ extend Tableau with Dataprep capability: https://www.tableau.com/reports/business-intelligence-trends
- Open Refine, a classical data cleaning tool, can well address part of Dataprep problems: http://openrefine.org/
The Quartz bad data guide is a well developed guide for people who starts to handle data sourcing and cleaning.