Section 02: Analytical Data Prep

A great deal of work in data mining projects is spent on data munging. Below some common data problems that can cause models and predictions to be inaccurate are listed along with their symptoms and potential solutions.

Enterprise Miner Materials

Example data
Diagram notes
Diagram XML
Introductory Video(s)

Supplementary References

Label, Segment, Featurize: a cross domain framework for prediction engineering
Introduction to Data Mining - chapter 2
Introduction to Data Mining - chapter 2 notes

Sample Quiz

Quiz key

A Data Preperation Lib Kes Wrote (works with spark, pandas, and h2o frames)

DataPreperation Library src
view notebook example
view notebook html *Not all the functions have been strenuously tested for all use cases, may have bugs (email kmcrandall@gwmail.gwu.edu if you find one).

Class notes

Problem	Symptoms	Solution
Incomplete data	Useless models and meaningless results.	Get more data. Get better data. Design of Experiment approaches.
Biased Data	Biased models and biased, inaccurate results.	Get more data. Get better data. Design of Experiment approaches.
Wide Data	Long, intolerable compute times. Meaningless results due to curse of dimensionality.	Feature selection. Feature extraction. L1 Regularization.
Sparse data^✝	Long, intolerable compute times. Meaningless results due to curse of dimensionality.	Feature extraction. Appropriate data representation, i.e. COO, CSR. Appropriate algorithm selection, e.g. factorization machines.
Imbalanced Target Variable	Single class model predictions. Biased model predictions.	Proportional Oversampling. Inverse prior probability weighting. Mixture models, e.g. zero-inflated regression methods.
Outliers	Biased models and biased, inaccurate results. Unstable parameter estimates and rule generation. Unreliable out-of-domain predictions.	Discretization. Winsorizing. Appropriate algorithm selection, e.g. Huber loss functions.
Missing Values	Information loss. Biased models and biased, inaccurate results.	Imputation. Discretization. Appropriate algorithm selection, e.g. Tree-based models, naive Bayes classification.
Character Variables^✝	Information loss. Biased models and biased, inaccurate results. Computational errors.	Encoding. Appropriate algorithm selection, e.g. Tree-based models, naive Bayes classification.
High Cardinality Categorical Variables	Over-fit models and inaccurate results. Long, intolerable compute times. Unreliable out-of-domain predictions.	Target Encoding (categorical) or variants e.g. perturbed rate-by-level or Weight of Evidence. Target Encoding (numeric) or variants average-, median, BLUP-by-level. Discretization. Embedding approaches, e.g. entity embedding neural networks, factorization machines.
Disparate Variable Scales	Unreliable parameter estimates, biased models, and biased, inaccurate results.	Standardization, Appropriate algorithm selection, e.g. Tree-based models.
Strong Multicollinearity (correlation)	Unstable parameter estimates, unstable rule generation, and unstable predictions.	Feature selection. Feature extraction. L2 Regularization.
Dirty Data	Information loss. Biased models and biased, inaccurate results. Long, intolerable compute times. Unstable parameter estimates and rule generation. Unreliable out-of-domain predictions.	Combination of solution strategies.

^✝ In some cases this is not a problem at all. Some algorithms and software packages handle this automatically and elegantly ... some don't.

Incomplete data

When a data set simply does not contain information about the phenomenon of interest. There is no analytical remedy for incomplete data. You must collect more and better data, and probably dispose of the original incomplete set.

Biased data

When a data set contains information about the phenomenon of interest, but that information is consistently and systematically wrong. There is no analytical remedy for biased data. You must collect more and better data, and probably dispose of the original biased set.

Feature selection - view notebook

Finding the best subset of original variables from a data set, typically by measuring the original variable's relationship to the target variable and taking the subset of original variables with the strongest relationships with the target. Feature selection decreases the impact of the curse of dimensionality and usually increases the signal-to-noise ratio in a data set, resulting in faster training times and more accurate models. Because feature selection uses original variables from a data set, its results are usually more interpretable than feature extraction techniques.

Feature extraction - view notebook

Combining the original variables in a data set into a new, smaller set of more representative variables, very often using unsupervised learning methods. Feature extraction may also be referred to as 'dimension reduction'. Feature extraction is the unsupervised analog of feature selection, i.e. it tends to decreases the impact of the curse of dimensionality and usually increases the signal-to-noise ratio in a data set. Feature extraction techniques combine the original variables in the data set in complex ways, usually creating uninterpretable new variables.

Oversampling - view notebook

Taking all the rows containing rare events in a data set and increasing them proportionally to the number of rows not containing rare values. 'Undersampling' is the opposite and equally valid approach where the rows not containing rare events are decreased proportionally to the number of rows containing rare events. With rare events, models will often find that the most accurate possible outcome is to predict the rare event never happens. Both oversampling and undersampling artificially inflate the frequency of rare events, which helps models learn to predict rare events.

Encoding - view notebook

Changing the representation of a variable. Very often in data mining applications categorical, character variables are encoded to numeric variables to be used with algorithms that cannot accept character or categorical variables.

Target Encoding (Categorical) - view notebook

An encoding method for changing categorical variables into numeric variables when the target is a binary categorical variable. Particularly helpful when a categorical variable has many levels.

Target Encoding (Numeric) - view notebook

An encoding method for changing categorical variables into numeric variables when the target is a numeric variable. Particularly helpful when a categorical variable has many levels.

Discretization - view notebook

Changing a numeric variable into an ordinal or nominal categorical variable based on value ranges of the original numeric variable. Discretization can also be referred to as 'binning'. Discretization has many benefits:

When restricted to using linear models, binning helps introduce nonlinearity because each bin in a variable gets its own parameter.
Binning smoothes complex signals in training data, often decreasing overfitting.
Binning deals with missing values elegantly by assigning them to their own bin.
Binning handles outliers elegantly by assigning all outlying values, in training and new data, to the 'high' or 'low' bin. (Outliers damage predictive models that seek to minimize squared error because they create disproportionately large, i.e. squared, residuals which optimization routines will try to minimize at the expense of minimizing the error for more reliable data points.)

Winsorizing - view notebook

Removing outliers in a variable's value and replacing them with more central values of that variable. (Outliers damage predictive models that seek to minimize squared error because they create disproportionately large, i.e. squared, residuals which optimization routines will try to minimize at the expense of minimizing the error for more reliable data points.)

Imputation - view notebook

Replacing missing data with an appropriate, non-missing value. In predictive modeling imputing should be used with care. Missingness is often predictive. Also imputation changes the distribution of the input variable learned by the model.

Standardization - view notebook

Enforcing similar scales on a set of variables. For distance-based algorithms (e.g. k-means) and algorithms that use gradient-related methods to create model parameters (e.g. regression, artificial neural networks) variables must be on the same scale, or variables with large values will incorrectly dominate the training process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02_analytical_data_prep.md

02_analytical_data_prep.md

Section 02: Analytical Data Prep

Enterprise Miner Materials

Supplementary References

Sample Quiz

Quiz key

A Data Preperation Lib Kes Wrote (works with spark, pandas, and h2o frames)

Class notes

Incomplete data

Biased data

Feature selection - view notebook

Feature extraction - view notebook

Oversampling - view notebook

Encoding - view notebook

Target Encoding (Categorical) - view notebook

Target Encoding (Numeric) - view notebook

Discretization - view notebook

Winsorizing - view notebook

Imputation - view notebook

Standardization - view notebook

Files

02_analytical_data_prep.md

Latest commit

History

02_analytical_data_prep.md

File metadata and controls

Section 02: Analytical Data Prep

Enterprise Miner Materials

Supplementary References

Sample Quiz

Quiz key

A Data Preperation Lib Kes Wrote (works with spark, pandas, and h2o frames)

Class notes

Incomplete data

Biased data

Feature selection - view notebook

Feature extraction - view notebook

Oversampling - view notebook

Encoding - view notebook

Target Encoding (Categorical) - view notebook

Target Encoding (Numeric) - view notebook

Discretization - view notebook

Winsorizing - view notebook

Imputation - view notebook

Standardization - view notebook