Add option for training fold "blocks" to avoid over-fitting #6

dtpc · 2019-06-05T07:51:15Z

Models can overfit when training samples are spatially adjacent.

A way to mitigate this is to select a pixel block size when extracting training folds such that pixels in the same local block are assigned the same fold.

The model will be encouraged to predict well outside areas local to the training data during cross-validation/model-selection.

dtpc · 2019-06-13T05:47:58Z

I've implemented this here: https://github.com/dtpc/landshark/tree/feature/6-fold-blocks

It does not account for the distribution of training points over the area, so may/will result in folds of unequal size.

Another approach I think would be useful is grouping based on some other training point property (e.g. https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data). Implementing this would require some more structural changes to the code, though. Currently the target HDF5 file only contains y and coord data.

dsteinberg · 2019-06-17T02:01:18Z

Oh yeah? do you mean when we select data randomly for our train/test folds, we can get an underestimate of the true error if our test points are often close to the training points?

Or by doing this are we testing if our model generalizes well away from the training data?

dtpc · 2019-06-17T02:31:17Z

The later, although I think "away from the training data" may not be that far in some cases.

Typically the training data is heavily biased, sparse but often locally dense. I think this can lead to learning very localised models, especially if the targets are highly correlated spatially. In the extreme case if neighbouring pixels (and target values) are more or less identical, then the model could potentially just learn the input (this is even more of a issue if we have training points located within the same pixel). This would be an accurate model, but probably not a very useful one to generate a predictive map from.

So, I think there is a need for different ways of splitting train/test data to encourage a more general model during model selection.

dsteinberg · 2019-06-17T02:38:04Z

Yeah agreed - a few more splitting methods would be useful.
This problem in general though is very hard -- it's really hard to know how a model will behave "away" from the training data... the exception is maybe a Gaussian Process with a prior distribution over kernel parameters - these sorts of models "revert" to their prior away from data, and you can specify that prior (Gaussian processes where we "learn" the prior don't necessarily have this behaviour). There are also models where you can learn what your training data looks like, and when you are querying the model with different data.

dtpc · 2019-06-20T04:18:34Z

Yes, this is definitely not intended as a solution for covariate shift. I guess just providing more flexibility around model selection/evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for training fold "blocks" to avoid over-fitting #6

Add option for training fold "blocks" to avoid over-fitting #6

dtpc commented Jun 5, 2019 •

edited

Loading

dtpc commented Jun 13, 2019

dsteinberg commented Jun 17, 2019

dtpc commented Jun 17, 2019

dsteinberg commented Jun 17, 2019

dtpc commented Jun 20, 2019

Add option for training fold "blocks" to avoid over-fitting #6

Add option for training fold "blocks" to avoid over-fitting #6

Comments

dtpc commented Jun 5, 2019 • edited Loading

dtpc commented Jun 13, 2019

dsteinberg commented Jun 17, 2019

dtpc commented Jun 17, 2019

dsteinberg commented Jun 17, 2019

dtpc commented Jun 20, 2019

dtpc commented Jun 5, 2019 •

edited

Loading