-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option for training fold "blocks" to avoid over-fitting #6
Comments
I've implemented this here: https://github.com/dtpc/landshark/tree/feature/6-fold-blocks It does not account for the distribution of training points over the area, so may/will result in folds of unequal size. Another approach I think would be useful is grouping based on some other training point property (e.g. https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data). Implementing this would require some more structural changes to the code, though. Currently the target HDF5 file only contains |
Oh yeah? do you mean when we select data randomly for our train/test folds, we can get an underestimate of the true error if our test points are often close to the training points? Or by doing this are we testing if our model generalizes well away from the training data? |
The later, although I think "away from the training data" may not be that far in some cases. Typically the training data is heavily biased, sparse but often locally dense. I think this can lead to learning very localised models, especially if the targets are highly correlated spatially. In the extreme case if neighbouring pixels (and target values) are more or less identical, then the model could potentially just learn the input (this is even more of a issue if we have training points located within the same pixel). This would be an accurate model, but probably not a very useful one to generate a predictive map from. So, I think there is a need for different ways of splitting train/test data to encourage a more general model during model selection. |
Yeah agreed - a few more splitting methods would be useful. |
Yes, this is definitely not intended as a solution for covariate shift. I guess just providing more flexibility around model selection/evaluation. |
Models can overfit when training samples are spatially adjacent.
A way to mitigate this is to select a pixel block size when extracting training folds such that pixels in the same local block are assigned the same fold.
The model will be encouraged to predict well outside areas local to the training data during cross-validation/model-selection.
The text was updated successfully, but these errors were encountered: