Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oversampling and undersampling #661

Closed
ablaom opened this issue Sep 21, 2020 · 8 comments
Closed

Oversampling and undersampling #661

ablaom opened this issue Sep 21, 2020 · 8 comments
Labels
design discussion Discussing design issues

Comments

@ablaom
Copy link
Member

ablaom commented Sep 21, 2020

https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#over-sampling

edit (July 2023) An updated version of the POC below is later in this thread

This is just to kick off a discussion. I see oversampling/undersampling as transformers plus model wrappers. Here's a rough POC for this:

#
using MLJ, TableOperations, Tables
import MLJBase, DataFrames


## A QUICK AND DIRTY OVERSAMPLER FOR ILLUSTRATION

mutable struct NaiveOversampler <: Static
    feature::Symbol
    minority_class
    multiplier::Int
end

function MLJBase.transform(s::NaiveOversampler, verbosity, data)
    mask(row) = Tables.getcolumn(row, s.feature) == s.minority_class
    Xminority_rows = data |> TableOperations.filter(mask) |> collect
    extra_rows = vcat((Xminority_rows for i in 1:(s.multiplier - 1))...)
    vcat(collect(Tables.rows(data)), extra_rows) |>
        Tables.materializer(data)
end

# demonstration:
data = (x1=1:4, x2=5:8, target=coerce([true, false, true, true], Multiclass))
julia> data |> pretty
┌───────┬───────┬───────────────────────────────┐
│ x1    │ x2    │ target                        │
│ Int64 │ Int64 │ CategoricalValue{Bool,UInt32} │
│ Count │ Count │ Multiclass{2}                 │
├───────┼───────┼───────────────────────────────┤
│ 151                             │
│ 260                             │
│ 371                             │
│ 481                             │
└───────┴───────┴───────────────────────────────┘

naive = NaiveOversampler(:target, false, 3)
mach = machine(naive) # static transformers have no training arguments
julia> transform(mach, data) |> pretty
┌───────┬───────┬───────────────────────────────┐
│ x1    │ x2    │ target                        │
│ Int64 │ Int64 │ CategoricalValue{Bool,UInt32} │
│ Count │ Count │ Multiclass{2}                 │
├───────┼───────┼───────────────────────────────┤
│ 151                             │
│ 260                             │
│ 371                             │
│ 481                             │
│ 260                             │
│ 260                             │
└───────┴───────┴───────────────────────────────┘


## HELPERS TO ADD SPLIT OFF TARGET FROM TABLE OF FEATURES 

function adjoin_target(y, X)
    X1 = Tables.columntable(X)
    return merge(X1, (target=y,)) |> Tables.materializer(X)
end
split(data) = unpack(data, ==(:target))

# demonstration:
y, X = split(data)
@assert adjoin_target(y, X) == data


## COMPOSITE FOR WRAPPING A CLASSIFIER WITH OVERSAMPLING

# models for the learning network:
naive = NaiveOversampler(:target, false, 2)
tree = (@load DecisionTreeClassifier pkg=DecisionTree)()

# the learning network:
X = source() 
y = source()
data = @node adjoin_target(y, X)
mach1 = machine(naive)
data_over = transform(mach1, data)
yX_over = @node split(data_over)
y_over = @node first(yX_over)
X_over = @node last(yX_over)
mach2 = machine(tree, X_over, y_over)
yhat = predict(mach2, X) # *not* `predict(mach2, X_over)`

# the learning network machine:
mach = machine(Probabilistic(), X, y; predict=yhat)

# exporting the network as a new composite type:
@from_network mach begin
    mutable struct OversampledModel
        over_sampler=naive
        classifier=tree
    end
end

# demonstration:
X, y = make_moons(10)
train, test = partition(eachindex(y), 0.6)
forest = (@load RandomForestClassifier pkg=DecisionTree)()
model = OversampledModel(over_sampler=NaiveOversampler(:target, 0, 2),
                         classifier=forest)
mach = machine(model, X, y)
fit!(mach, rows=train)
predict(mach, rows=test)

cc @DilumAluthge

@ablaom ablaom added the design discussion Discussing design issues label Sep 21, 2020
@ablaom
Copy link
Member Author

ablaom commented Sep 21, 2020

One drawback is that that model.over_sampler.feature is exposed to the user but shouldn't get altered (it should always be :target)

@Moelf
Copy link

Moelf commented Aug 3, 2021

any movement on this? This sounds like some data preparation utility that can be provided by a third party ML utility package?

@DilumAluthge
Copy link
Member

DilumAluthge commented Aug 3, 2021

We have some functionality in ClassImbalance.jl, but there is not yet an MLJ interface for that package. Help is welcome.

@Moelf
Copy link

Moelf commented Aug 4, 2021

bcbi/ClassImbalance.jl#85

show stopper

@DilumAluthge
Copy link
Member

Yeah the package needs to be updated and modernized.

@rikhuijzer
Copy link
Member

I've implemented SMOTE in Resample.jl. It has a very basic API, but is built with speed in mind and it uses the Tables interface.

@ablaom
Copy link
Member Author

ablaom commented Jul 25, 2023

It's been a while since I posted the above POC. Here's an updated version, based on more recent versions of the packages, and some other mild changes. You'll need MLJBase >= 0.21.12, and MLJDecisionTreeInterface in your env.

using MLJ, Tables
import MLJBase, StatsBase


## A QUICK AND DIRTY OVERSAMPLER FOR ILLUSTRATION

mutable struct NaiveOversampler <: Static
    ratio::Float64
end
NaiveOversampler(; ratio=1.0) = NaiveOversampler(ratio)

function MLJBase.transform(oversampler::NaiveOversampler, verbosity, X, y)
    d = StatsBase.countmap(y)
    counts = sort(collect(d), by=pair->last(pair))
    minority_class = first(counts) |> first
    dominant_class = last(counts) |> first
    nextras = max(
        0,
        round(Int, oversampler.ratio*d[dominant_class] - d[minority_class]),
    )
    all_indices = eachindex(y)
    minority_indices = all_indices[y .== minority_class]
    extra_indices = rand(minority_indices, nextras)
    over_indices = vcat(all_indices, extra_indices)
    Xover = Tables.subset(X, over_indices) |> Tables.materializer(X)
    yover = y[over_indices]
    return Xover, yover
end


# demonstration:
X = (x1=1:4, x2=5:8)
y = coerce([true, false, true, true], Multiclass)
StatsBase.countmap(y)
# Dict{CategoricalArrays.CategoricalValue{Bool, UInt32}, Int64} with 2 entries:
#   false => 1
#   true  => 3

naive = NaiveOversampler()
mach = machine(naive) # static transformers have no training arguments
Xover, yover = transform(mach, X, y)
StatsBase.countmap(yover)
# Dict{CategoricalArrays.CategoricalValue{Bool, UInt32}, Int64} with 2 entries:
#   false => 3
#   true  => 3

## COMPOSITE FOR WRAPPING A CLASSIFIER WITH OVERSAMPLING

# default component models for the wrapper:
naive = NaiveOversampler()
dummy = ConstantClassifier()

# we restrict to wrapping to `Probabilistic` models and so use
# `ProbablisticNetworkComposite` for the "exported" learning network type:
struct BalancedModel <:ProbabilisticNetworkComposite
    model::Probabilistic
    balancer  # oversampler or undersampler
end
BalancedModel(; model=dummy, balancer=naive) =
    BalancedModel(model, balancer)
BalancedModel(model; kwargs...) = BalancedModel(; model, kwargs...)

function MLJBase.prefit(over_sampled_model::BalancedModel, verbosity, _X, _y)

    # the learning network:
    X = source(_X)
    y = source(_y)
    mach1 = machine(:balancer) # `Static`, so no training arguments here
    data =  transform(mach1, X, y)
    # `first` and `last` are overloaded for nodes, so we can do:
    X_over = first(data)
    y_over = last(data)
    # we use the oversampled data for training:
    mach2 = machine(:model, X_over, y_over)
    # but consume new prodution data from the source:
    yhat = predict(mach2, X)

    # return the learning network interface:
    return (; predict=yhat)

end

## DEMONSTRATION

# synthesize some synthetic data:
Xraw, yraw = make_moons(1000);
for_deletion = eachindex(yraw)[yraw .== 0][1:400]
to_keep = setdiff(eachindex(yraw), for_deletion)
X = Tables.rowtable(Xraw)[to_keep]
y = coerce(yraw[to_keep], OrderedFactor)

train, test = partition(eachindex(y), 0.6)
model = (@load DecisionTreeClassifier pkg=DecisionTree)()
balanced_model = BalancedModel(model)
# BalancedModel(
#   model = DecisionTreeClassifier(
#         max_depth = -1,
#         min_samples_leaf = 1,
#         min_samples_split = 2,
#         min_purity_increase = 0.0,
#         n_subfeatures = 0,
#         post_prune = false,
#         merge_purity_threshold = 1.0,
#         display_depth = 5,
#         feature_importance = :impurity,
#         rng = Random._GLOBAL_RNG()),
#   balancer = NaiveOversampler(
#         ratio = 1.0))

mach = machine(balanced_model, X, y)
fit!(mach, rows=train)
predict(mach, rows=test[1:3])
# 3-element UnivariateFiniteVector{OrderedFactor{2}, String, UInt32, Float64}:
#  UnivariateFinite{OrderedFactor{2}}(0=>1.0, 1=>0.0)
#  UnivariateFinite{OrderedFactor{2}}(0=>0.0, 1=>1.0)
#  UnivariateFinite{OrderedFactor{2}}(0=>0.0, 1=>1.0)

@ablaom
Copy link
Member Author

ablaom commented Oct 6, 2023

A large number of oversampling/undersampling strategies, with MLJ interfaces, are now provided by Imbalance.jl, and a wrapper, BalancedModel(model, ....), allowing insertion into supervised learning pipelines, is provided by MLJBalancing.jl.

Closing as complete.

cc @EssamWissam

@ablaom ablaom closed this as completed Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion Discussing design issues
Projects
None yet
Development

No branches or pull requests

4 participants