Skip to content

Commit

Permalink
Patch release (#60)
Browse files Browse the repository at this point in the history
  • Loading branch information
tlienart authored Nov 25, 2019
1 parent 2339424 commit f114943
Show file tree
Hide file tree
Showing 11 changed files with 629 additions and 547 deletions.
8 changes: 5 additions & 3 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,22 +1,24 @@
name = "ScientificTypes"
uuid = "321657f4-b219-11e9-178b-2701a2544e81"
authors = ["Anthony D. Blaom <[email protected]>"]
version = "0.2.5"
version = "0.2.6"

This comment has been minimized.

Copy link
@tlienart

tlienart Nov 25, 2019

Author Collaborator

@JuliaRegistrator register

Release note:

  • Enable in-place coercion for DataFrames, with coerce!(df, ...) (#50)
  • Support coercion from categorical to count/continuous respecting order if there is one (#53)
  • Add elscitype functionality which complements that of scitype_union (#59)

[deps]
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
ColorTypes = "3da002f7-5984-5a60-b8a6-cbb66c0b333f"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"

[compat]
CategoricalArrays = "^0.7"
CategoricalArrays = "^0.7.3"
ColorTypes = "^0.8"
Tables = "^0.2"
julia = "1"

[extras]
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Random", "Test"]
test = ["Random", "Test", "CSV", "DataFrames"]
19 changes: 11 additions & 8 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,19 +100,15 @@ schema(Xfixed).scitypes

Note that, as it encountered missing values in `height` it coerced the type to `Union{Missing,Continuous}`.

Finally there is a `coerce!` method that does in-place coercion provided the data structure allows it (at the moment only `DataFrames.DataFrame` is supported).

## Notes

- We regard the built-in julia type `Missing` as a scientific type. The new scientific types introduced in the current package are rooted in the abstract type `Found` (see tree above) and you export the alias `Scientific = Union{Missing, Found}`.

- `Finite{N}`, `Muliticlass{N}` and `OrderedFactor{N}` are all parameterized by the number of levels `N`. We export the alias `Binary = Finite{2}`.

- `Image{W,H}`, `GrayImage{W,H}` and `ColorImage{W,H}` are all parameterized by the image width and height dimensions, `(W, H)`.

- We regard the built-in Julia type `Missing` as a scientific type. The new scientific types introduced in the current package are rooted in the abstract type `Found` (see tree above) and you export the alias `Scientific = Union{Missing, Found}`.
- `Finite{N}`, `Multiclass{N}` and `OrderedFactor{N}` are all parametrised by the number of levels `N`. We export the alias `Binary = Finite{2}`.
- `Image{W,H}`, `GrayImage{W,H}` and `ColorImage{W,H}` are all parametrised by the image width and height dimensions, `(W, H)`.
- The function `scitype` has the fallback value `Unknown`.

- Since Tables is an optional dependency, the `scitype` of a [`Tables.jl`](https://github.com/JuliaData/Tables.jl) supported table is `Unknown` unless Tables has been imported.

- Developers can define their own conventions using the code in `src/conventions/mlj/` as a template. The active convention is controlled by the value of `ScientificTypes.CONVENTION[1]`.


Expand Down Expand Up @@ -282,6 +278,13 @@ It is important to note that the order in which the rules are specified matters;
autotype(X; rules=(:few_to_finite,))
```

Finally, you can also use the following shorthands:

```julia
autotype(X, :few_to_finite)
autotype(X, (:few_to_finite, :discrete_to_continuous))
```

### Available rules

Rule symbol | scitype suggestion
Expand Down
190 changes: 19 additions & 171 deletions src/ScientificTypes.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ export Scientific, Found, Unknown, Finite, Infinite
export OrderedFactor, Multiclass, Count, Continuous
export Binary, Table
export ColorImage, GrayImage
export scitype, scitype_union, scitypes, coerce, schema
export scitype, scitype_union, scitypes, elscitype, coerce, coerce!, schema
export mlj
export autotype

Expand Down Expand Up @@ -107,183 +107,31 @@ function Table(Ts...)
return Table{<:Union{[AbstractVector{<:T} for T in Ts]...}}
end


# ## THE SCITYPE FUNCTION

"""
scitype(X)
The scientific type that `x` may represent.
"""
scitype(X) = scitype(X, Val(convention()))
scitype(X, C) = scitype(X, C, Val(trait(X)))
scitype(X, C, ::Val{:other}) = Unknown

scitype(::Missing) = Missing

# ## CONVENIENCE METHOD FOR UNIONS OVER ELEMENTS

"""
scitype_union(A)
Return the type union, over all elements `x` generated by the iterable
`A`, of `scitype(x)`.
See also `scitype`.
"""
scitype_union(A) = reduce((a,b)->Union{a,b}, (scitype(el) for el in A))


# ## SCITYPES OF TUPLES

scitype(t::Tuple, ::Val) = Tuple{scitype.(t)...}


# ## SCITYPES OF ARRAYS

"""
ScientificTypes.Scitype(::Type, C::Val)
Method for implementers of a conventions to enable speed-up of scitype
evaluations for large arrays.
In general, one cannot infer the scitype of an object of type
`AbstractArray{T, N}` from the machine type alone. For, example, this
never holds in the *mlj* convention for a categorical array, or in the
following examples: `X=Any[1, 2, 3]` and `X=Union{Missing,Int64}[1, 2,
3]`.
Nevertheless, for some *restricted* machine types `U`, the statement
`type(X) == AbstractArray{T, N}` for some `T<:U` already allows one
deduce that `scitype(X) = AbstractArray{S,N}`, where `S` is determined
by `U` alone. This is the case in the *mlj* convention, for example,
if `U = Integer`, in which case `S = Count`. If one explicitly declares
ScientificTypes.Scitype(::Type{<:U}, ::Val{:convention}) = S
in such cases, then ScientificTypes ensures a considerable speed-up in
the computation of `scitype(X)`. There is also a partial speed-up for
the case that `T <: Union{U, Missing}`.
For example, in *mlj* one has `Scitype(::Type{<:Integer}) = Count`.
"""
Scitype(::Type, C::Val) = nothing
Scitype(::Type{Any}, C::Val) = nothing # b/s `Any` isa `Union{<:Any, Missing}`

# For all such `T` we can also get almost the same speed-up in the case that
# `T` is replaced by `Union{T, Missing}`, which we detect by wrapping
# the answer:

Scitype(MT::Type{Union{T, Missing}}, C::Val) where T = Val(Scitype(T, C))

# For example, in *mlj* convention, Scitype(::Integer) = Count

const Arr{T,N} = AbstractArray{T,N}

# the dispatcher:
scitype(A::Arr{T}, C) where T = scitype(A, C, Scitype(T, C))

# the slow fallback:
scitype(A::Arr{<:Any,N}, ::Val, ::Nothing) where N =
AbstractArray{scitype_union(A),N}

# the speed-up:
scitype(::Arr{<:Any,N}, ::Val, S) where N = Arr{S,N}

# partial speed-up for missing types, because broadcast is faster than
# computing scitype_union:
function scitype(A::Arr{<:Any,N}, C::Val, ::Val{S}) where {N,S}
if S == nothing
return scitype(A, C, S)
else
Atight = broadcast(identity, A)
if typeof(A) == typeof(Atight)
return Arr{Union{S,Missing},N}
else
return Arr{S,N}
end
end
end


# ## STUB FOR COERCE METHOD

"""
coerce(A::AbstractArray, T; verbosity=1)
Coerce the julia types of elements of `A` to ensure the returned array
has `T` or `Union{Missing,T}` as the union of its element scitypes,
according to the active convention.
A warning is issued if missing values are encountered, unless
`verbosity` is `0` or less.
is_type(obj, spkg, stype)
julia> mlj()
julia> v = coerce([1, missing, 5], Continuous)
3-element Array{Union{Missing, Float64},1}:
1.0
missing
5.0
julia> scitype(v)
AbstractArray{Union{Missing,Continuous}, 1}
See also [`scitype`](@ref), [`scitype_union`](@ref).
This is a way to check that an object `obj` is of a given type that may come
from a package that is not loaded in the current environment.
For instance, say `DataFrames` is not loaded in the current environment, a
function from some package could still return a DataFrame in which case you
can check this with
```
is_type(obj, :DataFrames, :DataFrame)
```
"""
function coerce end


# ## TABLE SCHEMA

struct Schema{names, types, scitypes, nrows} end

Schema(names::Tuple{Vararg{Symbol}}, types::Type{T}, scitypes::Type{S}, nrows::Integer) where {T<:Tuple,S<:Tuple} = Schema{names, T, S, nrows}()
Schema(names, types, scitypes, nrows) = Schema{Tuple(Base.map(Symbol, names)), Tuple{types...}, Tuple{scitypes...}, nrows}()

function Base.getproperty(sch::Schema{names, types, scitypes, nrows}, field::Symbol) where {names, types, scitypes, nrows}
if field === :names
return names
elseif field === :types
return types === nothing ? nothing : Tuple(fieldtype(types, i) for i = 1:fieldcount(types))
elseif field === :scitypes
return scitypes === nothing ? nothing : Tuple(fieldtype(scitypes, i) for i = 1:fieldcount(scitypes))
elseif field === :nrows
return nrows === nothing ? nothing : nrows
else
throw(ArgumentError("unsupported property for ScientificTypes.Schema"))
end
function is_type(obj, spkg::Symbol, stype::Symbol)
# If the package is loaded, then it will just be `stype`
# otherwise it will be `spkg.stype`
rx = Regex("^($spkg\\.)?$stype")
match(rx, "$(typeof(obj))") === nothing || return true
return false
end

Base.propertynames(sch::Schema) = (:names, :types, :scitypes, :nrows)

_as_named_tuple(s::Schema) = NamedTuple{(:names, :types, :scitypes, :nrows)}((s.names, s.types, s.scitypes, s.nrows))

function Base.show(io::IO, ::MIME"text/plain", s::Schema)
show(io, MIME("text/plain"), _as_named_tuple(s))
end


"""
schema(X)
Inspect the column types and scitypes of a table.
julia> X = (ncalls=[1, 2, 4], mean_delay=[2.0, 5.7, 6.0])
julia> schema(X)
(names = (:ncalls, :mean_delay),
types = (Int64, Float64),
scitypes = (Count, Continuous))
"""
schema(X) = schema(X, Val(trait(X)))
schema(X, ::Val{:other}) =
throw(ArgumentError("Cannot inspect the internal scitypes of "*
"an object with trait `:other`\n"*
"Perhaps you meant to import Tables first?"))

include("tables.jl")
include("scitype.jl")
include("schema.jl")
include("coerce.jl")
include("autotype.jl")

## ACTIVATE DEFAULT CONVENTION
Expand Down
80 changes: 80 additions & 0 deletions src/coerce.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
function _coerce_col(X, name, types; args...)
y = getproperty(X, name)
if haskey(types, name)
# HACK isa LazyArrays.ApplyArray, see issue #49
if is_type(y, :LazyArrays, :ApplyArray)
y = convert(Vector, y)
end
return coerce(y, types[name]; args...)
else
return y
end
end

"""
coerce(X, col1=>scitype1, col2=>scitype2, ... ; verbosity=1)
coerce(X, d::AbstractDict; verbosity=1)
Return a copy of the table `X` with the scitypes of the specified
columns coerced to those specified, or to missing-value versions of
these scitypes, with warnings issued (for positive `verbosity`).
Alternatively, the specifications can be wrapped in a dictionary.
### Example
```julia
using CategoricalArrays, DataFrames, Tables
X = DataFrame(name=["Siri", "Robo", "Alexa", "Cortana"],
height=[152, missing, 148, 163],
rating=[1, 5, 2, 1])
coerce(X, :name=>Multiclass, :height=>Continuous, :rating=>OrderedFactor)
See also [`scitype`](@ref), [`schema`](@ref).
```
"""
function coerce(X, pairs::Pair{Symbol}...; verbosity=1)
trait(X) == :table ||
error("Non-tabular data encountered or Tables pkg not loaded.")
names = Tables.schema(X).names
dpairs = Dict(pairs)
X_ct = Tables.columntable(X)
ct_new = (_coerce_col(X_ct, col, dpairs; verbosity=verbosity) for col in names)
return Tables.materializer(X)(NamedTuple{names}(ct_new))
end
coerce(X, types::Dict; kw_args...) = coerce(X, (p for p in types)...; kw_args...)


"""
coerce!(X, ...)
Same as [`coerce`](@ref) except it does the modification in place provided `X`
supports in-place modification (at the moment, only the DataFrame! does).
An error is thrown otherwise. The arguments are the same as `coerce`.
"""
function coerce!(X, args...; kwargs...)
# DataFrame --> coerce_dataframe! (see convention)
is_type(X, :DataFrames, :DataFrame) && return coerce_df!(X, args...; kwargs...)
# Everything else
throw(ArgumentError("In place coercion not supported for $(typeof(X)). Try `coerce` instead."))
end
coerce!(X, types::Dict; kwargs...) = coerce!(X, (p for p in types)..., kwargs...)

function coerce_df!(df, pairs::Pair{Symbol}...; verbosity=1)
names = Tables.schema(df).names
types = Dict(pairs)
for name in names
name in keys(types) || continue
# for DataFrames >= 0.19 df[!, name] = coerce(df[!, name], types(name))
# but we want something that works more robustly... even for older DataFrames
# the only way to do this is to use the `df.name = something` but we cannot use
# setindex! which will throw a deprecation warning...
name_str = "$name"
ex = quote
$df.$name = coerce($df.$name, $types[Symbol($name_str)])
end
eval(ex)
end
return df
end
Loading

1 comment on commit f114943

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/5855

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if Julia TagBot is installed, or can be done manually through the github interface, or via:

git tag -a v0.2.6 -m "<description of version>" f1149435c4c9b75417e04319ef32ece194d751d3
git push origin v0.2.6

Please sign in to comment.