Skip to content

Commit

Permalink
Merge pull request #42 from alan-turing-institute/dev
Browse files Browse the repository at this point in the history
For tagged release 0.2.4
  • Loading branch information
ablaom authored Nov 11, 2019
2 parents d10edb6 + f12c30d commit 4e7f935
Show file tree
Hide file tree
Showing 12 changed files with 297 additions and 144 deletions.
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ matrix:
- julia: nightly

after_success:
- julia -e 'using Pkg; pkg"add Coverage"; using Coverage; Codecov.submit(Codecov.process_folder())'
- julia -e 'import Pkg; Pkg.add("Coverage"); using Coverage; Coveralls.submit(process_folder())'

jobs:
include:
Expand Down
17 changes: 8 additions & 9 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,23 +1,22 @@
name = "ScientificTypes"
uuid = "321657f4-b219-11e9-178b-2701a2544e81"
authors = ["Anthony D. Blaom <[email protected]>"]
version = "0.2.3"
version = "0.2.4"

[deps]
InteractiveUtils = "b77e0a4c-d291-57a0-90e8-8db25a27a240"
Requires = "ae029012-a4dd-5104-9daa-d747884805df"
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
ColorTypes = "3da002f7-5984-5a60-b8a6-cbb66c0b333f"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"

[compat]
Requires = "0.5.2"
CategoricalArrays = "^0.7"
ColorTypes = "^0.8"
Tables = "^0.2"
julia = "1"

[extras]
AbstractTrees = "1520ce14-60c1-5f80-bbc7-55ef81b5835c"
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
ColorTypes = "3da002f7-5984-5a60-b8a6-cbb66c0b333f"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["AbstractTrees", "CategoricalArrays", "ColorTypes", "Random", "Tables", "Test"]
test = ["Random", "Test"]
43 changes: 34 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,25 +4,49 @@
| :-----------: | :------: | :-----------: |
| [![Build Status](https://travis-ci.org/alan-turing-institute/ScientificTypes.jl.svg?branch=master)](https://travis-ci.org/alan-turing-institute/ScientificTypes.jl) | [![codecov.io](http://codecov.io/github/alan-turing-institute/ScientificTypes.jl/coverage.svg?branch=master)](http://codecov.io/github/alan-turing-institute/ScientificTypes.jl?branch=master) | [![](https://img.shields.io/badge/docs-stable-blue.svg)](https://alan-turing-institute.github.io/ScientificTypes.jl/dev) |

A light-weight Julia interface for implementing conventions about the scientific interpretation of data, and for performing type coercions enforcing those conventions.
A light-weight Julia interface for implementing conventions about the
scientific interpretation of data, and for performing type coercions
enforcing those conventions.

The package makes the distinction between between **machine type** and **scientific type**:

* the _machine type_ is a Julia type the data is currently encoded as (for instance: `Float64`)
* the _scientific type_ is a type defined by this package which encapsulates how the data should be _interpreted_ in the rest of the code (for instance: `Continuous` or `Multiclass`)
* the _scientific type_ is a type defined by this package which
encapsulates how the data should be _interpreted_ (for instance:
`Continuous` or `Multiclass`)

As a motivating example, the data might contain a column corresponding to a _number of transactions_, the machine type in that case could be an `Int` whereas the scientific type would be a `Count`.
The distinction is useful because the same machine type is often used
to represent data with *differing* scientific interpretations - `Int`
is used for product numbers (a factor) but also for a person's weight
(a continuous variable) - while the same scientific
type is frequently represented by *different* machine types - both
`Int` and `Float64` are used to represent weights, for example.

The usefulness of this machinery becomes evident when the machine type does not directly connect with a scientific type; taking the previous example, the data could have been encoded as a `Float64` whereas the meaning should still be a `Count`.

## Very quick start

(For more information and examples please refer to [the doc](https://alan-turing-institute.github.io/ScientificTypes.jl/dev))
For more information and examples please refer to [the
manual](https://alan-turing-institute.github.io/ScientificTypes.jl/dev).

This is a very quick start presenting two key functions exported by ScientificTypes:
ScientificTypes.jl has three components:

* `schema(X)` which gives an extended schema of the table `X` with the column scientific types implied by the current scitype convention,
* `coerce(X, ...)` which allows to overwrite scientific types for specific columns to indicate their appropriate scientific interpretation.
- An *interface*, for articulating a convention about the scientific
interpretation of data. This consists of a definition of a scientific
type hierarchy, and a single function `scitype` with scientific
types as values. Someone implementing a convention must add methods
to this function, while the general user just applies it to data, as
in `scitype(4.5)` (returning `Continuous` in the *mlj* convention).

- A built-in convention, called *mlj*, active by default.

- Convenience methods for working with scientific types, the most commonly used being:

- `schema(X)`, which gives an extended schema of any table `X`,
including the column scientific types implied by the active
convention.
.
- `coerce(X, ...)`, which coerces the machine types of `X`
to reflect a desired scientific type.

```julia
using ScientificTypes, DataFrames
Expand All @@ -49,7 +73,8 @@ will print
:e -- Union{Missing, Unknown}
```

this uses the default "MLJ convention" to attribute a scitype (cf. [docs](https://alan-turing-institute.github.io/ScientificTypes.jl/dev/#The-MLJ-convention-1)).
this uses the default *mlj* convention to attribute a scitype
(cf. [docs](https://alan-turing-institute.github.io/ScientificTypes.jl/dev/#The-MLJ-convention-1)).

Now you could want to specify that `b` is actually a `Count`, and that `d` and `e` are `Multiclass`; this is done with the `coerce` function:

Expand Down
43 changes: 33 additions & 10 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,22 +17,35 @@ The package `ScientificTypes` provides:

- A hierarchy of new Julia types representing scientific data types for use in method dispatch (eg, for trait values). Instances of the types play no role:

```@example 0
using ScientificTypes, AbstractTrees
ScientificTypes.tree()
```
Found
├─ Known
│ ├─ Finite
│ │ ├─ Multiclass
│ │ └─ OrderedFactor
│ ├─ Infinite
│ │ ├─ Continuous
│ │ └─ Count
│ ├─ Image
│ │ ├─ ColorImage
│ │ └─ GrayImage
│ └─ Table
└─ Unknown
```

- A single method `scitype` for articulating a convention about what scientific type each Julia object can represent. For example, one might declare `scitype(::AbstractFloat) = Continuous`.

- A default convention called *mlj*, based on optional dependencies `CategoricalArrays`, `ColorTypes`, and `Tables`, which includes a convenience method `coerce` for performing scientific type coercion on `AbstractVectors` and columns of tabular data (any table implementing the [Tables.jl](https://github.com/JuliaData/Tables.jl) interface).
- A default convention called *mlj*, based on dependencies
`CategoricalArrays`, `ColorTypes`, and `Tables`, which includes a
convenience method `coerce` for performing scientific type coercion
on `AbstractVectors` and columns of tabular data (any table
implementing the [Tables.jl](https://github.com/JuliaData/Tables.jl)
interface).

- A `schema` method for tabular data, based on the optional Tables dependency, for inspecting the machine and scientific types of tabular data, in addition to column names and number of rows.

### Dependencies

The only dependencies are [`Requires.jl`](https://github.com/MikeInnes/Requires.jl) and `InteractiveUtils` (from stdlib).

## Quick start
## Getting started

The package is registered and can be installed via the package manager with `add ScientificTypes`.

Expand Down Expand Up @@ -182,6 +195,16 @@ Similarly, the scitype of an `AbstractArray` is `AbstractArray{U}` where `U` is
scitype([1.3, 4.5, missing])
```

*Performance note:* Computing type unions over large arrays is
expensive and, depending on the convention's implementation and the
array eltype, computing the scitype can be slow. (In the *mlj*
convention this is mitigated with the help of the
`ScientificTypes.Scitype` method, of which other conventions could
make use. Do `?ScientificTypes.Scitype` for details.) An eltype `Any`
will always be slow and you may want to consider replacing an array
`A` with `broadcast(idenity, A)` to collapse the eltype and speed up
the computation.

Provided the [Tables.jl](https://github.com/JuliaData/Tables.jl) package is loaded, any table implementing the Tables interface has a scitype encoding the scitypes of its columns:

```@example 5
Expand Down Expand Up @@ -288,7 +311,7 @@ X = (a = rand("abc", n), # 3 values, not number --> Multiclass
autotype(X, only_changes=true)
```

For example, we could first apply the `:discrete_to_continuous` rule,
For example, we could first apply the `:discrete_to_continuous` rule,
followed by `:few_to_finite` rule. The first rule will apply to `b` and `e`
but the subsequent application of the second rule will mean we will
get the same result apart for `e` (which will be `Continuous`)
Expand All @@ -298,4 +321,4 @@ autotype(X, only_changes=true, rules=(:discrete_to_continuous, :few_to_finite))
```

One should check and possibly modify the returned dictionary
before passing to `coerce`.
before passing to `coerce`.
137 changes: 102 additions & 35 deletions src/ScientificTypes.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@ module ScientificTypes

export Scientific, Found, Unknown, Finite, Infinite
export OrderedFactor, Multiclass, Count, Continuous
export Binary, Table, ColorImage, GrayImage
export Binary, Table
export ColorImage, GrayImage
export scitype, scitype_union, scitypes, coerce, schema
export mlj
export autotype

using Requires, InteractiveUtils
using Tables, CategoricalArrays, ColorTypes

# ## FOR DEFINING SCITYPES ON OBJECTS DETECTED USING TRAITS

Expand Down Expand Up @@ -64,7 +66,7 @@ const Scientific = Union{Missing,Found}
"""
MLJBase.Table{K}
The scientific type for tabular data (a containter `X` for which
The scientific type for tabular data (a container `X` for which
`Tables.is_table(X)=true`).
If `X` has columns `c1, c2, ..., cn`, then, by definition,
Expand Down Expand Up @@ -107,45 +109,127 @@ end
# ## THE SCITYPE FUNCTION

"""
scitype(x)
scitype(X)
The scientific type that `x` may represent.
"""
scitype(X) = scitype(X, Val(convention()))
scitype(X, C) = scitype(X, C, Val(trait(X)))
scitype(X, C, ::Val{:other}) = Unknown

scitype(::Missing) = Missing


# ## CONVENIENCE METHOD FOR UNIONS OVER ELEMENTS

"""
scitype_union(A)
scitype_union(A)
Return the type union, over all elements `x` generated by the iterable
`A`, of `scitype(x)`.
See also `scitype`.
"""
scitype_union(A) = reduce((a,b)->Union{a,b}, (scitype(el) for el in A))


# ## SCITYPES OF TUPLES AND ARRAYS
# ## SCITYPES OF TUPLES

scitype(t::Tuple, ::Val) = Tuple{scitype.(t)...}

# The following fallback can be quite slow. Individual conventions
# will usually be able to find more perfomant overloadings of this
# method:
scitype(A::B, ::Val) where {T,N,B<:AbstractArray{T,N}} =

# ## SCITYPES OF ARRAYS

"""
ScientificTypes.Scitype(::Type, C::Val)
Method for implementers of a conventions to enable speed-up of scitype
evaluations for large arrays.
In general, one cannot infer the scitype of an object of type
`AbstractArray{T, N}` from the machine type alone. For, example, this
never holds in the *mlj* convention for a categorical array, or in the
following examples: `X=Any[1, 2, 3]` and `X=Union{Missing,Int64}[1, 2,
3]`.
Nevertheless, for some *restricted* machine types `U`, the statement
`type(X) == AbstractArray{T, N}` for some `T<:U` already allows one
deduce that `scitype(X) = AbstractArray{S,N}`, where `S` is determined
by `U` alone. This is the case in the *mlj* convention, for example,
if `U = Integer`, in which case `S = Count`. If one explicitly declares
ScientificTypes.Scitype(::Type{<:U}, ::Val{:convention}) = S
in such cases, then ScientificTypes ensures a considerable speed-up in
the computation of `scitype(X)`. There is also a partial speed-up for
the case that `T <: Union{U, Missing}`.
For example, in *mlj* one has `Scitype(::Type{<:Integer}) = Count`.
"""
Scitype(::Type, C::Val) = nothing
Scitype(::Type{Any}, C::Val) = nothing # b/s `Any` isa `Union{<:Any, Missing}`

# For all such `T` we can also get almost the same speed-up in the case that
# `T` is replaced by `Union{T, Missing}`, which we detect by wrapping
# the answer:

Scitype(MT::Type{Union{T, Missing}}, C::Val) where T = Val(Scitype(T, C))

# For example, in *mlj* convention, Scitype(::Integer) = Count

const Arr{T,N} = AbstractArray{T,N}

# the dispatcher:
scitype(A::Arr{T}, C) where T = scitype(A, C, Scitype(T, C))

# the slow fallback:
scitype(A::Arr{<:Any,N}, ::Val, ::Nothing) where N =
AbstractArray{scitype_union(A),N}

# the speed-up:
scitype(::Arr{<:Any,N}, ::Val, S) where N = Arr{S,N}

# partial speed-up for missing types, because broadcast is faster than
# computing scitype_union:
function scitype(A::Arr{<:Any,N}, C::Val, ::Val{S}) where {N,S}
if S == nothing
return scitype(A, C, S)
else
Atight = broadcast(identity, A)
if typeof(A) == typeof(Atight)
return Arr{Union{S,Missing},N}
else
return Arr{S,N}
end
end
end


# ## STUB FOR COERCE METHOD

"""
coerce(A::AbstractArray, T; verbosity=1)
Coerce the julia types of elements of `A` to ensure the returned array
has `T` or `Union{Missing,T}` as the union of its element scitypes,
according to the active convention.
A warning is issued if missing values are encountered, unless
`verbosity` is `0` or less.
julia> mlj()
julia> v = coerce([1, missing, 5], Continuous)
3-element Array{Union{Missing, Float64},1}:
1.0
missing
5.0
julia> scitype(v)
AbstractArray{Union{Missing,Continuous}, 1}
See also [`scitype`](@ref), [`scitype_union`](@ref).
"""
function coerce end


Expand Down Expand Up @@ -197,33 +281,16 @@ schema(X, ::Val{:other}) =
"an object with trait `:other`\n"*
"Perhaps you meant to import Tables first?"))

include("tables.jl")
include("autotype.jl")

## ACTIVATE DEFAULT CONVENTION

# and include code not requring optional dependencies:
# and include code not requiring optional dependencies:

mlj()
include("conventions/mlj/mlj.jl")


## FOR LOADING OPTIONAL DEPENDENCIES

function __init__()

# for printing out the type tree:
@require(AbstractTrees = "1520ce14-60c1-5f80-bbc7-55ef81b5835c",
include("tree.jl"))

# the scitype and schema of tabular data:
@require(Tables="bd369af6-aec1-5ad0-b16a-f7cc5008161c",
(include("tables.jl"); include("autotype.jl")))

# :mlj conventions requiring external packages
@require(CategoricalArrays="324d7699-5711-5eae-9e2f-1d82baa6b597",
include("conventions/mlj/finite.jl"))
@require(ColorTypes="3da002f7-5984-5a60-b8a6-cbb66c0b333f",
include("conventions/mlj/images.jl"))

end
include("conventions/mlj/finite.jl")
include("conventions/mlj/images.jl")

end # module
2 changes: 0 additions & 2 deletions src/autotype.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
export autotype

"""
autotype(X)
Expand Down
Loading

0 comments on commit 4e7f935

Please sign in to comment.