Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifying UnionType during concatenation #2182

Closed
grst opened this issue Jan 31, 2023 · 15 comments
Closed

Simplifying UnionType during concatenation #2182

grst opened this issue Jan 31, 2023 · 15 comments
Labels
feature New feature or request

Comments

@grst
Copy link

grst commented Jan 31, 2023

Description of new feature

In #2058, you concluded that it is more useful if ak.concatenate simplifies types where possible instead of creating a Union. Here I have another example where I think this would be useful:

Current behavior

>>> a1 = ak.Array([{'a': 'aaa', 'b': 'bbb'}])
>>> a2 = ak.Array([{'b': 'bbb', 'c': 'ccc'}])

>>> res = ak.concatenate([a1, a2])
>>> res
[{a: 'aaa', b: 'bbb'},
 {b: 'bbb', c: 'ccc'}]
----------------------
type: 2 * union[
    {
        a: string,
        b: string
    },
    {
        b: string,
        c: string
    }
]

Like this, I can't even access fields anymore after concatenation:

>>> res['c']
FieldNotFoundError                        Traceback (most recent call last)

Desired behavior
Create an option type

type: 2 * var * {
    a: ?string,
    b: string,
    c: ?string
}

FWIW, awkward infers exactly this option type when passing the concatenated array through a python object:

ak.Array(ak.to_list(res))

CC @ivirshup

@grst grst added the feature New feature or request label Jan 31, 2023
@agoose77
Copy link
Collaborator

I think this would be a useful feature, though probably hidden behind merge_record because there are times when this will lose information (if your fields are already option types).

@jpivarski
Copy link
Member

Yeah, about this kind of simplification—it's one that has come up before.

The option-type and union-type simplifications that we have now (@classmethod simplified) are simplifying the layouts/forms while maintaining the same high-level types. Here, I'm assuming type equivalences like

  • option[option[X]] is the same type as option[X] (each value can either be an instance of X or a missing value; the nested option does nothing visible at the level of values)
  • option[union[X, Y]] is the same type as union[option[X], option[Y]] and it's the same type as union[option[X], Y] (each value can either be an instance of X, an instance of Y, or a missing value).

Merging the types of a1 and a2 in

a1 = ak.Array([{'a': 'aaa', 'b': 'bbb'}])
a2 = ak.Array([{'b': 'bbb', 'c': 'ccc'}])

as

2 * var * {
    a: ?string,
    b: string,
    c: ?string
}

is both visible at the type level and it loses information. If they're merged as a union, then each value either has fields a and b or it has fields b and c—there's an anticorrelation between having a and having c—but if they're merged as a single record with optional fields, the correlation is lost. So that makes this a different kind of simplification than the low-level layout/form simplifications.

If we make that change, not only would we break things for current users, but we also wouldn't be able to back out of it easily if we decide that we needed the lost information after all. (We can't change it by calling it a bug-fix: the current behavior really was/is intended.)

Perhaps we could add a function that does this union-squashing manually, so that it's an opt-in thing (and AnnData could opt into it on the whole-library level, so that your users don't have to think about it). The hard part, then, is what should this function be named? (The implementation would not be very complicated, through recursively_apply.)

@grst
Copy link
Author

grst commented Jan 31, 2023

Thanks for the clarification. An explicit function to simplify types sounds good!
Would this function cope with only this specific case, or are there other cases that you would consider for type simplification?

Some naming ideas

ak.squash_types
ak.merge_types
ak.simplify_types
ak.concatenate(..., simplify=True)

@agoose77
Copy link
Collaborator

agoose77 commented Jan 31, 2023

Perhaps project_union() or squash_union()?1 I lean towards this being a dedicated function, because it will be opinionated and specific to unions. We don't yet have any idea what else we might want to simplify, and I suspect that other simplifications might better fit as separate functions if they're unlikely to be chained.

Footnotes

  1. "project" already has lots of (different) meaning, so perhaps this isn't a good name.

@jpivarski
Copy link
Member

project already has a meaning specifically for unions:

def project(self, index):
lentags = self._tags.length
assert (
self._index.length is None or lentags is None
) or self._index.length >= lentags
lenout = ak.index.Index64.empty(1, self._backend.index_nplike)
tmpcarry = ak.index.Index64.empty(lentags, self._backend.index_nplike)
assert (
lenout.nplike is self._backend.index_nplike
and tmpcarry.nplike is self._backend.index_nplike
and self._tags.nplike is self._backend.index_nplike
and self._index.nplike is self._backend.index_nplike
)
self._handle_error(
self._backend[
"awkward_UnionArray_project",
lenout.dtype.type,
tmpcarry.dtype.type,
self._tags.dtype.type,
self._index.dtype.type,
](
lenout.data,
tmpcarry.data,
self._tags.data,
self._index.data,
lentags,
index,
)
)
nextcarry = ak.index.Index64(
tmpcarry.data[: lenout[0]], nplike=self._backend.index_nplike
)
return self._contents[index]._carry(nextcarry, False)

Would this function cope with only this specific case, or are there other cases that you would consider for type simplification?

It would be better to make it apply only to this case, so that capabilities can be pieced together (like the way that ak.pad_none and ak.fill_none are separate functions: the original request was for one function, but doing it in two phases opens more possibilities).

The name should include both "union" and "record," since it turns unions of records into records with option-type fields. "Squash" sounds like "flatten," and that refers to list-type dimensions.

How about ak.merge_union_of_records (with an axis argument)? It's a long name, but this is a specialized thing and can have a long name.

@ivirshup
Copy link

ivirshup commented Feb 1, 2023

I'm trying to wrap my head around what Option types are. Is Option{T} meant to be equivalent to Union{T, Empty}? I feel like this is the path I see languages like Julia and Python take. It's also what I would expect from @jpivarski's comment:

  • option[option[X]] is the same type as option[X] (each value can either be an instance of X or a missing value; the nested option does nothing visible at the level of values)
  • option[union[X, Y]] is the same type as union[option[X], option[Y]] and it's the same type as union[option[X], Y] (each value can either be an instance of X, an instance of Y, or a missing value).

Especially:

union[option[X], option[Y]] and it's the same type as union[option[X], Y]

@agoose77
Copy link
Collaborator

agoose77 commented Feb 1, 2023

Most explicitly, the definition of option is taken from the datashape grammar that we use: https://datashape.readthedocs.io/en/latest/overview.html#option

Semantically, yes, an option is well defined by Optional[X] = X | None. However, we would not express this relation in the same terms in Awkward; None is not a free-standing layout class. Instead, the concept of option it is explicitly (and exclusively) introduced by wrapping a content in an optional layout e.g. IndexedOptionArray. We do have an Empty type — ak.contents.EmptyArray, but it represents "unknown" types, rather than missing values. This is seen e.g. from_iter([]), which has no typed data to infer a type from.

So, TLDR: at the layout level, None is not a distinct type, but option[T] is.

That said, Jim's example

union[option[X], option[Y]] and it's the same type as union[option[X], Y]

might seem confusing — it implies that the missing values in option[X] are identical to those in option[Y] such that we can express the missing values of Y as missing values of X. This is correct; we have a singular, distinct concept of "null". It's just not possible to define without relating to another type, even if that type is unknown e.g. option[unknown]:

ak.contents.UnmaskedArray(ak.contents.EmptyArray())

I hope that I haven't made this any more confusing. Jim might have a keener insight into your question.

@ivirshup
Copy link

ivirshup commented Feb 1, 2023

Most explicitly, the definition of option is taken from the datashape grammar that we use: https://datashape.readthedocs.io/en/latest/overview.html#option

Datashape doesn't have the concept of unions though, right?

We do have an Empty type — ak.contents.EmptyArray, but it represents "unknown" types, rather than missing values.

This one is a little confusing to me, since you can also have empty typed arrays (though I haven't figured out an easy way to create them). Like:

ak.to_packed(ak.Array([[1], [], [], []])[1:])
# <Array [[], [], []] type='3 * var * int64'>

But I think the concepts of unknown and empty are a little conflated here.

@ivirshup
Copy link

ivirshup commented Feb 1, 2023

I've been playing around a little more. I think this little snippet helped elucidate some things for me (the output is quite long):

arrays = [
    ak.pad_none(ak.contents.EmptyArray(), 1, 0),
    ak.Array([[{"a": 1, "b": "foo"}]]),
    ak.Array([[{"a": 2}]]),
]

for perm in permutations(arrays):
    try:
        res = ak.concatenate(perm)
        res.type.show()
        print(ak.to_buffers(res)[0])
    except:
        print("Error")

This does not return a consistent type. It returns both:

3 * union[
    option[var * {a: int64, b: string}],
    var * {a: int64}
]

and

3 * union[
    option[var * {a: int64}],
    var * {a: int64, b: string}
]

Each of which has a distinct form/ layout. If Optional[X] = X | None, then I think you could have a single canonical representation where IndexedOptionArray either wraps the UnionArray or is expressed as part of it.

@agoose77
Copy link
Collaborator

agoose77 commented Feb 1, 2023

Datashape doesn't have the concept of unions though, right?

Correct! See here.

This one is a little confusing to me, since you can also have empty typed arrays

Yes, the name would be better as UnknownArray. Though, EmptyArray is also empty (to have any values, it would need a concrete type, after all).

EmptyArray means "leaf node of unknown type". It's a type-level statement.

You example with merging is strongly related to the bug that this issue originally reports. Simply put, strings are implemented as views over character arrays. This view-metadata is being lost with our merging logic.

@jpivarski
Copy link
Member

You're right, @ivirshup, about how missing values could be implemented; we just made different choices here. Actually, I can think of three different ways to represent missing values that I'll outline below.

First, though, it would be useful to distinguish these two types:

  • unit type: there is exactly one possible value for this type. For example, NoneType in Python can only be instantiated as None.
  • bottom type: there are no possible values for this type: it cannot be instantiated. For example, the return type of a function that always raises an exception or always goes into an infinite loop (never halts) is the bottom type.

(The usefulness of a bottom type is that it is an identity under type-unification. Suppose you have an if-else branch that each return different values: the return type of that function is the type-union of the if branch and the else branch. If the if branch would return type T and the else branch would return the bottom type, then the function returns type T. That is to say, if it returns anything at all, it's not the bottom type: T + 0 = T. So bottom types are useful for book-keeping.)

The ak.contents.EmptyArray layout node, which has ak.types.UnknownType type, is an array of bottom type, not an array of unit type. For this reason, it can only ever have length zero. (I do think they're well-named. Other arrays can have zero length for other reasons; the implication goes in only one direction.)

Awkward Array doesn't have a unit type. If it did (UnitArray?), then it would be a metadata-only array characterized only by its length. Doing any __getitem__(int) on it would always return the same object, perhaps None. If Awkward Array had a unit type, then missing values could be implemented using a UnionArray with this UnitArray and one or more other arrays. The UnionArray.tags would function like ByteMaskedArray.mask and the UnionArray.index would function like IndexedOptionArray.index. This would use more memory than our current scheme, in which option type is either masked (BitMaskedArray or ByteMaskedArray) or indexed (IndexedOptionArray) or has no actual missing values (UnmaskedArray).

The three ways that I'm aware of for implementing missing values are:

  1. Unions that include a unit type, like Python and Julia, as you pointed out, and I'd also add some data formats like Avro and JSON-schema.
  2. Specialized option-type, which Awkward Array uses, as well as C++'s std::optional and other statically typed languages (Swift, Rust, Go...).
  3. Specialized option-type that is also viewed as an iterable container: option[T] is like list[T] except that option can only have length 0 or 1, whereas list can have any non-negative integer length. An empty option container corresponds to "missing" and a length-1 container corresponds to "not missing." Scala's Option works this way, and it's somewhat more common among functional languages because it lets you use the same map/reduce patterns on option types as on list types. This is the idea behind ak.singletons and ak.firsts, to convert into and out of this form.

(1) has a nice conceptual feature that you can reuse the concept of union to implement option. It's particularly good for dynamically typed languages, which are implicitly heterogenous everywhere. It does require you to introduce a unit type, but maybe you had other reasons for doing that, anyway. The downside is that now missingness has to be implemented by a union's tag and index, whereas with (2), we get a choice.

I just spent a while wondering why Julia, a compiled language, goes with (1), rather than (2), and it's because they're working with non-columnar data structures. As record-oriented structures,

struct UnionOfNullAnd{T}
  tag::UInt8
  null_value::Nothing
  other_value::T
end

is effectively the same as

struct Optional{T}
  is_missing::Bool
  other_value::T
end

since the Nothing doesn't use any memory at runtime. For columnar data structures, it's true that we could have defined UnitArray to be metadata-only, but a UnionArray requires tags to say which of the UnionArray.contents is active in a given element and an index to be able to find the element in the specified content in a random-access way (without having to compute a cumulative sum of the tags == 1 array). In a non-columnar world like Julia, the containing array would just have pointers to all of these structs.

So if we used representation (1), we'd be forced to use a mask-like UnionArray.tags and an index-like UnionArray.index for all option-types. What we do instead with (2) is use different techniques, either masking (BitMaskedArray or ByteMaskedArray) or indexing (IndexedOptionArray), depending on whichever is best for that situation.

The statements I made about

  • option[union[X, Y]] is the same type as union[option[X], option[Y]] and it's the same type as union[option[X], Y] (each value can either be an instance of X, an instance of Y, or a missing value).

are independent of representation. You'd want these type expressions to be equivalent in a high-level view, although all of the representations, (1), (2), and (3), would have to do some work to make them equivalent. Maybe Julia and other compiled languages would lose the wrapping-structs in some compiler optimization pass, so maybe they don't have to deal with it explicitly.1

We have to, and we do it (now) by imposing a conventional layout when the nodes are being constructed. (We used to do it after the nodes were constructed but before they were returned to the user, but missed some cases.)

Footnotes

  1. Actually, no: they can't optimize std::optional<std::optional<T>> into std::optional<T> because the compiler would have to be aware of some very long-distance relationships to know that it's only the logical-and of each std::optional::has_value that matters and be able to coalesce them into a single boolean, rather than two booleans. It gets even more complicated when std::optional and std::variant are mixed. I'd bet $10 that there isn't a compiler in the world that can do that.

@jpivarski
Copy link
Member

My last message here was addressing @grst's original issue, about ak.concatenate making a union of two incompatible record types, rather than merging the record types with optional fields (and #2185 added that ability as an opt-in function you can call).

I had missed this point:

I've been playing around a little more. I think this little snippet helped elucidate some things for me (the output is quite long):

arrays = [
    ak.pad_none(ak.contents.EmptyArray(), 1, 0),
    ak.Array([[{"a": 1, "b": "foo"}]]),
    ak.Array([[{"a": 2}]]),
]

for perm in permutations(arrays):
    try:
        res = ak.concatenate(perm)
        res.type.show()
        print(ak.to_buffers(res)[0])
    except:
        print("Error")

This is showing non-commutativity of the type that comes from ak.concatenate, although it's duplicate with #2173 (comment) and #2191 (reply in thread), all to be fixed by #2179. I think all of the examples that demonstrate non-commutativity of types from ak.concatenate have an EmptyArray in them.

@ivirshup
Copy link

ivirshup commented Feb 3, 2023

Thanks for the explanation @jpivarski, I think this was helpful. Especially seeing that unknown is the Bottom type part of EmptyArray. I can see why you would need a special option type if ?unknown is an allowable type, otherwise this would just be ?.

I think I've figured out how we're going to end up working with this on anndata once this feature is out (scverse/anndata#898).

Any idea on the timeframe for a release with ak.merge_union_of_records (2.1.0)?


Another question, are there other parts of the high level API that let me construct unions of records? I think I've only been seeing these from ak.concatenate, while the constructor tends to fill with optional types.


I just spent a while wondering why Julia, a compiled language, goes with (1), rather than (2), and it's because they're working with non-columnar data structures. As record-oriented structure

As a minor point, I don't think this description is quite accurate. Julia does use a "columnar" representation for null values in arrays. From the blog post describing the feature (which I think was quite good):

Quote describing how Array{Missing{T}} works

The second improvement consists in using a compact memory layout for Array object whose element type is a Union of bits types, i.e. immutable types which contain no references (see the isbits function). This includes Missing and basic types such as Int, Float64, Complex{Float64} and Date. When T is a bits type, Array{Union{Missing,T}} objects are internally represented as a pair of arrays of the same size: an Array{T} holding non-missing values and uninitialized memory for missing values; and an Array{UInt8} storing a type tag indicating whether each entry is of type Missing or T.

@jpivarski
Copy link
Member

Any idea on the timeframe for a release with ak.merge_union_of_records (2.1.0)?

Probably early next week, but I can do a patch release (2.0.7) that would include ak.merge_union_of_records because that's already in main.

Another question, are there other parts of the high level API that let me construct unions of records? I think I've only been seeing these from ak.concatenate, while the constructor tends to fill with optional types.

It would be hard to enumerate them, but there are others. One that comes to mind is ak.fill_none, since the fill value might not match the type of the nullable data:

>>> array = ak.Array([1, 2, 3, None, 4, 5, None, 6])
>>> array.show(type=True)
type: 8 * ?int64
[1,
 2,
 3,
 None,
 4,
 5,
 None,
 6]
>>> array2 = ak.fill_none(array, "hello")
>>> array2.show(type=True)
type: 8 * union[
    int64,
    string
]
[1,
 2,
 3,
 'hello',
 4,
 5,
 'hello',
 6]

As a minor point, I don't think this description is quite accurate. Julia does use a "columnar" representation for null values in arrays. From the blog post describing the feature (which I think was quite good):

I see! I followed through to the deprecated DataArrays.jl (because "In fact, the layout of the DataArray type is very similar to that of Array{Union{Missing,T}} described above.") So even though high-level Julia describes the type as (1) in my numbering scheme, they implement it as (2). I guess it's a SufficientlySmartCompiler.

@agoose77
Copy link
Collaborator

agoose77 commented Feb 4, 2023

This should have been fixed by 2.0.7 :)

@agoose77 agoose77 closed this as completed Feb 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants