`from_arrow` doesn't preserve empty string column name #1069

rocurley · 2021-08-20T20:16:01Z

rocurley
Aug 20, 2021

Version of Awkward Array

1.4.0

Description and code to reproduce

Code

import pyarrow as pa
import awkward as ak

table1 = pa.Table.from_pydict({"key": [True]})
table2 = pa.Table.from_pydict({"": [True]})
table3 = pa.Table.from_pydict({"": [True], "key": [False]})
print(ak.from_arrow(table1))
print(ak.from_arrow(table2))
print(ak.from_arrow(table3))

Output

[{key: True}]
[True]
[{'': True, key: False}]

Expected Output

[{key: True}]
[{'':True}]
[{'': True, key: False}]

agoose77 · 2021-08-20T20:40:08Z

agoose77
Aug 20, 2021
Maintainer

I can reproduce this, and the to_pydict representation of these tables is valid. I assume we support empty field names (Jim may correct me), because the third test case works.

0 replies

agoose77 · 2021-08-20T20:51:47Z

agoose77
Aug 20, 2021
Maintainer

The cause of this behaviour appears to be https://github.com/scikit-hep/awkward-1.0/blob/533f3852b5528543ad6b9c81d8f7d6fede69cb4c/src/awkward/operations/convert.py#L2902

This was introduced in 6dca862

I don't know why we do this — it reminds me of the lazy column reading with Parquet, but a cursory glance doesn't ring any bells.

0 replies

jpivarski · 2021-08-21T15:08:35Z

jpivarski
Aug 21, 2021
Maintainer

What part of that commit is the problem?

Arrow Tables and Parquet files require named fields, but Awkward Arrays can be non-records. For non-records, we use the empty string as a field name, so the empty string is treated differently. I don't know if that's the problem here. I hadn't considered that empty string column names would be desirable in a context with real records (more than one field, more than a formality to fit the technology).

0 replies

agoose77 · 2021-08-21T15:28:57Z

agoose77
Aug 21, 2021
Maintainer

What part of that commit is the problem?

Arrow Tables and Parquet files require named fields, but Awkward Arrays can be non-records. For non-records, we use the empty string as a field name, so the empty string is treated differently. I don't know if that's the problem here. I hadn't considered that empty string column names would be desirable in a context with real records (more than one field, more than a formality to fit the technology).

It's exactly this - we use the presence of only an empty key to implicitly mean "this is an (Awkward) non-record array". @rocurley in this case (I assume) wants to be able to load such a table and maintain the structure.

Regardless, this cannot be "solved" - any convention that applies meaning to the key will ultimately collide with some value chosen by a particularly determined user. But, maybe an Awkward prefix would be better than an empty key?Perhaps this issue should become should be a discussion, to welcome other Arrow users' opinions.

0 replies

jpivarski · 2021-08-22T11:51:16Z

jpivarski
Aug 22, 2021
Maintainer

Then what it boils down to us that we wanted a space larger than the set of all possible records; we wanted records and non-records, so we took a corner out of the space of records to represent the non-records. Now those records are not representable. We could have picked a special-sounding key, like "AwkwardFieldNotARecord," but it seemed like a better idea to use a special value (the only string of length zero) for this special meaning.

Changing it now would touch a lot of code—it wouldn't be clean and easy, though that's not the strongest agreement against it. The strongest argument is that some people have saved files with the one policy, but if new versions of Awkward Array use a different policy for interpreting single-key, empty key name records, the existing files will be interested differently—it would not be backward compatible.

There is something that we could do: we could introduce another convention to encode the single-key, empty key name case in yet another structure. That is, if you really wanted to save this Awkward Array:

{"": XYZ}

then we would put it in a file containing

{"": {"": XYZ}}

Okay, so what if you wanted the latter? Well, we'd save it in s file that contained three nested, single-key, empty key name records. A single rule could shift everything down by one, and even though this breaks existing uses of these structures, I doubt anyone's using them seriously.

Doing so, we'd get the whole space that we want: all records and non-records. The space of records is infinite, and that means that we can make a bijection to wither infinite space with strictly more things in it—this is Hilbert's hotel.

We could have done this uniformly with all structures, but if we did, then the Parquet view of all Awkward Arrays would be strange. This way, only strange structures get a strange encoding. (Strange enough, though, that I expect it would be raised as another bug report, when discovered!)

I think we should make this a discussion, but let's hear back from @rocurley first.

0 replies

agoose77 · 2021-08-22T12:08:48Z

agoose77
Aug 22, 2021
Maintainer

Changing it now would touch a lot of code—it wouldn't be clean and easy, though that's not the strongest agreement against it. The strongest argument is that some people have saved files with the one policy, but if new versions of Awkward Array use a different policy for interpreting single-key, empty key name records, the existing files will be interested differently—it would not be backward compatible.

That's a really important point that I failed to mention yesterday — any change to the convention here breaks reading old files. As you say, we'd want a strong case to be made in order to do that.

0 replies

rocurley · 2021-08-22T16:24:11Z

rocurley
Aug 22, 2021
Author

Hey folks!

So the context for how I encountered this is that I'm trying to read/write data in a very similar format to an Awkward Array from/to parquet, and I've got a property test set up to generate random data and round-trip it. Hypothesis likes to generate "smaller" values, such as the empty string, and so it discovered this corner case.

Practically speaking, it was pretty easy to solve this issue on my end: I just made it an error to make a column with an empty string as a name. Since my column names are directly human generated anyway, that wasn't much of a problem. Given that, I arguably no longer have a horse in this race. The reason I reported this as a bug anyway is that it felt like it made the library less predictable, which is something I value.

I've been thinking about what circumstances could cause somebody to run into this outside the context of a test. I think the most likely case is if you're using it more as a dict than as a struct: your column names are generated at runtime. Maybe something like "I'm recording the visitor counts per hour to each subdomain of my website (including the empty subdomain) and using awkward array to save it to parquet". What does that person want? I think, mostly, they don't want the special case, but that's not on the table for backwards compatibility. Failing that, I think they want reliable round-tripping through parquet. The Hilbert hotel approach will get them that, so that's nice. On the other hand, you could imagine that they're writing the data to parquet so that some other system can consume it: in that case they want the current behavior.

Not really sure what would be best here!

0 replies

jpivarski · 2021-08-23T14:31:44Z

jpivarski
Aug 23, 2021
Maintainer

The Awkward → Parquet → Awkward round trip can be made an identity function, which is more predictable, through the Hilbert Hotel encoding, but that makes the Awkward → Parquet step more complex and less predictable. The reason we're not wrapping everything in {"": is because we want the Parquet to be understandable in other (non-Awkward) programs. We can't have both Awkward → Parquet and Awkward → Parquet → Awkward both be simple because Parquet encodes a different space of data structures than Awkward.

Since this was motivated by hypothesis testing the edge cases, not a real use-case, I think we shouldn't change it yet.

As for the possibility of a real use-case in which records are treated more like dicts than structs: that's an antipattern we should discourage. Field names in C structs are not generated because they get "compiled in," there's a presumption that they're not dynamic (and can't be, unless you've got a JIT compiler). Trying to make them dynamic would make the performance worse, rather than better. Record names in Awkward Arrays are like that, too: each field creates its own memory buffer (and data type, and array infrastructure), so if they're dynamically created, the performance would be worse; it would make a lot more sense in that case to use a Python dict than an Awkward Array.

On the other hand, I should point out that a "mapping" data structure is a future feature (#780), but the structure is not a record with dynamically generated field names: it's two equal length arrays of keys and values (thus enforcing that the keys all have the same type as each other in one compact array, rather than an array and type for each key), in which the keys are sorted for binary searches. Uproot generates arrays of this type when it encounters a C++ std::map in a ROOT file, but there isn't anything in Awkward Array yet to use it for fast lookups.

I'll make this a discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`from_arrow` doesn't preserve empty string column name #1069

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

from_arrow doesn't preserve empty string column name #1069

rocurley Aug 20, 2021

Version of Awkward Array

Description and code to reproduce

Replies: 8 comments

agoose77 Aug 20, 2021 Maintainer

agoose77 Aug 20, 2021 Maintainer

jpivarski Aug 21, 2021 Maintainer

agoose77 Aug 21, 2021 Maintainer

jpivarski Aug 22, 2021 Maintainer

agoose77 Aug 22, 2021 Maintainer

rocurley Aug 22, 2021 Author

jpivarski Aug 23, 2021 Maintainer

`from_arrow` doesn't preserve empty string column name #1069

rocurley
Aug 20, 2021

agoose77
Aug 20, 2021
Maintainer

agoose77
Aug 20, 2021
Maintainer

jpivarski
Aug 21, 2021
Maintainer

agoose77
Aug 21, 2021
Maintainer

jpivarski
Aug 22, 2021
Maintainer

agoose77
Aug 22, 2021
Maintainer

rocurley
Aug 22, 2021
Author

jpivarski
Aug 23, 2021
Maintainer