Replies: 8 comments
-
I can reproduce this, and the |
Beta Was this translation helpful? Give feedback.
-
The cause of this behaviour appears to be https://github.com/scikit-hep/awkward-1.0/blob/533f3852b5528543ad6b9c81d8f7d6fede69cb4c/src/awkward/operations/convert.py#L2902 This was introduced in 6dca862 I don't know why we do this — it reminds me of the lazy column reading with Parquet, but a cursory glance doesn't ring any bells. |
Beta Was this translation helpful? Give feedback.
-
What part of that commit is the problem? Arrow Tables and Parquet files require named fields, but Awkward Arrays can be non-records. For non-records, we use the empty string as a field name, so the empty string is treated differently. I don't know if that's the problem here. I hadn't considered that empty string column names would be desirable in a context with real records (more than one field, more than a formality to fit the technology). |
Beta Was this translation helpful? Give feedback.
-
It's exactly this - we use the presence of only an empty key to implicitly mean "this is an (Awkward) non-record array". @rocurley in this case (I assume) wants to be able to load such a table and maintain the structure. Regardless, this cannot be "solved" - any convention that applies meaning to the key will ultimately collide with some value chosen by a particularly determined user. But, maybe an Awkward prefix would be better than an empty key?Perhaps this issue should become should be a discussion, to welcome other Arrow users' opinions. |
Beta Was this translation helpful? Give feedback.
-
Then what it boils down to us that we wanted a space larger than the set of all possible records; we wanted records and non-records, so we took a corner out of the space of records to represent the non-records. Now those records are not representable. We could have picked a special-sounding key, like "AwkwardFieldNotARecord," but it seemed like a better idea to use a special value (the only string of length zero) for this special meaning. Changing it now would touch a lot of code—it wouldn't be clean and easy, though that's not the strongest agreement against it. The strongest argument is that some people have saved files with the one policy, but if new versions of Awkward Array use a different policy for interpreting single-key, empty key name records, the existing files will be interested differently—it would not be backward compatible. There is something that we could do: we could introduce another convention to encode the single-key, empty key name case in yet another structure. That is, if you really wanted to save this Awkward Array:
then we would put it in a file containing
Okay, so what if you wanted the latter? Well, we'd save it in s file that contained three nested, single-key, empty key name records. A single rule could shift everything down by one, and even though this breaks existing uses of these structures, I doubt anyone's using them seriously. Doing so, we'd get the whole space that we want: all records and non-records. The space of records is infinite, and that means that we can make a bijection to wither infinite space with strictly more things in it—this is Hilbert's hotel. We could have done this uniformly with all structures, but if we did, then the Parquet view of all Awkward Arrays would be strange. This way, only strange structures get a strange encoding. (Strange enough, though, that I expect it would be raised as another bug report, when discovered!) I think we should make this a discussion, but let's hear back from @rocurley first. |
Beta Was this translation helpful? Give feedback.
-
That's a really important point that I failed to mention yesterday — any change to the convention here breaks reading old files. As you say, we'd want a strong case to be made in order to do that. |
Beta Was this translation helpful? Give feedback.
-
Hey folks! So the context for how I encountered this is that I'm trying to read/write data in a very similar format to an Awkward Array from/to parquet, and I've got a property test set up to generate random data and round-trip it. Hypothesis likes to generate "smaller" values, such as the empty string, and so it discovered this corner case. Practically speaking, it was pretty easy to solve this issue on my end: I just made it an error to make a column with an empty string as a name. Since my column names are directly human generated anyway, that wasn't much of a problem. Given that, I arguably no longer have a horse in this race. The reason I reported this as a bug anyway is that it felt like it made the library less predictable, which is something I value. I've been thinking about what circumstances could cause somebody to run into this outside the context of a test. I think the most likely case is if you're using it more as a dict than as a struct: your column names are generated at runtime. Maybe something like "I'm recording the visitor counts per hour to each subdomain of my website (including the empty subdomain) and using awkward array to save it to parquet". What does that person want? I think, mostly, they don't want the special case, but that's not on the table for backwards compatibility. Failing that, I think they want reliable round-tripping through parquet. The Hilbert hotel approach will get them that, so that's nice. On the other hand, you could imagine that they're writing the data to parquet so that some other system can consume it: in that case they want the current behavior. Not really sure what would be best here! |
Beta Was this translation helpful? Give feedback.
-
The Awkward → Parquet → Awkward round trip can be made an identity function, which is more predictable, through the Hilbert Hotel encoding, but that makes the Awkward → Parquet step more complex and less predictable. The reason we're not wrapping everything in Since this was motivated by hypothesis testing the edge cases, not a real use-case, I think we shouldn't change it yet. As for the possibility of a real use-case in which records are treated more like dicts than structs: that's an antipattern we should discourage. Field names in C structs are not generated because they get "compiled in," there's a presumption that they're not dynamic (and can't be, unless you've got a JIT compiler). Trying to make them dynamic would make the performance worse, rather than better. Record names in Awkward Arrays are like that, too: each field creates its own memory buffer (and data type, and array infrastructure), so if they're dynamically created, the performance would be worse; it would make a lot more sense in that case to use a Python dict than an Awkward Array. On the other hand, I should point out that a "mapping" data structure is a future feature (#780), but the structure is not a record with dynamically generated field names: it's two equal length arrays of keys and values (thus enforcing that the keys all have the same type as each other in one compact array, rather than an array and type for each key), in which the keys are sorted for binary searches. Uproot generates arrays of this type when it encounters a C++ I'll make this a discussion. |
Beta Was this translation helpful? Give feedback.
-
Version of Awkward Array
1.4.0
Description and code to reproduce
Code
Output
Expected Output
Beta Was this translation helpful? Give feedback.
All reactions