Streaming decoder #12

pepijndevos · 2022-08-11T07:04:58Z

As explained in #11 this PR adds a Wire class that decodes the data without allocating everything at once.

It's a bit hacky because it modifies some things to defer doing work until later which narrowly avoids interfering with normal Wire operation. Maybe there is a way to separate them cleaner or make Wire also more lazy, but I didn't want to break Wire too much.

One example could be to instead of keeping track of repeats, just repeat the value in the fmt dict, so that you can index into it directly rather than searching for the right value.

Probably needs some review, tests, and polish.
Here is some code I used for testing:

import minipb
import gzip

value = (('string_value', 'U'),
         ('float_value', 'f'),
         ('double_value', 'd'),
         ('int_value', 't'),
         ('uint_value', 'T'),
         ('sint_value', 'z'))

feature = (('id', 'T'),
           ('tags', '#T'),
           ('type', 't'),
           ('geometry', '#T'))

layer = (('name', '*U'),
         ('features', '+[', feature, ']'),
         ('keys', '+U'),
         ('values', '+[', value, ']'),
         ('extent', 'T'),
         ('_', 'x9'),
         ('version', '*T'))

tile = (('_', 'x2'), ('layers', '+[', layer, ']'))

refwire = minipb.Wire(tile)

wire = minipb.IterWire(tile)

dec = gzip.open('[redaced]vtiles/10/522/330.pbf', 'rb').read()

print(refwire.decode(dec))

for p in wire.decode(dec):
    print(p)

Which outputs

{'layers': ({'name': 'water', 'features': ({'id': None, 'tags': (1, 0), 'type': 3, 'geometry': (9, 8320, 127, 26, 0, 8448, 8447, 0, 0, 8447, 15)},), 'keys': ('id', 'class', 'intermittent', 'brunnel'), 'values': ({'string_value': 'ocean', 'float_value': None, 'double_value': None, 'int_value': None, 'uint_value': None, 'sint_value': None},), 'extent': 4096, 'version': 2},)}
(('layers', 'name'), 'water')
(('layers', 'features', 'tags'), 1)
(('layers', 'features', 'tags'), 0)
(('layers', 'features', 'type'), 3)
(('layers', 'features', 'geometry'), 9)
(('layers', 'features', 'geometry'), 8320)
(('layers', 'features', 'geometry'), 127)
(('layers', 'features', 'geometry'), 26)
(('layers', 'features', 'geometry'), 0)
(('layers', 'features', 'geometry'), 8448)
(('layers', 'features', 'geometry'), 8447)
(('layers', 'features', 'geometry'), 0)
(('layers', 'features', 'geometry'), 0)
(('layers', 'features', 'geometry'), 8447)
(('layers', 'features', 'geometry'), 15)
(('layers', 'keys'), 'id')
(('layers', 'keys'), 'class')
(('layers', 'keys'), 'intermittent')
(('layers', 'keys'), 'brunnel')
(('layers', 'values', 'string_value'), 'ocean')
(('layers', 'extent'), 4096)
(('layers', 'version'), 2)

This is using a mapbox vector tile sample:
330.zip

pepijndevos · 2022-08-11T07:38:51Z

Hrm, on micropython

>>> for x in unpack('/sd/vtiles/14/8432/5398.pbf'): pass
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 32, in unpack
  File "/sd/apps/minipb.py", line 987, in _decode_wire
  File "/sd/apps/minipb.py", line 987, in _decode_wire
  File "/sd/apps/minipb.py", line 985, in _decode_wire
  File "/sd/apps/minipb.py", line 731, in _decode_field
AttributeError: 'str' object has no attribute 'decode'

pepijndevos · 2022-08-11T12:57:16Z

So this now works reliably to parse even the largest mapbox vector tiles, though on an embedded device it's quite slow.

One thing I'm considering adding is start/end events like a proper SAX API. Currently it's kinda hard to tell where repeated elements start and end.

dogtopus · 2022-08-31T14:53:27Z

Is this ready?

pepijndevos · 2022-08-31T14:58:42Z

Basically? At least ready for a review. I think there might be an optimization or two to be had, and haven't looked at writing tests. I just threw a ton of data at it and it didn't break.

One thing I saw in profiles is that my buffer wrapper is quite hot, and I think I can avoid a lot of that by not nesting it.

And one thing that is kinda unavoidable is that it's less robust to malformed input because it reads things more lazily.

And it could be interesting to consider if it's worth it porting over the eager version to this system. Since the eager version is pretty wasteful with memory, allocating a byte array for every level of nesting, and then turning it into an IO again.

And I think another small thing that might speed things up is implement readinto on my buffer object somehow so that the parsevint function doesn't need to allocate like it does now.

pepijndevos · 2022-09-03T09:59:58Z

Concern: debug logging considerably slows the code down even when not actually logging like 1.9s vs 2.6s on a large file.

pepijndevos · 2022-09-03T16:18:13Z

So readinto actually resulted in a small speedup, but not nesting wrappers turned out to be way slower actually because tell is a costly operation.

So I guess I'm done with it. Question is what to do with logging and testing.

dogtopus · 2022-09-03T17:01:03Z

Since the core logic seems to be quite stable right now, I'm OK for removing debug logs. However I don't think changing warning messages to print is a good idea since I still want it to be as easy to integrate into CPython projects as possible.

Also once you finished, could you please add a couple of unit tests to test.py and make sure they pass on at least CPython?

pepijndevos · 2022-09-05T07:55:26Z

Will do. I did notice the loglevel logic seems incorrect, kinda forgot the details.

pepijndevos · 2023-06-25T06:58:39Z

Ohai would you still be interested in this if I got around to adding tests?

pepijndevos added 5 commits August 11, 2022 08:37

working stuff

3df5aa5

better working stuff

10a4e7f

bestest version

9eaf252

remove bisect

52cd5b9

not all file-like things support seek/tell

d5afa11

read should return bytes

3f9c7f7

pepijndevos added 2 commits August 11, 2022 15:12

yield start/end tokens like a good SAXy boy

757d865

minor speedup by assuming no repeats

dc9c155

pepijndevos added 3 commits September 3, 2022 12:02

remove debug logging as it significantly slows parsing down

5720f49

bring back readinto

bcfebf5

well that didn't work out

907c9b9

dogtopus closed this Jun 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming decoder #12

Streaming decoder #12

pepijndevos commented Aug 11, 2022

pepijndevos commented Aug 11, 2022

pepijndevos commented Aug 11, 2022

dogtopus commented Aug 31, 2022

pepijndevos commented Aug 31, 2022

pepijndevos commented Sep 3, 2022

pepijndevos commented Sep 3, 2022

dogtopus commented Sep 3, 2022

pepijndevos commented Sep 5, 2022

pepijndevos commented Jun 25, 2023

Streaming decoder #12

Streaming decoder #12

Conversation

pepijndevos commented Aug 11, 2022

pepijndevos commented Aug 11, 2022

pepijndevos commented Aug 11, 2022

dogtopus commented Aug 31, 2022

pepijndevos commented Aug 31, 2022

pepijndevos commented Sep 3, 2022

pepijndevos commented Sep 3, 2022

dogtopus commented Sep 3, 2022

pepijndevos commented Sep 5, 2022

pepijndevos commented Jun 25, 2023