Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming decoder #12

Closed
wants to merge 11 commits into from
Closed

Streaming decoder #12

wants to merge 11 commits into from

Conversation

pepijndevos
Copy link

As explained in #11 this PR adds a Wire class that decodes the data without allocating everything at once.

It's a bit hacky because it modifies some things to defer doing work until later which narrowly avoids interfering with normal Wire operation. Maybe there is a way to separate them cleaner or make Wire also more lazy, but I didn't want to break Wire too much.

One example could be to instead of keeping track of repeats, just repeat the value in the fmt dict, so that you can index into it directly rather than searching for the right value.

Probably needs some review, tests, and polish.
Here is some code I used for testing:

import minipb
import gzip

value = (('string_value', 'U'),
         ('float_value', 'f'),
         ('double_value', 'd'),
         ('int_value', 't'),
         ('uint_value', 'T'),
         ('sint_value', 'z'))

feature = (('id', 'T'),
           ('tags', '#T'),
           ('type', 't'),
           ('geometry', '#T'))

layer = (('name', '*U'),
         ('features', '+[', feature, ']'),
         ('keys', '+U'),
         ('values', '+[', value, ']'),
         ('extent', 'T'),
         ('_', 'x9'),
         ('version', '*T'))

tile = (('_', 'x2'), ('layers', '+[', layer, ']'))

refwire = minipb.Wire(tile)

wire = minipb.IterWire(tile)

dec = gzip.open('[redaced]vtiles/10/522/330.pbf', 'rb').read()

print(refwire.decode(dec))

for p in wire.decode(dec):
    print(p)

Which outputs

{'layers': ({'name': 'water', 'features': ({'id': None, 'tags': (1, 0), 'type': 3, 'geometry': (9, 8320, 127, 26, 0, 8448, 8447, 0, 0, 8447, 15)},), 'keys': ('id', 'class', 'intermittent', 'brunnel'), 'values': ({'string_value': 'ocean', 'float_value': None, 'double_value': None, 'int_value': None, 'uint_value': None, 'sint_value': None},), 'extent': 4096, 'version': 2},)}
(('layers', 'name'), 'water')
(('layers', 'features', 'tags'), 1)
(('layers', 'features', 'tags'), 0)
(('layers', 'features', 'type'), 3)
(('layers', 'features', 'geometry'), 9)
(('layers', 'features', 'geometry'), 8320)
(('layers', 'features', 'geometry'), 127)
(('layers', 'features', 'geometry'), 26)
(('layers', 'features', 'geometry'), 0)
(('layers', 'features', 'geometry'), 8448)
(('layers', 'features', 'geometry'), 8447)
(('layers', 'features', 'geometry'), 0)
(('layers', 'features', 'geometry'), 0)
(('layers', 'features', 'geometry'), 8447)
(('layers', 'features', 'geometry'), 15)
(('layers', 'keys'), 'id')
(('layers', 'keys'), 'class')
(('layers', 'keys'), 'intermittent')
(('layers', 'keys'), 'brunnel')
(('layers', 'values', 'string_value'), 'ocean')
(('layers', 'extent'), 4096)
(('layers', 'version'), 2)

This is using a mapbox vector tile sample:
330.zip

@pepijndevos
Copy link
Author

Hrm, on micropython

>>> for x in unpack('/sd/vtiles/14/8432/5398.pbf'): pass
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 32, in unpack
  File "/sd/apps/minipb.py", line 987, in _decode_wire
  File "/sd/apps/minipb.py", line 987, in _decode_wire
  File "/sd/apps/minipb.py", line 985, in _decode_wire
  File "/sd/apps/minipb.py", line 731, in _decode_field
AttributeError: 'str' object has no attribute 'decode'

@pepijndevos
Copy link
Author

So this now works reliably to parse even the largest mapbox vector tiles, though on an embedded device it's quite slow.

One thing I'm considering adding is start/end events like a proper SAX API. Currently it's kinda hard to tell where repeated elements start and end.

@dogtopus
Copy link
Owner

Is this ready?

@pepijndevos
Copy link
Author

Basically? At least ready for a review. I think there might be an optimization or two to be had, and haven't looked at writing tests. I just threw a ton of data at it and it didn't break.

One thing I saw in profiles is that my buffer wrapper is quite hot, and I think I can avoid a lot of that by not nesting it.

And one thing that is kinda unavoidable is that it's less robust to malformed input because it reads things more lazily.

And it could be interesting to consider if it's worth it porting over the eager version to this system. Since the eager version is pretty wasteful with memory, allocating a byte array for every level of nesting, and then turning it into an IO again.

And I think another small thing that might speed things up is implement readinto on my buffer object somehow so that the parsevint function doesn't need to allocate like it does now.

@pepijndevos
Copy link
Author

Concern: debug logging considerably slows the code down even when not actually logging like 1.9s vs 2.6s on a large file.

@pepijndevos
Copy link
Author

So readinto actually resulted in a small speedup, but not nesting wrappers turned out to be way slower actually because tell is a costly operation.

So I guess I'm done with it. Question is what to do with logging and testing.

@dogtopus
Copy link
Owner

dogtopus commented Sep 3, 2022

Since the core logic seems to be quite stable right now, I'm OK for removing debug logs. However I don't think changing warning messages to print is a good idea since I still want it to be as easy to integrate into CPython projects as possible.

Also once you finished, could you please add a couple of unit tests to test.py and make sure they pass on at least CPython?

@pepijndevos
Copy link
Author

Will do. I did notice the loglevel logic seems incorrect, kinda forgot the details.

@dogtopus dogtopus closed this Jun 25, 2023
@pepijndevos
Copy link
Author

Ohai would you still be interested in this if I got around to adding tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants