-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open historical data files in binary mode (it's faster) [Edit: on Windows, it turns out] #691
Open historical data files in binary mode (it's faster) [Edit: on Windows, it turns out] #691
Conversation
190d0fe
to
891f91e
Compare
Both json and orjson can parse JSON directly from raw bytes, they don't require the bytes to first be decoded to str. It turns out that reading files in binary mode and parsing the bytes directly to JSON is 1.1 to 1.2 times faster than reading the raw data in text mode. This does mean there is a type discrepancy between StreamListener.on_data and HistoricListener.on_data (str vs bytes respectively). It wouldn't be difficult to change betfairlightweight to have StreamListender.on_data accept bytes, and while it would be an API change I doubt anyone would need to change any code as more or less any .on_data implementation needs to parse the data to JSON immediately anyway and that would work the same with str or bytes. From my benchmark I can't see any significant difference in str vs bytes inside betfairlightweight and as such it seems to me to be more prudent to bend the type hinting a little here and not make any changes in betfairlightweight itself. Benchmark https://gist.github.com/petedmarsh/802f1d2cde79d957afcc744d63d34347 Benchmarks, repeat=5, number=5 ┌────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────────────┬─────────────────┬─────────────────┐ │ Benchmark │ Min │ Max │ Mean │ Min (+) │ Max (+) │ Mean (+) │ ├────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┼─────────────────┤ │ (In Memory) json(str) vs json(bytes) │ 1.188 │ 1.194 │ 1.192 │ 1.199 (-1.0x) │ 1.208 (-1.0x) │ 1.203 (-1.0x) │ │ (In Memory) orjson(str) vs orjson(bytes) │ 0.577 │ 0.580 │ 0.578 │ 0.576 (1.0x) │ 0.580 (-1.0x) │ 0.578 (1.0x) │ │ (In Memory) json(decoded bytes) vs json(bytes) │ 1.193 │ 1.200 │ 1.196 │ 1.200 (-1.0x) │ 1.207 (-1.0x) │ 1.203 (-1.0x) │ │ (In Memory) orjson(decoded bytes) vs orjson(bytes) │ 0.585 │ 0.590 │ 0.587 │ 0.579 (1.0x) │ 0.585 (1.0x) │ 0.582 (1.0x) │ │ (From File) rt json vs rt orjson │ 1.832 │ 1.855 │ 1.839 │ 1.219 (1.5x) │ 1.220 (1.5x) │ 1.220 (1.5x) │ │ (From File) rt json vs rb json │ 1.834 │ 1.836 │ 1.835 │ 1.633 (1.1x) │ 1.649 (1.1x) │ 1.641 (1.1x) │ │ (From File) rt orjson vs rb orjson │ 1.217 │ 1.226 │ 1.221 │ 1.000 (1.2x) │ 1.003 (1.2x) │ 1.002 (1.2x) │ │ (From File) rb json vs rb orjson │ 1.635 │ 1.643 │ 1.638 │ 1.000 (1.6x) │ 1.010 (1.6x) │ 1.007 (1.6x) │ └────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘
891f91e
to
6bb96f5
Compare
Just doing some real world testing and I am finding this to be considerably slower when using orjson, ~5% |
The benchmark is quite synthetic so it's very possible IRL it's worse.
Revert the merge and I'll see if I can make the benchmark realistic. Also
possible it varies due to hardware.
…On Mon, 11 Sept 2023, 16:54 Liam, ***@***.***> wrote:
Just doing some real world testing and I am finding this to be
considerably slower when using orjson, ~5%
—
Reply to this email directly, view it on GitHub
<#691 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJDC6SA5GPLSJYPMAGLGKTXZ4QZLANCNFSM6AAAAAA4I2HK5Y>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I did some very rough and ready testing against some of my recorded data and rb+orjson is still ahead for me. Could be the benchmark or it could be differences in hardware. If you prefer I can make this configurable somehow? |
This is testing on 22 markets but 44 files, py3.22 / orjson / mac M1 using
[ncalls] [tottime] percall [cumtime] percall
195090 0.589 0.000 13.538 0.000
[ncalls] [tottime] [cumtime]
37582 0.021 0.827 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:303[read1]
1444735 0.113 0.113 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:323[closed]
[ncalls] [tottime] percall [cumtime] percall
195090 0.484 0.000 14.357 0.000
[ncalls] [tottime] [cumtime]
88 0.000 0.000 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:323[closed]
1444647 0.460 1.811 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:397[readline] |
I have run the benchmark on an M1 mac book and there I see no difference
between rt and rb with orjson (versus 1.2x betterish on Windows).
So something is going in and clearly this is not a universal improvement, I
will attempt to dig into why the behaviour is different but I suggest
reverting this for now.
…On Tue, 12 Sept 2023, 07:26 Liam, ***@***.***> wrote:
This is testing on 22 markets but 44 files, orjson / mac M1 using
smart_open
r
[ncalls] [tottime] percall [cumtime] percall195090 0.589 0.000 13.538 0.000
[ncalls] [tottime] [cumtime] 37582 0.021 0.827 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:303[read1]1444735 0.113 0.113 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:323[closed]
rb
[ncalls] [tottime] percall [cumtime] percall195090 0.484 0.000 14.357 0.000
[ncalls] [tottime] [cumtime]88 0.000 0.000 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:323[closed]1444647 0.460 1.811 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:397[readline]
—
Reply to this email directly, view it on GitHub
<#691 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJDC6QPIYW7DC4636TELK3XZ7XBDANCNFSM6AAAAAA4I2HK5Y>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Ok I've been digging into this. It turns out that on Windows Python opens files using the cp1252 encoding which is a extended-ascii encoding. My files are UTF-8 encoded and if I open then with So the speedup here seen with 'rb' is due to not using the cp1252 codec and is Windows specific. Please revert this change. What I'll do is make the file encoding parametrizable instead. I think the vast majority of people have UTF-8 encoded historical data but it's possible a few don't (if they are running flumine on Windows) and straight switching over the default encoding could possibly break things for them. |
I took a look into making the encoding parametrizable and as things are architected now it's kind of a pain and ugly. So, what could be done is hardcoding 'encoding=utf8' for every
In Python 3.15 the default encoding for If you merge/release this change as is it will also effectively change the encoding to utf8 everywhere (because the raw bytes will be passed to the JSON parser and it will not work properly with bytes that are actually encoding cp1252. So, please don't release this as-is, consider defaulting to utf8 and if not know that one day that will happen weather you want it to or not :) |
Thanks for looking into this, I have reverted the binary change but made a few small optimisations which make a fair amount of difference on my machines (I haven't benchmarked) |
Have you run some numbers on how much RAM you might need vs how many files
you have open? If you exceed RAM then the OS will have to start paging to
disk and everything will start slowig down again.
I haven't done the math myself to be clear :)
…On Mon, 18 Sept 2023, 09:23 Liam, ***@***.***> wrote:
Thanks for looking into this, I have reverted the binary change but made a
few small optimisations which make a fair amount of difference on my
machines (I haven't benchmarked)
2ba747d
<2ba747d>
—
Reply to this email directly, view it on GitHub
<#691 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJDC6SLALCBP4WSVLGUAPDX27ZFNANCNFSM6AAAAAA4I2HK5Y>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Both json and orjson can parse JSON directly from raw bytes, they
don't require the bytes to first be decoded to str.
It turns out that reading files in binary mode and parsing the bytes
directly to JSON is 1.1 to 1.2 times faster than reading the raw data
in text mode.
This does mean there is a type discrepancy between
StreamListener.on_data and HistoricListener.on_data (str vs bytes
respectively). It wouldn't be difficult to change betfairlightweight
to have StreamListender.on_data accept bytes, and while it would be
an API change I doubt anyone would need to change any code as more or
less any .on_data implementation needs to parse the data to JSON
immediately anyway and that would work the same with str or bytes.
From my benchmark I can't see any significant difference in str vs
bytes inside betfairlightweight and as such it seems to me to be
more prudent to bend the type hinting a little here and not make
any changes in betfairlightweight itself.
Benchmark
https://gist.github.com/petedmarsh/802f1d2cde79d957afcc744d63d34347