Open historical data files in binary mode (it's faster) [Edit: on Windows, it turns out] #691

petedmarsh · 2023-09-02T20:55:32Z

Both json and orjson can parse JSON directly from raw bytes, they
don't require the bytes to first be decoded to str.

It turns out that reading files in binary mode and parsing the bytes
directly to JSON is 1.1 to 1.2 times faster than reading the raw data
in text mode.

This does mean there is a type discrepancy between
StreamListener.on_data and HistoricListener.on_data (str vs bytes
respectively). It wouldn't be difficult to change betfairlightweight
to have StreamListender.on_data accept bytes, and while it would be
an API change I doubt anyone would need to change any code as more or
less any .on_data implementation needs to parse the data to JSON
immediately anyway and that would work the same with str or bytes.
From my benchmark I can't see any significant difference in str vs
bytes inside betfairlightweight and as such it seems to me to be
more prudent to bend the type hinting a little here and not make
any changes in betfairlightweight itself.

Benchmark

https://gist.github.com/petedmarsh/802f1d2cde79d957afcc744d63d34347

                                                       Benchmarks, repeat=5, number=5
┌────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────────────┬─────────────────┬─────────────────┐
│                                          Benchmark │ Min     │ Max     │ Mean    │ Min (+)         │ Max (+)         │ Mean (+)        │
├────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┼─────────────────┤
│               (In Memory) json(str) vs json(bytes) │ 1.188   │ 1.194   │ 1.192   │ 1.199 (-1.0x)   │ 1.208 (-1.0x)   │ 1.203 (-1.0x)   │
│           (In Memory) orjson(str) vs orjson(bytes) │ 0.577   │ 0.580   │ 0.578   │ 0.576 (1.0x)    │ 0.580 (-1.0x)   │ 0.578 (1.0x)    │
│     (In Memory) json(decoded bytes) vs json(bytes) │ 1.193   │ 1.200   │ 1.196   │ 1.200 (-1.0x)   │ 1.207 (-1.0x)   │ 1.203 (-1.0x)   │
│ (In Memory) orjson(decoded bytes) vs orjson(bytes) │ 0.585   │ 0.590   │ 0.587   │ 0.579 (1.0x)    │ 0.585 (1.0x)    │ 0.582 (1.0x)    │
│                   (From File) rt json vs rt orjson │ 1.832   │ 1.855   │ 1.839   │ 1.219 (1.5x)    │ 1.220 (1.5x)    │ 1.220 (1.5x)    │
│                     (From File) rt json vs rb json │ 1.834   │ 1.836   │ 1.835   │ 1.633 (1.1x)    │ 1.649 (1.1x)    │ 1.641 (1.1x)    │
│                 (From File) rt orjson vs rb orjson │ 1.217   │ 1.226   │ 1.221   │ 1.000 (1.2x)    │ 1.003 (1.2x)    │ 1.002 (1.2x)    │
│                   (From File) rb json vs rb orjson │ 1.635   │ 1.643   │ 1.638   │ 1.000 (1.6x)    │ 1.010 (1.6x)    │ 1.007 (1.6x)    │
└────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

Both json and orjson can parse JSON directly from raw bytes, they don't require the bytes to first be decoded to str. It turns out that reading files in binary mode and parsing the bytes directly to JSON is 1.1 to 1.2 times faster than reading the raw data in text mode. This does mean there is a type discrepancy between StreamListener.on_data and HistoricListener.on_data (str vs bytes respectively). It wouldn't be difficult to change betfairlightweight to have StreamListender.on_data accept bytes, and while it would be an API change I doubt anyone would need to change any code as more or less any .on_data implementation needs to parse the data to JSON immediately anyway and that would work the same with str or bytes. From my benchmark I can't see any significant difference in str vs bytes inside betfairlightweight and as such it seems to me to be more prudent to bend the type hinting a little here and not make any changes in betfairlightweight itself. Benchmark https://gist.github.com/petedmarsh/802f1d2cde79d957afcc744d63d34347 Benchmarks, repeat=5, number=5 ┌────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────────────┬─────────────────┬─────────────────┐ │ Benchmark │ Min │ Max │ Mean │ Min (+) │ Max (+) │ Mean (+) │ ├────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┼─────────────────┤ │ (In Memory) json(str) vs json(bytes) │ 1.188 │ 1.194 │ 1.192 │ 1.199 (-1.0x) │ 1.208 (-1.0x) │ 1.203 (-1.0x) │ │ (In Memory) orjson(str) vs orjson(bytes) │ 0.577 │ 0.580 │ 0.578 │ 0.576 (1.0x) │ 0.580 (-1.0x) │ 0.578 (1.0x) │ │ (In Memory) json(decoded bytes) vs json(bytes) │ 1.193 │ 1.200 │ 1.196 │ 1.200 (-1.0x) │ 1.207 (-1.0x) │ 1.203 (-1.0x) │ │ (In Memory) orjson(decoded bytes) vs orjson(bytes) │ 0.585 │ 0.590 │ 0.587 │ 0.579 (1.0x) │ 0.585 (1.0x) │ 0.582 (1.0x) │ │ (From File) rt json vs rt orjson │ 1.832 │ 1.855 │ 1.839 │ 1.219 (1.5x) │ 1.220 (1.5x) │ 1.220 (1.5x) │ │ (From File) rt json vs rb json │ 1.834 │ 1.836 │ 1.835 │ 1.633 (1.1x) │ 1.649 (1.1x) │ 1.641 (1.1x) │ │ (From File) rt orjson vs rb orjson │ 1.217 │ 1.226 │ 1.221 │ 1.000 (1.2x) │ 1.003 (1.2x) │ 1.002 (1.2x) │ │ (From File) rb json vs rb orjson │ 1.635 │ 1.643 │ 1.638 │ 1.000 (1.6x) │ 1.010 (1.6x) │ 1.007 (1.6x) │ └────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

liampauling · 2023-09-11T14:54:01Z

Just doing some real world testing and I am finding this to be considerably slower when using orjson, ~5%

petedmarsh · 2023-09-11T15:35:33Z

The benchmark is quite synthetic so it's very possible IRL it's worse. Revert the merge and I'll see if I can make the benchmark realistic. Also possible it varies due to hardware.

…

On Mon, 11 Sept 2023, 16:54 Liam, ***@***.***> wrote: Just doing some real world testing and I am finding this to be considerably slower when using orjson, ~5% — Reply to this email directly, view it on GitHub <#691 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJDC6SA5GPLSJYPMAGLGKTXZ4QZLANCNFSM6AAAAAA4I2HK5Y> . You are receiving this because you authored the thread.Message ID: ***@***.***>

petedmarsh · 2023-09-11T20:01:50Z

                                                      Benchmarks, repeat=5, number=5
┌────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────────────┬─────────────────┬─────────────────┐
│                                          Benchmark │ Min     │ Max     │ Mean    │ Min (+)         │ Max (+)         │ Mean (+)        │
├────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┼─────────────────┤
│               (In Memory) json(str) vs json(bytes) │ 0.004   │ 0.004   │ 0.004   │ 0.004 (-1.0x)   │ 0.004 (1.0x)    │ 0.004 (1.0x)    │
│           (In Memory) orjson(str) vs orjson(bytes) │ 0.002   │ 0.002   │ 0.002   │ 0.002 (1.0x)    │ 0.002 (1.1x)    │ 0.002 (1.0x)    │
│     (In Memory) json(decoded bytes) vs json(bytes) │ 0.004   │ 0.004   │ 0.004   │ 0.004 (-1.0x)   │ 0.004 (-1.0x)   │ 0.004 (-1.0x)   │
│ (In Memory) orjson(decoded bytes) vs orjson(bytes) │ 0.004   │ 0.004   │ 0.004   │ 0.002 (2.0x)    │ 0.002 (1.9x)    │ 0.002 (2.0x)    │
│                   (From File) rt json vs rt orjson │ 4.183   │ 4.200   │ 4.190   │ 1.558 (2.7x)    │ 1.563 (2.7x)    │ 1.560 (2.7x)    │
│                     (From File) rt json vs rb json │ 4.192   │ 4.512   │ 4.294   │ 4.506 (-1.1x)   │ 4.734 (-1.0x)   │ 4.632 (-1.1x)   │
│                 (From File) rt orjson vs rb orjson │ 1.587   │ 1.634   │ 1.608   │ 1.239 (1.3x)    │ 1.311 (1.2x)    │ 1.275 (1.3x)    │
│                   (From File) rb json vs rb orjson │ 4.443   │ 4.617   │ 4.501   │ 1.317 (3.4x)    │ 1.406 (3.3x)    │ 1.379 (3.3x)    │
└────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

I did some very rough and ready testing against some of my recorded data and rb+orjson is still ahead for me. Could be the benchmark or it could be differences in hardware. If you prefer I can make this configurable somehow?

liampauling · 2023-09-12T05:26:31Z

This is testing on 22 markets but 44 files, py3.22 / orjson / mac M1 using smart_open

r

[ncalls] [tottime] percall  [cumtime] percall
195090    0.589    0.000   13.538    0.000

[ncalls] [tottime] [cumtime] 
37582    0.021    0.827  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:303[read1]
1444735    0.113    0.113  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:323[closed]

rb

[ncalls] [tottime] percall  [cumtime] percall
195090    0.484    0.000   14.357    0.000

[ncalls] [tottime] [cumtime]
88    0.000    0.000  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:323[closed]
1444647    0.460    1.811  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:397[readline]

petedmarsh · 2023-09-12T11:06:57Z

I have run the benchmark on an M1 mac book and there I see no difference between rt and rb with orjson (versus 1.2x betterish on Windows). So something is going in and clearly this is not a universal improvement, I will attempt to dig into why the behaviour is different but I suggest reverting this for now.

…

On Tue, 12 Sept 2023, 07:26 Liam, ***@***.***> wrote: This is testing on 22 markets but 44 files, orjson / mac M1 using smart_open r [ncalls] [tottime] percall [cumtime] percall195090 0.589 0.000 13.538 0.000 [ncalls] [tottime] [cumtime] 37582 0.021 0.827 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:303[read1]1444735 0.113 0.113 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:323[closed] rb [ncalls] [tottime] percall [cumtime] percall195090 0.484 0.000 14.357 0.000 [ncalls] [tottime] [cumtime]88 0.000 0.000 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:323[closed]1444647 0.460 1.811 /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:397[readline] — Reply to this email directly, view it on GitHub <#691 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJDC6QPIYW7DC4636TELK3XZ7XBDANCNFSM6AAAAAA4I2HK5Y> . You are receiving this because you authored the thread.Message ID: ***@***.***>

petedmarsh · 2023-09-13T16:53:37Z

Ok I've been digging into this. It turns out that on Windows Python opens files using the cp1252 encoding which is a extended-ascii encoding. My files are UTF-8 encoded and if I open then with open('...', encoding='utf-8') then everything speeds up on Windows.

So the speedup here seen with 'rb' is due to not using the cp1252 codec and is Windows specific.

Please revert this change.

What I'll do is make the file encoding parametrizable instead. I think the vast majority of people have UTF-8 encoded historical data but it's possible a few don't (if they are running flumine on Windows) and straight switching over the default encoding could possibly break things for them.

petedmarsh · 2023-09-16T19:24:07Z

@liampauling

I took a look into making the encoding parametrizable and as things are architected now it's kind of a pain and ugly.

So, what could be done is hardcoding 'encoding=utf8' for every open. This would have the following consequences:

For Linux and OSX no consequence whatsoever
For Windows instead of reading files as cp1252 files will be read as utf8 and reading historical files will be faster as the utf8 codec is much faster than the cp1252 codec on this historical files (as they are largely effectively ASCII). However, anyone who has been recording data on Windows and not setting an encoding explicitly will have been saving data cp1252 encoded and so such a change could cause those people to be unable to open their historical data files. It is not that hard to convert them to utf8 but they would have to do so.

In Python 3.15 the default encoding for open will become utf8 so in the nearish future this change will be forced in any case.

If you merge/release this change as is it will also effectively change the encoding to utf8 everywhere (because the raw bytes will be passed to the JSON parser and it will not work properly with bytes that are actually encoding cp1252.

So, please don't release this as-is, consider defaulting to utf8 and if not know that one day that will happen weather you want it to or not :)

liampauling · 2023-09-18T07:22:52Z

Thanks for looking into this, I have reverted the binary change but made a few small optimisations which make a fair amount of difference on my machines (I haven't benchmarked)

2ba747d

petedmarsh · 2023-09-18T12:30:29Z

Have you run some numbers on how much RAM you might need vs how many files you have open? If you exceed RAM then the OS will have to start paging to disk and everything will start slowig down again. I haven't done the math myself to be clear :)

…

On Mon, 18 Sept 2023, 09:23 Liam, ***@***.***> wrote: Thanks for looking into this, I have reverted the binary change but made a few small optimisations which make a fair amount of difference on my machines (I haven't benchmarked) 2ba747d <2ba747d> — Reply to this email directly, view it on GitHub <#691 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJDC6SLALCBP4WSVLGUAPDX27ZFNANCNFSM6AAAAAA4I2HK5Y> . You are receiving this because you authored the thread.Message ID: ***@***.***>

petedmarsh force-pushed the read-historical-files-as-bytes branch from 190d0fe to 891f91e Compare September 2, 2023 20:59

petedmarsh force-pushed the read-historical-files-as-bytes branch from 891f91e to 6bb96f5 Compare September 3, 2023 19:15

liampauling approved these changes Sep 11, 2023

View reviewed changes

liampauling changed the base branch from master to release/2.5.0 September 11, 2023 14:44

liampauling merged commit 5c97a15 into betcode-org:release/2.5.0 Sep 11, 2023
6 checks passed

petedmarsh deleted the read-historical-files-as-bytes branch September 16, 2023 19:24

petedmarsh changed the title ~~Open historical data files in binary mode (it's faster)~~ Open historical data files in binary mode (it's faster) [Edit: on Windows, it turns out] Sep 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open historical data files in binary mode (it's faster) [Edit: on Windows, it turns out] #691

Open historical data files in binary mode (it's faster) [Edit: on Windows, it turns out] #691

petedmarsh commented Sep 2, 2023 •

edited

Loading

liampauling commented Sep 11, 2023

petedmarsh commented Sep 11, 2023 via email

petedmarsh commented Sep 11, 2023

liampauling commented Sep 12, 2023 •

edited

Loading

petedmarsh commented Sep 12, 2023 via email

petedmarsh commented Sep 13, 2023

petedmarsh commented Sep 16, 2023

liampauling commented Sep 18, 2023

petedmarsh commented Sep 18, 2023 via email

Open historical data files in binary mode (it's faster) [Edit: on Windows, it turns out] #691

Open historical data files in binary mode (it's faster) [Edit: on Windows, it turns out] #691

Conversation

petedmarsh commented Sep 2, 2023 • edited Loading

liampauling commented Sep 11, 2023

petedmarsh commented Sep 11, 2023 via email

petedmarsh commented Sep 11, 2023

liampauling commented Sep 12, 2023 • edited Loading

petedmarsh commented Sep 12, 2023 via email

petedmarsh commented Sep 13, 2023

petedmarsh commented Sep 16, 2023

liampauling commented Sep 18, 2023

petedmarsh commented Sep 18, 2023 via email

petedmarsh commented Sep 2, 2023 •

edited

Loading

liampauling commented Sep 12, 2023 •

edited

Loading