Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open historical data files in binary mode (it's faster) [Edit: on Windows, it turns out] #691

Conversation

petedmarsh
Copy link
Contributor

@petedmarsh petedmarsh commented Sep 2, 2023

Both json and orjson can parse JSON directly from raw bytes, they
don't require the bytes to first be decoded to str.

It turns out that reading files in binary mode and parsing the bytes
directly to JSON is 1.1 to 1.2 times faster than reading the raw data
in text mode.

This does mean there is a type discrepancy between
StreamListener.on_data and HistoricListener.on_data (str vs bytes
respectively). It wouldn't be difficult to change betfairlightweight
to have StreamListender.on_data accept bytes, and while it would be
an API change I doubt anyone would need to change any code as more or
less any .on_data implementation needs to parse the data to JSON
immediately anyway and that would work the same with str or bytes.
From my benchmark I can't see any significant difference in str vs
bytes inside betfairlightweight and as such it seems to me to be
more prudent to bend the type hinting a little here and not make
any changes in betfairlightweight itself.

Benchmark

https://gist.github.com/petedmarsh/802f1d2cde79d957afcc744d63d34347

                                                       Benchmarks, repeat=5, number=5
┌────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────────────┬─────────────────┬─────────────────┐
│                                          Benchmark │ Min     │ Max     │ Mean    │ Min (+)         │ Max (+)         │ Mean (+)        │
├────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┼─────────────────┤
│               (In Memory) json(str) vs json(bytes) │ 1.188   │ 1.194   │ 1.192   │ 1.199 (-1.0x)   │ 1.208 (-1.0x)   │ 1.203 (-1.0x)   │
│           (In Memory) orjson(str) vs orjson(bytes) │ 0.577   │ 0.580   │ 0.578   │ 0.576 (1.0x)    │ 0.580 (-1.0x)   │ 0.578 (1.0x)    │
│     (In Memory) json(decoded bytes) vs json(bytes) │ 1.193   │ 1.200   │ 1.196   │ 1.200 (-1.0x)   │ 1.207 (-1.0x)   │ 1.203 (-1.0x)   │
│ (In Memory) orjson(decoded bytes) vs orjson(bytes) │ 0.585   │ 0.590   │ 0.587   │ 0.579 (1.0x)    │ 0.585 (1.0x)    │ 0.582 (1.0x)    │
│                   (From File) rt json vs rt orjson │ 1.832   │ 1.855   │ 1.839   │ 1.219 (1.5x)    │ 1.220 (1.5x)    │ 1.220 (1.5x)    │
│                     (From File) rt json vs rb json │ 1.834   │ 1.836   │ 1.835   │ 1.633 (1.1x)    │ 1.649 (1.1x)    │ 1.641 (1.1x)    │
│                 (From File) rt orjson vs rb orjson │ 1.217   │ 1.226   │ 1.221   │ 1.000 (1.2x)    │ 1.003 (1.2x)    │ 1.002 (1.2x)    │
│                   (From File) rb json vs rb orjson │ 1.635   │ 1.643   │ 1.638   │ 1.000 (1.6x)    │ 1.010 (1.6x)    │ 1.007 (1.6x)    │
└────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

Both json and orjson can parse JSON directly from raw bytes, they
don't require the bytes to first be decoded to str.

It turns out that reading files in binary mode and parsing the bytes
directly to JSON is 1.1 to 1.2 times faster than reading the raw data
in text mode.

This does mean there is a type discrepancy between
StreamListener.on_data and HistoricListener.on_data (str vs bytes
respectively). It wouldn't be difficult to change betfairlightweight
to have StreamListender.on_data accept bytes, and while it would be
an API change I doubt anyone would need to change any code as more or
less any .on_data implementation needs to parse the data to JSON
immediately anyway and that would work the same with str or bytes.
From my benchmark I can't see any significant difference in str vs
bytes inside betfairlightweight and as such it seems to me to be
more prudent to bend the type hinting a little here and not make
any changes in betfairlightweight itself.

Benchmark

https://gist.github.com/petedmarsh/802f1d2cde79d957afcc744d63d34347

                                                           Benchmarks, repeat=5, number=5
    ┌────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────────────┬─────────────────┬─────────────────┐
    │                                          Benchmark │ Min     │ Max     │ Mean    │ Min (+)         │ Max (+)         │ Mean (+)        │
    ├────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┼─────────────────┤
    │               (In Memory) json(str) vs json(bytes) │ 1.188   │ 1.194   │ 1.192   │ 1.199 (-1.0x)   │ 1.208 (-1.0x)   │ 1.203 (-1.0x)   │
    │           (In Memory) orjson(str) vs orjson(bytes) │ 0.577   │ 0.580   │ 0.578   │ 0.576 (1.0x)    │ 0.580 (-1.0x)   │ 0.578 (1.0x)    │
    │     (In Memory) json(decoded bytes) vs json(bytes) │ 1.193   │ 1.200   │ 1.196   │ 1.200 (-1.0x)   │ 1.207 (-1.0x)   │ 1.203 (-1.0x)   │
    │ (In Memory) orjson(decoded bytes) vs orjson(bytes) │ 0.585   │ 0.590   │ 0.587   │ 0.579 (1.0x)    │ 0.585 (1.0x)    │ 0.582 (1.0x)    │
    │                   (From File) rt json vs rt orjson │ 1.832   │ 1.855   │ 1.839   │ 1.219 (1.5x)    │ 1.220 (1.5x)    │ 1.220 (1.5x)    │
    │                     (From File) rt json vs rb json │ 1.834   │ 1.836   │ 1.835   │ 1.633 (1.1x)    │ 1.649 (1.1x)    │ 1.641 (1.1x)    │
    │                 (From File) rt orjson vs rb orjson │ 1.217   │ 1.226   │ 1.221   │ 1.000 (1.2x)    │ 1.003 (1.2x)    │ 1.002 (1.2x)    │
    │                   (From File) rb json vs rb orjson │ 1.635   │ 1.643   │ 1.638   │ 1.000 (1.6x)    │ 1.010 (1.6x)    │ 1.007 (1.6x)    │
    └────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘
@liampauling liampauling changed the base branch from master to release/2.5.0 September 11, 2023 14:44
@liampauling liampauling merged commit 5c97a15 into betcode-org:release/2.5.0 Sep 11, 2023
6 checks passed
@liampauling
Copy link
Member

Just doing some real world testing and I am finding this to be considerably slower when using orjson, ~5%

@petedmarsh
Copy link
Contributor Author

petedmarsh commented Sep 11, 2023 via email

@petedmarsh
Copy link
Contributor Author

                                                      Benchmarks, repeat=5, number=5
┌────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────────────┬─────────────────┬─────────────────┐
│                                          Benchmark │ Min     │ Max     │ Mean    │ Min (+)         │ Max (+)         │ Mean (+)        │
├────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────────────┼─────────────────┼─────────────────┤
│               (In Memory) json(str) vs json(bytes) │ 0.004   │ 0.004   │ 0.004   │ 0.004 (-1.0x)   │ 0.004 (1.0x)    │ 0.004 (1.0x)    │
│           (In Memory) orjson(str) vs orjson(bytes) │ 0.002   │ 0.002   │ 0.002   │ 0.002 (1.0x)    │ 0.002 (1.1x)    │ 0.002 (1.0x)    │
│     (In Memory) json(decoded bytes) vs json(bytes) │ 0.004   │ 0.004   │ 0.004   │ 0.004 (-1.0x)   │ 0.004 (-1.0x)   │ 0.004 (-1.0x)   │
│ (In Memory) orjson(decoded bytes) vs orjson(bytes) │ 0.004   │ 0.004   │ 0.004   │ 0.002 (2.0x)    │ 0.002 (1.9x)    │ 0.002 (2.0x)    │
│                   (From File) rt json vs rt orjson │ 4.183   │ 4.200   │ 4.190   │ 1.558 (2.7x)    │ 1.563 (2.7x)    │ 1.560 (2.7x)    │
│                     (From File) rt json vs rb json │ 4.192   │ 4.512   │ 4.294   │ 4.506 (-1.1x)   │ 4.734 (-1.0x)   │ 4.632 (-1.1x)   │
│                 (From File) rt orjson vs rb orjson │ 1.587   │ 1.634   │ 1.608   │ 1.239 (1.3x)    │ 1.311 (1.2x)    │ 1.275 (1.3x)    │
│                   (From File) rb json vs rb orjson │ 4.443   │ 4.617   │ 4.501   │ 1.317 (3.4x)    │ 1.406 (3.3x)    │ 1.379 (3.3x)    │
└────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

I did some very rough and ready testing against some of my recorded data and rb+orjson is still ahead for me. Could be the benchmark or it could be differences in hardware. If you prefer I can make this configurable somehow?

@liampauling
Copy link
Member

liampauling commented Sep 12, 2023

This is testing on 22 markets but 44 files, py3.22 / orjson / mac M1 using smart_open

r

[ncalls] [tottime] percall  [cumtime] percall
195090    0.589    0.000   13.538    0.000

[ncalls] [tottime] [cumtime] 
37582    0.021    0.827  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:303[read1]
1444735    0.113    0.113  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:323[closed]

rb

[ncalls] [tottime] percall  [cumtime] percall
195090    0.484    0.000   14.357    0.000

[ncalls] [tottime] [cumtime]
88    0.000    0.000  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:323[closed]
1444647    0.460    1.811  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py:397[readline]

@petedmarsh
Copy link
Contributor Author

petedmarsh commented Sep 12, 2023 via email

@petedmarsh
Copy link
Contributor Author

Ok I've been digging into this. It turns out that on Windows Python opens files using the cp1252 encoding which is a extended-ascii encoding. My files are UTF-8 encoded and if I open then with open('...', encoding='utf-8') then everything speeds up on Windows.

So the speedup here seen with 'rb' is due to not using the cp1252 codec and is Windows specific.

Please revert this change.

What I'll do is make the file encoding parametrizable instead. I think the vast majority of people have UTF-8 encoded historical data but it's possible a few don't (if they are running flumine on Windows) and straight switching over the default encoding could possibly break things for them.

@petedmarsh
Copy link
Contributor Author

@liampauling

I took a look into making the encoding parametrizable and as things are architected now it's kind of a pain and ugly.

So, what could be done is hardcoding 'encoding=utf8' for every open. This would have the following consequences:

  1. For Linux and OSX no consequence whatsoever
  2. For Windows instead of reading files as cp1252 files will be read as utf8 and reading historical files will be faster as the utf8 codec is much faster than the cp1252 codec on this historical files (as they are largely effectively ASCII). However, anyone who has been recording data on Windows and not setting an encoding explicitly will have been saving data cp1252 encoded and so such a change could cause those people to be unable to open their historical data files. It is not that hard to convert them to utf8 but they would have to do so.

In Python 3.15 the default encoding for open will become utf8 so in the nearish future this change will be forced in any case.

If you merge/release this change as is it will also effectively change the encoding to utf8 everywhere (because the raw bytes will be passed to the JSON parser and it will not work properly with bytes that are actually encoding cp1252.

So, please don't release this as-is, consider defaulting to utf8 and if not know that one day that will happen weather you want it to or not :)

@petedmarsh petedmarsh deleted the read-historical-files-as-bytes branch September 16, 2023 19:24
@petedmarsh petedmarsh changed the title Open historical data files in binary mode (it's faster) Open historical data files in binary mode (it's faster) [Edit: on Windows, it turns out] Sep 16, 2023
@liampauling
Copy link
Member

Thanks for looking into this, I have reverted the binary change but made a few small optimisations which make a fair amount of difference on my machines (I haven't benchmarked)

2ba747d

@petedmarsh
Copy link
Contributor Author

petedmarsh commented Sep 18, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants