Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetBars/GetQuotes/GetTrades/etc. cannot be paginated manually and have no way of indicating progress #248

Open
probably-not opened this issue Jun 6, 2023 · 17 comments

Comments

@probably-not
Copy link
Contributor

When fetching Historical data via the Market Data API, there is no way of manually paginating the data (via the page tokens as specified in the docs), and there is no way of seeing the progress of the internal paginations.

This is pretty problematic:

  • It can lead to explosions in the amount of memory used (for example, if requesting several months of trades/quotes, since everything is appended into a gigantic slice)
  • It can lead to clients not being able to indicate progress on the request (fetching more than a few weeks of data can take a large amount of time and the program pretty much just hangs...)

In the v2 version of the module, there was a GetxxxAsync version of each of the functions, which let the user stream the entities over a channel. This solved both of the above issues:

  • Data wasn't appended to a giant slice, so the memory could be managed more by the user (you can choose to append to a slice in memory, or you can write to a DB/File/etc. and not hold on to the memory)
  • Progress could be reported as the entities were streamed to the consumer of the channel.

I understand that in v3 the Async functions were removed (which sucks for me, I was using them heavily to stream historical data through my systems for simulations and backtesting), but since the PageToken (on the request) and NextPageToken (on the response) values aren't exposed on the client, I cannot even implement pagination/streaming manually using the official module, which is a huge problem...

Can we either:

  • Return the Async versions of the functions to the module, OR
  • Expose the PageToken on the request structs and NextPageToken on the function responses, so that pagination can be handled manually
@JayBusch
Copy link

The associated PR with this issue was never merged/accepted. I would really like to know if the maintainers will accept a PR that supports manual paging, or if there is a way to use it now that I am not aware of.

@gnvk
Copy link
Collaborator

gnvk commented Aug 26, 2024

@JayBusch I completely support the idea of adding back functions that let you avoid loading huge amount of data into memory by returning the requested data (trades, quotes, bars) in chunks.

However, I don't really support exposing the page token or the internals of pagination on the API because I cannot really think of a real reason any user would want to work with them. This is why I didn't merge #249 but asked for some changes. Honestly I'm not sure why @raja closed it though.

In v2 we had functions postfixed with "Async" that did exactly what you wanted (also mentioned by @probably-not in this issue), but I removed them from v3 for two reasons:

  1. They confused a lot of users. They assumed that they would provide live data, not historical.
  2. They returned data through a channel, one by one. This - as I learned since - is quite not performant.

Accidentally 😄, go 1.23 was released last week with support for iterators. I see this as a good opportunity to finally fix this issue in a "now go native" way, without all the before mentioned problems.

One thing I'm a bit unsure about at the moment is whether we would want to iterate over the pages, or the items themselves, e.g.

// Returns an iterator that yields all the trades one by one
func GetTradesIter(symbol string, req GetTradesRequest) func(yield func(Trade, error)) bool 

vs

// Returns an iterator that yields the pages of the trades
func GetTradesPaged(symbol string, req GetTradesRequest) func(yield func([]Trade, error)) bool

@JayBusch
Copy link

@gnvk thank you for your quick and informed response to my question.

I am glad you see a path forward on this and agree that a "go native" approach is likely the best solution. I am not sure however why you are reluctant to expose things like the page token or paging "internals" when these are mechanisms exposed by the API to end users already. As for reasons why they are useful to users of the library, I think my use case gives a good example. I wish to be able to retrieve data and load it into a local database without concern that a large data set will plague my application with OOM errors. This may now be a moot point if the approach using iterators is pursued; but just to clarify my position - anything exposed by the API should be made available to the end user of the library - it makes sense to provide convenience functions to the user who may not want to manage these details, but the details should not be hidden from users that may for their own reasons make use of them; doing so risks devaluing the library in favor of users simply writing their own REST clients so as to access the full API feature set.

Regarding the choice of iterators for individual records vs. pages I would hold the position that both may be useful and it would not be double the work to support both. If one must be chosen over the other I think it makes the most sense to go with what the API currently supports, which would be by page. If I am batch loading data into a database I may find it more useful to process a page at a time, or to aggregate some number of pages into a batch. If my only option is to iterate over each individual record to create a batch for efficient submission to the database I would likely find that as a bottleneck to my code. I can see the value in per-record iterators as well, albeit not for any use case I currently have. Perhaps if someone has an algorithm that requires pulling small sets of records down and using them for calculation the discarding them this could be useful, I think in most other circumstances the data is likely to be loaded to local files or databases for offline/local and efficient computation.

If you would let me know what design would likely be accepted, I would be happy to put together a PR. I have already forked the repo and implemented a solution nearly identical to @probably-not ( I didn't fork his repo as I wanted to understand the code in depth, but as I implemented it basically became a copy of his solution). I have this working for my own project now but would very much prefer not to have to maintain an ongoing fork of the library. Just let me know what direction will have the support of the team so any time I invest in a PR is not done in futility and I will get started right away.

@gnvk
Copy link
Collaborator

gnvk commented Aug 27, 2024

I still don't see a valid example for exposing the page token :D But I'm not reluctant to do it, if you insist.

Regarding the question of iterating through pages vs records, providing both is in fact probably the best solution.

I can't overstate how thrilled I am to hear that you'd be happy to work on this. However, in the next 5 days I'm going to be unavailable, but after that I'll be happy to review your PR and make sure it gets merged once it's in an acceptable phase. Moreover, please realise that this is not an easy / small task: there are a lot of functions, data types, endpoints. This change is going to be big and it will most certainly require some iterations. If you're okay with this, again, I'm very happy to accept your offer. If not, I can also do this next week.

Anyway, the first step is to agree on the public API. Based on your suggestion now I'm proposing these:

  • Per record iterator, single symbol, no support for page token. I'm okay with skipping this.
func GetTradesIter(symbol string, req GetTradesRequest) func(yield func(Trade, error) bool)
  • Per page iterator, single symbol, support for page token
type GetTradesPagedRequest struct {
	GetTradesRequest
	PageToken string
}

type TradesPage struct {
	Trades        []Trade
	NextPageToken string // *string is also acceptable, but I prefer "" over nil
}

func GetTradesPaged(symbol string, req GetTradesPagedRequest) func(yield func(TradePage, error) bool) 
  • Per page iterator, multi symbol, support for page token
type MultiTradesPage struct {
	Trades        map[string]Trade
	NextPageToken string
}

func GetMultiTradesPaged(symbols []string, req GetMultiTradesPagedRequest) func(yield func(MultiTradesPage, error) bool)

Alternatively, as I said, I'm fine with not adding the page token to either to request or the response, and keep the original signature for the paged iterators as well.

One implementation detail: there are quite a lot of duplications in the actual paging logic between the endpoints (trades, quotes, etc.). They are kind of hard to deduplicate, because even with generics it's not trivial to find the right abstraction. This problem is not closely related to this issue, but whoever does the actual implementation, maybe it makes sense to think about this as well.

@probably-not
Copy link
Contributor Author

Hey @gnvk I'd like to add one note to the conversation here - I would not use Go's native iterators yet. While I'm very excited for them to be in the standard library, they require Go 1.23 (which was just released two weeks ago) to be the toolchain used in the user's go.mod and any deployment pipelines (which is not necessarily something that can be easily done for all users), and they are still pretty new... meaning they are not necessarily at the top of performance vs the previous solutions (this has been an open discussion on the Go performance Slack recently).

Unless the plan would be to support both native iterators and non-native iterators via build-tags, I would avoid using them, as using them would require that the library's toolchain version be updated from Go 1.19 to Go 1.23, which would lock out users from updating the library for any other updates that may come after this.

@gnvk
Copy link
Collaborator

gnvk commented Aug 27, 2024

@probably-not My suggestion is completely backward compatible. To actually use the iterators, the clients will need go 1.22 or 1.23, but the library can remain on 1.19.

@probably-not
Copy link
Contributor Author

Ah my mistake @gnvk, I'm just catching up on the thread here - I see your implementation is just returning the functions for iterating, that works 👍

I can probably re-open my PR (not sure why it was closed...), the main reason that we left it on our end is because we had already implemented internally based on the original API that my branch exposed, so we just continued with the fork and kept it up to date with the main branch and just merging/rebasing our changes. IIRC the implementation that you mentioned is not very far off from what I've got, so it should be (hopefully) not too difficult to adapt and implement.

A question about the Alpaca pagination implementation/note regarding whether or not to expose the PageToken value: Is a PageToken reusable? And if it is, for how long (i.e. does it have an expiration)?

The reason I'm asking is that I do think that exposing it is useful, specifically in situations where there is an error while paginating. If I'm paginating several years of data and I happen to have an error in the middle, having the current PageToken would let me restart from exactly where I was, but not exposing the PageToken would mean that I have to manage the pagination restart on my own, which can get complicated for users.

If the PageToken cannot be reused, then that point becomes moot, and I would say that you are correct in saying that the SDKs should not expose it - and the docs should explicitly mention that they are an internal implementation detail and should not be relied on (so that there is no more confusion by other devs about why the SDK does not fully implement the API).

@JayBusch
Copy link

I'm glad I sparked some movement on this issue. Although I'd be happy to work on this I'll let @probably-not go for it if he wants to. I have much work to do on my own project and even some medical stuff coming up next week that may slow me down. If by then it's not something anyone else is already addressing I'll go ahead and cut a PR for page iterators as described. @probably-not please let me know if you don't plan to move forward on this.

@gnvk
Copy link
Collaborator

gnvk commented Sep 2, 2024

@probably-not: A question about the Alpaca pagination implementation/note regarding whether or not to expose the PageToken value: Is a PageToken reusable? And if it is, for how long (i.e. does it have an expiration)?

Yes, it's reusable indefinitely.

@probably-not: If I'm paginating several years of data and I happen to have an error in the middle, having the current PageToken would let me restart from exactly where I was, but not exposing the PageToken would mean that I have to manage the pagination restart on my own, which can get complicated for users.

Implementing this is complicated with or without the page token. There is already a retry logic in the client configurable with the RetryLimit (defaults to 10) and the RetryDelay (defaults to one second) options. If you need something more sophisticated than this, I see that more of an argument to improve the retry logic itself (so that it can also be leveraged by other users) than an argument to give access to the page token.

@JayBusch @probably-not How would you like to proceed with this?

@JayBusch
Copy link

JayBusch commented Sep 3, 2024

I'm recovering from a medical procedure for the next few days, but as I mentioned before will be happy to pick this up afterwards if @probably-not is not interested. I would like to give him proper deference as this was his issue to begin with and he has already submitted a PR.

Regarding the exposure of the page_token I don't know his exact use case but I can imagine that one could have cause to access that if they were building an application to retrieve a large amount of data and wished to be able to restart the process in the case that the application fails (but the token was able to be persisted somehow). I keep the philosophy that a library is a tool to allow developers to access the full features of an API or other resource without re-writing common boilerplate code, not as a means of restricting their access.

I'm happy to implement this in whatever way will allow its acceptance into the code base and that shows ample respect for the work others have put in before me. I'll wait another day or two for my own healing and for any consensus to be reached before starting any new work.

@gnvk
Copy link
Collaborator

gnvk commented Sep 13, 2024

I'd like to assign this ticket to either one of you, or to myself. Let's make this decision. Since @probably-not already has a PR, I'll ask him first. Please only say yes if you're committed to make it till merge, potentially with several iterations after my feedback. If he says no, I'll ask @JayBusch. If both of you say no, I'll pick it up.

@JayBusch
Copy link

Sounds good. Just let me know.

@gnvk
Copy link
Collaborator

gnvk commented Sep 16, 2024

@probably-not Please let us know if you'd like to work on this.

@gnvk
Copy link
Collaborator

gnvk commented Sep 20, 2024

With respect, I take the silence from @probably-not as a no, and I pass the ball to @JayBusch.

@JayBusch
Copy link

Ok, sounds good. I'll get started this weekend.

Please send me any implementation detail requirements/guidance you think would be helpful.

@gnvk
Copy link
Collaborator

gnvk commented Sep 20, 2024

@JayBusch I don't know how much time you have, but I'd recommend only doing one or two methods first, e.g. only stock trades with tests and everything. I'll review those and if it's in all good, you can do the rest (quotes, bars, crypto, option, etc.)

@JayBusch
Copy link

Sounds good, I'll take that approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants