Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing fetch ken embeddings test #900

Open
jeromedockes opened this issue Apr 25, 2024 · 2 comments
Open

Failing fetch ken embeddings test #900

jeromedockes opened this issue Apr 25, 2024 · 2 comments
Labels
CI / build Continuous integration and build

Comments

@jeromedockes
Copy link
Member

jeromedockes commented Apr 25, 2024

This test has started failing, saying it does not find the footer of the downloaded parquet file.

I cannot reproduce locally; I am wondering if it could be that the file is only partly downloaded but it is weird that it only seems to happen in the python 3.11 environment.
edit: actually it happens randomly in the several macos CI runs, either during the first or second fetch_ken_embeddings call in the test, and while loading varying parts of the parquet table. So that would confirm that it is a failing download. We should make the download more robust and compare a checksum to get a better error message when it fails, and mock downloads in the tests. In the meanwhile I would suggest that we skip this test as it is bound to be flaky and it is currently causing CI to fail in all PRs that are unrelated to ken embeddings, WDYT @jovan-stojanovic ?

In any case we should avoid tests that rely on network resources and mock the downloads, especially tests that download large files. So we need a quick fix for this test but in the medium term we should mock downloads in the main test suite

@jeromedockes jeromedockes added the CI / build Continuous integration and build label Apr 25, 2024
@jeromedockes
Copy link
Member Author

the ken embedding tests download 864M of data and takes over a minute to run locally so we probably want to start with those.
A good part of the data seems to be the embedding type ID (file 39266300) which is 652M; @jovan-stojanovic do you know if this is one of the files that were recently updated on figshare?

@jeromedockes
Copy link
Member Author

I am also wondering about the way files are downloaded: the remote file is opened with pyarrow, then chunks of it are loaded into pandas dataframes and written in separate files with pandas.to_parquet.
However, in all cases the whole file is downloaded, and when using the skrub fetcher, the whole file is loaded into memory. So why do we use this approach rather than streaming the download into a single local file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI / build Continuous integration and build
Projects
None yet
Development

No branches or pull requests

1 participant