evaluation list is slow #415

giuseppec · 2017-05-09T14:13:51Z

@joaquinvanschoren knows the solution and has to implement it

giuseppec · 2017-05-31T15:37:39Z

This is still very slow.

giuseppec · 2017-05-31T15:43:44Z

This already takes more than 5 minutes for me: https://www.openml.org/api/v1/json/evaluation/list/tag/study_14/limit/1. Is it just me or how can you guys work with openml if this is not working?

@PhilippPro @DanielKuehn87 what do you guys do? Is this also slow for you guys?

joaquinvanschoren · 2017-05-31T15:46:03Z

I will look at this tonight.

PhilippPro · 2017-06-01T07:54:58Z

On Debian:

> system.time(a <- listOMLRunEvaluations(tag = "study_14"))
Downloading from 'https://www.openml.org/api/v1/json/evaluation/list/tag/study_14' to '<mem>'.
       User      System verstrichen 
      1.708       0.888     767.338

This is not a big problem for us, as we have everything in a database.

Other very slow functions like getOMLRunParList(getOMLRun(x)) are a much bigger problem, cause it takes many days to get all the hyperparameters via this functions...

see also openml/openml-r#348

giuseppec · 2017-06-01T08:04:33Z

Hm, the thing is when people start building their own (local) database because openml does not meet their requirement, this is a clear sign that something is wrong with openml, right?

PhilippPro · 2017-06-01T08:12:41Z

I don't necessarily think so. If you have a huge database it is natural that you cannot get everything from a database of the internet, this takes too long...

joaquinvanschoren · 2017-06-01T08:29:00Z

Guys, this is a software issue. We 'just' need to speed up the API call. I looked at this already last night. The query can be made faster, but most time goes to writing out all the data (same with dataset and task lists). I'll do more experiments today. I hope to cut down the time server-side, but worse case we need to restrict the amount of data we have to write and send over the internet. If you want to set up your own local DB, that's fine, OpenML is designed to allow that, but obviously this should not be required for all users. Maybe we can even provide code to help people run OpenML more locally, but this is not high priority for now.

On Thu, 1 Jun 2017 at 10:12, Philipp Probst ***@***.***> wrote: I don't necessarily think so. If you have a huge database it is natural that you cannot get everything from a database of the internet, this takes too long... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#415 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABpQVyn7W_6J1pNzaKSyi5cLPneAkTj-ks5r_nJ5gaJpZM4NVYUd> .

-- Thank you, Joaquin

giuseppec · 2017-06-02T07:59:49Z

I agree with restricting the amout of data. But shouldn't this be better solved by the frontend (i.e. Clients) and not by the backend? Otherwise we just end up in such kind of issues openml/openml-r#275. I think we need to restrict the amount of data better: Currently, I don't like the way the evaluation list is "limiting" the amout of data, the limiting could be done better by filtering w.r.t. measures #423 as I think the most common usecase in benchmarking is to focus on one or a low number of measures and to ask the server for all runs w.r.t. that measures (currently we are returing many different measures for only an incomplete subset of runs if the limit is reached). If we restrict the number of runs, people will still do chunked requests to get all runs. But that might just be my impression. 2017-06-01 10:29 GMT+02:00 Joaquin Vanschoren <[email protected]>:

…

Guys, this is a software issue. We 'just' need to speed up the API call. I looked at this already last night. The query can be made faster, but most time goes to writing out all the data (same with dataset and task lists). I'll do more experiments today. I hope to cut down the time server-side, but worse case we need to restrict the amount of data we have to write and send over the internet. If you want to set up your own local DB, that's fine, OpenML is designed to allow that, but obviously this should not be required for all users. Maybe we can even provide code to help people run OpenML more locally, but this is not high priority for now. On Thu, 1 Jun 2017 at 10:12, Philipp Probst ***@***.***> wrote: > I don't necessarily think so. If you have a huge database it is natural > that you cannot get everything from a database of the internet, this takes > too long... > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#415 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABpQVyn7W _6J1pNzaKSyi5cLPneAkTj-ks5r_nJ5gaJpZM4NVYUd> > . > -- Thank you, Joaquin — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#415 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AECtnE1SJTnlh9a7Ro35wpwFDYxoSbuVks5r_nZNgaJpZM4NVYUd> .

joaquinvanschoren · 2017-06-02T08:06:06Z

That's exactly what I mean. It's not really about the number of results, it's that we (have to) give the results for all evaluation measures that results in lots of writing stuff to JSON/XML. One option is to make the evaluation measure required, or only return one 'default' measure if this is not stated. On Fri, Jun 2, 2017 at 9:59 AM Giuseppe Casalicchio < [email protected]> wrote:

I agree with restricting the amout of data. But shouldn't this be better solved by the frontend (i.e. Clients) and not by the backend? Otherwise we just end up in such kind of issues openml/openml-r#275. I think we need to restrict the amount of data better: Currently, I don't like the way the evaluation list is "limiting" the amout of data, the limiting could be done better by filtering w.r.t. measures #423 as I think the most common usecase in benchmarking is to focus on one or a low number of measures and to ask the server for all runs w.r.t. that measures (currently we are returing many different measures for only an incomplete subset of runs if the limit is reached). If we restrict the number of runs, people will still do chunked requests to get all runs. But that might just be my impression. 2017-06-01 10:29 GMT+02:00 Joaquin Vanschoren ***@***.***>: > Guys, this is a software issue. We 'just' need to speed up the API call. I > looked at this already last night. The query can be made faster, but most > time goes to writing out all the data (same with dataset and task lists). > I'll do more experiments today. I hope to cut down the time server-side, > but worse case we need to restrict the amount of data we have to write and > send over the internet. > > If you want to set up your own local DB, that's fine, OpenML is designed to > allow that, but obviously this should not be required for all users. Maybe > we can even provide code to help people run OpenML more locally, but this > is not high priority for now. > On Thu, 1 Jun 2017 at 10:12, Philipp Probst ***@***.***> > wrote: > > > I don't necessarily think so. If you have a huge database it is natural > > that you cannot get everything from a database of the internet, this > takes > > too long... > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <#415 (comment)>, > or mute > > the thread > > <https://github.com/notifications/unsubscribe-auth/ABpQVyn7W > _6J1pNzaKSyi5cLPneAkTj-ks5r_nJ5gaJpZM4NVYUd> > > . > > > -- > Thank you, > Joaquin > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#415 (comment)>, or mute > the thread > < https://github.com/notifications/unsubscribe-auth/AECtnE1SJTnlh9a7Ro35wpwFDYxoSbuVks5r_nZNgaJpZM4NVYUd > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#415 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABpQV5YClwDdxJnMDlTKK_EgB9PyvXdgks5r_8D4gaJpZM4NVYUd> .

-- Thank you, Joaquin

DanielKuehn87 · 2017-06-04T13:15:35Z

Ok, what about just creating an image of the OpenML database in SQLight (or some other open database) and give users a way to download the whole thing? I guess, that the OpenML database is something like 10 GB large. So if I want to run a larger analysis, it might be better to just download the whole database instead of running several queries against the API, which transfers the data via XML/JSON.

joaquinvanschoren · 2017-06-04T13:48:54Z

Under guide > developers there is a link of a nightly snapshot of the database. It has the most useful stuff (not everything, that would be over 100GB).

DanielKuehn87 · 2017-06-04T13:59:51Z

The link is not working currently, but this is helpful for me. Thanks.

joaquinvanschoren · 2017-06-04T23:40:01Z

Snapshot link is fixed.

joaquinvanschoren · 2017-06-06T10:04:37Z

I just submitted a fix for the slow queries. Under review now.
openml/openml.org@1371d0d

joaquinvanschoren · 2017-06-06T14:53:35Z

Fix is running on production.
Giuseppe's example returns immediately now:
https://www.openml.org/api/v1/json/evaluation/list/tag/study_14/limit/1

giuseppec added the priority: highest label May 9, 2017

DanielKuehn87 mentioned this issue Jun 5, 2017

Use OpenML Snapshot database ja-thomas/OMLbots#34

Open

joaquinvanschoren closed this as completed Jun 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation list is slow #415

evaluation list is slow #415

giuseppec commented May 9, 2017

giuseppec commented May 31, 2017

giuseppec commented May 31, 2017

joaquinvanschoren commented May 31, 2017

PhilippPro commented Jun 1, 2017 •

edited

Loading

giuseppec commented Jun 1, 2017

PhilippPro commented Jun 1, 2017

joaquinvanschoren commented Jun 1, 2017 via email

giuseppec commented Jun 2, 2017 via email

joaquinvanschoren commented Jun 2, 2017 via email

DanielKuehn87 commented Jun 4, 2017

joaquinvanschoren commented Jun 4, 2017

DanielKuehn87 commented Jun 4, 2017

joaquinvanschoren commented Jun 4, 2017

joaquinvanschoren commented Jun 6, 2017

joaquinvanschoren commented Jun 6, 2017

evaluation list is slow #415

evaluation list is slow #415

Comments

giuseppec commented May 9, 2017

giuseppec commented May 31, 2017

giuseppec commented May 31, 2017

joaquinvanschoren commented May 31, 2017

PhilippPro commented Jun 1, 2017 • edited Loading

giuseppec commented Jun 1, 2017

PhilippPro commented Jun 1, 2017

joaquinvanschoren commented Jun 1, 2017 via email

giuseppec commented Jun 2, 2017 via email

joaquinvanschoren commented Jun 2, 2017 via email

DanielKuehn87 commented Jun 4, 2017

joaquinvanschoren commented Jun 4, 2017

DanielKuehn87 commented Jun 4, 2017

joaquinvanschoren commented Jun 4, 2017

joaquinvanschoren commented Jun 6, 2017

joaquinvanschoren commented Jun 6, 2017

PhilippPro commented Jun 1, 2017 •

edited

Loading