-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evaluation list is slow #415
Comments
This is still very slow. |
This already takes more than 5 minutes for me: https://www.openml.org/api/v1/json/evaluation/list/tag/study_14/limit/1. Is it just me or how can you guys work with openml if this is not working? @PhilippPro @DanielKuehn87 what do you guys do? Is this also slow for you guys? |
I will look at this tonight. |
On Debian:
This is not a big problem for us, as we have everything in a database. Other very slow functions like see also openml/openml-r#348 |
Hm, the thing is when people start building their own (local) database because openml does not meet their requirement, this is a clear sign that something is wrong with openml, right? |
I don't necessarily think so. If you have a huge database it is natural that you cannot get everything from a database of the internet, this takes too long... |
Guys, this is a software issue. We 'just' need to speed up the API call. I
looked at this already last night. The query can be made faster, but most
time goes to writing out all the data (same with dataset and task lists).
I'll do more experiments today. I hope to cut down the time server-side,
but worse case we need to restrict the amount of data we have to write and
send over the internet.
If you want to set up your own local DB, that's fine, OpenML is designed to
allow that, but obviously this should not be required for all users. Maybe
we can even provide code to help people run OpenML more locally, but this
is not high priority for now.
On Thu, 1 Jun 2017 at 10:12, Philipp Probst ***@***.***> wrote:
I don't necessarily think so. If you have a huge database it is natural
that you cannot get everything from a database of the internet, this takes
too long...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#415 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABpQVyn7W_6J1pNzaKSyi5cLPneAkTj-ks5r_nJ5gaJpZM4NVYUd>
.
--
Thank you,
Joaquin
|
I agree with restricting the amout of data. But shouldn't this be better
solved by the frontend (i.e. Clients) and not by the backend? Otherwise we
just end up in such kind of issues
openml/openml-r#275.
I think we need to restrict the amount of data better: Currently, I don't
like the way the evaluation list is "limiting" the amout of data, the
limiting could be done better by filtering w.r.t. measures
#423 as I think the most common
usecase in benchmarking is to focus on one or a low number of measures and
to ask the server for all runs w.r.t. that measures (currently we are
returing many different measures for only an incomplete subset of runs if
the limit is reached). If we restrict the number of runs, people will still
do chunked requests to get all runs. But that might just be my impression.
2017-06-01 10:29 GMT+02:00 Joaquin Vanschoren <[email protected]>:
… Guys, this is a software issue. We 'just' need to speed up the API call. I
looked at this already last night. The query can be made faster, but most
time goes to writing out all the data (same with dataset and task lists).
I'll do more experiments today. I hope to cut down the time server-side,
but worse case we need to restrict the amount of data we have to write and
send over the internet.
If you want to set up your own local DB, that's fine, OpenML is designed to
allow that, but obviously this should not be required for all users. Maybe
we can even provide code to help people run OpenML more locally, but this
is not high priority for now.
On Thu, 1 Jun 2017 at 10:12, Philipp Probst ***@***.***>
wrote:
> I don't necessarily think so. If you have a huge database it is natural
> that you cannot get everything from a database of the internet, this
takes
> too long...
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#415 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ABpQVyn7W
_6J1pNzaKSyi5cLPneAkTj-ks5r_nJ5gaJpZM4NVYUd>
> .
>
--
Thank you,
Joaquin
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#415 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AECtnE1SJTnlh9a7Ro35wpwFDYxoSbuVks5r_nZNgaJpZM4NVYUd>
.
|
That's exactly what I mean. It's not really about the number of results,
it's that we (have to) give the results for all evaluation measures that
results in lots of writing stuff to JSON/XML.
One option is to make the evaluation measure required, or only return one
'default' measure if this is not stated.
On Fri, Jun 2, 2017 at 9:59 AM Giuseppe Casalicchio <
[email protected]> wrote:
I agree with restricting the amout of data. But shouldn't this be better
solved by the frontend (i.e. Clients) and not by the backend? Otherwise we
just end up in such kind of issues
openml/openml-r#275.
I think we need to restrict the amount of data better: Currently, I don't
like the way the evaluation list is "limiting" the amout of data, the
limiting could be done better by filtering w.r.t. measures
#423 as I think the most common
usecase in benchmarking is to focus on one or a low number of measures and
to ask the server for all runs w.r.t. that measures (currently we are
returing many different measures for only an incomplete subset of runs if
the limit is reached). If we restrict the number of runs, people will still
do chunked requests to get all runs. But that might just be my impression.
2017-06-01 10:29 GMT+02:00 Joaquin Vanschoren ***@***.***>:
> Guys, this is a software issue. We 'just' need to speed up the API call.
I
> looked at this already last night. The query can be made faster, but most
> time goes to writing out all the data (same with dataset and task lists).
> I'll do more experiments today. I hope to cut down the time server-side,
> but worse case we need to restrict the amount of data we have to write
and
> send over the internet.
>
> If you want to set up your own local DB, that's fine, OpenML is designed
to
> allow that, but obviously this should not be required for all users.
Maybe
> we can even provide code to help people run OpenML more locally, but this
> is not high priority for now.
> On Thu, 1 Jun 2017 at 10:12, Philipp Probst ***@***.***>
> wrote:
>
> > I don't necessarily think so. If you have a huge database it is natural
> > that you cannot get everything from a database of the internet, this
> takes
> > too long...
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <#415 (comment)>,
> or mute
> > the thread
> > <https://github.com/notifications/unsubscribe-auth/ABpQVyn7W
> _6J1pNzaKSyi5cLPneAkTj-ks5r_nJ5gaJpZM4NVYUd>
> > .
> >
> --
> Thank you,
> Joaquin
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#415 (comment)>,
or mute
> the thread
> <
https://github.com/notifications/unsubscribe-auth/AECtnE1SJTnlh9a7Ro35wpwFDYxoSbuVks5r_nZNgaJpZM4NVYUd
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#415 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABpQV5YClwDdxJnMDlTKK_EgB9PyvXdgks5r_8D4gaJpZM4NVYUd>
.
--
Thank you,
Joaquin
|
Ok, what about just creating an image of the OpenML database in SQLight (or some other open database) and give users a way to download the whole thing? I guess, that the OpenML database is something like 10 GB large. So if I want to run a larger analysis, it might be better to just download the whole database instead of running several queries against the API, which transfers the data via XML/JSON. |
Under guide > developers there is a link of a nightly snapshot of the database. It has the most useful stuff (not everything, that would be over 100GB). |
The link is not working currently, but this is helpful for me. Thanks. |
Snapshot link is fixed. |
I just submitted a fix for the slow queries. Under review now. |
Fix is running on production. |
@joaquinvanschoren knows the solution and has to implement it
The text was updated successfully, but these errors were encountered: