Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex filters? #54

Closed
chozabu opened this issue Jan 14, 2016 · 26 comments
Closed

Complex filters? #54

chozabu opened this issue Jan 14, 2016 · 26 comments

Comments

@chozabu
Copy link

chozabu commented Jan 14, 2016

How complex can our filters be?

Say I have a Post model:

class Post:
    user = models.ForeignKey(User, related_name='posts')
    title = models.CharField(max_length=200)

I want to find users who have an old post about gmail AND a new post about ymail

uo = User.objects
uo= uo.filter(posts__title__icontains="gmail", posts__publish_date__lte="01/01/2009")
uo= uo.filter(posts__title__icontains="ymail", posts__publish_date__gte="01/01/2015")

I don't see how I could currently do this with DRFF? I'd actually like to be able to do much more complex filters too

@chozabu
Copy link
Author

chozabu commented Jan 14, 2016

I should note - I've already written some code to support this functionality in plain django - https://gist.github.com/chozabu/86b60caa0ce211f232da - the function takes a table name, list of filters, list of excludes, sort method and gives back JSON.

So, my above question would be answered like:

http://endpoint?table=User&filters=[[posts__title__icontains="gmail", posts__publish_date__lte="01/01/2009"], [ posts__title__icontains="ymail", posts__publish_date__gte="01/01/2015"]]

(probably got URL formatting wrong, could be using JSON post there)

I'm not convinced this method is production-ready (hence saying do not use in its title), and would be much more interested replicating the important bits in DFF - interested to hear thoughts

@philipn
Copy link
Owner

philipn commented Jan 14, 2016

I believe the only problem here is the repeated posts__title__icontains. To accomplish something like this we could allow for a repeated-filter-key heuristic. I'm not sure what the HTTP spec says about repeating query string keys?

@chozabu
Copy link
Author

chozabu commented Jan 14, 2016

@philipn so - if we were to assume the HTTP spec allows for repeated query string keys - and they were decoded into a list of pairs, with order maintained, We could do a query like:

http://endpoint/api/users/?posts__title__icontains="gmail", posts__publish_date__lte="01/01/2009",  posts__title__icontains="ymail", posts__publish_date__gte="01/01/2015"
#rather than:
http://endpoint?table=User&filters=[[posts__title__icontains="gmail", posts__publish_date__lte="01/01/2009"], [ posts__title__icontains="ymail", posts__publish_date__gte="01/01/2015"]]
# note the lack of [[filters], [filters]] list structure

But still would it work? if the filters are all applied in one go, or one at a time, rather than in two "filtersets" would we not get results about users with gmail and ymail post in the wrong dates - or no results atall?

@chozabu
Copy link
Author

chozabu commented Jan 15, 2016

To expand on above thoughts:

http://endpoint/api/users/?posts__title__icontains="gmail", posts__publish_date__lte="01/01/2009",  posts__title__icontains="ymail", posts__publish_date__gte="01/01/2015"

would resolve to either: (4 separate filters)

User.objects.filter(posts__title__icontains="gmail").filter(posts__publish_date__lte="01/01/2009").filter(posts__title__icontains="ymail").filter( posts__publish_date__gte="01/01/2015")

or (1 combined filter)

User.objects.filter(posts__title__icontains="gmail", posts__publish_date__lte="01/01/2009", posts__title__icontains="ymail", posts__publish_date__gte="01/01/2015")

but what we really want is: (two separate filters)

User.objects.filter(posts__title__icontains="gmail", posts__publish_date__lte="01/01/2009").filter(posts__title__icontains="ymail", posts__publish_date__gte="01/01/2015")

@philipn
Copy link
Owner

philipn commented Jan 17, 2016

Ah, I see what you're saying, now. I'm not sure what / if we should do here. Do other libraries allow queries as expressive as these? It's almost as if we would want to duplicate all possible operations on querysets -- that could get messy.

Maybe we should suggest, for people who want multiple levels of queryset operations, that they define the operation explicitly on the FilterSet instead?

@chozabu
Copy link
Author

chozabu commented Jan 18, 2016

All possible operations on a query set would be a bit much, including annotation and aggregation - though, I'd sure be interested in such a lib!

I'm not aware of libraries that allow this functionality - though I'd be surprised if something does not exist...

I can see how multiple querysets/excludesets are "on the line" - though this functionality is sure something that would be very handy - particularly on top of the sensible filter/security systems in DRFF.

I'm not sure of a compatible way of encoding into the url query section - it seems sending lists of filtersets is vital.

@rpkilby
Copy link
Collaborator

rpkilby commented Jan 18, 2016

it seems sending lists of filtersets is vital.

That's kind of the problem with the standard query string syntax - it's a single set of key/value pairs. We would need to define a more complex syntax that supports multiple filter sets. If we defined a multiple set syntax, then that would also allow us to add in OR and negation operations.

eg, something like:

/api/widgets?filter=[field1=value1:field2!=value2] | ![field3=value3:field4=value4]

Which would map to

f1 = Widget.objects.filter(field1=value1).exclude(field2=value2)
f2 = Widget.objects.ftiler(field3=value3).filter(field4=value4)
return f1 | ~f2

@chozabu As to your original question. Even if we implement this more complex syntax, your problem wouldn't be completely solved. The behavior you describe in your original post only applies to filters spanning multi-valued relationships. django-filter processes each filter serially, resulting in

uo.filter(posts__title__icontains="gmail").filter(posts__publish_date__lte="01/01/2009")

not

uo.filter(posts__title__icontains="gmail", posts__publish_date__lte="01/01/2009")

You would probably need to create a method filter that takes both values and does the combined filtering manually.

eg,

/api/users?old_post_title=gmail,01/01/2009&recent_post_title=ymail,01/01/2015

@rpkilby
Copy link
Collaborator

rpkilby commented Jan 18, 2016

Also, a couple of quick notes on query strings:

  • The query string syntax is a single set of key/value pairs. There is no builtin syntax for grouping parameters.
  • There is no official standard on handling repeated keys, although most web frameworks do support them. More info here.
  • Django handles query strings with their QueryDict class. It supports repeated keys, although it's up to the individual form field/FilterSet filter on how to handle multiple values for a key. For example, multiple select fields accept multiple values, while an integer field would only accept one value.

@philipn
Copy link
Owner

philipn commented Jan 27, 2016

I've just pushed out a new release (0.7.0), so you should be able to do a normal pip install to see the fixes here.

@chozabu
Copy link
Author

chozabu commented Feb 1, 2016

So, for query strings, it looks like there are non-standard ways of passing arrays - http://stackoverflow.com/a/9547490/445831

But perhaps when a query complexity is getting to this level, it could make more sense to POST the query as json?

Also - nice links, was not aware you could link to a line range on github! And 0.7.0 seems good :)

@chozabu
Copy link
Author

chozabu commented Jul 30, 2016

I'm hitting a point in my project where Think kind of functionality would be really handy

Writing quite a few filters like:

admin_of_group = filters.MethodFilter()
def filter_admin_of_group(self, oname, qs, value):
    name = 'group_memberships__group'
    name2 = 'group_memberships__role'
    if LOOKUP_SEP in oname:
        rel, _ = oname.rsplit(LOOKUP_SEP, 1)
        name = LOOKUP_SEP.join([rel, name])
        name2 = LOOKUP_SEP.join([rel, name2])
    return qs.filter(**{name: int(value), name2: "admin"})

where my models are like:

class GroupMembership(models.Model):
    group = models.ForeignKey(RGroup, related_name='members')
    role = models.CharField(max_length=100, default='member')
    user = models.ForeignKey(User, related_name='group_memberships')
class Group(models.Model):
    name = models.CharField(max_length=100)

I can query like

/api/questions/?user__admin_of_group=6

This works - with some limitations, hard to find the admin of two groups, becomes a bit more complex for some other filters like:

/api/questions/?tag_rated_gte=cheese,13

to find questions tagged cheese, with a rating above or equal to 13. but, I cannot easily query for two tags matching some criteria.

And If i want to be able to filter by the existing matching methods (icontains,iexact,gte,gt,lte,lt etc), and the model has a few different fields to filter on, and there are quite a few more models to do the same tasks for - this looks more and more like something that should be automated.

It is somewhat tempting to accept syntax like

 /api/questions/?custom_tag_filter=(("name", "cheese"),("rated__gte", 13)),(("name", "tomato"),("rated__gte", 13))

or even

 /api/questions/?custom_filterset=("tag",(("name", "cheese"),("rated__gte", 13)),(("name", "tomato"),("rated__gte", 13))), (group,()

Which I think should be happily parsable by literal_eval. not to awful to look at, and not too hard to do a backend for.

Main problem is "the easy way" would be to filter directly on the users given values - which is as secure as a wet paper wall...

Is there an easy/practical way I can check if a filterstring (like tag__name ) is a valid filter in DRFF?

Hope you don't mind the rambling post!

@rpkilby
Copy link
Collaborator

rpkilby commented Jul 30, 2016

Is it necessary to do this all in one request? The problems you're running into only apply when filtering across relationships. If you remove the relationship from the query and use multiple requests, the issue is no longer relevant.

You could get your user ids through:

/api/group_memberships?role=admin&group__in=6,7

Let's say the above query returns memberships for users 1 and 2. You can filter your questions by doing:

/api/questions?user__in=1,2

The same applies to filtering your tags.

/api/tags?name__in=cheese,tomato&rated__gte=13
/api/questions?tag__in=1,2

@chozabu
Copy link
Author

chozabu commented Jul 31, 2016

@rpkilby I am doing similar things in quite a few places, for something like group admin it is just kinda nice to not have to make the extra request, but there are many situations where it will not scale very well - tags being quite a good example

say I have 1million questions, the first 750k are tagged cheese, and the last 750k tomato.
Something like

/api/tags?name__in=cheese,tomato&rated__gte=2

could well return ~400k items. assuming we were doing something like this

/api/tags/id_only/?name__in=cheese,tomato&rated__gte=2

which I do do on many routes, returning flat_list('id') to minimise the amount of data being passed - it still looks rather heavy - and with the length of a GET url expected to stay under 2000 chars (system dependant) it is way out there (even dividing numbers above by 1000 we are still at 4k ids, and many more chars)

@rpkilby
Copy link
Collaborator

rpkilby commented Jul 31, 2016

See #99 - I'm thinking about a reimplementation based on subqueries, which should address the issue. As this thread points out, users are generally going to want to filter criteria together, not separately.

In the meantime, I'm not sure why you have 400k tags for two names. You should just have the two tags, right?

I'd argue that the extra request is okay if it's relatively small. eg, getting a handful of groups or tags. Obviously, this is problematic for several hundreds or hundreds of thousands of results.

@chozabu
Copy link
Author

chozabu commented Jul 31, 2016

@rpkilby #99 looks interesting - the same direction for sure - though I'm not sure what your plan is for how this will look in a url?

As for the tags - sorry, I may have oversimplified my example a little, current setup is:

Table of Posts
Table of TagConnections
Table of Tags

each Post can have multiple TagConnections to relate it to a Tag
Users can vote on tags - and tags can be something like "Easy", "Ethical" or "Important" - and several stats are kept about the vote results (Avg, Sum, Deviation, etc)
So, it would be probable (even desirable) to have tags like easy/important attached to most questions

I was prototyping a system along these lines in plain 'ol Django last year - Though no longer developed it may be interesting to have a glance at this page: http://voteflow.net/agora/woi/ (may require a second load to work) - it lets you build a query and graph your choice of two data items
click the blue "Filters" button to see my rather messy JQuery GUI query constructor - a somewhat basic but powerful system - but it is backed by the gist I pasted at the top - which is no good for systems that contain private data

Hope you don't mind the rambling posts, I'm usually more concise - #99 looks like it may be a better start to the same conversation :)

@rpkilby
Copy link
Collaborator

rpkilby commented Aug 1, 2016

btw, it looks like I mistyped the subquery example in #99 - it's been corrected.

though I'm not sure what your plan is for how this will look in a url

The idea is that this request:

/api/users?memberships__role=admin&memberships__group__in=6,7

would result in this query+subquery:

User.objects.filter(
    group_memberships=GroupMembership.objects.filter(
        role="admin"
    ).filter(
        group__in=[6, 7]
    )
)

which would ideally return admin users for groups 6 and 7. See the first TODO in #99, as the above query may not actually work as intended.

As for the tags - sorry, I may have oversimplified my example a little

Ah, yeah - the through model would present a problem as all of those results would be distinct. This should be solved by the subquery approach (again, probably...). eg,

/api/questions?connection__rated_gte=13&connection__tag__name__in=cheese,tomato

Note that the above doesn't solve your original problem of being able to achieve:

uo = User.objects
uo = uo.filter(posts__title__icontains="gmail", posts__publish_date__lte="01/01/2009") 
uo = uo.filter(posts__title__icontains="ymail", posts__publish_date__gte="01/01/2015")

which is related to your point here:

It is somewhat tempting to accept syntax like

/api/questions/?custom_filterset=("tag",(("name":"cheese"),("rated__gte", 13)),(("name":"tomato"),("rated__gte", 13)))

At this point, there isn't anything much to do except to define a custom set syntax, urlencode it, then parse it inside the FilterBackend. You would need to manually combine the FilterSet results. It's been on my mind, but it seems somewhat out of scope for the moment.

@chozabu
Copy link
Author

chozabu commented Aug 1, 2016

@rpkilby Ah-ha, good correction - the subquery now makes more sense - though, I think perhaps it should be filter instead of exclude to get the same results?

I too mistyped my custom query syntax thoughts, using : instead if , - now corrected

And I see what you mean - combining the parts of the query by default - which to me is by no means ideal, but sure sounds like a better default than not doing it

thinking more on custom syntax for complex queries a more refined & flexible version could look like

/api/questions/?filtersets=((("tag__name","cheese"),("tag__rated__gte",13)),(("tag__name","tomato"),("tag__rated__gte",3)))&excludesets=((("tag__name","pineapple"),("tag__rated__gte",3)),(("tag__name","ham"),("tag__rated__gte",5)))

which de-uglified looks like

filtersets = (
    (("tag__name", "cheese"), ("tag__rated__gte", 13)),
    (("tag__name", "tomato"), ("tag__rated__gte", 3))
)
excludesets = (
    (("tag__name", "pineapple"), ("tag__rated__gte", 3)),
    (("tag__name", "ham"), ("tag__rated__gte", 5))
)

only uses () and , - which should be valid in URLs - and is trivial to decode on the backend - literal_eval will turn it into a tuple

This would provide all the query power I want from the frontend (and then some!)

To me - the hard part would be getting it to work as part of DRFF, using it to prevent querys that reveal private data - but perhaps I should get out the elbow grease and look into implementing it.

Would you be interested in having this functionality in DRFF?

@rpkilby
Copy link
Collaborator

rpkilby commented Aug 2, 2016

I think perhaps it should be filter instead of exclude to get the same results

Ach, copied it from the django docs, forgot to change the example to use filtering.

And to answer in reverse:

Would you be interested in having this functionality in DRFF?

Something well tested and well specified - I'd definitely be interested. :)
@philipn may have other thoughts as it's potentially complex? At the very least, this should integrate well as another third-party package - the functionality should be able to be implemented purely in a FilterBackend (the permissions issue would be handled by the FilterSet class in drf-filters).

To me - the hard part would be getting it to work as part of DRFF, using it to prevent querys that reveal private data - but perhaps I should get out the elbow grease and look into implementing it.

I've put down some thoughts in #100. The problem is that we're currently filtering on the model's default queryset (aka, all of the objects) without taking into account default filtering based on the request user. For example, a posts end point may allow the logged in user to view all of their posts + all other published posts - all other draft posts. Default user filtering is usually handled in DRF ViewSet's .get_queryset() method. If the subquery approach works for related filter - this gives us a great opportunity to integrate the querysets provided by DRF's viewsets.

The downside is that this couples the FilterSet to the request, but I think we can structure this in a way that makes request handling optional and dry.

but perhaps I should get out the elbow grease and look into implementing it.

By all means, feel free to take a crack at it. It's definitely on my todo list too, and I have every intent of tackling these issues. Current priorities are:

  • Fixup PRs for django-filter, as drf-filters is dependent on some of that work (such as MethodFilter cleanup, DRF FilterSet integration, etc...).
  • v0.9.0 release - this integrates the above django-filter changes, fixes bugs in milestone.
  • v0.10.0 release - anything that comes up between now and v1.0.0?
  • v1.0.0 release - subquery approach to filtering, permissions fixes/viewset integration, set syntax (detailed below).

thinking more on custom syntax for complex queries a more refined & flexible version could look like

Your example seems like a good start for another, more complex FilterBackend. A couple of ideas I've been floating around:

  • I would stick to querystrings as the fundamental syntax unit - something that is natively recognized by the urlencoding/urldecoding. eg, key1=value&key2=value. You could run into escaping issues otherwise - maybe you mean air quotes, but what if you really mean to use "air quotes"? Alternatively, you could use a json dict for each tuple. Important thing is to use a well defined syntax - not something ad hoc. Something ad hoc would be more prone to bugs.

  • I would stick with python's existing set semantics instead of having separate filtersets and excludesets.

    • & (intersection)
    • | (union)
    • - (difference/exclusion)

    For example (using a, b, c etc as key=value shorthand):

    GET /api?filter=(a&b&c) | (d&e) - (f&g)

    or more practically,

    GET /api/users?filter=(posts__title__icontains=gmail&posts__publish_date__lte=2009-01-01) 
                         &(posts__title__icontains=ymail&posts__publish_date__gte=2015-01-01)

    which translates to:

    f1 = User.objects.filter(
        posts=Post.objects
                  .filter(title__icontains="gmail")
                  .filter(publish_date__lte=datetime(2009, 1, 1)))
    f2 = User.objects.filter(
        posts=Post.objects
                  .filter(title__icontains="ymail")
                  .filter(publish_date__gte=datetime(2015, 1, 1)))
    return f1 & f2

As far as getting around URL syntax, I think this could actually be done by two levels of url escaping. escape the key=value sets first, then wrap them in the set syntax, then escape that :P.

Again, with the email posts example:

>>> from urllib2 import quote
>>> set1 = quote('posts__title__icontains=gmail&posts__publish_date__lte=2009-01-01')
>>> set2 = quote('posts__title__icontains=ymail&posts__publish_date__gte=2015-01-01')
>>> set1
'posts__title__icontains%3Dgmail%26posts__publish_date__lte%3D2009-01-01'

>>> f = quote("(%s) & (%s)" % (set1, set2))
>>> f
'%28posts__title__icontains%253Dgmail%2526posts__publish_date__lte%253D2009-01-01%29%20%26%20%28posts__title__icontains%253Dymail%2526posts__publish_date__gte%253D2015-01-01%29'

That value should be entirely querystring safe, so you could then just

GET /api/users?%28posts__title__icontains%253Dgmail%2526posts__publish_date__lte%253D2009-01-01%29%20%26%20%28posts__title__icontains%253Dymail%2526posts__publish_date__gte%253D2015-01-01%29

@chozabu
Copy link
Author

chozabu commented Aug 4, 2016

@rpkilby That looks fantastic - perhaps harder to implement, and URLs have more chance of being a bit ugly, but very worth it for the extra features and safety

Few thoughts:

How does the performance of a large sub-query compare to doing it as a filterset? (hopefully very similar?)

are there situations where we would need sub-sub-querys?

set1 = quote('posts__title__icontains=gmail&posts__publish_date__lte=2009-01-01'&posts__author__age__gte=32&posts__author__popularity__gte=50)

Perhaps not too important - I think this subquery approach is generally better. This also means should be able do even more complex querys and remain secure/private.

On the system I am working on, we have a voting system - but some votes may be private. For normal querys, this is an issue - as I would like to be able to filter "Questions" based on a users vote - but only for public votes. I don't think this can be done without a subquery. (which can be done clientside)

Currently not much of a security issue, but with the system I was suggesting for filtersets it would be

@rpkilby
Copy link
Collaborator

rpkilby commented Aug 4, 2016

URLs have more chance of being a bit ugly, but very worth it for the extra features and safety

The URLs are going to be ugly. I guarantee it.
Luckily, users won't see the URL because ajax.

How does the performance of a large sub-query compare to doing it as a filterset? (hopefully very similar?)

No clue. It'd be worth coming up with some complex examples, then looking at the differences in the resulting SQL and query plans.

are there situations where we would need sub-sub-querys?

Yes. As far as I can tell, nested subqueries should work just fine. Related filters should handle this automatically, as they're processed recursively.

@chozabu
Copy link
Author

chozabu commented Aug 5, 2016

Just performed a very quick test to compare subquerys: https://gist.github.com/chozabu/e2f7e9e56f828feaaf390bd01627de1f

Looks like subquerys are very slightly slower in my test.
only 2k items in query - could still do with performing a much more complex test

some info on the output attached in this post

    #some more relevant parts of the SQL
    print("--subsql--")
    #INNER JOIN "represent_app_question" ON ("represent_app_userextra"."id" = "represent_app_question"."user_id") WHERE "represent_app_question"."id" IN (SELECT U0."id" FROM "represent_app_question" U0 WHERE U0."direct_vote_count" > 1)
    print("--directsql--")
    #INNER JOIN "represent_app_question" ON ("represent_app_userextra"."id" = "represent_app_question"."user_id") WHERE "represent_app_question"."direct_vote_count" > 1
    print("----")

    #both querys return 1724 results at time of writing
    print("subavg:", subavg*.1)
    print("directavg:", directavg*.1)

    #overall results - subquery seems to be slightly slower to do subquery
    # subavg: 0.18400719165802004
    # directavg: 0.18088221549987793

    #another run
    # subavg: 0.17612528800964355
    # directavg: 0.16919872760772706

@chozabu
Copy link
Author

chozabu commented Aug 5, 2016

Funnily enough - looking at http://scottlobdell.me/2015/01/sql-database-best-practices-django-orm/ it is suggested that list() should be called on subquerys before being passed into the next query, but this seemed to have a negative performance impact in my little test:

subavg: 0.18407721519470216
listsubavg: 0.24150981903076174
directavg: 0.17892849445343018

the SQL for subavg and listsubavg look identical - though could do with a better test (and/or more reading) to make sure this holds for a more complex situation - I guess the extra time here is evaluating the query, and turning it back into a fairly optimal query for second part

mistest - the list evaluated subquery does indeed pass in all the ids from the initial query

updated gist here: https://gist.github.com/chozabu/6f41a359423c7ef07171279471d981d0

@rpkilby
Copy link
Collaborator

rpkilby commented Aug 7, 2016

Yeah, I would probably disregard the article as it comes from a complete lack of understanding. As you saw, evaluating the query w/ list() was a third slower than with a join/subquery (which the author strongly recommends against).

@rpkilby
Copy link
Collaborator

rpkilby commented Nov 16, 2016

@chozabu - btw, still churning on related work. Some interesting results.

Anyway, as a quick note on subqueries the two below are actually a bit different:

q1 = User.objects.filter(
    posts=Post.objects
              .filter(title__icontains="gmail")
              .filter(publish_date__lte=datetime(2009, 1, 1)))

q2 = User.objects.filter(
    id__in=Post.objects
              .filter(title__icontains="gmail")
              .filter(publish_date__lte=datetime(2009, 1, 1))
              .values('author'))

Both use a subquery, but q1 relies on a join while q2 is just a subquery. I'd be curious to see if there's any performance difference there.

@rpkilby rpkilby modified the milestones: v0.10.0, v1.0.0 Apr 19, 2017
@rpkilby rpkilby removed this from the v0.10.0 milestone Apr 19, 2017
@rpkilby
Copy link
Collaborator

rpkilby commented Oct 25, 2017

Hi @chozabu, #198 is a first pass at this implementation.

@rpkilby
Copy link
Collaborator

rpkilby commented Dec 30, 2017

Closed by #198.

@rpkilby rpkilby closed this as completed Dec 30, 2017
@rpkilby rpkilby removed this from the v1.0.0 milestone Jul 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants