-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map-reduce support #61
Comments
Could you elaborate more on how would you use it? Since many activities of the same key will have different params/owners and such, how can you benefit from that? |
Suppose we have the key Same for, lets say Article, if an actor updates it. |
I have a similar issue, for example a User's post is liked 100 times by other users. Currently their Activity Feed is going to be spammed with 100 activities instead of just something like |
A simple approach is to update last activity if some conditions are met. And that would be the easiest way to handle it for basic usecases. |
I think I need to study map-reduce more in depth to answer this question. I thought that it was a simple operation for grouping records by fields and getting count of each group. If it's really more complex than an ORM+DB problem, I will learn it shortly and post some meaningful comments here. |
I think associating a post with an activity could solve the problem, as every new comment would set the unread status (or something similar) on the notification. Also a habtm relation with join table for read/unread for each user concerned in that post (author and those who commented, for example). I think there's a gem for that, but I can't recall the name. |
That is one hard topic, on which I contemplated from the very beginning. We can either merge them on creation or in views. In one of the apps I developed using public_activity I did the grouping in JavaScript, client-side.
Looked and behaved properly, but requires fetching many activities (though it was still fast, as the activities were very simple to render). This is the closest I got to Facebook implementation, but on client-side (facebook does this somewhere between database and displaying). |
Maybe we can use the params or a field with a parent_id activity to know how to group activities and then query only the parents activities, and on each parent activity make the subqueries, grouping , logic to display what you want. now the question is how to set up some logic for parent / child creation , like: makes any sense ? |
My take was to on Activity creation find the most recent activity with the same key/trackable combination and "attach" the newly created activity to the one already existing. Merging activities is trivial on MongoDB, thanks to its schema-less design. Everything relational requires additional columns, and probably associations. Facebook example, when two people comment on my post Facebook merges those notifications into one:
So we need to merge owners into a list, which could be stored as a serialized array (but we lose querying possibilities in ActiveRecord), or as an additional association, and update This is what I did, but in views. So no merging was done in the database. This was far from perfect, as @michelson mentioned, because we have no way of knowing how many activities we are actually going to get. |
I'm not a fan of any proposed approaches. What I would do is add a callback on In ex. Mark triggers an activity on something related to a This makes sense for |
@stas good thinking. I like the |
Also true. În data de Fri, 15 Feb 2013 13:47:01 +0200, Piotrek Okoński
|
im still think that add a parent_id association is the simplest solution, but i agree that the storage of data in the serialized params fields is a is obvious that the check of the keys have to be in a before_save , so we Because, the cached data could be an array of names ? or ids ? but what the main problem with ONLY cache serialized params is that we loose also I'll be happy to help in this task regards Atte. On Fri, Feb 15, 2013 at 9:10 AM, Стас Сушков [email protected]:
|
@stas, @pokonski: I took a shot at this in the app I'm using
If the user is viewing the list of notifications on 1/4/13, you might want to group the notifications like:
What it seems like to me is that there should be a hook when creating activity that defines how to create the "aggregation key" which could then be used by the group_by method. By default, that aggregation key could be generated by combining: |
@mewdriller yes, exactly. All that makes sense. Since it's a gem it needs to provide reasonable defaults with possibility of changing them, as every one needs something slightly different. |
+1 Just came across public activity today. we have implemented a custom notifications system, basically doing what public_activity does with a few additions. Public activity does it much more elegantly than our implementation. We only create new aggregates for user when he becomes active after few hours of inactivity, otherwise we keep aggregating the activities into older latest activity. This way user has a clear separation of things which happened while he was away and things happening now. Custom proc can generate the aggregation key as required, other option can be to return activity to aggregate current activity into. There are a few additional issues/features which we tackled and would make sense for public activity too. Have started a new thread #82 to discuss them further. |
One more important aspect to consider would be translation keys for aggregated activities. Different projects will have different requirements. Notifications can look like:
Fallback to i18n solely based on activity key might not be enough in such a scenario, with different number of actors visible upfront and pluralization. Developers should have some method to define the translation key and interpolation parameters for each activity. |
All this comes down to joining actors, but keeping in mind that different languages have different syntax. English is the easiest one, I think. Implementation in p_a should be vague enough for people to implement this by themselves. Rails already has i18n support for joining arrays in human-readable format: https://github.com/svenfuchs/rails-i18n/blob/master/rails/locale/en.yml#L188 |
Public Activity already offers flexibility to implement an i18n or a view-based rendering for entries, I'm using views, and it works great. Don't think this is a blocker. |
Yes, definitely not a blocker by any means. |
I've extracted the code I've been using for this purpose into a library at mieko/coalesce, which I'm trying to turn into a general-purpose "object aggregation library". Contributions expanding its scope to cover all of the use-cases explored here would most certainly be welcome. It'd be nice to finally nail down this functionality, in a way which doesn't invade PA, and would be useful for other purposes. Coalesce works something like: g = Coalesce::Grouper.new do
rule :create_and_comment do
same :id, :owner
attr_in :key, 'ticket.create', 'ticket.comment'
time_delta 5.minutes, from: :first
combine :key, with: :smart_key
combine :text, with: :array
end
end
activities = [
Activity.new(id: 5, key: 'ticket.create', owner: 10, text: 'Data 1'),
Activity.new(id: 5, key: 'ticket.comment', owner: 10, text: 'Data 2')
]
results = g.each(activities).to_a
# => [Activity(id: 5, key: 'ticket.comment_create', owner: 10, text: ['Data 1', 'Data 2'])] The PA-specific niceties like the The API is still up in the air, and the functionality only implements enough to cover my use-cases. The gem is pre-release, without proper versioning and packaging yet. I'm working on documentation as I have time, so anyone looking to check it out must be willing to code-dive at the moment. With a bit of help, I think this could be turned into a system powerful enough to reasonably express all the complexities of Facebook-style aggregation, while staying maintainable. Regardless, it may be a good starting point for anyone looking for functionality like this. |
@mieko That's awesome. Thanks 👍 |
But does your library reduce on the server side? If not, the performance will be penalised when done in Ruby. |
@farnoy Coalesce works at the Ruby object level. I guess performance is a matter of perspective. My target is mobile and widely-compatible mobile, (think Android 2.3 on spotty 3G). When we moved to server-side aggregation, with standard Rails caching techniques, both bandwidth requirements and page-load responsiveness improved. As far as our users are concerned, it was a definite performance win. Our servers take a little performance ding on a cache miss. Where an application lives on the SPA-vs-traditional web app gradient will determine if this approach is appropriate. |
@mieko nice idea! Looking forward to hearing more :) |
While caching can excuse any bottlenecks temporarily, this is not a true solution for map reduce. |
Caching is actually the cornerstone performance strategy of large web deployments, but I'm digressing.
While the issue title contains the phrase "map-reduce", the problems described are more generalized aggregation. One forcing MapReduce (the paper) strategy to get the results discussed here, would probably be disappointed in how little is gained from not necessarily independent data. A lot of rules we use are outright sequential and heavily inter-dependent for the sake of "humanized" grouping. Regardless, Coalesce is neither Hadoop nor a Javascript library, depending on what is expected of a true solution. Deep ORM query generation is even out of its scope. It's a still-incubating library for aggregating Ruby objects which meet certain conditions in prescribed ways. |
@farnoy disagree. Caching is an absolute necessity in web-dev. No matter how brilliantly awesome your code is. Serving static content will beat it every time. Like @mieko pointed out, there is no single serve-all-purposes solution, and definitely not one I can include in p_a. I did aggregation with p_a in JS (client-side), and server-side for different projects, depending on the use-case. Doing it in the database would bind p_a implementation very closely to the postgres/mysql/oracle or some other crappy database (MongoDB). |
We agreed that we can't ship a solution for everyone in p_a months ago on the mailing list. @mieko, people doing large scale webdev will have serious problems as soon as they use your gem for aggregation. I'm not saying MapReduce specifically. The use case you've described as an example would be infinitely better performing if developers were to store data the way your results look like.
The algorithm for this bundling would be really simple using the find_by method. This does introduce another query when inserting, but it won't crash your servers when you purge caches. By the way, I think one PostgreSQL query can replace your gem. Using Refs: |
I've updated my in-database format to resemble my views, and replaced the gem with the proposed query. My servers have stopped crashing. I'll leave Coalesce (now in maintenance-only mode) up as a warning to others intending to scale. |
This requires a proper migration, which we'll document in the wiki, but for testing purposes, when parameters is still a serialized string, |
I am now doing remove_duplicate like this in the controller
How's it? and I attempt to do this in the model but fail |
The algorithm is really bad because you process all activities every time
|
I successfully implemented a sample application where there are two models + activities. One is called I then implemented simple functionality to share these updates (facebook style) by authors (fully done on activities, just a simple concept). Finally, I've written a query that finds all activities with key
The structure looks like this: In this simple form, it does not achieve anything substantial, but I think it can solve most of the aggregation problems that our users encounter. The short version is: @activities = PublicActivity::Activity.where(key: 'update.shared')
.group(:trackable_id, :trackable_type)
.group("activities.parameters -> 'read'")
.select("activities.parameters -> 'read' AS read,
activities.trackable_id, activities.trackable_type,
array_agg(activities.owner_id) AS owner_ids")
# this produces above mentioned structure
@activities.map {|o| {[o.trackable.content, o.read] => o.owner_ids} } My solution uses PostgreSQL and requires one change in p_a codebase plus a migration (to store parameters in native HStore). I'd like to hear what do you think about this solution and if it makes sense for your use cases. Ideally, we'd like a RDBMS agnostic solution, which probably could be achieved (with the exception of referencing into parameters which only Postgres has I think). |
@farnoy I was expecting a more DB schema oriented approach (as I mentioned, I would see the reduce process happening at the write level rather when activities are queried/read). Reading above, It is not clear how activity generation is handling the entries which look same. Maybe I'm missing something, could you explain more? Thanks. P.S.: By the time I opened this ticket, I was working on @Courseware which is now open source. So if you are looking for a project where to benchmark/test things, feel free to take a look at it (It has a dozen models with activities). |
The way I understand map reduce is that mapping and reducing happens when
|
Closing this for now as we have no specific way of achieving this on multiple ORMs. |
Is there somewhere else we can discuss this as it's certainly logical enchancement to p_a - even if it doesn't necessarily make sense for it to be a part of the gem itself. @farnoy - perhaps you could share your sample Authors app please? I'm also struggling to get to grips with a good way of doing this. |
If you make your parameters a I don't have that app anymore. It should be easy in general, just limited because of SQL. |
Great, thanks for that starting point. I'm going to have a go at it and report back here. I know it's not all inclusive from an ORM perspective but hopefully it will help other Postgres users - for others there is the JS solution above. |
I solved the problem for postgresql like this. It groups events that happens in sequence and counts the number of occurrences. SELECT * , row_number - coalesce(lag(row_number) OVER(), 0) as count
FROM(
SELECT
key,
created_at,
row_number() OVER(ORDER BY created_at),
lead(key) OVER(ORDER BY created_at)
FROM activities
WHERE owner_id = <user-id>
) t
WHERE key IS DISTINCT
FROM lead
ORDER BY created_at DESC; Here's the output.
Here |
Hey @oleander - definitely interested :) |
@oleander I too am interested! Thanks in advance! |
This is more like an open question.
I need a way to reduce the amount of repetitive activities so that if I get same keys for the same object, I could develop a strategy to map/merge those into a single entry.
Ideally I would patch
public_activity
to handle that for me. Anyone interested in getting something like this into the core or I should not even bother pitching this.Thanks
The text was updated successfully, but these errors were encountered: