-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ChangeSet more full-featured #100
base: master
Are you sure you want to change the base?
Conversation
A Changeset now includes a message as well as author/committer objects modeled on JGit's PersonIdent. This is a breaking change for anyone who uses ChangeSet. The payoff is that on complex git-based repositories, users can now better handle the distributed nature of git. The time-based CommitRanges provided by RepoDriller now use the committer time when doing time comparisons. This is what the user generally intends: "the time at which this commit entered the repository". Addresses mauricioaniche#96.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this review
} | ||
|
||
@Override | ||
public int hashCode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I just remove this?
* A POJO to track details about a person involved with a commit. | ||
*/ | ||
public class CommitPerson { | ||
public String name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just access these fields directly. Do you want me to use getters instead or to make these fields final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two concerns about this:
-
The idea of the
ChangeSet
class was for it to be a lightweight way to get all commit hashes that need to be investigated later. I think we can definitely add the two dates Git stores. However, In this PR, theChangeSet
contains the message and authors, which can be a burden at that moment where repodriller just needs the commit IDs. -
I need to carefully review it, but I suggest that we keep the behaviour of existing filters, and we add a new one that now looks to the committer date, as you need.
ChangeSet
s are only used internally, so I don't see a problem when changing it.
Thoughts?
Since a ChangeSet includes a time and some CommitRanges ( Storage cost: The new ChangeSet is slightly larger, it's true. But it's still a relatively small class composed of Strings and Calendars. I don't think the addition of a Calendar and some Strings will noticeably inflate the overhead, especially since we only load the ChangeSets for one SCM at a time. Computation cost: Obtaining the message and authors uses the same JGit/SVN objects that were used to obtain the commit IDs. We just extract more information from them. So there's not really extra computational cost. I don't see any downside to having more information in the ChangeSet. The upside is that a CommitRange can now make better decisions. The current example is that the CommitterTime, not the AuthorTime, is the time we probably want.
This has two parts:
Another way to phrase this second part is "If we had to choose between making an improvement and maintaining backwards compatibility, which do we pick?". My opinion is that RepoDriller doesn't have a large-enough user base to hold back on improvements just to avoid breaking users. Making a breaking change without a good reason is silly, of course, but I think this is a case where the upsides justify it. A project like BOA has to consider backwards compatibility, but I don't think RepoDriller is there yet. Interested in your thoughts. |
I am sure commit messages can be quite large. As we don't have filter based on commit messages, I suggest we do not store them in The current No, we do not have hundreds of users using it, but myself and PhD students here have a lot of studies on it. Changing it would require us to change a lot of our existing code base (which we will not do, as we'll forget about this breaking change 2 days after the PR being merged, hehe). |
|
Address mauricioaniche's comments. Note: We are defaulting to "probably not the value the user wants". Is this a good idea?
4b88f68 addresses the second suggestion. I've kept the commit messages for now. I don't think it makes sense to worry about the storage overhead of messages -- do you have an example repository with large commit messages I could evaluate? |
Mark this deprecated. Existing analyses should migrate to getAuthor().time or getCommitter().time. I suspect everyone should use getCommitter().time.
6ba0cf8 means that existing ChangeSet filters won't break, though I've marked |
If one needs to do something (like filtering) with the commit message, then it can be done at the
I still do not see a reason to make Let me reflect on this. |
It seems like the question is how much it costs to extract a full Commit, and what the user might want to filter on. Here are my thoughts:
I don't have any data to back up my intuition, but we could certainly take some measurements -- say, on the Linux kernel, Ruby, NodeJS, apache, whatever @ayaankazerouni is working on, and so on. I think a user might want to filter on any of the metadata. "Commits from author X", "Commits between times A and B", "Commits that touch .java files", "Commits whose checkin messages contain profanity", and so on. While a user could filter on data ("Commits that modified class X or introduced method Y"), I think this is a rare enough case that the user can apply their own filter while processing the Commit. Since my intuition is that "metadata is cheap, data is expensive", I suggest that the ChangeSet have all of the metadata we can get ahold of, but not the data. Basically this is everything printed by |
Two more cents on this:
What do you think? |
If you like my "metadata vs. data" proposal, then I think it makes sense to refactor If we drop To avoid having to process the entire repository through Then we can walk through the surviving As a future possibility, we could add a nextChangeSet() and nextCommit() API to SCM. Then we could stream Since we already load all of the (minimal) |
That's how it works right now, right? We get a list of cheap We could definitely offer a Stream API to users, but that can be in another PR. I think for this one, I'd just now inject the Just to give you a concrete example of how I use it and why memory is important to me. I often run my experiments in cheap Digital Ocean machines with 200 MB of RAM available for me. I'd love to keep this 200MB as our goal. |
Right, at the moment we load all the I'll add a commit moving |
Addresses concerns on mauricioaniche#100. This version still maintains two copies of a ChangeSet, one in the Commit and one in the list of ChangeSets that survived the filter. This is because the SCM interface doesn't have a way to pass in (and save a reference to) a Changeset. Just the Id is passed at the moment.
Per comments on mauricioaniche#100, don't duplicate the ChangeSet when creating a Commit. This is an optimization. Add an SCM.getCommit(ChangeSet cs) API. Refactor RepositoryMining to use this API. Deprecate the SCM.getCommit(String id) API.
@mauricioaniche, commits 91f3bee and 78537b3 refactor Commit to use a ChangeSet, and add a new SCM API |
@davisjam, yes, but we do not store a list of I'll review this commit today at some point! |
@mauricioaniche Right, my point is just that even if we don't run out of memory, we still don't want to pay the cost of loading every It wouldn't be hard to pop and discard |
ee75977 fixes merge conflicts. |
We are agreeing on not loading more Commits than necessary. I just wanna test how bigger our memory footprint will be as soon as Review is still on my ToDo list. Sorry! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR sounds ok!
Before merging, I'd love to see a memory comparison. I'll do it as soon as I have time (I'll have a flight from SFO -> Chicago, which I'll have some hours).
I think a simple CommitVisitor that tracks the free memory is enough for us to observe the differences.
Runtime runtime = Runtime.getRuntime();
long memory = runtime.totalMemory() - runtime.freeMemory()
Nevertheless, a good optimization would be to remove the ChangeSet
from the list after it was used, and let GC remove it from the memory.
|
||
private Set<String> parentIds; /* Parent(s) of this commit. Only "merge commits" have more than one parent. Does the root commit have 0? */ | ||
|
||
private Set<String> branches; /* Set of branches that contain this commit, i.e. the branches for which this commit is an ancestor of the most recent commit in that branch. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this will definitely increase the memory footprint. Maybe we should indeed discard ChangeSet
s after they are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discarding ChangeSet
s after they are used will definitely reduce the time of the maximum footprint (decreasing ChangeSet
footprint as the expected footprint from CommitVisitors
rises). I'll look into adding this, should just be a few lines of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 1b7affc.
} | ||
|
||
@Override | ||
public int hashCode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the one that intellij generates automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I didn't realize an IDE could do that.
String msg = r.getFullMessage().trim(); | ||
CommitContributor author = extractCommitContributor(r.getAuthorIdent()); | ||
CommitContributor committer = extractCommitContributor(r.getCommitterIdent()); | ||
List<RevCommit> parents = Arrays.asList(r.getParents()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the point where I believe we can connect to #104. We could just load what the user really needs in her particular study.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I don't think we should do that in this PR though.
Problem: We currently construct all ChangeSets in memory and then process each one. We maintain pointers to all of the ChangeSets during this process, so the overall memory footprint only increases due to the CommitVisitors. See RepositoryMining.processRepo. The "full-featured ChangeSet" PR increases the size of each ChangeSet, and there is concern that this may lead to OOM issues on small machines. Optimization: Discard the pointer to each ChangeSet after we process it. If a CommitVisitor does not save the Commits it visits, then a ChangeSet can be GC'd after a worker processes it. Longer-term suggestions: 1. If we stream ChangeSets instead of constructing them up front, this concern goes away entirely. 2. If we make the amount of data stored in a ChangeSet tunable, the old ChangeSet memory footprint can still be achieved.
Awesome. This PR is done! I'll just check it out and run some experiments soon! |
Hey @mauricioaniche, had any time for experiments on this? |
@davisjam Ah, I'll do the charts and etc. Just need travis to run there! :) |
Paper deadline on Dec 1, I should be able to look at this around Dec. 5? |
Sure! Good luck with your submission! |
@davisjam See the latest build in the master and the build at this one. In the master, it takes a few milliseconds to start processing the repo:
In this PR, I'm waiting for 7 minutes now, and it does even leave the first line:
I suppose that collecting all the information beforehand is just too much... Thoughts? |
The semester is almost over...
On Sat, Dec 16, 2017 at 5:00 PM Maurício Aniche ***@***.***> wrote:
@davisjam <https://github.com/davisjam> See the latest build in the
master and the build at this one.
In the master, it takes a few milliseconds to start processing the repo:
21:46:17.737 [main] INFO org.repodriller.RepositoryMining - Git repository in /home/travis/build/mauricioaniche/repodriller/target/test-classes/../../test-repos/rails
21:46:19.369 [main] INFO org.repodriller.RepositoryMining - 1001 ChangeSets to process
In this PR, I'm waiting for 7 minutes now, and it does even leave the
first line:
21:52:31.963 [main] INFO org.repodriller.RepositoryMining - Git repository in /home/travis/build/mauricioaniche/repodriller/target/test-classes/../../test-repos/rails
I suppose that collecting all the information beforehand is just too
much... Thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#100 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVw9r7ABDHRHKgp5kbVtzJ0ep29u0m2Dks5tBD2UgaJpZM4P-h1a>
.
--
Jamie Davis
PhD student, Computer Science, 2015-?
Virginia Tech
http://people.cs.vt.edu/~davisjam/
|
A lot has changed since this (old) PR. Merging it will be challenging. What do you wanna do, @davisjam ? Shall we just close it? |
I'm a bit torn. I think the additional fields are useful. I just don't need them at the moment. Open to thoughts of other RepoDriller users on reviving this PR is of interest. |
A Changeset now includes a message as well as author/committer objects
modeled on JGit's PersonIdent.
This is a breaking change for anyone who uses ChangeSet's time field.
The payoff is that on complex git-based repositories, users can now better handle the distributed nature of git.
The time-based CommitRanges provided by RepoDriller now use the committer time when doing time comparisons.
This is what the user generally intends: "the time at which this commit entered the repository".
Addresses #96.