Make ChangeSet more full-featured #100

davisjam · 2017-10-19T00:09:48Z

A Changeset now includes a message as well as author/committer objects
modeled on JGit's PersonIdent.

This is a breaking change for anyone who uses ChangeSet's time field.
The payoff is that on complex git-based repositories, users can now better handle the distributed nature of git.

The time-based CommitRanges provided by RepoDriller now use the committer time when doing time comparisons.
This is what the user generally intends: "the time at which this commit entered the repository".

Addresses #96.

A Changeset now includes a message as well as author/committer objects modeled on JGit's PersonIdent. This is a breaking change for anyone who uses ChangeSet. The payoff is that on complex git-based repositories, users can now better handle the distributed nature of git. The time-based CommitRanges provided by RepoDriller now use the committer time when doing time comparisons. This is what the user generally intends: "the time at which this commit entered the repository". Addresses mauricioaniche#96.

davisjam

I made this review

davisjam · 2017-10-19T00:10:24Z

src/main/java/org/repodriller/domain/CommitPerson.java

+	}
+
+	@Override
+	public int hashCode() {


Should I just remove this?

davisjam · 2017-10-19T00:11:41Z

src/main/java/org/repodriller/domain/CommitPerson.java

+ * A POJO to track details about a person involved with a commit.
+ */
+public class CommitPerson {
+	public String name;


I just access these fields directly. Do you want me to use getters instead or to make these fields final?

mauricioaniche

I have two concerns about this:

The idea of the ChangeSet class was for it to be a lightweight way to get all commit hashes that need to be investigated later. I think we can definitely add the two dates Git stores. However, In this PR, the ChangeSet contains the message and authors, which can be a burden at that moment where repodriller just needs the commit IDs.
I need to carefully review it, but I suggest that we keep the behaviour of existing filters, and we add a new one that now looks to the committer date, as you need.

ChangeSets are only used internally, so I don't see a problem when changing it.

Thoughts?

davisjam · 2017-10-19T16:28:17Z

"The idea of the ChangeSet class was for it to be a lightweight way to get all commit hashes that need to be investigated later. I think we can definitely add the two dates Git stores. However, In this PR, the ChangeSet contains the message and authors, which can be a burden at that moment where repodriller just needs the commit IDs."

Since a ChangeSet includes a time and some CommitRanges (MonthlyCommits) explicitly relies on this field, I think a ChangeSet is more accurately described as "as much commit metadata as we can extract cheaply for easy filtering".

Storage cost: The new ChangeSet is slightly larger, it's true. But it's still a relatively small class composed of Strings and Calendars. I don't think the addition of a Calendar and some Strings will noticeably inflate the overhead, especially since we only load the ChangeSets for one SCM at a time.

Computation cost: Obtaining the message and authors uses the same JGit/SVN objects that were used to obtain the commit IDs. We just extract more information from them. So there's not really extra computational cost.

I don't see any downside to having more information in the ChangeSet. The upside is that a CommitRange can now make better decisions. The current example is that the CommitterTime, not the AuthorTime, is the time we probably want.

"I need to carefully review it, but I suggest that we keep the behaviour of existing filters, and we add a new one that now looks to the committer date, as you need."

This has two parts:

I think the existing RepoDriller-provided MonthlyCommits filter is not doing what it is supposed to in more complicated repositories, as described in MonthlyCommits is unsafe #96. This PR make the behavior what the user actually expects.
Breaking existing user-created filters that use timestamps is unavoidable here, because this PR introduces a second definition of time. Do you think this will significantly inconvenience the userbase? Is this a popular features, are there enough long-term users that accommodating their needs matters, etc.?

Another way to phrase this second part is "If we had to choose between making an improvement and maintaining backwards compatibility, which do we pick?". My opinion is that RepoDriller doesn't have a large-enough user base to hold back on improvements just to avoid breaking users. Making a breaking change without a good reason is silly, of course, but I think this is a case where the upsides justify it. A project like BOA has to consider backwards compatibility, but I don't think RepoDriller is there yet. Interested in your thoughts.

mauricioaniche · 2017-10-20T15:17:17Z

I am sure commit messages can be quite large. As we don't have filter based on commit messages, I suggest we do not store them in ChangeSet.

The current MonthlyCommits gives commits according to one out of the two dates Git saves. It's not wrong, it's just does not sufficiently cover all the cases. My suggestion is to, instead of breaking the existing behavior of MonthlyCommits, we either 1) add a CommitterMonthlyCommits or 2) create a constructor in MonthlyCommits that receives a flag that indicates which date to use; the default is the existing one. We add nice documentation about it in the README.

No, we do not have hundreds of users using it, but myself and PhD students here have a lot of studies on it. Changing it would require us to change a lot of our existing code base (which we will not do, as we'll forget about this breaking change 2 days after the PR being merged, hehe).

davisjam · 2017-10-20T15:42:12Z

"As we don't have filter based on commit messages, I suggest we do not store them in ChangeSet."
I've seen some research studies that do filter based on commit messages. Is this something we might want in the future?
I like the idea of controlling which date the MonthlyCommits filter uses. There's a between-dates filter that will be similarly affected. I'll try something along these lines.

Address mauricioaniche's comments. Note: We are defaulting to "probably not the value the user wants". Is this a good idea?

davisjam · 2017-10-20T19:05:30Z

4b88f68 addresses the second suggestion.

I've kept the commit messages for now. I don't think it makes sense to worry about the storage overhead of messages -- do you have an example repository with large commit messages I could evaluate?

Mark this deprecated. Existing analyses should migrate to getAuthor().time or getCommitter().time. I suspect everyone should use getCommitter().time.

davisjam · 2017-10-20T19:09:06Z

6ba0cf8 means that existing ChangeSet filters won't break, though I've marked ChangeSet.getTime deprecated.

mauricioaniche · 2017-10-21T17:12:02Z

If one needs to do something (like filtering) with the commit message, then it can be done at the CommitVisitor, e.g.

if(commit.getMsg().contains("bug")) return;

I still do not see a reason to make ChangeSet almost as Commit. It seems a lot of duplicated data for no big reason. Maybe the all Filter part of the API should change and make use of Commit instead of ChangeSet.

Let me reflect on this.

davisjam · 2017-10-21T18:00:14Z

It seems like the question is how much it costs to extract a full Commit, and what the user might want to filter on.

Here are my thoughts:
My guess is that:

All of the metadata (author, message, time, etc.) is "cheap" (relatively small and predictable in size).
Some diffs are small (one-liners) but diffs can be huge. We shouldn't load these if we can avoid it.
The list of files touched is metadata and is a reasonable thing to include in the Filter (e.g. "files that end in .java"). This list can be long but I would expect it to be expensive only in rare cases (e.g. directory re-arrangements).

I don't have any data to back up my intuition, but we could certainly take some measurements -- say, on the Linux kernel, Ruby, NodeJS, apache, whatever @ayaankazerouni is working on, and so on.

I think a user might want to filter on any of the metadata. "Commits from author X", "Commits between times A and B", "Commits that touch .java files", "Commits whose checkin messages contain profanity", and so on.

While a user could filter on data ("Commits that modified class X or introduced method Y"), I think this is a rare enough case that the user can apply their own filter while processing the Commit.

Since my intuition is that "metadata is cheap, data is expensive", I suggest that the ChangeSet have all of the metadata we can get ahold of, but not the data. Basically this is everything printed by git log --stat.

mauricioaniche · 2017-10-22T13:13:36Z

Two more cents on this:

Maybe we should once and for all remove the redundancy between CommitRange and Filter. They are meant for the same thing: filtering. However, one looks to ChangeSet (and thus, the filtering happen before capturing all Modifications) and the other looks to Commit (therefore, after creating the 'expensive' Commit object. I am still not sure if we should keep both or simplify it once and for all. At the beginning, they were different just for the sake of saving the trouble of loading the Commit object. But now, I do not know if it worths.
We can keep cheap things at ChangeSet. I also do not have data to back up my opinion. It will definitely increase memory usage as we'd have two objects holding the same data. Unless Commit starts to internally store the ChangeSet, and we save the commit message just once.

What do you think?

davisjam · 2017-10-22T23:48:02Z

If you like my "metadata vs. data" proposal, then I think it makes sense to refactor Commit to contain a ChangeSet and a List<Modification>. The data duplicated between ChangeSet and Commit would then be only the list of affected files, which should be small.

If we drop ChangeSet filters and only offer Commit filters, we will at some point have loaded every commit into a Commit. This seems undesirably expensive.

To avoid having to process the entire repository through Commits at some point during the repository visit, I think it makes sense to load the metadata from the git repository into ChangeSets and let the user apply filters to them.

Then we can walk through the surviving ChangeSets and build Commits one-at-a-time.

As a future possibility, we could add a nextChangeSet() and nextCommit() API to SCM. Then we could stream ChangeSets/Commits instead of loading all of the {ChangeSets, Commits} into memory at once.

Since we already load all of the (minimal) ChangeSets into memory and no one has complained, the easiest approach is to expand the ChangeSets and see if it breaks anything. But that nextChangeSet/nextCommit thing would be cool.

mauricioaniche · 2017-10-24T09:45:29Z

That's how it works right now, right? We get a list of cheap ChangeSet, we filter them. Then, for each ChangeSet, we build an "expensive" Commit and give it to the Visitor. Commits are built one by one and not in a list.

We could definitely offer a Stream API to users, but that can be in another PR.

I think for this one, I'd just now inject the ChangeSet inside of the Commit object, so that 'commit message' is only stored once in the memory. I'd also love to see some memory measurements in any repository (e.g. Rails, Linux). Hopefully, storing all commits upfront will not blow things up.

Just to give you a concrete example of how I use it and why memory is important to me. I often run my experiments in cheap Digital Ocean machines with 200 MB of RAM available for me. I'd love to keep this 200MB as our goal.

davisjam · 2017-10-24T16:17:44Z

Right, at the moment we load all the ChangeSets at once, filter, and then one-at-a-time build Commits. But the total cost of the Commits is equal to the overall size of the repository if there is no filtering -- the goal of the filters is partly for the user's convenience and partly to make sure we only create Commits when we have to.

I'll add a commit moving ChangeSet inside Commit to reduce duplication.

Addresses concerns on mauricioaniche#100. This version still maintains two copies of a ChangeSet, one in the Commit and one in the list of ChangeSets that survived the filter. This is because the SCM interface doesn't have a way to pass in (and save a reference to) a Changeset. Just the Id is passed at the moment.

Per comments on mauricioaniche#100, don't duplicate the ChangeSet when creating a Commit. This is an optimization. Add an SCM.getCommit(ChangeSet cs) API. Refactor RepositoryMining to use this API. Deprecate the SCM.getCommit(String id) API.

davisjam · 2017-10-24T19:44:57Z

@mauricioaniche, commits 91f3bee and 78537b3 refactor Commit to use a ChangeSet, and add a new SCM API getCommit(Changeset) to avoid duplicating the ChangeSet when creating a Commit from a ChangeSet.

mauricioaniche · 2017-10-26T09:14:20Z

@davisjam, yes, but we do not store a list of Commits in memory, right? This means that after a Commit is processed, the object is ready to be GC'ed. ChangeSets, unfortunately, stay in the memory through all the execution.

I'll review this commit today at some point!

davisjam · 2017-10-26T16:53:55Z

@mauricioaniche Right, my point is just that even if we don't run out of memory, we still don't want to pay the cost of loading every Commit if we can filter some out ahead of time.

It wouldn't be hard to pop and discard ChangeSets after we process them. That would reduce the amount of time that we stay in the "high memory usage" zone until a PR that streams the ChangeSets themselves.

davisjam · 2017-10-26T16:58:37Z

ee75977 fixes merge conflicts.

mauricioaniche · 2017-10-27T06:51:03Z

We are agreeing on not loading more Commits than necessary. I just wanna test how bigger our memory footprint will be as soon as ChangeSets are bigger.

Review is still on my ToDo list. Sorry!

mauricioaniche

This PR sounds ok!

Before merging, I'd love to see a memory comparison. I'll do it as soon as I have time (I'll have a flight from SFO -> Chicago, which I'll have some hours).

I think a simple CommitVisitor that tracks the free memory is enough for us to observe the differences.

Runtime runtime = Runtime.getRuntime();
long memory = runtime.totalMemory() - runtime.freeMemory()

Nevertheless, a good optimization would be to remove the ChangeSet from the list after it was used, and let GC remove it from the memory.

mauricioaniche · 2017-10-27T22:25:25Z

src/main/java/org/repodriller/domain/ChangeSet.java

+
+	private Set<String> parentIds; /* Parent(s) of this commit. Only "merge commits" have more than one parent. Does the root commit have 0? */
+
+	private Set<String> branches; /* Set of branches that contain this commit, i.e. the branches for which this commit is an ancestor of the most recent commit in that branch. */


Yeah, this will definitely increase the memory footprint. Maybe we should indeed discard ChangeSets after they are used.

Discarding ChangeSets after they are used will definitely reduce the time of the maximum footprint (decreasing ChangeSet footprint as the expected footprint from CommitVisitors rises). I'll look into adding this, should just be a few lines of code.

See 1b7affc.

mauricioaniche · 2017-10-27T22:26:37Z

src/main/java/org/repodriller/domain/CommitContributor.java

+	}
+
+	@Override
+	public int hashCode() {


is this the one that intellij generates automatically?

No. I didn't realize an IDE could do that.

mauricioaniche · 2017-10-27T22:30:07Z

src/main/java/org/repodriller/scm/GitRepository.java

+		String msg = r.getFullMessage().trim();
+		CommitContributor author = extractCommitContributor(r.getAuthorIdent());
+		CommitContributor committer = extractCommitContributor(r.getCommitterIdent());
+		List<RevCommit> parents = Arrays.asList(r.getParents());


this is the point where I believe we can connect to #104. We could just load what the user really needs in her particular study.

Agreed. I don't think we should do that in this PR though.

Problem: We currently construct all ChangeSets in memory and then process each one. We maintain pointers to all of the ChangeSets during this process, so the overall memory footprint only increases due to the CommitVisitors. See RepositoryMining.processRepo. The "full-featured ChangeSet" PR increases the size of each ChangeSet, and there is concern that this may lead to OOM issues on small machines. Optimization: Discard the pointer to each ChangeSet after we process it. If a CommitVisitor does not save the Commits it visits, then a ChangeSet can be GC'd after a worker processes it. Longer-term suggestions: 1. If we stream ChangeSets instead of constructing them up front, this concern goes away entirely. 2. If we make the amount of data stored in a ChangeSet tunable, the old ChangeSet memory footprint can still be achieved.

mauricioaniche · 2017-11-02T02:55:10Z

Awesome. This PR is done! I'll just check it out and run some experiments soon!

davisjam · 2017-11-16T01:27:47Z

Hey @mauricioaniche, had any time for experiments on this?

mauricioaniche · 2017-11-18T11:04:56Z

@davisjam Ah, I'll do the charts and etc. Just need travis to run there! :)

davisjam · 2017-11-25T23:27:05Z

Paper deadline on Dec 1, I should be able to look at this around Dec. 5?

mauricioaniche · 2017-11-26T14:43:10Z

Sure! Good luck with your submission!

mauricioaniche · 2017-12-16T22:00:52Z

@davisjam See the latest build in the master and the build at this one.

In the master, it takes a few milliseconds to start processing the repo:

21:46:17.737 [main] INFO  org.repodriller.RepositoryMining - Git repository in /home/travis/build/mauricioaniche/repodriller/target/test-classes/../../test-repos/rails
21:46:19.369 [main] INFO  org.repodriller.RepositoryMining - 1001 ChangeSets to process

In this PR, I'm waiting for 7 minutes now, and it does even leave the first line:

21:52:31.963 [main] INFO  org.repodriller.RepositoryMining - Git repository in /home/travis/build/mauricioaniche/repodriller/target/test-classes/../../test-repos/rails

I suppose that collecting all the information beforehand is just too much... Thoughts?

davisjam · 2017-12-16T22:42:38Z

The semester is almost over...

On Sat, Dec 16, 2017 at 5:00 PM Maurício Aniche ***@***.***> wrote: @davisjam <https://github.com/davisjam> See the latest build in the master and the build at this one. In the master, it takes a few milliseconds to start processing the repo: 21:46:17.737 [main] INFO org.repodriller.RepositoryMining - Git repository in /home/travis/build/mauricioaniche/repodriller/target/test-classes/../../test-repos/rails 21:46:19.369 [main] INFO org.repodriller.RepositoryMining - 1001 ChangeSets to process In this PR, I'm waiting for 7 minutes now, and it does even leave the first line: 21:52:31.963 [main] INFO org.repodriller.RepositoryMining - Git repository in /home/travis/build/mauricioaniche/repodriller/target/test-classes/../../test-repos/rails I suppose that collecting all the information beforehand is just too much... Thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#100 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVw9r7ABDHRHKgp5kbVtzJ0ep29u0m2Dks5tBD2UgaJpZM4P-h1a> .

-- Jamie Davis PhD student, Computer Science, 2015-? Virginia Tech http://people.cs.vt.edu/~davisjam/

mauricioaniche · 2018-02-18T22:13:08Z

A lot has changed since this (old) PR. Merging it will be challenging. What do you wanna do, @davisjam ? Shall we just close it?

davisjam · 2018-02-19T02:31:44Z

I'm a bit torn. I think the additional fields are useful. I just don't need them at the moment. Open to thoughts of other RepoDriller users on reviving this PR is of interest.

davisjam commented Oct 19, 2017

View reviewed changes

mauricioaniche reviewed Oct 19, 2017

View reviewed changes

davisjam added 2 commits October 20, 2017 14:33

Merge branch 'master' into BiggerChangeSet

08d27cb

Preserve existing behavior in pre-defined CommitRanges

4b88f68

Address mauricioaniche's comments. Note: We are defaulting to "probably not the value the user wants". Is this a good idea?

Restore getTime to ChangeSet for compatibility

6ba0cf8

Mark this deprecated. Existing analyses should migrate to getAuthor().time or getCommitter().time. I suspect everyone should use getCommitter().time.

davisjam mentioned this pull request Oct 24, 2017

if it's a merge, provide both parents of the commit #51

Open

davisjam added 2 commits October 24, 2017 15:28

Optimization: Avoid ChangeSet duplication

78537b3

Per comments on mauricioaniche#100, don't duplicate the ChangeSet when creating a Commit. This is an optimization. Add an SCM.getCommit(ChangeSet cs) API. Refactor RepositoryMining to use this API. Deprecate the SCM.getCommit(String id) API.

Merge branch 'master' into BiggerChangeSet

ee75977

mauricioaniche reviewed Oct 27, 2017

View reviewed changes

davisjam mentioned this pull request Nov 17, 2017

Memory Consumption #105

Merged

mauricioaniche added 3 commits December 16, 2017 19:18

Merge branch 'master' into BiggerChangeSet

1c60d27

Merge branch 'master' into BiggerChangeSet

a0ace05

Merge branch 'master' into BiggerChangeSet

921d9cf


		private Set<String> parentIds; /* Parent(s) of this commit. Only "merge commits" have more than one parent. Does the root commit have 0? */

		private Set<String> branches; /* Set of branches that contain this commit, i.e. the branches for which this commit is an ancestor of the most recent commit in that branch. */

Make ChangeSet more full-featured #100

Are you sure you want to change the base?

Make ChangeSet more full-featured #100

Conversation

davisjam commented Oct 19, 2017 • edited Loading

davisjam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauricioaniche left a comment

Choose a reason for hiding this comment

davisjam commented Oct 19, 2017 • edited Loading

mauricioaniche commented Oct 20, 2017

davisjam commented Oct 20, 2017

davisjam commented Oct 20, 2017

davisjam commented Oct 20, 2017

mauricioaniche commented Oct 21, 2017

davisjam commented Oct 21, 2017 • edited Loading

mauricioaniche commented Oct 22, 2017

davisjam commented Oct 22, 2017

mauricioaniche commented Oct 24, 2017

davisjam commented Oct 24, 2017

davisjam commented Oct 24, 2017

mauricioaniche commented Oct 26, 2017

davisjam commented Oct 26, 2017

davisjam commented Oct 26, 2017

mauricioaniche commented Oct 27, 2017

mauricioaniche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauricioaniche commented Nov 2, 2017

davisjam commented Nov 16, 2017

mauricioaniche commented Nov 18, 2017

davisjam commented Nov 25, 2017

mauricioaniche commented Nov 26, 2017

mauricioaniche commented Dec 16, 2017

davisjam commented Dec 16, 2017 via email

mauricioaniche commented Feb 18, 2018

davisjam commented Feb 19, 2018

davisjam commented Oct 19, 2017 •

edited

Loading

davisjam commented Oct 19, 2017 •

edited

Loading

davisjam commented Oct 21, 2017 •

edited

Loading