You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let D and D′ be two deduplicated datasets of size n and n′ respectively. Greedy linkage involves two steps: comparisons, which are trivially O(nn′), and solving.
Let p be the probability that a pair of records has similarity above the threshold. As n, n′ → ∞ (for a given source of data) p converges to the probability of a non-matching pair being above the threshold. (p > 0 unless there are never non-matching pairs above the threshold, in which case solving isn’t necessary.)
There are pnn′ pairs above the threshold. Because we sort them, our solving step is O(pnn′ log pnn′). This equals O(nn′ log nn′) since p converges to a positive constant. The entire linkage process is thus O(nn′ + nn′ log nn′) = O(nn′ log nn′).
However, solving can be done in O(pnn′ + m log m) time, where m is the number of found pairs. Observing that m ≤ min(n, n′) for a given source of data, O(pnn′ + m log m) = O(nn′). This is the same as the time complexity of the comparison step and it is the best we can do without ignoring input.
The algorithm is similar to Filter-Kruskal [1] (although I claim I thought of it before I found Filter-Kruskal!). It is a modification of QuickSort, and it filters out pairs that will provably not be accepted before they are sorted. Pseudo-Python:
(Of course, in practice one would set a length limit for pairs, below which quick_greedy_inner calls the trivial greedy algorithm. This improves the speed since the trivial algorithm is faster for smaller problems.)
The time complexity of this algorithm can be proven by lightly modifying the proof of Theorem 3.1 in [1]. It relies on Lemma 1.1 in [2], which also needs adjusting.
[1] Osipov, Vitaly, Peter Sanders, and Johannes Singler. "The Filter-Kruskal Minimum Spanning Tree Algorithm." 2009 Proceedings of the Eleventh Workshop on Algorithm Engineering and Experiments (ALENEX). Society for Industrial and Applied Mathematics, 2009.
[2] Chan, Timothy M. "Backwards analysis of the Karger-Klein-Tarjan algorithm for minimum spanning trees." Information Processing Letters 67.6 (1998): 303-304.
The text was updated successfully, but these errors were encountered:
Let D and D′ be two deduplicated datasets of size n and n′ respectively. Greedy linkage involves two steps: comparisons, which are trivially O(nn′), and solving.
Let p be the probability that a pair of records has similarity above the threshold. As n, n′ → ∞ (for a given source of data) p converges to the probability of a non-matching pair being above the threshold. (p > 0 unless there are never non-matching pairs above the threshold, in which case solving isn’t necessary.)
There are pnn′ pairs above the threshold. Because we sort them, our solving step is O(pnn′ log pnn′). This equals O(nn′ log nn′) since p converges to a positive constant. The entire linkage process is thus O(nn′ + nn′ log nn′) = O(nn′ log nn′).
However, solving can be done in O(pnn′ + m log m) time, where m is the number of found pairs. Observing that m ≤ min(n, n′) for a given source of data, O(pnn′ + m log m) = O(nn′). This is the same as the time complexity of the comparison step and it is the best we can do without ignoring input.
The algorithm is similar to Filter-Kruskal [1] (although I claim I thought of it before I found Filter-Kruskal!). It is a modification of QuickSort, and it filters out pairs that will provably not be accepted before they are sorted. Pseudo-Python:
(Of course, in practice one would set a length limit for
pairs
, below whichquick_greedy_inner
calls the trivial greedy algorithm. This improves the speed since the trivial algorithm is faster for smaller problems.)The time complexity of this algorithm can be proven by lightly modifying the proof of Theorem 3.1 in [1]. It relies on Lemma 1.1 in [2], which also needs adjusting.
[1] Osipov, Vitaly, Peter Sanders, and Johannes Singler. "The Filter-Kruskal Minimum Spanning Tree Algorithm." 2009 Proceedings of the Eleventh Workshop on Algorithm Engineering and Experiments (ALENEX). Society for Industrial and Applied Mathematics, 2009.
[2] Chan, Timothy M. "Backwards analysis of the Karger-Klein-Tarjan algorithm for minimum spanning trees." Information Processing Letters 67.6 (1998): 303-304.
The text was updated successfully, but these errors were encountered: