Optimal time complexity greedy linkage #212

nbgl · 2019-05-23T07:10:34Z

Let D and D′ be two deduplicated datasets of size n and n′ respectively. Greedy linkage involves two steps: comparisons, which are trivially O(nn′), and solving.

Let p be the probability that a pair of records has similarity above the threshold. As n, n′ → ∞ (for a given source of data) p converges to the probability of a non-matching pair being above the threshold. (p > 0 unless there are never non-matching pairs above the threshold, in which case solving isn’t necessary.)

There are pnn′ pairs above the threshold. Because we sort them, our solving step is O(pnn′ log pnn′). This equals O(nn′ log nn′) since p converges to a positive constant. The entire linkage process is thus O(nn′ + nn′ log nn′) = O(nn′ log nn′).

However, solving can be done in O(pnn′ + m log m) time, where m is the number of found pairs. Observing that m ≤ min(n, n′) for a given source of data, O(pnn′ + m log m) = O(nn′). This is the same as the time complexity of the comparison step and it is the best we can do without ignoring input.

The algorithm is similar to Filter-Kruskal [1] (although I claim I thought of it before I found Filter-Kruskal!). It is a modification of QuickSort, and it filters out pairs that will provably not be accepted before they are sorted. Pseudo-Python:

def pick_pivot(pairs):
    raise NotImplementedError()


def partition(pairs, pivot):
    raise NotImplementedError()


def quick_greedy(candidate_pairs):
    matched_A = set()
    matched_B = set()
    accepted = set()

    def filter_provable_nonmatches(pairs):
        return [pair
                for pair in pairs
                if pair.index_A not in matched_A and pair.index_B not in matched_B]

    def quick_greedy_inner(pairs):
        if not candidate_pairs:
            return

        pivot = pick_pivot(pairs)
        left, right = partition(pairs, pivot)
        quick_greedy_inner(left)

        if pivot.index_A not in matched_A and pivot.index_B not in matched_B:
            matched_A.add(pivot.index_A)
            matched_B.add(pivot.index_B)
            accepted.add(pivot)

        right = filter_provable_nonmatches(right)
        quick_greedy_inner(right)

    quick_greedy_inner(candidate_pairs)
    return accepted

(Of course, in practice one would set a length limit for pairs, below which quick_greedy_inner calls the trivial greedy algorithm. This improves the speed since the trivial algorithm is faster for smaller problems.)

The time complexity of this algorithm can be proven by lightly modifying the proof of Theorem 3.1 in [1]. It relies on Lemma 1.1 in [2], which also needs adjusting.

[1] Osipov, Vitaly, Peter Sanders, and Johannes Singler. "The Filter-Kruskal Minimum Spanning Tree Algorithm." 2009 Proceedings of the Eleventh Workshop on Algorithm Engineering and Experiments (ALENEX). Society for Industrial and Applied Mathematics, 2009.
[2] Chan, Timothy M. "Backwards analysis of the Karger-Klein-Tarjan algorithm for minimum spanning trees." Information Processing Letters 67.6 (1998): 303-304.

The text was updated successfully, but these errors were encountered:

nbgl added research optimisation labels May 23, 2019

nbgl mentioned this issue May 24, 2019

Project: faster greedy #213

Merged

nbgl mentioned this issue Jun 28, 2019

Sort similarity scores data61/anonlink-entity-service#335

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal time complexity greedy linkage #212

Optimal time complexity greedy linkage #212

nbgl commented May 23, 2019 •

edited

Loading

Optimal time complexity greedy linkage #212

Optimal time complexity greedy linkage #212

Comments

nbgl commented May 23, 2019 • edited Loading

nbgl commented May 23, 2019 •

edited

Loading