Detection of Duplicate Records in Large-scale Multi-tenant Platforms


Paul O'Grady

#PyConIE - 6th November 2016


Currently Data Scientist at ...

Twitter: @paul_ogrady





Record Linkage Problem

Multi-Tenant Platform

Records \(\equiv\) Products

Product Similarity


Product Similarity

We will be looking at Categorical Data

Product Space

Products exist in space/manifold


Product Space

Distances between products indicate similarity


Product Space

Product duplicates are products that are very similar


Cartesian Join

Measuring Similarity

Composite Similarity Measure

Hamming Distance

Scaling Up


Scaling Up De-duplication

Q: If we relax our requirements on missing duplicates can we use a trick?

A: Yes - Locality-Sensitive Hashing (LSH)

Locality-Sensitive Hashing

Jaccard Index

Simple Hashing





LSH Band Hashing


LSH Bucketing


LSH Duplicate Detection

System Tuning


Prob. of Two Products in Same Bucket?

For s-similar products, where r is the number of rows and b is the number of bands

\[t = \operatorname{jaccard}(P_1, P_2)\]

prob. that all rows in band equal, \(t^r\)

prob. that some row in band unequal, \(1 - t^r\)

prob. that no band identical, \((1 - t^r)^b\)

prob. that at least 1 band identical (s-curve),

\[1 - (1 - t^r)^b.\]

LSH S-curve Tuning


Candidate-Pair Probability

Tuning r & b for specified s-similarity

s-curve Optimisation

class SCurveOpt:

   # Opt. *s*-curve Threshold
   def candidate_pair_prob(self, r, b, s):
       return 1 - (1 - s**r)**b

   def candidate_pair_prob_derivative(self, r, b, s):
       s_sym = Symbol('s')
       y = self.candidate_pair_prob(r, b, s_sym)
       yprime = y.diff(s_sym)
       f = lambdify(s_sym, yprime, 'numpy')
       return f(s)

   def scurve_threshold(self, r, b):
       x0 = (1/b)**(1/r)
       func = lambda s :-self.candidate_pair_prob_derivative(r, b, s)
       res = minimize(func, x0, method='L-BFGS-B',bounds=[(0.,1.)])
       return res.x

   # Opt. Area Under Curve
   def candidate_pair_prob_integral(self, r, b, lower, upper):
       s_sym = Symbol('s')
       return float(Integral(self.candidate_pair_prob(s_sym, b, r),
                        (s_sym, lower, upper)))

   def calc_false_positive_pct_area(self, r, b):
       total_area = self._candidate_pair_prob_integral(r, b, 0., 1.)
       area_left_of_threshold = self._candidate_pair_prob_integral(
                              r, b, 0, self.scurve_threshold_)
       return area_left_of_threshold / total_area

   # Opt. for Both
   def __call__(self, r, b):
      self.scurve_threshold_ = self.scurve_threshold(r, b)
      self.calc_false_positive_area_ = \
          self.calc_false_positive_pct_area(r, b)
      self.scurve_threshhold_error_ = \
          abs(self.scurve_threshold_ - self.target_s_thresh)
      return (self.calc_false_positive_area_,
>>> r = 4; b =16
>>> scurve_opt = SCurveOpt(target_s_thresh=0.5)
>>> scurve_opt(r,b)
(0.107047, 0.032862)

Optimise With Reg.

def combine_outputs(x, alpha=1.0):
    area, thres_error = x
    return (1-alpha)*area + alpha*thres_error

def opt(max_hash_budget, target_s_threshold,
           alpha, use_s_thresh_approx, slices):

    # Setup function
    scurve = SCurveOpt(target_s_threshold, use_s_thresh_approx)
    func = lambda x : combine_outputs(scurve(x[0], x[1]), alpha) \
                        if x[0]*x[1] <= max_hash_budget else np.inf

    # Perform optimisation
    x0 , fval, grid, Jout = scipy.optimize.brute(
                    func, slices, full_output=True, finish=None)
    return x0 , fval, grid, Jout

Optimal LSH Params

Opt. args: [  5.  26.] # Rows, Bands





[1]"Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping"—Mikhail Bilenko, Sugato Basu & Mehran Sahami.
[2]"Locality-Sensitive Hashing for Finding Nearest Neighbors"—Malcolm Slaney & Michael Casey.
[3]Mining of Massive Datasets—J. Leskovec, A. Rajaraman, J. Ullman,