Detection of Duplicate Records in Large-scale Multi-tenant Platforms

_images/FASH_2014_zLabels_006.jpg

Paul O'Grady


#PyConIE - 6th November 2016

ME:


Currently Data Scientist at ...


Twitter: @paul_ogrady

Overview

_images/TECH_2014_Hackweek_001.jpg

Platform

_images/zalando_platform_2.jpg

Record Linkage Problem

Multi-Tenant Platform

Records \(\equiv\) Products

Product Similarity


_images/product_similarity.png

Product Similarity


We will be looking at Categorical Data

Product Space

Products exist in space/manifold

_images/product_plot_1.png

Product Space

Distances between products indicate similarity

_images/product_plot_2.png

Product Space

Product duplicates are products that are very similar

_images/product_plot_3.png

Cartesian Join

Measuring Similarity

Composite Similarity Measure

Hamming Distance

Scaling Up

_images/OPS_2013_Outlet_Berlin_01.jpg

Scaling Up De-duplication


Q: If we relax our requirements on missing duplicates can we use a trick?

A: Yes - Locality-Sensitive Hashing (LSH)

Locality-Sensitive Hashing

Jaccard Index

Simple Hashing

Min-Hashing

Min-Hashing

_images/signature_matrix.png

[3]

LSH Band Hashing

_images/band_hashing.png

LSH Bucketing

_images/bucketing.png

LSH Duplicate Detection

System Tuning

_images/FASH_2014_zLabels_003.jpg

Prob. of Two Products in Same Bucket?

For s-similar products, where r is the number of rows and b is the number of bands

\[t = \operatorname{jaccard}(P_1, P_2)\]

prob. that all rows in band equal, \(t^r\)

prob. that some row in band unequal, \(1 - t^r\)

prob. that no band identical, \((1 - t^r)^b\)

prob. that at least 1 band identical (s-curve),

\[1 - (1 - t^r)^b.\]

LSH S-curve Tuning


_images/lsh_scurve.png

Candidate-Pair Probability

Tuning r & b for specified s-similarity

s-curve Optimisation

class SCurveOpt:

   # Opt. *s*-curve Threshold
   def candidate_pair_prob(self, r, b, s):
       return 1 - (1 - s**r)**b

   def candidate_pair_prob_derivative(self, r, b, s):
       s_sym = Symbol('s')
       y = self.candidate_pair_prob(r, b, s_sym)
       yprime = y.diff(s_sym)
       f = lambdify(s_sym, yprime, 'numpy')
       return f(s)

   def scurve_threshold(self, r, b):
       x0 = (1/b)**(1/r)
       func = lambda s :-self.candidate_pair_prob_derivative(r, b, s)
       res = minimize(func, x0, method='L-BFGS-B',bounds=[(0.,1.)])
       return res.x

   # Opt. Area Under Curve
   def candidate_pair_prob_integral(self, r, b, lower, upper):
       s_sym = Symbol('s')
       return float(Integral(self.candidate_pair_prob(s_sym, b, r),
                        (s_sym, lower, upper)))

   def calc_false_positive_pct_area(self, r, b):
       total_area = self._candidate_pair_prob_integral(r, b, 0., 1.)
       area_left_of_threshold = self._candidate_pair_prob_integral(
                              r, b, 0, self.scurve_threshold_)
       return area_left_of_threshold / total_area


   # Opt. for Both
   def __call__(self, r, b):
      self.scurve_threshold_ = self.scurve_threshold(r, b)
      self.calc_false_positive_area_ = \
          self.calc_false_positive_pct_area(r, b)
      self.scurve_threshhold_error_ = \
          abs(self.scurve_threshold_ - self.target_s_thresh)
      return (self.calc_false_positive_area_,
              self.scurve_threshhold_error_)
>>> r = 4; b =16
>>> scurve_opt = SCurveOpt(target_s_thresh=0.5)
>>> scurve_opt(r,b)
(0.107047, 0.032862)

Optimise With Reg.

def combine_outputs(x, alpha=1.0):
    area, thres_error = x
    return (1-alpha)*area + alpha*thres_error

def opt(max_hash_budget, target_s_threshold,
           alpha, use_s_thresh_approx, slices):

    # Setup function
    scurve = SCurveOpt(target_s_threshold, use_s_thresh_approx)
    func = lambda x : combine_outputs(scurve(x[0], x[1]), alpha) \
                        if x[0]*x[1] <= max_hash_budget else np.inf

    # Perform optimisation
    x0 , fval, grid, Jout = scipy.optimize.brute(
                    func, slices, full_output=True, finish=None)
    return x0 , fval, grid, Jout

Optimal LSH Params

Opt. args: [  5.  26.] # Rows, Bands

_images/heatmap.png

Questions?

_images/OPS_2015_Box_04.jpg

References

[1]"Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping"—Mikhail Bilenko, Sugato Basu & Mehran Sahami.
[2]"Locality-Sensitive Hashing for Finding Nearest Neighbors"—Malcolm Slaney & Michael Casey.
[3]Mining of Massive Datasets—J. Leskovec, A. Rajaraman, J. Ullman, http://www.mmds.org