Skip to yearly menu bar Skip to main content


Poster

Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Ping Li · Xiaoyun Li · Cun-Hui Zhang

East Exhibition Hall B + C #34

Keywords: [ Nonlinear Dimensionality Reduction and Manifold Learning ] [ Algorithms ]


Abstract:

Jaccard similarity is widely used as a distance measure in many machine learning and search applications. Typically, hashing methods are essential for the use of Jaccard similarity to be practical in large-scale settings. For hashing binary (0/1) data, the idea of one permutation hashing (OPH) with densification significantly accelerates traditional minwise hashing algorithms while providing unbiased and accurate estimates. In this paper, we propose a strategy named “re-randomization” in the process of densification that could achieve the smallest variance among all densification schemes. The success of this idea naturally inspires us to generalize one permutation hashing to weighted (non-binary) data, which results in the socalled “bin-wise consistent weighted sampling (BCWS)” algorithm. We analyze the behavior of BCWS and compare it with a recent alternative. Extensive experiments on various datasets illustrates the effectiveness of our proposed methods.

Live content is unavailable. Log in and register to view live content