Timezone: »
Weighted minwise hashing (WMH) is one of the fundamental subroutine, required by many celebrated approximation algorithms, commonly adopted in industrial practice for large -scale search and learning. The resource bottleneck with WMH is the computation of multiple (typically a few hundreds to thousands) independent hashes of the data. We propose a simple rejection type sampling scheme based on a carefully designed red-green map, where we show that the number of rejected sample has exactly the same distribution as weighted minwise sampling. The running time of our method, for many practical datasets, is an order of magnitude smaller than existing methods. Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe's method without losing any accuracy. Our method is also around 100x faster than approximate heuristics capitalizing on the efficient ``densified" one permutation hashing schemes~\cite{Proc:OneHashLSHICML14,Proc:ShrivastavaUAI14}. Given the simplicity of our approach and its significant advantages, we hope that it will replace existing implementations in practice.
Author Information
Anshumali Shrivastava (Rice University)
More from the Same Authors
-
2021 Spotlight: Practical Near Neighbor Search via Group Testing »
Joshua Engels · Benjamin Coleman · Anshumali Shrivastava -
2021 : PISTACHIO: Patch Importance Sampling To Accelerate CNNs via a Hash Index Optimizer »
Zhaozhuo Xu · Anshumali Shrivastava -
2022 : Adaptive Sparse Federated Learning in Large Output Spaces via Hashing »
Zhaozhuo Xu · Luyang Liu · Zheng Xu · Anshumali Shrivastava -
2023 Poster: DESSERT: An Efficient Algorithm for Vector Set Search with Vector Set Queries »
Joshua Engels · Benjamin Coleman · Vihan Lakshman · Anshumali Shrivastava -
2023 Poster: One-Pass Distribution Sketch for Measuring Data Heterogeneity in Federated Learning »
Zichang Liu · Zhaozhuo Xu · Benjamin Coleman · Anshumali Shrivastava -
2023 Poster: Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time »
Zichang Liu · Aditya Desai · Fangshuo Liao · Weitao Wang · Victor Xie · Zhaozhuo Xu · Anastasios Kyrillidis · Anshumali Shrivastava -
2022 Poster: The trade-offs of model size in large recommendation models : 100GB to 10MB Criteo-tb DLRM model »
Aditya Desai · Anshumali Shrivastava -
2022 Poster: Retaining Knowledge for Learning with Dynamic Definition »
Zichang Liu · Benjamin Coleman · Tianyi Zhang · Anshumali Shrivastava -
2022 Poster: Graph Reordering for Cache-Efficient Near Neighbor Search »
Benjamin Coleman · Santiago Segarra · Alexander Smola · Anshumali Shrivastava -
2021 Poster: Breaking the Linear Iteration Cost Barrier for Some Well-known Conditional Gradient Methods Using MaxIP Data-structures »
Zhaozhuo Xu · Zhao Song · Anshumali Shrivastava -
2021 Poster: Practical Near Neighbor Search via Group Testing »
Joshua Engels · Benjamin Coleman · Anshumali Shrivastava -
2021 Poster: Locality Sensitive Teaching »
Zhaozhuo Xu · Beidi Chen · Chaojian Li · Weiyang Liu · Le Song · Yingyan Lin · Anshumali Shrivastava -
2021 Poster: Raw Nav-merge Seismic Data to Subsurface Properties with MLP based Multi-Modal Information Unscrambler »
Aditya Desai · Zhaozhuo Xu · Menal Gupta · Anu Chandran · Antoine Vial-Aussavy · Anshumali Shrivastava -
2020 Poster: Adaptive Learned Bloom Filter (Ada-BF): Efficient Utilization of the Classifier with Application to Real-Time Information Filtering on the Web »
Zhenwei Dai · Anshumali Shrivastava -
2020 Session: Orals & Spotlights Track 03: Language/Audio Applications »
Anshumali Shrivastava · Dilek Hakkani-Tur -
2019 Poster: Fast and Accurate Stochastic Gradient Estimation »
Beidi Chen · Yingchen Xu · Anshumali Shrivastava -
2019 Poster: Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products »
Tharun Kumar Reddy Medini · Qixuan Huang · Yiqiu Wang · Vijai Mohan · Anshumali Shrivastava -
2018 Poster: Topkapi: Parallel and Fast Sketches for Finding Top-K Frequent Elements »
Ankush Mandal · He Jiang · Anshumali Shrivastava · Vivek Sarkar -
2014 Poster: Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) »
Anshumali Shrivastava · Ping Li -
2014 Oral: Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) »
Anshumali Shrivastava · Ping Li -
2013 Poster: Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search »
Anshumali Shrivastava · Ping Li -
2011 Poster: Hashing Algorithms for Large-Scale Learning »
Ping Li · Anshumali Shrivastava · Joshua L Moore · Arnd C König