Timezone: »

Hashing Algorithms for Large-Scale Learning
Ping Li · Anshumali Shrivastava · Joshua L Moore · Arnd C König

Tue Dec 13 08:45 AM -- 02:59 PM (PST) @ None #None
Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare $b$-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.

Author Information

Ping Li (Baidu Research USA)
Anshumali Shrivastava (Rice University)
Joshua L Moore (Cornell University)
Arnd C König (Microsoft Research)

More from the Same Authors