Wu, W, Li, B, Chen, L & Zhang, C 2016, 'Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash', Proceedings of the 2016 IEEE 16th International Conference on Data Mining, IEEE International Conference on Data Mining, IEEE, Barcelona, Spain, pp. 1287-1292.View/Download from: UTS OPUS or Publisher's site
Min-Hash, as a member of the Locality Sensitive Hashing (LSH) family for sketching sets, plays an important role in the big data era. It is widely used for efficiently estimating similarities of bag-of-words represented data and has been extended to dealing with multi-sets and real-value weighted sets. Improved Consistent Weighted Sampling (ICWS) has been recognized as the state-of-the-art for real-value weighted Min-Hash. However, the algorithmic implementation of ICWS is flawed because it violates the uniformity of the Min-Hash scheme. In this paper, we propose a Canonical Consistent Weighted Sampling (CCWS) algorithm, which not only retains the same theoretical complexity as ICWS but also strictly complies with the definition of Min-Hash. The experimental results demonstrate that the proposed CCWS algorithm runs faster than the state-of-the-arts while achieving similar classification performance on a number of real-world text data sets.