Parallelizing Information-Theoretic Clustering Methods

Ron Bekkerman; Martin Scholz

doi:10.1017/CBO9781139042918.014

13 - Parallelizing Information-Theoretic Clustering Methods

from Part Two - Supervised and Unsupervised Learning Algorithms

Published online by Cambridge University Press: 05 February 2012

Ron Bekkerman and

Edited by

Mikhail Bilenko and

Ron Bekkerman: Affiliation:
LinkedIn Corporation, Mountain View, CA, USA
Martin Scholz: Affiliation:
HP Labs, Palo Alto, CA, USA
Ron Bekkerman: Affiliation:
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko: Affiliation:
Microsoft Research, Redmond, Washington
John Langford: Affiliation:
Yahoo! Research, New York

Book contents

Get access

Summary

Facing a problem of clustering amultimillion-data-point collection, amachine learning practitioner may choose to apply the simplest clustering method possible, because it is hard to believe that fancier methods can be applicable to datasets of such scale. Whoever is about to adopt this approach should first weigh the following considerations:

Simple clustering methods are rarely effective. Indeed, four decades of research would not have been spent on data clustering if a simple method could solve the problem. Moreover, even the simplest methods may run for long hours on a modern PC, given a large-scale dataset. For example, consider a simple online clustering algorithm (which, we believe, is machine learning folklore): first initialize k clusters with one data point per cluster, then iteratively assign the rest of data points into their closest clusters (in the Euclidean space). If k is small enough, we can run this algorithm on one machine, because it is unnecessary to keep the entire data in RAM. However, besides being slow, it will produce low-quality results, especially when the data is highly multi-dimensional.
State-of-the-art clustering methods can scale well, which we aim to justify in this chapter.

With the deployment of large computational facilities (such as Amazon.com's EC2, IBM's BlueGene, and HP's XC), the Parallel Computing paradigm is probably the only currently available option for tackling gigantic data processing tasks. Parallel methods are becoming an integral part of any data processing system, and thus getting special attention (e.g., universities introduce parallel methods to their core curricula; see Johnson et al., 2008).

Type: Chapter
Information: Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 262 - 280

DOI: https://doi.org/10.1017/CBO9781139042918.014 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bekkerman, R., and Scholz, M. 2008. Data Weaving: Scaling Up the State-of-the-Art in Data Clustering. Pages 1083–1092 of: Proceedings of CIKM-17.Google Scholar

Bekkerman, R., El-Yaniv, R., Tishby, N., and Winter, Y. 2001. On Feature Distributional Clustering for Text Categorization. Pages 146–153 of: Proceedings of SIGIR.Google Scholar

Bekkerman, R., El-Yaniv, R., and McCallum, A. 2005. Multi-Way Distributional Clustering via Pairwise Interactions. Pages 41–48 of: Proceedings of ICML-22.Google Scholar

Bekkerman, R., Sahami, M., and Learned-Miller, E. 2006. Combinatorial Markov Random Fields. In: Proceedings of ECML-17.Google Scholar

Besag, J. 1986. On the Statistical Analysis of Dirty Pictures. Journal of the Royal Statistical Society, 48(3).Google Scholar

Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.Google Scholar

Böhm, C., Faloutsos, C., Pan, J.-Y., and Plant, C. 2006. Robust Information-Theoretic Clustering. Pages 65–75 of: Proceedings of ACM SIGKDD.Google Scholar

Brent, R. P., and Luk, F. T. 1985. The Solution of Singular-Value and Symmetric Eigenvalue Problems on Multiprocessor Arrays. SIAM Journal on Scientific and Statistical Computing, 6, 69–84.CrossRef Google Scholar

Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y. Y., Bradski, G. R., Ng, A. Y., and Olukotun, K. 2006. MapReduce for Machine Learning on Multicore. In: Advances in Neural Information Processing Systems (NIPS).Google Scholar

Cilibrasi, R., and Vitányi, P. 2005. Clustering by Compression. IEEE Transactions on Information Theory, 51(4), 1523–1545.CrossRef Google Scholar

Crammer, K., Talukdar, P., and Pereira, F. 2008. A Rate-Distortion One-Class Model and Its Applications to Clustering. In: Proceedings of the 25st International Conference on Machine Learning.Google Scholar

Dean, J., and Ghemawat, S. 2004. MapReduce: Simplified Data Processing on Large Clusters. Symposium on Operating System Design and Implementation (OSDI), 137–150.Google Scholar

Dhillon, I. S., and Modha, D. S. 2000. A Data Clustering Algorithm on Distributed Memory Multiprocessors. In: Large-Scale Parallel Data Mining. Lecture Notes in Artificial Intelligence, vol. 1759.Google Scholar

Dhillon, I. S., Mallela, S., and Modha, D. S. 2003. Information-Theoretic Co-clustering. Pages 89–98 of: Proceedings of SIGKDD-9.Google Scholar

El-Yaniv, R., and Souroujon, O. 2001. Iterative Double Clustering for Unsupervised and Semisupervised Learning. In: Advances in Neural Information Processing Systems (NIPS-14).Google Scholar

Forman, G., and Zhang, B. 2000. Distributed Data Clustering Can Be Efficient and Exact. SIGKDD Exploration Newsletter, 2(2), 34–38.CrossRef Google Scholar

Friedman, N., Mosenzon, O, Slonim, N., and Tishby, N. 2001. Multivariate Information Bottleneck. In: Proceedings of UAI-17.Google Scholar

Gao, B., Liu, T.-Y., Zheng, X., Cheng, Q.-S., and Ma, W.-Y. 2005. Consistent Bipartite Graph Copartitioning for Star-Structured High-Order Heterogeneous Data Co-clustering. In: Proceedings of ACM SIGKDD.Google Scholar

Hadjidoukas, P. E., and Amsaleg, L. 2006. Parallelization of a Hierarchical Data Clustering Algorithm Using OpenMP. In: Proceedings of the International Workshop on OpenMP (IWOMP).Google Scholar

Johnson, M., Liao, R. H., Rasmussen, A., Sridharan, R., Garcia, D., and Harvey, B. 2008. Infusing Parallelism into Introductory Computer Science using MapReduce. In: Proceedings of SIGCSE: Symposium on Computer Science Education.Google Scholar

Judd, D., McKinley, P. K., and Jain, A. K. 1998. Large-Scale Parallel Data Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 871–876.CrossRef Google Scholar

Lauritzen, S. L. 1996. Graphical Models. Oxford: Clarendon Press.Google Scholar

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. JMLR, 5, 361–397.Google Scholar

McCallum, A., Nigam, K., and Ungar, L. H. 2000. Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching. Pages 169–178 of: Proceedings of ACM SIGKDD.Google Scholar

McCallum, A., Corrada-Emmanuel, A., and Wang, X. 2005. Topic and Role Discovery in Social Networks. Pages 786–791 of: Proceedings of IJCAI-19.Google Scholar

Rocci, R., and Vichi, M. 2008. Two-Mode Multi-partitioning. Computational Statistics and Data Analysis, 52(4).CrossRef Google Scholar

Slonim, N., and Tishby, N. 2000. Agglomerative Information Bottleneck. Pages 617–623 of: Advances in Neural Information Processing Systems 12 (NIPS).Google Scholar

Slonim, N., Friedman, N., and Tishby, N. 2002. Unsupervised Document Classification Using Sequential Information Maximization. In: Proceedings of SIGIR-25.Google Scholar

Snir, M., Otto, S. W., Huss-Lederman, S., Walker, D.W., and Dongarra, J. 1998. MPI – The Complete Reference: Volume 1, The MPI Core, 2nd ed. Cambridge, MA: MIT Press.Google Scholar

Sutton, C., and McCallum, A. 2005. Piecewise Training of Undirected Models. In: Proceedings of UAI-21.Google Scholar

Tilton, J. C., and Strong, J. P. 1984. Analyzing Remotely Sensed Data on the Massively Parallel Processor. Pages 398–400 of: Proceedings of 7th International Confecrence on Pattern Recognition.Google Scholar

Tishby, N., Pereira, F., and Bialek, W. 1999. The Information Bottleneck Method. Invited paper to the 37th Annual Allerton Conference on Communication, Control, and Computing.Google Scholar

Book contents

13 - Parallelizing Information-Theoretic Clustering Methods

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive