Skip to main content Accessibility help
×
Hostname: page-component-76fb5796d-wq484 Total loading time: 0 Render date: 2024-04-26T03:28:11.647Z Has data issue: false hasContentIssue false

13 - Parallelizing Information-Theoretic Clustering Methods

from Part Two - Supervised and Unsupervised Learning Algorithms

Published online by Cambridge University Press:  05 February 2012

Ron Bekkerman
Affiliation:
LinkedIn Corporation, Mountain View, CA, USA
Martin Scholz
Affiliation:
HP Labs, Palo Alto, CA, USA
Ron Bekkerman
Affiliation:
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko
Affiliation:
Microsoft Research, Redmond, Washington
John Langford
Affiliation:
Yahoo! Research, New York
Get access

Summary

Facing a problem of clustering amultimillion-data-point collection, amachine learning practitioner may choose to apply the simplest clustering method possible, because it is hard to believe that fancier methods can be applicable to datasets of such scale. Whoever is about to adopt this approach should first weigh the following considerations:

  • Simple clustering methods are rarely effective. Indeed, four decades of research would not have been spent on data clustering if a simple method could solve the problem. Moreover, even the simplest methods may run for long hours on a modern PC, given a large-scale dataset. For example, consider a simple online clustering algorithm (which, we believe, is machine learning folklore): first initialize k clusters with one data point per cluster, then iteratively assign the rest of data points into their closest clusters (in the Euclidean space). If k is small enough, we can run this algorithm on one machine, because it is unnecessary to keep the entire data in RAM. However, besides being slow, it will produce low-quality results, especially when the data is highly multi-dimensional.

  • State-of-the-art clustering methods can scale well, which we aim to justify in this chapter.

With the deployment of large computational facilities (such as Amazon.com's EC2, IBM's BlueGene, and HP's XC), the Parallel Computing paradigm is probably the only currently available option for tackling gigantic data processing tasks. Parallel methods are becoming an integral part of any data processing system, and thus getting special attention (e.g., universities introduce parallel methods to their core curricula; see Johnson et al., 2008).

Type
Chapter
Information
Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 262 - 280
Publisher: Cambridge University Press
Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bekkerman, R., and Scholz, M. 2008. Data Weaving: Scaling Up the State-of-the-Art in Data Clustering. Pages 1083–1092 of: Proceedings of CIKM-17.Google Scholar
Bekkerman, R., El-Yaniv, R., Tishby, N., and Winter, Y. 2001. On Feature Distributional Clustering for Text Categorization. Pages 146–153 of: Proceedings of SIGIR.Google Scholar
Bekkerman, R., El-Yaniv, R., and McCallum, A. 2005. Multi-Way Distributional Clustering via Pairwise Interactions. Pages 41–48 of: Proceedings of ICML-22.Google Scholar
Bekkerman, R., Sahami, M., and Learned-Miller, E. 2006. Combinatorial Markov Random Fields. In: Proceedings of ECML-17.Google Scholar
Besag, J. 1986. On the Statistical Analysis of Dirty Pictures. Journal of the Royal Statistical Society, 48(3).Google Scholar
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.Google Scholar
Böhm, C., Faloutsos, C., Pan, J.-Y., and Plant, C. 2006. Robust Information-Theoretic Clustering. Pages 65–75 of: Proceedings of ACM SIGKDD.Google Scholar
Brent, R. P., and Luk, F. T. 1985. The Solution of Singular-Value and Symmetric Eigenvalue Problems on Multiprocessor Arrays. SIAM Journal on Scientific and Statistical Computing, 6, 69–84.CrossRefGoogle Scholar
Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y. Y., Bradski, G. R., Ng, A. Y., and Olukotun, K. 2006. MapReduce for Machine Learning on Multicore. In: Advances in Neural Information Processing Systems (NIPS).Google Scholar
Cilibrasi, R., and Vitányi, P. 2005. Clustering by Compression. IEEE Transactions on Information Theory, 51(4), 1523–1545.CrossRefGoogle Scholar
Crammer, K., Talukdar, P., and Pereira, F. 2008. A Rate-Distortion One-Class Model and Its Applications to Clustering. In: Proceedings of the 25st International Conference on Machine Learning.Google Scholar
Dean, J., and Ghemawat, S. 2004. MapReduce: Simplified Data Processing on Large Clusters. Symposium on Operating System Design and Implementation (OSDI), 137–150.Google Scholar
Dhillon, I. S., and Modha, D. S. 2000. A Data Clustering Algorithm on Distributed Memory Multiprocessors. In: Large-Scale Parallel Data Mining. Lecture Notes in Artificial Intelligence, vol. 1759.Google Scholar
Dhillon, I. S., Mallela, S., and Modha, D. S. 2003. Information-Theoretic Co-clustering. Pages 89–98 of: Proceedings of SIGKDD-9.Google Scholar
El-Yaniv, R., and Souroujon, O. 2001. Iterative Double Clustering for Unsupervised and Semisupervised Learning. In: Advances in Neural Information Processing Systems (NIPS-14).Google Scholar
Forman, G., and Zhang, B. 2000. Distributed Data Clustering Can Be Efficient and Exact. SIGKDD Exploration Newsletter, 2(2), 34–38.CrossRefGoogle Scholar
Friedman, N., Mosenzon, O, Slonim, N., and Tishby, N. 2001. Multivariate Information Bottleneck. In: Proceedings of UAI-17.Google Scholar
Gao, B., Liu, T.-Y., Zheng, X., Cheng, Q.-S., and Ma, W.-Y. 2005. Consistent Bipartite Graph Copartitioning for Star-Structured High-Order Heterogeneous Data Co-clustering. In: Proceedings of ACM SIGKDD.Google Scholar
Hadjidoukas, P. E., and Amsaleg, L. 2006. Parallelization of a Hierarchical Data Clustering Algorithm Using OpenMP. In: Proceedings of the International Workshop on OpenMP (IWOMP).Google Scholar
Johnson, M., Liao, R. H., Rasmussen, A., Sridharan, R., Garcia, D., and Harvey, B. 2008. Infusing Parallelism into Introductory Computer Science using MapReduce. In: Proceedings of SIGCSE: Symposium on Computer Science Education.Google Scholar
Judd, D., McKinley, P. K., and Jain, A. K. 1998. Large-Scale Parallel Data Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 871–876.CrossRefGoogle Scholar
Lauritzen, S. L. 1996. Graphical Models. Oxford: Clarendon Press.Google Scholar
Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. JMLR, 5, 361–397.Google Scholar
McCallum, A., Nigam, K., and Ungar, L. H. 2000. Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching. Pages 169–178 of: Proceedings of ACM SIGKDD.Google Scholar
McCallum, A., Corrada-Emmanuel, A., and Wang, X. 2005. Topic and Role Discovery in Social Networks. Pages 786–791 of: Proceedings of IJCAI-19.Google Scholar
Rocci, R., and Vichi, M. 2008. Two-Mode Multi-partitioning. Computational Statistics and Data Analysis, 52(4).CrossRefGoogle Scholar
Slonim, N., and Tishby, N. 2000. Agglomerative Information Bottleneck. Pages 617–623 of: Advances in Neural Information Processing Systems 12 (NIPS).Google Scholar
Slonim, N., Friedman, N., and Tishby, N. 2002. Unsupervised Document Classification Using Sequential Information Maximization. In: Proceedings of SIGIR-25.Google Scholar
Snir, M., Otto, S. W., Huss-Lederman, S., Walker, D.W., and Dongarra, J. 1998. MPI – The Complete Reference: Volume 1, The MPI Core, 2nd ed. Cambridge, MA: MIT Press.Google Scholar
Sutton, C., and McCallum, A. 2005. Piecewise Training of Undirected Models. In: Proceedings of UAI-21.Google Scholar
Tilton, J. C., and Strong, J. P. 1984. Analyzing Remotely Sensed Data on the Massively Parallel Processor. Pages 398–400 of: Proceedings of 7th International Confecrence on Pattern Recognition.Google Scholar
Tishby, N., Pereira, F., and Bialek, W. 1999. The Information Bottleneck Method. Invited paper to the 37th Annual Allerton Conference on Communication, Control, and Computing.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×