Knowledge-Based Sampling for Subgroup Discovery

Scholz, Martin

doi:10.1007/11504245_11

Martin Scholz²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3539))

388 Accesses
6 Citations

Abstract

Subgroup discovery aims at finding interesting subsets of a classified example set that deviates from the overall distribution. The search is guided by a so-called utility function, trading the size of subsets (coverage) against their statistical unusualness. By choosing the utility function accordingly, subgroup discovery is well suited to find interesting rules with much smaller coverage and bias than possible with standard classifier induction algorithms. Smaller subsets can be considered local patterns, but this work uses yet another definition: According to this definition global patterns consist of all patterns reflecting the prior knowledge available to a learner, including all previously found patterns. All further unexpected regularities in the data are referred to as local patterns. To address local pattern mining in this scenario, an extension of subgroup discovery by the knowledge-based sampling approach to iterative model refinement is presented. It is a general, cheap way of incorporating prior probabilistic knowledge in arbitrary form into Data Mining algorithms addressing supervised learning tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Article MATH Google Scholar
Brin, S., Motwani, R., Ullman, J., Tsur, S.: Dynamic Itemset Counting and Implication Rules for Market Basket Data. In: Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD 1997), Tucson, AZ, pp. 255–264. ACM, New York (1997)
Chapter Google Scholar
Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Researchers. Submitted to Machine Learning (2004)
Google Scholar
Freund, Y., Schapire, R.R.: A decision–theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)
Article MATH MathSciNet Google Scholar
Friedman, J.H., Hastie, T., Tibshirani, R.: Additive logistic regression: A statistical view of boosting. Annals of Statistics (28), 337–374 (2000)
Google Scholar
Fürnkranz, J., Flach, P.A.: An Analysis of Rule Evaluation Metrics. In: Proceedings of the 20th International Conference on Machine Learning (ICML 2003). Morgan Kaufman, San Francisco (2003)
Google Scholar
Hand, D.: Pattern detection and discovery. In: Hand, D.J., Adams, N.M., Bolton, R.J. (eds.) Pattern Detection and Discovery. LNCS (LNAI), vol. 2447, p. 1. Springer, Heidelberg (2002)
Chapter Google Scholar
John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Klösgen, W.: Explora: A Multipattern and Multistrategy Discovery Assistant. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, ch. 3, pp. 249–272. AAAI Press/The MIT Press, Menlo Park (1996)
Google Scholar
Lavrac, N., Zelezny, F., Flach, P.: RSD: Relational subgroup discovery through first-order feature construction. In: Matwin, S., Sammut, C. (eds.) ILP 2002. LNCS (LNAI), vol. 2583, pp. 149–165. Springer, Heidelberg (2003)
Chapter Google Scholar
Lavrac, N., Flach, P., Kavsek, B., Todorovski, L.: Rule Induction for Subgroup Discovery with CN2-SD. In: Bohanec, M., Mladenic, D., Lavrac, N. (eds.) 2nd Int. Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta Learning (August 2002)
Google Scholar
Lavrac, N., Flach, P., Zupan, B.: Rule Evaluation Measures: A Unifying View. In: Džeroski, S., Flach, P.A. (eds.) ILP 1999. LNCS (LNAI), vol. 1634, p. 174. Springer, Heidelberg (1999)
Chapter Google Scholar
Mackay, D.J.C.: Introduction To Monte Carlo Methods. In: Learning in Graphical Models, pp. 175–204 (1998)
Google Scholar
Mierswa, I., Klinkberg, R., Fischer, S., Ritthoff, O.: A Flexible Platform for Knowledge Discovery Experiments: YALE – Yet Another Learning Environment. In: LLWA 2003 - Tagungsband der GI-Workshop-Woche Lernen - Lehren - Wissen - Adaptivität (2003)
Google Scholar
Mitchell, T.M.: Machine Learning. McGraw Hill, New York (1997)
MATH Google Scholar
Schapire, R.E.: The Strength of Weak Learnability. Machine Learning 5, 197–227 (1990)
Google Scholar
Schapire, R.E., Singer, Y.: Improved boosting using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999)
Article MATH Google Scholar
Scheffer, T., Wrobel, S.: A Sequential Sampling Algorithm for a General Class of Utility Criteria. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (2000)
Google Scholar
Scheffer, T., Wrobel, S.: Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling. Journal of Machine Learning Research 3, 833–862 (2002)
Article MathSciNet Google Scholar
Silberschatz, A., Tuzhilin, A.: What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering 8(6), 970–974 (December 1996)
Article Google Scholar
Suzuki, E.: Discovering Interesting Exception Rules with Rule Pair. In: ECML/PKDD 2004 Workshop, Advances in Inductive Rule Learning (2004)
Google Scholar
Witten, I., Frank, E.: Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Wrobel, S.: An Algorithm for Multi–relational Discovery of Subgroups. In: Komorowski, J., Żytkow, J.M. (eds.) PKDD 1997. LNCS, vol. 1263, pp. 78–87. Springer, Heidelberg (1997)
Google Scholar
Zadrozny, B., Langford, J., Naoki, A.: Cost–Sensitive Learning by Cost–Proportionate Example Weighting. In: Proceedings of the 2003 IEEE International Conference on Data Mining, ICDM 2003 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Artificial Intelligence Group, Department of Computer Science, University of Dortmund, Germany
Martin Scholz

Authors

Martin Scholz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science VIII, artificial Intelligence Unit, Technische Universität Dortmund, 44221, Dortmund, Germany
Katharina Morik
INSA-Lyon, LIRIS CNRS UMR5205, F-69621, Villeurbanne, France
Jean-François Boulicaut
Department of Computer Science, Universiteit Utrecht,
Arno Siebes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scholz, M. (2005). Knowledge-Based Sampling for Subgroup Discovery. In: Morik, K., Boulicaut, JF., Siebes, A. (eds) Local Pattern Detection. Lecture Notes in Computer Science(), vol 3539. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11504245_11

Download citation

DOI: https://doi.org/10.1007/11504245_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26543-6
Online ISBN: 978-3-540-31894-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics