skip to main content
10.1145/1081870.1081902acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Sampling-based sequential subgroup mining

Published:21 August 2005Publication History

ABSTRACT

Subgroup discovery is a learning task that aims at finding interesting rules from classified examples. The search is guided by a utility function, trading off the coverage of rules against their statistical unusualness. One shortcoming of existing approaches is that they do not incorporate prior knowledge. To this end a novel generic sampling strategy is proposed. It allows to turn pattern mining into an iterative process. In each iteration the focus of subgroup discovery lies on those patterns that are unexpected with respect to prior knowledge and previously discovered patterns. The result of this technique is a small diverse set of understandable rules that characterise a specified property of interest. As another contribution this article derives a simple connection between subgroup discovery and classifier induction. For a popular utility function this connection allows to apply any standard rule induction algorithm to the task of subgroup discovery after a step of stratified resampling. The proposed techniques are empirically compared to state of the art subgroup discovery algorithms.

References

  1. C. Blake and C. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.html.]]Google ScholarGoogle Scholar
  2. L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic Itemset Counting and Implication Rules for Market Basket Data. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD '97), pages 255--264, Tucson, AZ., 1997. ACM.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Cunningham and J. Carney. Diversity versus Quality in Classification Ensembles Based on Feature Selection. In Proceedings of the 11th European Conference on Machine Learning (ECML 2000), pages 109 -- 116. Springer Verlag Berlin, Barcelona, Spain, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Fawcett. ROC Graphs: Notes and Practical Considerations for Researchers, 2004. Submitted to Machine Learning.]]Google ScholarGoogle Scholar
  6. P. A. Flach. The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics. In Proceedings of the 20th International Conference on Machine Learning (ICML-03). Morgen Kaufman, 2003.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Freund and R. R. Schapire. A decision--theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119--139, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. Annals of Statistics, (28):337--374, 2000.]]Google ScholarGoogle ScholarCross RefCross Ref
  9. J. Fürnkranz and P. Flach. ROC 'n' Rule Learning -- Towards a Better Understanding of Covering Algorithms. Machine Learning, 58(1):39--77, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Jaroszewicz and D. A. Simovici. Interestingness of Frequent Itemsets Using Bayesian Networks as Background Knowledge. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD-2004). AAAI Press, August 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. H. John and P. Langley. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pages 338--345. Morgan Kaufmann, 1995.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Klösgen. Explora: A Multipattern and Multistrategy Discovery Assistant. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 3, pages 249--272. AAAI Press/The MIT Press, Menlo Park, California, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Lavrac, P. Flach, and B. Zupan. Rule Evaluation Measures: A Unifying View. In 9th International Workshop on Inductive Logic Programming, Lecture Notes in Computer Science. Springer, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Lavrac, B. Kavsek, P. Flach, and L. Todorovski. Subgroup discovery with CN2-SD. Journal of Machine Learning Research, 5:153--188, Feb 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. N. Lavrac, F. Zelezny, and P. Flach. RSD: Relational subgroup discovery through first-order feature construction. In 12th International Conference on Inductive Logic Programming. Springer, 2002.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Mackay. Introduction To Monte Carlo Methods. In Learning in Graphical Models, pages 175--204. 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. Mierswa, R. Klinkberg, S. Fischer, and O. Ritthoff. A Flexible Platform for Knowledge Discovery Experiments: YALE -- Yet Another Learning Environment. In LLWA 03 - Tagungsband der GI-Workshop-Woche Lernen - Lehren - Wissen - Adaptivität, 2003.]]Google ScholarGoogle Scholar
  18. T. M. Mitchell. Machine Learning. McGraw Hill, New York, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. E. Schapire. The Strength of Weak Learnability. Machine Learning, 5:197--227, 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. E. Schapire, M. Rochery, M. Rahim, and N. Gupta. Incorporating Prior Knowledge into Boosting. In Proc. of the 19th International Conference on Machine Learning (ICML-02), 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. E. Schapire and Y. Singer. Improved Boosting Using Confidence-rated Predictions. Machine Learning, 37(3):297--336, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Scheffer and S. Wrobel. Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling. Journal of Machine Learning Research, 3:833--862, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Scholz. Knowledge-Based Sampling for Subgroup Discovery. In K. Morik, J.-F. Boulicaut, and A. Siebes, editors, Proc. of the Workshop on Detecting Local Patterns, Lecture Notes in Computer Science. Springer, 2005. To appear.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8(6):970--974, dec 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Suzuki. Discovering Interesting Exception Rules with Rule Pair. In ECML/PKDD 2004 Workshop, Advances in Inductive Rule Learning, 2004.]]Google ScholarGoogle Scholar
  26. I. Witten and E. Frank. Data Mining -- Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Wrobel. An Algorithm for Multi--relational Discovery of Subgroups. In J. Komorowski and J. Zytkow, editors, Principles of Data Mining and Knowledge Discovery: First European Symposium (PKDD 97), pages 78--87, Berlin, New York, 1997. Springer.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. X. Wu and R. Srihari. Incorporating Prior Knowledge with Weighted Margin Support Vector Machines. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD-2004). AAAI Press, August 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Zadrozny, J. Langford, and A. Naoki. Cost--Sensitive Learning by Cost--Proportionate Example Weighting. In Proceedings of the 2003 IEEE International Conference on Data Mining (ICDM'03), 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Sampling-based sequential subgroup mining

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
        August 2005
        844 pages
        ISBN:159593135X
        DOI:10.1145/1081870

        Copyright © 2005 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 August 2005

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader