Article

Sampling-based sequential subgroup mining

Author:
Martin Scholz

University of Dortmund, Germany

University of Dortmund, Germany
View Profile

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data miningAugust 2005Pages 265–274https://doi.org/10.1145/1081870.1081902

Published:21 August 2005Publication History

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Pages 265–274

ABSTRACT

Subgroup discovery is a learning task that aims at finding interesting rules from classified examples. The search is guided by a utility function, trading off the coverage of rules against their statistical unusualness. One shortcoming of existing approaches is that they do not incorporate prior knowledge. To this end a novel generic sampling strategy is proposed. It allows to turn pattern mining into an iterative process. In each iteration the focus of subgroup discovery lies on those patterns that are unexpected with respect to prior knowledge and previously discovered patterns. The result of this technique is a small diverse set of understandable rules that characterise a specified property of interest. As another contribution this article derives a simple connection between subgroup discovery and classifier induction. For a popular utility function this connection allows to apply any standard rule induction algorithm to the task of subgroup discovery after a step of stratified resampling. The proposed techniques are empirically compared to state of the art subgroup discovery algorithms.

References

C. Blake and C. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.html.]]Google Scholar
L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001.]] Google ScholarDigital Library
S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic Itemset Counting and Implication Rules for Market Basket Data. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD '97), pages 255--264, Tucson, AZ., 1997. ACM.]] Google ScholarDigital Library
P. Cunningham and J. Carney. Diversity versus Quality in Classification Ensembles Based on Feature Selection. In Proceedings of the 11th European Conference on Machine Learning (ECML 2000), pages 109 -- 116. Springer Verlag Berlin, Barcelona, Spain, 2000.]] Google ScholarDigital Library
T. Fawcett. ROC Graphs: Notes and Practical Considerations for Researchers, 2004. Submitted to Machine Learning.]]Google Scholar
P. A. Flach. The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics. In Proceedings of the 20th International Conference on Machine Learning (ICML-03). Morgen Kaufman, 2003.]]Google ScholarDigital Library
Y. Freund and R. R. Schapire. A decision--theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119--139, 1997.]] Google ScholarDigital Library
J. H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. Annals of Statistics, (28):337--374, 2000.]]Google ScholarCross Ref
J. Fürnkranz and P. Flach. ROC 'n' Rule Learning -- Towards a Better Understanding of Covering Algorithms. Machine Learning, 58(1):39--77, 2005.]] Google ScholarDigital Library
S. Jaroszewicz and D. A. Simovici. Interestingness of Frequent Itemsets Using Bayesian Networks as Background Knowledge. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD-2004). AAAI Press, August 2004.]] Google ScholarDigital Library
G. H. John and P. Langley. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pages 338--345. Morgan Kaufmann, 1995.]]Google ScholarDigital Library
W. Klösgen. Explora: A Multipattern and Multistrategy Discovery Assistant. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 3, pages 249--272. AAAI Press/The MIT Press, Menlo Park, California, 1996.]] Google ScholarDigital Library
N. Lavrac, P. Flach, and B. Zupan. Rule Evaluation Measures: A Unifying View. In 9th International Workshop on Inductive Logic Programming, Lecture Notes in Computer Science. Springer, 1999.]] Google ScholarDigital Library
N. Lavrac, B. Kavsek, P. Flach, and L. Todorovski. Subgroup discovery with CN2-SD. Journal of Machine Learning Research, 5:153--188, Feb 2004.]] Google ScholarDigital Library
N. Lavrac, F. Zelezny, and P. Flach. RSD: Relational subgroup discovery through first-order feature construction. In 12th International Conference on Inductive Logic Programming. Springer, 2002.]]Google ScholarDigital Library
D. Mackay. Introduction To Monte Carlo Methods. In Learning in Graphical Models, pages 175--204. 1998.]] Google ScholarDigital Library
I. Mierswa, R. Klinkberg, S. Fischer, and O. Ritthoff. A Flexible Platform for Knowledge Discovery Experiments: YALE -- Yet Another Learning Environment. In LLWA 03 - Tagungsband der GI-Workshop-Woche Lernen - Lehren - Wissen - Adaptivität, 2003.]]Google Scholar
T. M. Mitchell. Machine Learning. McGraw Hill, New York, 1997.]] Google ScholarDigital Library
R. E. Schapire. The Strength of Weak Learnability. Machine Learning, 5:197--227, 1990.]] Google ScholarDigital Library
R. E. Schapire, M. Rochery, M. Rahim, and N. Gupta. Incorporating Prior Knowledge into Boosting. In Proc. of the 19th International Conference on Machine Learning (ICML-02), 2002.]] Google ScholarDigital Library
R. E. Schapire and Y. Singer. Improved Boosting Using Confidence-rated Predictions. Machine Learning, 37(3):297--336, 1999.]] Google ScholarDigital Library
T. Scheffer and S. Wrobel. Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling. Journal of Machine Learning Research, 3:833--862, 2002.]] Google ScholarDigital Library
M. Scholz. Knowledge-Based Sampling for Subgroup Discovery. In K. Morik, J.-F. Boulicaut, and A. Siebes, editors, Proc. of the Workshop on Detecting Local Patterns, Lecture Notes in Computer Science. Springer, 2005. To appear.]] Google ScholarDigital Library
A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8(6):970--974, dec 1996.]] Google ScholarDigital Library
E. Suzuki. Discovering Interesting Exception Rules with Rule Pair. In ECML/PKDD 2004 Workshop, Advances in Inductive Rule Learning, 2004.]]Google Scholar
I. Witten and E. Frank. Data Mining -- Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000.]] Google ScholarDigital Library
S. Wrobel. An Algorithm for Multi--relational Discovery of Subgroups. In J. Komorowski and J. Zytkow, editors, Principles of Data Mining and Knowledge Discovery: First European Symposium (PKDD 97), pages 78--87, Berlin, New York, 1997. Springer.]] Google ScholarDigital Library
X. Wu and R. Srihari. Incorporating Prior Knowledge with Weighted Margin Support Vector Machines. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD-2004). AAAI Press, August 2004.]] Google ScholarDigital Library
B. Zadrozny, J. Langford, and A. Naoki. Cost--Sensitive Learning by Cost--Proportionate Example Weighting. In Proceedings of the 2003 IEEE International Conference on Data Mining (ICDM'03), 2003.]] Google ScholarDigital Library

Index Terms

Sampling-based sequential subgroup mining
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Logical and relational learning
        Inductive logic learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Analysis of sampling techniques for association rule mining
ICDT '09: Proceedings of the 12th International Conference on Database Theory

In this paper, we present a comprehensive theoretical analysis of the sampling technique for the association rule mining problem. Most of the previous works have concentrated only on the empirical evaluation of the effectiveness of sampling for the step ...
Read More
Developing Novel and Effective Approach for Association Rule Mining Using Progressive Sampling
ICCEE '09: Proceedings of the 2009 Second International Conference on Computer and Electrical Engineering - Volume 01

A challenging task in data mining is the process of discovering association rules from a large database. Most of the existing association rule mining algorithms make repeated passes over the entire database to determine the frequent itemsets, which is ...
Read More
Mining top-K frequent itemsets through progressive sampling

We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
August 2005
844 pages
ISBN:159593135X
DOI:10.1145/1081870
General Chair:
Robert Grossman
University of Illinois at Chicago & Open Data Partners, USA
,
Program Chairs:
Roberto Bayardo
IBM Almaden Research, USA
,
Kristin Bennett
RPI, USA
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
prior knowledge
sampling
subgroup discovery
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 794
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Sampling-based sequential subgroup mining

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Analysis of sampling techniques for association rule mining

Developing Novel and Effective Approach for Association Rule Mining Using Progressive Sampling

Mining top-K frequent itemsets through progressive sampling