ABSTRACT
Subgroup discovery is a learning task that aims at finding interesting rules from classified examples. The search is guided by a utility function, trading off the coverage of rules against their statistical unusualness. One shortcoming of existing approaches is that they do not incorporate prior knowledge. To this end a novel generic sampling strategy is proposed. It allows to turn pattern mining into an iterative process. In each iteration the focus of subgroup discovery lies on those patterns that are unexpected with respect to prior knowledge and previously discovered patterns. The result of this technique is a small diverse set of understandable rules that characterise a specified property of interest. As another contribution this article derives a simple connection between subgroup discovery and classifier induction. For a popular utility function this connection allows to apply any standard rule induction algorithm to the task of subgroup discovery after a step of stratified resampling. The proposed techniques are empirically compared to state of the art subgroup discovery algorithms.
- C. Blake and C. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.html.]]Google Scholar
- L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001.]] Google ScholarDigital Library
- S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic Itemset Counting and Implication Rules for Market Basket Data. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD '97), pages 255--264, Tucson, AZ., 1997. ACM.]] Google ScholarDigital Library
- P. Cunningham and J. Carney. Diversity versus Quality in Classification Ensembles Based on Feature Selection. In Proceedings of the 11th European Conference on Machine Learning (ECML 2000), pages 109 -- 116. Springer Verlag Berlin, Barcelona, Spain, 2000.]] Google ScholarDigital Library
- T. Fawcett. ROC Graphs: Notes and Practical Considerations for Researchers, 2004. Submitted to Machine Learning.]]Google Scholar
- P. A. Flach. The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics. In Proceedings of the 20th International Conference on Machine Learning (ICML-03). Morgen Kaufman, 2003.]]Google ScholarDigital Library
- Y. Freund and R. R. Schapire. A decision--theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119--139, 1997.]] Google ScholarDigital Library
- J. H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. Annals of Statistics, (28):337--374, 2000.]]Google ScholarCross Ref
- J. Fürnkranz and P. Flach. ROC 'n' Rule Learning -- Towards a Better Understanding of Covering Algorithms. Machine Learning, 58(1):39--77, 2005.]] Google ScholarDigital Library
- S. Jaroszewicz and D. A. Simovici. Interestingness of Frequent Itemsets Using Bayesian Networks as Background Knowledge. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD-2004). AAAI Press, August 2004.]] Google ScholarDigital Library
- G. H. John and P. Langley. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pages 338--345. Morgan Kaufmann, 1995.]]Google ScholarDigital Library
- W. Klösgen. Explora: A Multipattern and Multistrategy Discovery Assistant. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 3, pages 249--272. AAAI Press/The MIT Press, Menlo Park, California, 1996.]] Google ScholarDigital Library
- N. Lavrac, P. Flach, and B. Zupan. Rule Evaluation Measures: A Unifying View. In 9th International Workshop on Inductive Logic Programming, Lecture Notes in Computer Science. Springer, 1999.]] Google ScholarDigital Library
- N. Lavrac, B. Kavsek, P. Flach, and L. Todorovski. Subgroup discovery with CN2-SD. Journal of Machine Learning Research, 5:153--188, Feb 2004.]] Google ScholarDigital Library
- N. Lavrac, F. Zelezny, and P. Flach. RSD: Relational subgroup discovery through first-order feature construction. In 12th International Conference on Inductive Logic Programming. Springer, 2002.]]Google ScholarDigital Library
- D. Mackay. Introduction To Monte Carlo Methods. In Learning in Graphical Models, pages 175--204. 1998.]] Google ScholarDigital Library
- I. Mierswa, R. Klinkberg, S. Fischer, and O. Ritthoff. A Flexible Platform for Knowledge Discovery Experiments: YALE -- Yet Another Learning Environment. In LLWA 03 - Tagungsband der GI-Workshop-Woche Lernen - Lehren - Wissen - Adaptivität, 2003.]]Google Scholar
- T. M. Mitchell. Machine Learning. McGraw Hill, New York, 1997.]] Google ScholarDigital Library
- R. E. Schapire. The Strength of Weak Learnability. Machine Learning, 5:197--227, 1990.]] Google ScholarDigital Library
- R. E. Schapire, M. Rochery, M. Rahim, and N. Gupta. Incorporating Prior Knowledge into Boosting. In Proc. of the 19th International Conference on Machine Learning (ICML-02), 2002.]] Google ScholarDigital Library
- R. E. Schapire and Y. Singer. Improved Boosting Using Confidence-rated Predictions. Machine Learning, 37(3):297--336, 1999.]] Google ScholarDigital Library
- T. Scheffer and S. Wrobel. Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling. Journal of Machine Learning Research, 3:833--862, 2002.]] Google ScholarDigital Library
- M. Scholz. Knowledge-Based Sampling for Subgroup Discovery. In K. Morik, J.-F. Boulicaut, and A. Siebes, editors, Proc. of the Workshop on Detecting Local Patterns, Lecture Notes in Computer Science. Springer, 2005. To appear.]] Google ScholarDigital Library
- A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8(6):970--974, dec 1996.]] Google ScholarDigital Library
- E. Suzuki. Discovering Interesting Exception Rules with Rule Pair. In ECML/PKDD 2004 Workshop, Advances in Inductive Rule Learning, 2004.]]Google Scholar
- I. Witten and E. Frank. Data Mining -- Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000.]] Google ScholarDigital Library
- S. Wrobel. An Algorithm for Multi--relational Discovery of Subgroups. In J. Komorowski and J. Zytkow, editors, Principles of Data Mining and Knowledge Discovery: First European Symposium (PKDD 97), pages 78--87, Berlin, New York, 1997. Springer.]] Google ScholarDigital Library
- X. Wu and R. Srihari. Incorporating Prior Knowledge with Weighted Margin Support Vector Machines. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD-2004). AAAI Press, August 2004.]] Google ScholarDigital Library
- B. Zadrozny, J. Langford, and A. Naoki. Cost--Sensitive Learning by Cost--Proportionate Example Weighting. In Proceedings of the 2003 IEEE International Conference on Data Mining (ICDM'03), 2003.]] Google ScholarDigital Library
Index Terms
- Sampling-based sequential subgroup mining
Recommendations
Analysis of sampling techniques for association rule mining
ICDT '09: Proceedings of the 12th International Conference on Database TheoryIn this paper, we present a comprehensive theoretical analysis of the sampling technique for the association rule mining problem. Most of the previous works have concentrated only on the empirical evaluation of the effectiveness of sampling for the step ...
Developing Novel and Effective Approach for Association Rule Mining Using Progressive Sampling
ICCEE '09: Proceedings of the 2009 Second International Conference on Computer and Electrical Engineering - Volume 01A challenging task in data mining is the process of discovering association rules from a large database. Most of the existing association rule mining algorithms make repeated passes over the entire database to determine the frequent itemsets, which is ...
Mining top-K frequent itemsets through progressive sampling
We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very ...
Comments