RDE: A novel approach to improve the classification performance and expressivity of KDB

Hua Lou; LiMin Wang; DingBo Duan; Cheng Yang; Musa Mammadov

doi:10.1371/journal.pone.0199822

Abstract

Bayesian network classifiers (BNCs) have demonstrated competitive classification performance in a variety of real-world applications. A highly scalable BNC with high expressivity is extremely desirable. This paper proposes Redundant Dependence Elimination (RDE) for improving the classification performance and expressivity of k-dependence Bayesian classifier (KDB). To demonstrate the unique characteristics of each case, RDE identifies redundant conditional dependencies and then substitute/remove them. The learned personalized k-dependence Bayesian Classifier (PKDB) can achieve high-confidence conditional probabilities, and graphically interpret the dependency relationships between attributes. Two thyroid cancer datasets and four other cancer datasets from the UCI machine learning repository are selected for our experimental study. The experimental results prove the effectiveness of the proposed algorithm in terms of zero-one loss, bias, variance and AUC.

Citation: Lou H, Wang L, Duan D, Yang C, Mammadov M (2018) RDE: A novel approach to improve the classification performance and expressivity of KDB. PLoS ONE 13(7): e0199822. https://doi.org/10.1371/journal.pone.0199822

Editor: Lars Kaderali, Universitatsmedizin Greifswald, GERMANY

Received: December 21, 2017; Accepted: June 14, 2018; Published: July 23, 2018

Copyright: © 2018 Lou et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data used in this study are third party and are publicly available from the UCI repository of machine learning databases: http://archive.ics.uci.edu/ml/datasets.html.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Data mining is the analysis step of the “knowledge discovery in databases” process and its goal is the extraction of patterns and knowledge from large amounts of data. During the past decades, statistical models, such as Bayesian network, neural network and support vector machine, have been proposed and applied in many real life applications, e.g. precision medicine. Due to the high prediction performance of these statistical models, researchers would like to gain an understanding of the reasons behind such a prediction, especially when the prediction contradicts their intuition. For example, physicians are typically not only interested in the final prediction, but also like to understand the underlying inference procedure that may help explain why the system makes a certain recommendation. An explanatory, causal and graphical model is more desirable to visualize and mine previously undiscovered knowledge from data [1].

Bayesian network classifiers (BNCs) have long been a popular tool for graphically representing the probabilistic dependencies and inferring under conditions of uncertainty [2–5]. Numerous BNCs (e.g., Naive Bayes (NB) [6], tree augmented Naive Bayes (TAN) [7], Averaged One-Dependence Estimators (AODE) [8] and k-dependence Bayesian classifier (KDB) [9–11] have been proposed to mine dependency relationships from data. Among them, KDB can generalize to describe any higher degrees of attribute dependence. KDB provides the “average network” to express significant dependencies and this “one size fits all” solution obviously cannot apply to all cases. Patients with similar symptoms may have different kinds of diseases. For example, because of low incidence rate, AIDS (Acquired Immune Deficiency Syndrome) at early stage is often diagnosed as influenza [12]. How to enable person to have “personalized network”, which can describe the dependency relationships among specific characteristics or attributes for each case, is still challenging. Local graph structure [2] takes each case or unlabeled testing instance as a target and can describe local causal relationships implicated. However, the number of conditional dependencies in is determined by user-specified parameter k. Some redundant dependencies should be replaced with more meaningful or “personalized” dependencies that only hold in specific instances.

In this paper, a new approach, called Redundant Dependency Elimination (RDE), is proposed to identify redundant conditional dependencies in and then substitute/remove them at classification time. The resulting optimized network structure of , denoted by KDB_O, can increase the confidence level of conditional probabilities. The final personalized classifier, PKDB, is an ensemble of KDBs learned from training data and testing instance respectively. PKDB combines the computational efficiency of classical generative learning with the control of bias/variance trade-off. Two thyroid disease datasets and four other cancer datasets from the UCI machine learning repository are selected for our experimental study. The experimental results show the advantages of PKDB over other classifiers.

Materials and methods

Classifiers

LibSVM [13] and Random forest [14] are introduced in this paper for comparison study. We use Weka’s implementations and default settings of Random forest with the exceptions of 20 decision trees. We use Weka’s implementations and default settings of LibSVM and performing a “grid-search” on C and γ for the RBF kernel using 5-fold cross-validation. Each pair of (C, γ) is tried (C = 2⁻⁵, 2⁻³, ⋯, 2¹⁵, γ = 2⁻¹⁵, 2⁻¹³, ⋯, 2³) and the one with the lowest cross-validation zero-one loss is selected. For clarity, the abbreviation of algorithms mentioned above is shown in Table 1.

Download:

Table 1. Abbreviation of algorithms introduced in this paper.

https://doi.org/10.1371/journal.pone.0199822.t001

Data

Six datasets from UCI machine learning repository [15] are selected in this paper for case study. The detailed introduction of these datasets are shown in Table 2, which summarizes the characteristics of each dataset, including the numbers of instances, attributes and classes. For each benchmark dataset, we use MDL discretization [16] to discretize quantitative attributes using 3-bin equal frequency discretization.

Download:

Table 2. Description of data sets.

https://doi.org/10.1371/journal.pone.0199822.t002

Metrics

Zero-one loss is one of the most commonly used metrics to measure the classification performance of a classifier. Zero-one loss can measure how well a classifier correctly identifies or discriminate an unlabeled instance. Let X and Y be the input and output spaces respectively, and elements x and y respectively. The zero-one loss function for instance x is defined as [17]: where if and zero otherwise, y and are respectively the true class label and predicted label of x. Kohavi and Wolpert presented a bias-variance decomposition of the zero-one loss function [17]. The bias term measures the squared difference between the average output of the target and the algorithm, and it is defined as follows [17]: The variance term measures the sensitivity of the algorithm to the changes in the training set, and it is defined as follows [17]:

In machine learning, the bias-variance tradeoff is a central problem for supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods (e.g., high-dependence BNCs) are usually more complex, enabling them to capture more complex multivariate relationships, but at risk of overfitting to noisy or unrepresentative training data. In contrast, high-bias component of zero-one loss is highly appealing to simpler models that don’t tend to overfit, but may underfit their training data, failing to capture important regularities.

The statistical hypothesis test, e.g. Friedman test [18], can test the null hypothesis of no differences between algorithms. Friedman test ranks the algorithms for each data set separately: the best performing algorithm getting the rank of 1, the second best ranking 2, and so on. In case of ties, average ranks are assigned. The Friedman statistic can be computed as follows [18]: (1) where and is the rank of the j-th of t algorithms on the i-th of N datasets.

Sensitivity measures the proportion of actual positives that are correctly identified and specificity measures the proportion of actual negatives that are correctly identified. Receiver operating characteristic curve, i.e. ROC curve, is a powerful tool to illustrate the diagnostic ability of a binary classifier by plotting the true positive rate (Sensitivity) against the false positive rate (100-Specificity) for different cut-off points. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The ROC curve graphically displays the trade-off between sensitivity and specificity and is useful in assigning the best cut-offs. The area under the ROC curve (AUC) [19] provides a simple numeric measure indicating the performance over the visual comparison of ROC curves.

Bayesian network classifiers

Given class variable Y and a set of discrete attributes X = {X₁, X₂, ⋯, X_n} (In the following formulas, all variables are assumed to be discrete.) the aim of supervised learning is to predict the discrete class label y of a testing instance x = (x₁, ⋯, x_n), where x_i is the value of attribute X_i and y is the value of class variable Y. The restricted BNCs, e.g., KDB, model joint probability distribution P(x, y) according to chain rule, which can be described in the form of a product of a set of conditional probabilities. (2) where Π_i represents the parent attribute set of X_i.

From the definition of conditional probability, we use the following Formula to classify (3)

When attribute number n is high and/or data size N is relatively small, it would be difficult to obtain a sufficiently accurate estimate of P(x_i|Π_i, y) from the sample frequencies. One popular solution is to restrict the number of parents of each attribute while trying to retain accurate estimate of P(x_i|Π_i, y). That is, given attribute subset , holds. Sahami [11] proposed the notion of k-dependence BNC, which allows each attribute X_i to have a maximum of k attribute nodes as parents.

NB is the simplest of the BNCs, assuming that all attributes are independent given the class. There exist no dependency relationships between attributes and thus NB is a 0-dependence BNC. AODE utilizes a restricted class of one-dependence estimators (ODEs) and aggregates the predictions of all qualified estimators within this class. TAN relaxes NB’s independence assumption by allowing every attribute to have at most one other attribute as parent. Its basic structure extends the Chow-Liu tree [20] to a maximum spanning tree. The arc or conditional dependence between attributes X_i and X_j is measured by conditional mutual information (CMI) I(X_i; X_j|Y) given class variable, which is defined as follows [21], (4)

KDB further relaxes NB’s independence assumption by allowing any attribute X_i to be conditioned on at most k other attributes, i.e., at most k arcs from other attributes to X_i. Unlike TAN, KDB requires to determine the attribute order by comparing the mutual information (MI) I(X_i; Y) between attribute X_i and class Y, which is defined as follows [21], (5) The learning procedures of KDB is described in Algorithm 1.

Algorithm 1: Structure learning of KDB

Input: Training set , parameter k = 2, crosstab CMI = {I(X_i, X_j|Y)|1 ≤ i ≠ j ≤ n} (see formula (4)) and vector MI = {I(X_i; Y)|1 ≤ i ≤ n} (see formula (5)).

Output: Network structure , where is the node set and is the edge set.

1 Let be a list of all X_i in descending order of I(X_i; Y).

2 ; ;

3 for i = 1 → n do

4 ;

5 ;

6 end

7 for i = 1 → n do

8 ;

9 ;

10 while () do

11 ;

12 ;

13 ;

14 S = S ∪ {m};

15 end

16 end

17 return

Algorithm 2: Structure learning of

Input: testing instance , parameter k = 2, vector LMI = {I(x_i;Y)|1 ≤ i ≤ n}, crosstab CLMI = {I(x_i, x_j|Y)|1 ≤ i ≠ j ≤ n} (see formula (6)).

Output: Network structure , where is the node set and is the edge set.

1 Let be a list of all x_i in descending order of I(x_i;Y).

2 ; ;

3 for i = 1 → n do

4 ;

5 ;

6 end

7 for i = 1 → n do

8 ;

9 ;

10 while () do

11 ;

12 ;

13 ;

14 S = S ∪ {m};

15 end

16 end

17 return

KDB can represent the “average knowledge” or “expert knowledge” mined from data, that roughly describes the dependency relationships between different inputs, e.g., the dependency relationship between {Gender, Age} and TSH. However, KDB cannot finely describe the dependency relationships in different patient records, e.g., the relative independency relationship between {Gender = “male”, Age = 20} and TSH = “yes”, or the relative dependency relationship between {Gender = “female”, Age = 45} and TSH = “yes”. In contrast, represents “personalized knowledge” mined from instance . The “average knowledge” learned from labeled training data and the “personalized knowledge” learned from unlabeled testing instance are complementary in nature. Thus they should be considered simultaneously for classification. To achieve this goal, applies the same learning strategy that KDB uses. Given testing instance , sorts attributes by comparing local mutual information (LMI) I(x_i; x_j|Y) and choose appropriate conditional dependencies by comparing conditional local mutual information (CLMI). LMI and CLMI are defined as follows [2], (6)

From the viewpoint of information theory, MI or I(X_i; Y) can measure the uncertainty reduction in Y given the information from X_i. The attributes corresponding to greater reduction will get higher rank and added to the network structure in priority. By comparing formulas (5) and (6) we can see that, I(X_i; Y) = ∑_{x_i} I(x_i; Y). MI refers to the average of all possible events, and it is the expected value of LMI over all possible values of X_i. LMI can be used to measure the uncertainty reduction in Y given the information from X_i = x_i. Because I(X_i; X_j|Y) = ∑_{x_i, x_j} I(x_i; x_j|Y), we can get similar results that I(x_i; x_j|Y) can measure the conditional dependence between X_i and X_j when they take specific values.

The ensemble of KDB and , i.e., AKDB [2], has better overall prediction accuracy, on average, than any individual member. KDB and apply the same learning strategy whereas model different data spaces (training data and testing instance). It is difficult to judge which output from these two classifiers should be considered in priority. The linear combiner is used for models that output real-valued numbers, so is applicable for BNC. In practice, it is inappropriate to pre-determine the weight of subclassifier. Thus in practice AKDB uses the uniformly rather than nonuniformly weighted average. The ensemble probability estimate is (7)

Given m class labels, the class label y* of unlabeled instance x corresponds to the highest value of posterior probability of , where y ∈ {y₁, ⋯, y_m}, i.e., (8)

The classification procedure of AKDB is shown in Algorithm 3.

Algorithm 3: Classification procedure of AKDB

Input: testing instance , KDB learned from Algorithm 1 and learned from Algorithm 2.

Output: Class label y*.

1 Compute the joint probability P(y,x|KDB) and by Formula (2);

2 Compute the conditional probability P(y|x, KDB) and by Formula (3);

3 Compute the conditional probability P(y|x, AKDB) by Formula (7);

4 Compare and predict the class label y* for by Formula (8);

5 Return y*;

Redundant dependency elimination

Suppose that Π_i = {X₁, ⋯, X_i−1}, from the chain rule of mutual information we have [21] (9)

KDB implicitly reduces I(X_i; X_j|X₁, ⋯, X_j−1, Y) to I(X_i; X_j|Y) when j > 1. The same strategy is also applicable to . Obviously, the dependency relationships between the parent attributes of X_i are neglected, that will inevitably result in estimation bias. For different instances, the dependency relationships may differ. Here, we introduce Pointwise mutual information (PMI) I(x_i; x_j) and Pointwise conditional mutual information (PCMI) I(x_i; x_k|x_j) to address this issue. The definitions of PMI and PCMI are as follows [22], (10)

The dependency relationships in that are relevant or irrelevant to class labels are respectively measured by formulas (6) and (10), the confidence levels of which are determined by the estimation of probability distributions. The probability distributions have to be estimated from training data before structure learning. For small datasets, the sparsely distributed attribute values make the estimation of lower-order probability estimations much more reliable than that of the higher-order ones. If the probability distributions learned from training data are not reliable, the resulting non-robust classifier will make wrong prediction. That may be the main reason why NB offers competitive performance with high efficiency, strong robustness and loose coupling on some small datasets.

PMI and PCMI refer to single events. Like MI, PMI also follows the chain rule, i.e., (11)

In computational linguistics, PMI has been used for finding co-occurrences of words in a text corpus and to approximate the probabilities P(x) and P(x, y) respectively. MI can roughly measure the dependency relationship between the associated variables, but cannot measure the inherent relational mapping between specific variable values. Given two attributes X_i, X_j, each having two values and <X_i, X_j > = {(1, 1), (1, 2), (2, 2)} for example, obviously when X_i = 1 the uncertainty of X_j reaches a maximum, whereas when X_i = 2 the uncertainty of X_j is reduced to zero. In practice, it may be the case that certain values are more significant than others, or that certain patterns of association are more semantically important than others. Further, it is desirable to obtain reasonable causal relationships for causality analysis rather than a simple classification result. Considering an example of a substructure shown in Fig 1(a). Corresponding training data is presented in Table 3. In this example, X_i has two parents X_j, X_k and its conditional probability is P(x_i|x_j, x_k, y).

Download:

Fig 1. Example: Conditional dependencies between X_i and its parents.

(a) X_i has two parent attributes X_j and X_k. (b) Parent attribute X_k is substituted with X_t.

https://doi.org/10.1371/journal.pone.0199822.g001

Download:

Table 3. An example of training data with four attributes, in which the mapping relationships between {X_i, X_j, X_k} are shown.

https://doi.org/10.1371/journal.pone.0199822.t003

KDB or just consider the conditional dependence between X_i and its parents, and the relationships among parents are neglected, that may not help to increase the confidence level of P(x_i|Π_i, y) or reduce the uncertainty of X_i when it takes specific values. For testing instance , its class label is unknown thus Redundant Dependency Elimination (RDE) just considers the dependency relationships between attribute values. For example, given {X_i = b, X_j = d, X_k = e} in Table 3, from the chain rule of PMI we will have (12)

Thus I(x_i; x_j, x_k) = I(x_i; x_j), i.e., x_k does not provide any extra valuable information to reduce the uncertainty of x_i. To further increase the conditional probability of x_i, we should select another attribute value, e.g., x_t, to take the place of x_k. If I(x_i; x_t|x_j) > 0, then (13)

Thus the larger the difference is, the more appropriate X_t is as the parent of X_i. If the attributes are sorted by comparing I(x_i; Y) and the resulting order is {x₁, ⋯, x_n}, then x_i can select at most k parents from i − 1 attributes that ranks higher. Suppose that its parents are sorted by comparing I(x_i; x_j|Y)(j < i) and the order is , RDE first operates by iteratively identifying redundant parents of each attribute. It uses the criterion (14) to infer that except the information provides to x_i, can provide extra information to x_i, where and 1 < j ≤ i − 1, δ is a minimum redundancy ratio. If there exist attribute values that make formula (14) hold, then an appropriate parent of x_i should be selected from them. This process is terminated if there is no redundancy or no substituted attribute available. We keep the attribute value with the smallest index and disregard the other attribute values. For instance, if and hold for formula (14), we only take as the parent of x_i.

Starting from the basic network structure learned from testing instance, KDB_O repairs “harmful” interdependencies by applying RDE to remove highly correlated attribute values in classification time. Note that attribute selection approaches, such as Backwards sequential elimination (BSE, [23, 24]), simply remove attributes to achieve zero-one loss improvement. BSE operates by iteratively removing successive attributes until no zero-one loss improvement. According to Formula 2, attribute X_i can have at most i − 1 parents, i.e., there exists i − 1 conditional dependencies between X_i and its parents. If X_i is removed from Bayesian network structure, then i − 1 conditional dependencies will be implicitly removed correspondingly. That will result in great change in network structure and classification bias. In contrast, RDE retains all attributes and resolve such interdependencies with much more flexible strategy and finer tuning, as for some test instances one conditional dependence may be identified as redundant and then substituted or removed, for other test instances it may hold.

One effective way of resolving the trade-off between bias and variance is to use ensemble learning [9, 25]. For example, boosting combines many “weak” (high bias) models in an ensemble that has lower bias than the individual models, while bagging combines “strong” learners in a way that reduces their variance. KDB and KDB_O are both “strong” learners. KDB takes training set as a target and build general BNC for it. KDB_O takes testing instance as a target and build a specific BNC for . In contrast to KDB, KDB_O is defined by the conditional dependencies at the attribute values in . Obviously, for different testing instances, KDB remains the same while KDB_O may differ greatly. RDE identifies and then substitutes/removes the redundant dependencies in , that will make the conditional dependencies in KDB_O much more reasonable.

The final model, PKDB, is an ensemble of KDB and KDB_O. The ensemble probability estimate for PKDB is

PKDB can represent arbitrary k-dependence relationships. It seems that PKDB with higher degree of attribute dependence will more closely fit the training data and can achieve better generalization performance than those with lower degree of attribute dependence. However, higher degree of attribute dependence needs more training instances to ensure more accurate estimation of conditional probability. From Table 2, the thyroid disease datasets for experimental study contain relatively small number (< 3800) of instances but large number (≥ 25) of attributes. To make resulting algorithm combine the computational efficiency of classical generative learning with the control of bias/variance trade-off, in the following discussion we restrict PKDB to be 2-dependence, i.e., k = 2, as used in [2]. Since attribute X_i can have k parent attributes with higher ranks, the problem of redundant dependency arises when i ≥ k + 2. The detailed learning procedure of PKDB is presented in Algorithm 4.

Algorithm 4: Redundant Dependency Elimination for KDB_O when k = 2

Input: Network structure , parameter k, testing instance .

Output: KDB_O, network structure after applying RDE.

1 Transform to a set of children-parent pairs {x₁, Π₁} ⋯, {x_n, Π_n}.

2 Let be a list of all x_i in descending order of I(x_i; Y).

3 for i = k + 2 → n do

4 Let be a list of all x_j(x_j ∈ Π_i) in descending order of I(x_i; x_j|Y);

5 ;

6 for j = 2 → i − 1 do

7 if () (see formula (14)) then

8 ;

9 end

10 end

11 end

12 Transform revised children-parent pairs {x₁, Π₁} ⋯, {x_n, Π_n} to KDB_O.

13 return KDB_O

During training PKDB generates a three-dimensional table of co-occurrence counts for each pair of attribute values and each class value to estimate the probabilities P(y), P(x_i, y), P(x_i, x_j, y), P(x_i, x_j) and P(x_i, x_j, x_k). KDB requires O(Nm(nv)²) time (dominated by calculating CMI) [11] to build the network structure, where v is the average number of discrete values that an attribute may take. The basic structure of KDB_O only considers the attribute values in testing instance and thus requires O(Nmn²) time. RDE requires O(Nn²) time to calculate PCMI, then an extra pass is needed to perform identification and then substitute/remove redundant conditional dependencies. The final time complexity for building KDB_O is O(Nmn²) + O(Nn³). The time complexities of classifying a single instance for KDB and KDB_O are the same, O(mnk).

Results

The experimental system is implemented in C++. The experiments are conducted on a desktop computer with an Intel(R) Core(TM) i5-7200 CPU @3.20GHz, 64 bits and 12,288 MB of memory. For the BNCs to be compared, 10-fold cross validation is applied to obtain an accurate estimation of the average performance. For each fold, leave-one-out cross validation zero-one loss [26] [27] is used as selection criterion to determine δ in Formula (14). Table 4 summarizes the experimental results in terms of zero-one loss, bias, variance and AUC. The Friedman statistic is distributed according to with t − 1 degrees of freedom. Thus, for any pre-determined level of significance α, the null hypothesis will be rejected if . The critical value of for α = 0.05 with seven degrees of freedom is 14.07. The Friedman statistic of zero-one loss in Table 4 is 15.32, which is larger than 14.07. Hence, the null-hypotheses is rejected and these classifiers are different.

Download:

Table 4. The comparison of classification performance between classifiers in terms of zero-one loss, bias, variance and AUC.

https://doi.org/10.1371/journal.pone.0199822.t004

Quinlan believed that the two relatively large datasets, i.e. Dis and Hypothyroid, have been corrupted [28] and many missing values exist (6064 missing values in dataset Dis and 5329 missing values in dataset Hypothyroid). When we substitute these missing values with a specific value, i.e., “?” or unknown, noise is artificially introduced and the performance of learned classifier may be degraded. For the other four small datasets with less than 800 instances, the training data provided only accounts for a small portion of the full dataset. Thus the estimation of conditional probability will be of low-confidence. Relatively simple structure resulted from underfitting rather than overfitting may help to improve the classification performance of learning algorithm. From the experimental results of zero-one loss in Table 4 we can see that, classifiers with complex structure don’t necessarily enjoy significant advantage over classifiers with simple structure. For example, KDB, LibSVM and RF perform poorer than NB on datasets Breast-cancer-w and Heart-disease-c. However, provides an effective way to learn high-confidence dependency relationships implicated in testing instance. RDE can remove the redundant dependency relationships that are irrelevant to class label and add high-confidence conditional dependencies. The negative effect caused by noise and insufficient data will be mitigated to some extent. AKDB performs better than KDB. PKDB even performs the best among all classifiers in terms of zero-one loss.

We then clarify from the viewpoint of bias-variance decomposition. The experimental results of variance are reasonable that AODE achieves higher variance than NB because of its complex structure. However, AODE achieves higher bias on dataset Dis, which means underfitting to some extent. Since AODE indiscriminately represents all 29*28 = 812 conditional dependencies, some weak dependencies may represent a large noise component in the training set and counteract the effect the strong dependencies, making it underfit dataset Dis and its prediction less accurate than NB. For dataset Hypothyroid AODE only needs to represent 25*24 = 600 conditional dependencies and negative effect of weak dependencies can be mitigated. When k = 2, KDB can represent 0 + 1 + 2⋯ + 2 = 49 conditional dependencies (as shown in Fig 2 whereas TAN only needs to represent 28 conditional dependencies. Thus KDB achieves higher variance since it fits training set well even there exists noise. As a result, the KDB does not fit the testing instance much better than TAN. Noisy training data will reduce the confidence level of the classification model. For classifiers learned from the other four small datasests, overfitting is almost inevitable. KDB, LibSVM and RF perform poorer than NB on datasets Breast-cancer-w and Heart-disease-c in terms of variance. How to reduce variance is a crucial point for improving classification accuracy. RDE helps to mitigate the negative effect of overfitting, thus the variance for PKDB is always lower than that for AKDB and KDB.

Download:

Fig 2. The network structure of KDB(k = 2) on dataset Hypothyroid.

Class variable Y is not included for simplicity. Only conditional dependencies between attributes are shown.

https://doi.org/10.1371/journal.pone.0199822.g002

AUC is often used to evaluate the classification performance while dealing with imbalanced data. From Table 4 we can see that, TAN and KDB perform better than NB more often than not on small datasets. That indicates although the negative effect caused by overfitting may reduce the classification accuracy, the dependency relationships implicated will help to improve the the discriminatory power of BNCs. The definitions of LMI and CLMI considers all possible values of class variable, thus cannot overfit the given testing instance , but provides a possible dependence tree structure to describe the relationships among attribute values in . The advantage of PKDB over other classifiers in AUC is especially obvious on datasets Dis and Hypothyroid. In contrast, AKDB also uses the personalized , it performs much worse. This can be attributed to the low-confidence dependency relationships mined from these small datasets. LibSVM performs poorer on datasets Dis and Hypothyroid but better on the other four small datasets. RF demonstrates significant robustness while dealing with relatively large or small datasets.

Discussion

Doctors may need to determine if blood tests are necessary for patients due to their respective risk factors, e.g., family history of goitres, Gender or Age. By computing LMI, CLMI from the local perspective, KDB_O, which learns from individual testing instance, is obviously an example of learners for precision medicine. PKDB can utilize the information provided by the training set and testing instances with the help of the aggregating mechanism. To prove this, we take two cases for example from Hypothyroid dataset, which take different class labels. The first case that is diagnozed as “hypothyroid” is shown as follows (15) where ‘?’ is used to denote a value that is missing or unknown. The attribute values in caseÂ 1 have been sorted by comparing I(x_i; Y). Among them, x₁₉ or TSH ranks the highest, thus the level of TSH is closely related to some definite results and further tests will be needed. The full network structure with 25 attributes are too complex (47 arcs or conditional dependencies) to explain, so we just select one substructure to clarify. The conditional dependencies in and KDB, which focus on attributes {X₈, X₆, X₉}, are respectively shown in Figs 3(a) and 4. In Fig 3(a), the testing result of X₂₂(T4U) can explain why the patient does not feel sick(X₈ = f), thus X₆(query on hyperthyroid) does not provide valuable information. The arc X₆ → X₈ is removed. By comparing KDB shown in Fig 4 and KDB_O shown in Fig 3(b), the limitation of KDB in precise representation is obvious. Hyperthyroidism is a condition in which thyroid gland produces too much of the hormone thyroxine. One symptom for hyperthyroid is an enlarged thyroid gland, which may appear as a swelling at the base of one’s neck. It is reasonable in Fig 3(b) that X₉(tumor) is related to X₆(query on hyperthyroid) whereas in Fig 4 X₉(tumor) is related to X₅(query on hypothyroid). To judge the possibility of hypothyroidism, blood tests (including TSH(X₁₂), T3(X₂₀) and T4U(X₂₂)) are needed. The close relationships can be clearly seen in Fig 3.

Download:

Fig 3. The substructures of

(a) and KDB_O (b) for {X₈, X₆, X₉} learned from Case1 = (x₁₉ = 43, x₂₃ = 47, x₂₂ = 1.26, x₂₀ = 2, x₂₁ = 59, x₁₂ = y, x₁₃ = y, x₆ = t, x₁₄ = y, x₁₅ = y, x₁₆ = y, x₅ = f, x₂₄ = ?, x₁₇ = n, x₁ = f, x₀ = F, x₁₈ = 28, x₄ = f, x₂ = f, x₈ = f, x₁₁ = f, x₇ = f, x₉ = t, x₃ = f, x₁₀ = f).

The arcs X₆(Query hyperthyroid) → X₈(Sick), X₁₂(TSH measured) → X₉(Tumor) in (a) are identified as redundant and removed. As shown in (b), no more attributes with higher ranks are considered as possible parents of X₈ and X₉.

https://doi.org/10.1371/journal.pone.0199822.g003

Download:

Fig 4. The substructure of KDB for {X₈, X₆, X₉} learned from training set.

The arc X₅(TSH measured) → X₉(Tumor) is not reasonable when X₉ = t (i.e., ‘true’).

https://doi.org/10.1371/journal.pone.0199822.g004

The detail of the second instance that is diagnozed as “negative” is shown as follows, (16)

The conditional dependencies in and KDB, which focus on attributes {X₁, X₁₅, X₁₆}, are respectively shown in Figs 5(a) and 6. The information implicated in some attribute values may overlap or even cover that in other attribute values. For example, “TSH measured = y” is a premise of “TSH = 4.6”. “Sex = F” is a premise of “Pregnant = t”. Although there exist strong dependencies between these attribute values and they may appear simultaneously as the co-parents of some attributes, this kind of dependencies are redundant and should be substituted. The arc X₁₂ → X₁ is removed from Fig 5(a) and we should find another parent for X₁ as shown in Fig 5(b). To provide accurate diagnosis for hypothyroid, the blood tests of TT4 and FTI are always used simultaneously. Thus the arc X₁₅ → X₁₆ is also redundant and should be removed. The limitation of KDB in scalability is obvious. As shown in Fig 6, the value of X₁₃(T3 measured) is a premise of the value of X₂₀(T3). When they appear as the co-parents of some other attribute, e.g., X₁, the conditional probability P(x₁|x₁₃, x₂₀, y) will approximate the estimate of P(x₁|x₂₀, y). X₁₃(T3 measured) cannot provide any valuable information to X₁.

Download:

Fig 5. The substructures of

(a) and KDB_O (b) for {X₁, X₁₅, X₁₆} learned from Case2 = (x₂₃ = 51, x₂₁ = 37, x₂₀ = 0.5, x₁₉ = 9.7, x₂₂ = 0.72, x₁₃ = y, x₁₂ = y, x₁ = t, x₁₄ = y, x₁₅ = y, x₁₆ = y, x₅ = f, x₀ = F, x₂₄ = ?, x₁₇ = n, x₄ = f, x₁₈ = 46, x₆ = f, x₈ = f, x₂ = f, x₁₁ = f, x₉ = f, x₇ = f, x₃ = f, x₁₀ = f).

The arcs X₁₂(TSH measured) → X₁₅(Query hypothyroid) and X₁₅(T4U measured) → X₁₆(FTI measured) in (a) are identified as redundant and removed. No more attributes with higher ranks are considered as possible parents of X₁₅ and X₁₆. Arc X₁₂(TSH measured) → X₁(On thyroxine) is substituted with arc X₂₀(T3) → X₁(On thyroxine).

https://doi.org/10.1371/journal.pone.0199822.g005

Download:

Fig 6. The substructure of KDB for {X₁, X₁₅, X₁₆}.

The arc X₁₃ (T3 measured) → X₁ (On thyroxine) is redundant since the information provided by X₂₀ (T3) includes the information provided by X₁₃ (T3 measured).

https://doi.org/10.1371/journal.pone.0199822.g006

Conclusion and future work

takes instance as the target and its network structure describes the dependency relationships in . Because of the computational overhead, only a limited number of dependencies, which are determined by parameter k, can be described by . The proposed approach, RDE, is a filter that transforms the testing instance to substitute these redundant dependencies with other dependencies at classification time. The experimental results show that the classification accuracy (or zero-one loss) and robustness (bias and variance) are significantly enhanced by the addition of RDE. Besides, the dependency relationships that RDE identified in testing instance are irrelevant to class label, thus it is especially applicable to imbalanced data, e.g. Dis and Hypothyroid. That may be the main reason why RDE obtains the highest AUC values among all the BNCs on the datasets Dis and Hypothyroid.

RDE searches for the mapping relationships between specific attribute values and then identifies redundant ones. Thus it is suited to probabilistic techniques which deal with discrete attributes, such as KDB. RDE can also be extended to deal with continuous attributes. One possible solution is that, if the conditional probability density function p(x_j|x_i) is relatively high (or greater than a specified value δ) then the mapping relationship x_i → x_j is supposed to exist and x_j is redundant. The estimation of p(x_j|x_i) should be learned reliably from training data and the data size should be very large. Although the estimation of p(x_j|x_i) will be time-consuming and more experimental study is needed to determine the value of δ for different attributes, the research work on extending RDE is still very promising.

References

1. Riccard B, Blaz Z. Predictive data mining in clinical medicine: Current issues and guidelines, International Journal of Medical Informatics, 2008; 77(2): 81–97.
- View Article
- Google Scholar
2. Wang LM, Zhao HY, Sun MH, Ning Y. General and Local: Averaged k-Dependence Bayesian Classifiers. ENTROPY, 2015; 17(6): 4134–4154.
- View Article
- Google Scholar
3. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, Palo Alto, CA. 1988.
4. Wu J, Cai Z. A naive Bayes probability estimation model based on self-adaptive differential evolution. Journal of Intelligent Information Systems, 2014; 42(3): 671–694.
- View Article
- Google Scholar
5. Zheng F, Webb GI. Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning. Machine Learning, 2011; 87(1): 1947–88.
- View Article
- Google Scholar
6. Park SH, Rnkranz J. Efficient implementation of class-based decomposition schemes for Naive Bayes. Machine Learning, 2014; 96(3): 295–309.
- View Article
- Google Scholar
7. Jiang LX, Cai ZH, Wang DH. Improving tree augmented naive bayes for class probability estimation. Knowledge-Based Systems, 2011; 26: 239–45.
- View Article
- Google Scholar
8. Zheng F, Geoffrey W. Efficient lazy elimination for averaged one-dependence estimators. in Proceedings of the Twenty-third International Conference on Machine Learning, 2006; 1113–1120.
9. Francisco L, Anderson A. Bagging k-dependence probabilistic networksAn alternative powerful fraud detection tool. Expert Systems with Applications, 2012; 39(14): 11583–92.
- View Article
- Google Scholar
10. Taheri S, Mammadov M. Structure learning of Bayesian Networks using global optimization with applications in data classification. Optimization Letters, 2015; 9(5): 931–948.
- View Article
- Google Scholar
11. Sahami M. Learning limited dependence Bayesian classifiers. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996: 335–338.
12. https://www.healthline.com/health/hiv-aids/early-signs-hiv-infection#stages-of-hiv .
13. Cortes C, Vapnik V. Support-vector networks. Machine Learning, 1995; 20(3): 273–297.
- View Article
- Google Scholar
14. Breiman L. Random Forests. Machine Learning, 2001; 45(1): 5–32. In Proceedings of the 5th International Joint Conference on Artificial Intelligence, 1993; 1022–1029.
15. Murphy PM, Aha DW. UCI repository of machine learning databases. http://archive.ics.uci.edu/ml/datasets.html, 1995.
16. Fayyad UM, Irani KB. Multi-interval discretization of continuousvalued attributes for classification learning.
17. Kohavi R, Wolpert D. Bias plus variance decomposition for zero-one loss functions. In Proceedings of the 13th International Conference on Machine Learning, 1996; 275–283.
18. Friedman M. A comparison of alternative tests of significance for the problem of m rankings. Journal of the American Statistical Association, 1940; 11(1): 86–92.
- View Article
- Google Scholar
19. Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997; 30(7): 1145–1159.
- View Article
- Google Scholar
20. Chow C. Liu C. Approximating discrete probability distributions with dependency trees. IEEE Transactions on Information Theory. 1968; 14: 462–467.
- View Article
- Google Scholar
21. Shannon CE. The Mathematical Theory of Communication. University of Illinois Press, 1949.
22. Kenneth WC, Patrick H. Word association norms, mutual information, and lexicography. Meeting on Association for Computational Linguist, 1989; 16(1): 22–29.
- View Article
- Google Scholar
23. Kittler J. Feature selection and extraction. Handbook of pattern recognition and image processing. New York: Academic Press, 1986.
24. Blanco R, Inza I, Merino M and Quiroga J. Feature selection in Bayesian classifiers for the prognosis of survival of cirrhotic patients treated with TIPS. Journal of Biomedical Informatics, 2005; 8: 376–388.
- View Article
- Google Scholar
25. Bouckaert RR. Voting massive collections of Bayesian network classifiers for data streams. In Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence, Berlin, Heidelberg, Springer-Verlag, 2006; 243–252.
26. Chen SL, Nayyar AZ. Scalable learning of Bayesian network classifiers. Journal of Machine Learning Research, 2013; 1–30.
- View Article
- Google Scholar
27. Chen SL, Martinez AM, Webb GI, Wang LM. Sample Based Attribute Selective AnDE for Large Data. IEEE Transactions on Knowledge and Data Engineering, 2017; 29(1): 172–185.
- View Article
- Google Scholar
28. http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/HEL-LO .

[ref1] 1. Riccard B, Blaz Z. Predictive data mining in clinical medicine: Current issues and guidelines, International Journal of Medical Informatics, 2008; 77(2): 81–97.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Wang LM, Zhao HY, Sun MH, Ning Y. General and Local: Averaged k-Dependence Bayesian Classifiers. ENTROPY, 2015; 17(6): 4134–4154.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, Palo Alto, CA. 1988.

[ref4] 4. Wu J, Cai Z. A naive Bayes probability estimation model based on self-adaptive differential evolution. Journal of Intelligent Information Systems, 2014; 42(3): 671–694.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref5] 5. Zheng F, Webb GI. Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning. Machine Learning, 2011; 87(1): 1947–88.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref6] 6. Park SH, Rnkranz J. Efficient implementation of class-based decomposition schemes for Naive Bayes. Machine Learning, 2014; 96(3): 295–309.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref7] 7. Jiang LX, Cai ZH, Wang DH. Improving tree augmented naive bayes for class probability estimation. Knowledge-Based Systems, 2011; 26: 239–45.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref8] 8. Zheng F, Geoffrey W. Efficient lazy elimination for averaged one-dependence estimators. in Proceedings of the Twenty-third International Conference on Machine Learning, 2006; 1113–1120.

[ref9] 9. Francisco L, Anderson A. Bagging k-dependence probabilistic networksAn alternative powerful fraud detection tool. Expert Systems with Applications, 2012; 39(14): 11583–92.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref10] 10. Taheri S, Mammadov M. Structure learning of Bayesian Networks using global optimization with applications in data classification. Optimization Letters, 2015; 9(5): 931–948.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref11] 11. Sahami M. Learning limited dependence Bayesian classifiers. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996: 335–338.

[ref12] 12. https://www.healthline.com/health/hiv-aids/early-signs-hiv-infection#stages-of-hiv .

[ref13] 13. Cortes C, Vapnik V. Support-vector networks. Machine Learning, 1995; 20(3): 273–297.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref14] 14. Breiman L. Random Forests. Machine Learning, 2001; 45(1): 5–32. In Proceedings of the 5th International Joint Conference on Artificial Intelligence, 1993; 1022–1029.

[ref15] 15. Murphy PM, Aha DW. UCI repository of machine learning databases. http://archive.ics.uci.edu/ml/datasets.html, 1995.

[ref16] 16. Fayyad UM, Irani KB. Multi-interval discretization of continuousvalued attributes for classification learning.

[ref17] 17. Kohavi R, Wolpert D. Bias plus variance decomposition for zero-one loss functions. In Proceedings of the 13th International Conference on Machine Learning, 1996; 275–283.

[ref18] 18. Friedman M. A comparison of alternative tests of significance for the problem of m rankings. Journal of the American Statistical Association, 1940; 11(1): 86–92.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref19] 19. Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997; 30(7): 1145–1159.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref20] 20. Chow C. Liu C. Approximating discrete probability distributions with dependency trees. IEEE Transactions on Information Theory. 1968; 14: 462–467.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref21] 21. Shannon CE. The Mathematical Theory of Communication. University of Illinois Press, 1949.

[ref22] 22. Kenneth WC, Patrick H. Word association norms, mutual information, and lexicography. Meeting on Association for Computational Linguist, 1989; 16(1): 22–29.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref23] 23. Kittler J. Feature selection and extraction. Handbook of pattern recognition and image processing. New York: Academic Press, 1986.

[ref24] 24. Blanco R, Inza I, Merino M and Quiroga J. Feature selection in Bayesian classifiers for the prognosis of survival of cirrhotic patients treated with TIPS. Journal of Biomedical Informatics, 2005; 8: 376–388.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref25] 25. Bouckaert RR. Voting massive collections of Bayesian network classifiers for data streams. In Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence, Berlin, Heidelberg, Springer-Verlag, 2006; 243–252.

[ref26] 26. Chen SL, Nayyar AZ. Scalable learning of Bayesian network classifiers. Journal of Machine Learning Research, 2013; 1–30.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref27] 27. Chen SL, Martinez AM, Webb GI, Wang LM. Sample Based Attribute Selective AnDE for Large Data. IEEE Transactions on Knowledge and Data Engineering, 2017; 29(1): 172–185.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref28] 28. http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/HEL-LO .

Figures

Abstract

Introduction

Materials and methods

Classifiers

Data

Metrics

Bayesian network classifiers

Redundant dependency elimination

Results

Discussion

Conclusion and future work

References