Abstract
The percent model affinity (PMA) index is used to measure the similarity of two probability profiles representing, for example, an ideal profile (i.e. reference condition) and a monitored profile (i.e. possibly impacted condition). The goal of this work is to study the effects of sample size, evenness, true value of the index and number of classes on the statistical properties of the estimator of the PMA index. We derive and extend previous formulas of the expectation and variance of the estimator for estimated monitored profile and fixed reference profile. Using the obtained extension, we find that the estimator is asymptotically unbiased, converging faster when the profiles differ. When both profiles are estimated, we calculate the expectation using transformation rules for expectation and in addition derive the formula for the estimator’s variance. Since the computation of the probabilities in the variance formula is slow, we study the behavior of the variance with simulation experiments and assess whether it could be approximated with the variance for the fixed reference profile. Finally, we provide a set of recommendations for the users of the PMA index to avoid the most common caveats of the index.
Similar content being viewed by others
References
Alahuhta J, Vuori K-M, Hellsten S, Järvinen M, Olin M, Rask M, Palomäki A (2009) Defining the ecological status of small forest lakes using multiple biological quality elements and palaeolimnological analysis. Fund Appl Limnol / Archiv für Hydrobiologie 175(3):203–216
Aroviita J, Hellsten S, Jyväsjärvi J, Järvenpää L, Järvinen M, Karjalainen S, Kauppila P, Keto A (2012) Guidelines for the ecological and chemical status classification of surface waters for 2012–2013—updated assessment criteria and their application. Environ Admin Guidel 7:144
Arratia R, Gordon L (1989) Tutorial on large deviation for the binomial distribution. Bull Math Biol 51(1):125–131
Bergstrand K-G (1989) Fördelningsaptering med näroptimalmetoden Ű reviderad version [bucking to order with a close-to-optimal method Ű revised version]. Forskningsstiftelsen Skogsarbeten. 1989-12-11 (In Swedish)
Bloom A (1981) Similarity indices in community studies: potential pitfalls. Mar Ecol-Prog Ser 5:125–128
Cao Y, Epifanio J (2010) Quantifying the responses of macroinvertebrate assemblages to simulated stress: are more accurate similarity indices less useful? Methods Ecol Evol 1:380–388
Cao Y, Hawkins CP (2005) Simulating biological impairment to evaluate the accuracy of ecological indicators. J Appl Ecol 42:954–965
Chao A, Hsieh T, Chazdon L, Colwell R, Gotelli N (2015) Unveiling the species-rank abundance distribution by generalizing the Good–Turing sample coverage theory. Ecology 96(5):1189–1201
Chao A, Shen T (2003) Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environ Ecol Stat 10:429–443
CIS (2003) Monitoring under the Water Framework Directive. Common Implementation Strategy for the Water Framework Directive (2000/60/EC). Guidance Document No 10. Working Group 2.7 - Monitoring. European Communities, Luxembourg
Duncan OD, Duncan B (1955) A methodological analysis of segregation indexes. Am Sociol Rev 20(2):210–217
Goldstein M, Wolf E (1977) On the problem of bias in multinomial classification. Biometrics 33(2):325–331
Jahn J, Schmidt CF, Schrag C (1947) The measurement of ecological segregation. Am Sociol Rev 12(3):293–303
Kauppila T, Kanninen A, Viitasalo M, Räsänen J, Meissner K, Mattila J (2012) Comparing long term sediment records to current biological quality element data—implications for bioassessment and management of eutrophic lake. Limnologica 42(1):19–30
Koskela L, Sinha BK, Nummi T (2007) Some aspects of the sampling distribution of the apportionment index and related infrence. Silva Fennica 41(4):699–715
Marcon Hérault B, Baraloto C, Lang G (2012) The decomposition of Shannon’s entropy and a confidence interval for beta diversity. Oikos 121:516–522
Matossian AD, Matsinos YG, Konstantinidis P, Moustakas A (2013) Post-fire succession indices performance in a Mediterranean ecosystem. Stoch Environ Res Risk Asses 27:323–335
Monk WA, Wood PJ, Hannah DM, Extence CA, Chadd RP, Dunbar MJ (2012) How does macroinvertebrate taxonomic resolution influence ecohydrological relationships in riverine ecosystems. Ecohydrology 5:36–45
Novak MA, Bode RW (1992) Percent model affinity: a new measure of macroinvertebrate community composition. J N Am Benthol Soc 11(1):80–85
Pielou E (1966) The measurement of diversity in different types of biological collections. J Theor Biol 13:131–144
Pollice A, Arima S, Lasinio GJ, Basset A, Rosati I (2015) Bayesian analysis of three indices for lagoons ecological status evaluation. Stoch Environ Res Risk Assess 29(2):477–485
R Core Team R (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Ransom MR (2000) Sampling distributions of segregation indexes. Sociol Method Res 28(4):454–475
Renkonen O (1938) Statisch-ökologische Untersuchungen über die terrestrische Käferwelt der finnischen Bruchmoore. Ann Zool Soc Bot Fenn Vanamo 6:1–231
Ricklefs RE, Lau M (1980) Bias and dispersion of overlap indices: results of some Monte Carlo simulations. Ecology 61(5):1019–1024
Romik D (2000) Stirling’s approximation for n!: the ultimate short proof? Am Math Mon 107(6):556–557
Seber GAF (1973) The estimation of animal abundance and related parameters. C. Griffin, London
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423
Smith EP (1982) Niche breadth, resource availability and inference. Ecology 63(6):1675–1681
Smith EP, Zaret TM (1982) Bias in estimating niche overlap. Ecology 63(5):1248–1253
Stoddard JL, Larsen DP, Hawkins CP, Johnson RK, Norris RH (2006) Setting expectations for the ecological condition of streams: the concept of reference condition. Ecol Appl 16(4):1267–1276
Thompson RM, Townsend CR (2000) Is resolution the solution? The effect of taxonomic resolution on the calculated properties of three stream food webs. Freshw Biol 44(3):413–422
Venrick E (1983) Percent similarity: the prediction of bias. Fish Bull 81(2):375–387
WFD (2000) Directive 2000/60/EC of the European Parliament and the Council of 23, October 2000. A framework for community action in the field of water policy. Off J Eur Commun L327:72
Wolda H (1981) Similarity indices, sample size and diversity. Oecologia (Berl) 50:296–302
Acknowledgments
We thank the Ellen and Artturi Nyyssönen foundation for the grant of Ärje and the Academy of Finland (projects 289076 (SK) and 289104 (KM)). KPC acknowledges the support of Singapore Ministry of Education Academic Research Fund R-155-000-147-112. We kindly thank Jukka Aroviita for insights into the Finnish adaptation of the PMA in the WFD context and Antti Penttinen and Jukka Nyblom for helpful conversations.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Lemma 1
Let \(\varOmega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\varOmega\). Let us consider one value \(\omega _h \in \Omega\) with the corresponding probabilities \(p_h,q_h \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(Z_{n,h}\) a binomial random variable with parameters n and \(p_h\), that is \(Z_{n,h} \sim {\mathcal{B}}(n,p_h)\), and define the function \(A(p_h,q_h)\) as follows
Then, for every \(h=1,\ldots ,c\) we have
Proof
By definition we have
Recalling that a binomial random variable \(Z_{n,h}\) can be represented as the sum of two independent binomials \(Z_{n-1,h}\) and \(Z_{1,h}\), we have
and we can reformulate Eq. (24) as follows
proving the Lemma. \(\square\)
Proposition 1
Under the assumption that the sampling frequencies \({\mathbf{X}}=(X_1,\ldots ,X_c)\) are distributed as a multinomial random vector \({\mathcal{M}}(n,{\mathbf{p}})\), the expected value of the estimator \(\hat{I}\) in Eq. (3) is given by
where the \(Z_{n-1,h}\) for \(h=1,\ldots ,c\) are independent binomial random variables, each one with parameters \(n-1\) and \(p_h\), that is \(Z_{n-1,h} \sim {\mathcal{B}}(n-1,p_h)\).
Proof
From Eq. (5) the expectation of \(\hat{I}\) can be represented as
Recalling that the h-th component of a random vector \({\mathbf{X}}\), multinomial distributed with parameters n and \({\mathbf{p}}\), has marginal binomial distribution with parameters n and \(p_h\), that is \(X_{h} \sim {\mathcal{B}}(n,p_h)\), that expectation can be rewritten as
Therefore, the proof follows from combining Eq (25) in Lemma 1 and Eq. (26). \(\square\)
Lemma 2
Let \(\Omega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\Omega\). Let us consider one value \(\omega _h \in \Omega\) with the corresponding probabilities \(p_h,q_h \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(Z_{n,h}\) a binomial random variable with parameters n and \(p_h\), that is \(Z_{n,h} \sim {\mathcal{B}}(n,p_h)\), and define the function \(B(p_h,q_h)\) as follows
Then, for every \(h=1,\ldots ,c\) we have
Proof
By definition we have
Now, let us split Eq. (27) in three terms. First, we have
Second,
Similarly, we also have
Combining results (28), (29) and (30), and recalling that a binomial random variable can be represented as sum of independent components, as for instance \(Z_{n,h}=Z_{n-1,h}+Z_{1,h}\) and \(Z_{n-1,h}=Z_{n-2,h}+Z^*_{1,h}\), we prove the Lemma, in fact
where we have used iteratively the following property
\(\square\)
Lemma 3
Let \(\Omega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\Omega\). Let us consider two values \(\omega _h,\omega _l \in \Omega\) with the corresponding probabilities \(p_h,p_l,q_h,q_l \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(U_{n,h}\) and \(U_{n,l}\) the components of a multinomial random vector \(\mathbf {U}_n=\big (U_{n,h},U_{n,l},n-U_{n,h}-U_{n,l} \big )\) with parameters n and \({\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)\), that is \(\mathbf {U}_n \sim {\mathcal{M}}(n,{\mathbf{p}}_{hl})\), and define the function \(C(p_h,p_l,q_h,q_l)\) as follows
Then, for every \(h=1,\ldots ,c\) we have
Proof
By definition we have
Now, let us split the Eq. (31) in four terms. First, we have
Second
Similarly, by exchanging h for l, we also derive
And last
Combining results (32), (33), (34) and (35) into Eq. (31) we prove the Lemma. \(\square\)
Proposition 2
Under the assumption that the sampling frequencies \({\mathbf{X}}=(X_1,\ldots ,X_c)\) are distributed as a multinomial random vector \({\mathcal{M}}(n,{\mathbf{p}})\), the variance of the estimator \(\hat{I}\) in Eq. (3) is given by
where \(Z_{n,h}\) are binomial random variables with parameters n and \(p_h\), that is \(Z_{n,h} \sim {\mathcal{B}}(n,p_h)\), while \(U_{n,h}\) and \(U_{n,l}\) are the components of a multinomial random vector \(\mathbf {U}_n=\big (U_{n,h},U_{n,l},n-U_{n,h}-U_{n,l} \big )\) with parameters n and \({\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)\), that is \(\mathbf {U}_n \sim {\mathcal{M}}(n,{\mathbf{p}}_{hl})\).
Proof
First, we recall the following representation
If the random vector \({\mathbf{X}}=(X_1,\ldots ,X_c)\) is multinomial distributed with parameters n and \({\mathbf{p}}\), we know that the h-th component \(X_h\) has marginal binomial distribution with parameters n and \(p_h\) while two components \(X_h\) and \(X_l\) have marginal joint multinomial distribution with parameters n and \({\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)\), therefore we can rewrite Eq. (36) as follows
where \(Z_{n,h} \sim {\mathcal{B}}(n,p_h)\) and \((U_{n,h},U_{n,l}) \sim {\mathcal{M}}(n, {\mathbf{p}}_{hl})\), or equivalently as
where
and
Now combining the results from Lemmas 1, 2 and 3 we prove the Proposition. \(\square\)
Lemma 4
Let \(\Omega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\Omega\). Let us consider one value \(\omega _h \in \Omega\) with the corresponding probabilities \(p_h,q_h \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(Z'_{n,h}\) a binomial random variable with parameters n and \(p_h\) and by \(Z''_{m,h}\) a binomial random variable with parameters m and \(q_h\), that is \(Z'_{n,h} \sim {\mathcal{B}}(n,p_h)\) and \(Z''_{m,h} \sim {\mathcal{B}}(m,q_h)\), respectively. Let us define the function \(D(p_h,q_h)\) as follows
Then, for every \(h=1,\ldots ,c\) we have
Proof
Let us introduce the indicator function \({\mathbb{I}}(A)\) of an event A, which equals to 1 if the event A is observed and 0 otherwise. Then, the quantity \(D(p_h,q_h)\) can be represented as
Let us split Eq. (37) into two terms. Recalling that a binomial random variable \(Z''_{m,h}\) can be represented as the sum of m independent random variables \(Z''_{1,h,i}\) distributed as Bernoulli \(Z''_{1,h}\) or also as the sum of independent Bernoulli \(Z''_{1,h}\) and binomial \(Z''_{m-1,h}\), we have
that, due to the identical distribution of the Bernoulli variables \(Z''_{1,h,i}\) to \(Z''_{1,h}\), can be simplified in the notation. Hence
where we used the properties
and
Similarly, we can also derive
Now, combining the results (38) and (39) into Eq. (37) we prove the Lemma, in fact
\(\square\)
Proposition 3
Under the assumption that the sampling frequencies \({\mathbf{X}}=(X_1,\ldots ,X_c)\) and \(\mathbf {Y}=(Y_1,\ldots ,Y_c)\) are distributed as multinomial random vectors \({\mathcal{M}}(n,{\mathbf{p}})\) and \({\mathcal{M}}(m,{\mathbf{q}})\), respectively, the expected value of the estimator \(\hat{I}\) in Eq. (19) is given by
where the \(Z'_{n-1,h}\) are binomial random variables with parameters \(n-1\) and \(p_h\) and the \(Z''_{m-1,h}\) are binomial random variables with parameters \(m-1\) and \(q_h\), that is \(Z'_{n-1,h} \sim {\mathcal{B}}(n-1,p_h)\) and \(Z''_{m-1,h} \sim {\mathcal{B}}(m-1,q_h)\), respectively.
Proof
From Eq. (19) we have
hence its expectation can be represented as
Now, recalling that the h-th component of a random vector \({\mathbf{X}}\), multinomial distributed with parameters n and \({\mathbf{p}}\), has marginal binomial distribution with parameters n and \(p_h\), that is \(X_{h} \sim {\mathcal{B}}(n,p_h)\), and the h-th component of a random vector \(\mathbf {Y}\), multinomial distributed with parameters m and \({\mathbf{q}}\), has marginal binomial distribution with parameters m and \(q_h\), that is \(Y_{h} \sim {\mathcal{B}}(m,q_h)\), Eq. (41) can be rewritten as
Therefore, the proof follows from combining Eq. (40) in Lemma 4 and Eq. (42). \(\square\)
Lemma 5
Let \(\Omega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\Omega\). Let us consider one value \(\omega _h \in \Omega\) with the corresponding probabilities \(p_h,q_h \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(Z'_{n,h}\) a binomial random variable with parameters n and \(p_h\) and by \(Z''_{m,h}\) a binomial random variable with parameters m and \(q_h\), that is \(Z'_{n,h} \sim {\mathcal{B}}(n,p_h)\) and \(Z''_{m,h} \sim {\mathcal{B}}(m,q_h)\), respectively. Let us define the function \(F(p_h,q_h)\) as follows
Then, for every \(h=1,\ldots ,c\) we have
Proof
By using the indicator function \({\mathbb{I}}(A)\) of an event A, the quantity \(F(p_h,q_h)\) can be represented as
Now, let us split Eq. (43) in three terms and recall that a binomial random variable \(Z''_{m,h}\) can be represented as sum of m independent random variables \(Z''_{1,h,i}\) distributed as Bernoulli \(Z''_{1,h}\) or also as sum of independent Bernoulli \(Z''_{1,h}\) and binomial \(Z''_{m-1,h}\). For the first term, we have
in which, as in Lemma 4, we have used the independence and the identical distribution of the Bernoulli variables \(Z''_{1,h,i}\). Similarly, we can also derive the second term of Eq. (43)
while for the last term, we have
Now, it suffices to plug the results (44), (45) and (46) into Eq. (43) and prove the Lemma. In fact
\(\square\)
Lemma 6
Let \(\Omega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\Omega\). Let us consider two values \(\omega _h,\omega _l \in \Omega\) with the corresponding probabilities \(p_h,p_l,q_h,q_l \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(U'_{n,h}\) and \(U'_{n,l}\) the components of a multinomial random vector \(\mathbf {U}'_n=\big (U'_{n,h},U'_{n,l},n-U'_{n,h}-U'_{n,l} \big )\) with parameters n and \({\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)\), that is \(\mathbf {U}'_n \sim {\mathcal{M}}(n,{\mathbf{p}}_{hl})\), and by \(U''_{m,h}\) and \(U''_{m,l}\) the components of a multinomial random vector \(\mathbf {U}''_m=\big (U''_{m,h},U''_{m,l},m-U''_{m,h}-U''_{m,l} \big )\) with parameters m and \({\mathbf{q}}_{hl}=(q_h,q_l,1-q_h-q_l)\), that is \(\mathbf {U}''_m \sim {\mathcal{M}}(m,{\mathbf{q}}_{hl})\). Let us define the function \(G(p_h,p_l,q_h,q_l)\) as follows
Then, for every \(h=1,\ldots ,c\) we have
Proof
By using the indicator function \({\mathbb{I}}(A)\) of an event A, the quantity \(G(p_h,p_l,q_h,q_l)\) can be represented as
Now, let us split Eq. (48) in four terms and recall that a multinomial random vector \(\mathbf {U}''_m=\big (U''_{m,h},U''_{m,l},m-U''_{m,h}-U''_{m,l} \big )\) can be represented as sum of m independent random vectors \(\mathbf {U}''_{1,i}=\big (U''_{1,h,i},U''_{1,l,i},1-U''_{1,h,i}-U''_{1,l,i} \big )\) identically distributed as multinomial \(\mathbf {U}''_1=\big (U''_{1,h},U''_{1,l},1-U''_{1,h}-U''_{1,l} \big )\) or also as sum of two independent multinomials \(\mathbf {U}''_{1}\) and \(\mathbf {U}''_{m-1}\) with the same parameter of probabilities \({\mathbf{q}}_{hl}=(q_h,q_l,1-q_h-q_l)\). Hence, for the first term, we have
where when \(i=j\), we use the property
In fact, the two components \(U''_{1,h,i}\) and \(U''_{1,l,i}\) of a multinomial \(\mathbf {U}''_{1,i}=\big (U''_{1,h,i},U''_{1,l,i},1-U''_{1,h,i}-U''_{1,l,i} \big )\) can not both be equal to 1.
Similarly, the second term of Eq. (48) is given by
while for the third term of Eq. (48) we have
Similarly, the last term of Eq. (48) can be obtained as follows
Now, it suffices to plug the results (49), (50), (51) and (52) into Eq. (48) and prove the Lemma. In fact
\(\square\)
Proposition 4
Under the assumption that the sampling frequencies \({\mathbf{X}}=(X_1,\ldots ,X_c)\) and \(\mathbf {Y}=(Y_1,\ldots ,Y_c)\) are distributed as multinomial random vectors \({\mathcal{M}}(n,{\mathbf{p}})\) and \({\mathcal{M}}(m,{\mathbf{q}})\), respectively, the variance of the estimator \(\hat{I}\) in Eq. (19) is given by
where \(Z'_{n,h} \sim {\mathcal{B}}(n,p_h)\), \(Z''_{m,h} \sim {\mathcal{B}}(m,q_h)\), \((U'_{n,h},U'_{n,l}) \sim {\mathcal{M}}(n, {\mathbf{p}}_{hl})\) and \((U''_{m,h},U''_{m,l}) \sim {\mathcal{M}}(m, {\mathbf{q}}_{hl})\),
Proof
First, we recall the following representation
If a random vector \({\mathbf{X}}=(X_1,\ldots ,X_c)\) is multinomial distributed with parameters n and \({\mathbf{p}}\), the h-th component \(X_h\) has marginal binomial distribution with parameters n and \(p_h\) while two components \(X_h\) and \(X_l\) have marginal joint multinomial distribution with parameters n and \({\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)\). Also if a random vector \(\mathbf {Y}=(Y_1,\ldots ,Y_c)\) is multinomial distributed with parameters m and \({\mathbf{q}}\), the h-th component \(Y_h\) has marginal binomial distribution with parameters m and \(q_h\) while two components \(Y_h\) and \(Y_l\) have marginal joint multinomial distribution with parameters m ad \({\mathbf{q}}_{hl}=(q_h,q_l,1-q_h-q_l)\). Therefore we can rewrite Eq. (54) as follows
where \(Z'_{n,h} \sim {\mathcal{B}}(n,p_h)\), \(Z''_{m,h} \sim {\mathcal{B}}(m,q_h)\), \((U'_{n,h},U'_{n,l}) \sim {\mathcal{M}}(n, {\mathbf{p}}_{hl})\) and \((U''_{m,h},U''_{m,l}) \sim {\mathcal{M}}(m, {\mathbf{q}}_{hl})\), or equivalently as
where
and
Now combining the results from Lemmas 4, 5 and 6 we prove the Proposition. \(\square\)
Rights and permissions
About this article
Cite this article
Ärje, J., Choi, KP., Divino, F. et al. Understanding the statistical properties of the percent model affinity index can improve biomonitoring related decision making. Stoch Environ Res Risk Assess 30, 1981–2008 (2016). https://doi.org/10.1007/s00477-015-1202-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00477-015-1202-6