Skip to main content
Log in

Understanding the statistical properties of the percent model affinity index can improve biomonitoring related decision making

  • Original Paper
  • Published:
Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Abstract

The percent model affinity (PMA) index is used to measure the similarity of two probability profiles representing, for example, an ideal profile (i.e. reference condition) and a monitored profile (i.e. possibly impacted condition). The goal of this work is to study the effects of sample size, evenness, true value of the index and number of classes on the statistical properties of the estimator of the PMA index. We derive and extend previous formulas of the expectation and variance of the estimator for estimated monitored profile and fixed reference profile. Using the obtained extension, we find that the estimator is asymptotically unbiased, converging faster when the profiles differ. When both profiles are estimated, we calculate the expectation using transformation rules for expectation and in addition derive the formula for the estimator’s variance. Since the computation of the probabilities in the variance formula is slow, we study the behavior of the variance with simulation experiments and assess whether it could be approximated with the variance for the fixed reference profile. Finally, we provide a set of recommendations for the users of the PMA index to avoid the most common caveats of the index.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Alahuhta J, Vuori K-M, Hellsten S, Järvinen M, Olin M, Rask M, Palomäki A (2009) Defining the ecological status of small forest lakes using multiple biological quality elements and palaeolimnological analysis. Fund Appl Limnol / Archiv für Hydrobiologie 175(3):203–216

    Article  Google Scholar 

  • Aroviita J, Hellsten S, Jyväsjärvi J, Järvenpää L, Järvinen M, Karjalainen S, Kauppila P, Keto A (2012) Guidelines for the ecological and chemical status classification of surface waters for 2012–2013—updated assessment criteria and their application. Environ Admin Guidel 7:144

    Google Scholar 

  • Arratia R, Gordon L (1989) Tutorial on large deviation for the binomial distribution. Bull Math Biol 51(1):125–131

    Article  CAS  Google Scholar 

  • Bergstrand K-G (1989) Fördelningsaptering med näroptimalmetoden Ű reviderad version [bucking to order with a close-to-optimal method Ű revised version]. Forskningsstiftelsen Skogsarbeten. 1989-12-11 (In Swedish)

  • Bloom A (1981) Similarity indices in community studies: potential pitfalls. Mar Ecol-Prog Ser 5:125–128

    Article  Google Scholar 

  • Cao Y, Epifanio J (2010) Quantifying the responses of macroinvertebrate assemblages to simulated stress: are more accurate similarity indices less useful? Methods Ecol Evol 1:380–388

    Article  Google Scholar 

  • Cao Y, Hawkins CP (2005) Simulating biological impairment to evaluate the accuracy of ecological indicators. J Appl Ecol 42:954–965

    Article  Google Scholar 

  • Chao A, Hsieh T, Chazdon L, Colwell R, Gotelli N (2015) Unveiling the species-rank abundance distribution by generalizing the Good–Turing sample coverage theory. Ecology 96(5):1189–1201

    Article  Google Scholar 

  • Chao A, Shen T (2003) Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environ Ecol Stat 10:429–443

    Article  Google Scholar 

  • CIS (2003) Monitoring under the Water Framework Directive. Common Implementation Strategy for the Water Framework Directive (2000/60/EC). Guidance Document No 10. Working Group 2.7 - Monitoring. European Communities, Luxembourg

  • Duncan OD, Duncan B (1955) A methodological analysis of segregation indexes. Am Sociol Rev 20(2):210–217

    Article  Google Scholar 

  • Goldstein M, Wolf E (1977) On the problem of bias in multinomial classification. Biometrics 33(2):325–331

    Article  Google Scholar 

  • Jahn J, Schmidt CF, Schrag C (1947) The measurement of ecological segregation. Am Sociol Rev 12(3):293–303

    Article  Google Scholar 

  • Kauppila T, Kanninen A, Viitasalo M, Räsänen J, Meissner K, Mattila J (2012) Comparing long term sediment records to current biological quality element data—implications for bioassessment and management of eutrophic lake. Limnologica 42(1):19–30

    Article  CAS  Google Scholar 

  • Koskela L, Sinha BK, Nummi T (2007) Some aspects of the sampling distribution of the apportionment index and related infrence. Silva Fennica 41(4):699–715

    Article  Google Scholar 

  • Marcon Hérault B, Baraloto C, Lang G (2012) The decomposition of Shannon’s entropy and a confidence interval for beta diversity. Oikos 121:516–522

    Article  Google Scholar 

  • Matossian AD, Matsinos YG, Konstantinidis P, Moustakas A (2013) Post-fire succession indices performance in a Mediterranean ecosystem. Stoch Environ Res Risk Asses 27:323–335

    Article  Google Scholar 

  • Monk WA, Wood PJ, Hannah DM, Extence CA, Chadd RP, Dunbar MJ (2012) How does macroinvertebrate taxonomic resolution influence ecohydrological relationships in riverine ecosystems. Ecohydrology 5:36–45

    Article  Google Scholar 

  • Novak MA, Bode RW (1992) Percent model affinity: a new measure of macroinvertebrate community composition. J N Am Benthol Soc 11(1):80–85

    Article  Google Scholar 

  • Pielou E (1966) The measurement of diversity in different types of biological collections. J Theor Biol 13:131–144

    Article  Google Scholar 

  • Pollice A, Arima S, Lasinio GJ, Basset A, Rosati I (2015) Bayesian analysis of three indices for lagoons ecological status evaluation. Stoch Environ Res Risk Assess 29(2):477–485

    Article  Google Scholar 

  • R Core Team R (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

    Google Scholar 

  • Ransom MR (2000) Sampling distributions of segregation indexes. Sociol Method Res 28(4):454–475

    Article  Google Scholar 

  • Renkonen O (1938) Statisch-ökologische Untersuchungen über die terrestrische Käferwelt der finnischen Bruchmoore. Ann Zool Soc Bot Fenn Vanamo 6:1–231

    Google Scholar 

  • Ricklefs RE, Lau M (1980) Bias and dispersion of overlap indices: results of some Monte Carlo simulations. Ecology 61(5):1019–1024

    Article  Google Scholar 

  • Romik D (2000) Stirling’s approximation for n!: the ultimate short proof? Am Math Mon 107(6):556–557

    Article  Google Scholar 

  • Seber GAF (1973) The estimation of animal abundance and related parameters. C. Griffin, London

    Google Scholar 

  • Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423

    Article  Google Scholar 

  • Smith EP (1982) Niche breadth, resource availability and inference. Ecology 63(6):1675–1681

    Article  Google Scholar 

  • Smith EP, Zaret TM (1982) Bias in estimating niche overlap. Ecology 63(5):1248–1253

    Article  Google Scholar 

  • Stoddard JL, Larsen DP, Hawkins CP, Johnson RK, Norris RH (2006) Setting expectations for the ecological condition of streams: the concept of reference condition. Ecol Appl 16(4):1267–1276

    Article  Google Scholar 

  • Thompson RM, Townsend CR (2000) Is resolution the solution? The effect of taxonomic resolution on the calculated properties of three stream food webs. Freshw Biol 44(3):413–422

    Article  Google Scholar 

  • Venrick E (1983) Percent similarity: the prediction of bias. Fish Bull 81(2):375–387

    Google Scholar 

  • WFD (2000) Directive 2000/60/EC of the European Parliament and the Council of 23, October 2000. A framework for community action in the field of water policy. Off J Eur Commun L327:72

    Google Scholar 

  • Wolda H (1981) Similarity indices, sample size and diversity. Oecologia (Berl) 50:296–302

    Article  Google Scholar 

Download references

Acknowledgments

We thank the Ellen and Artturi Nyyssönen foundation for the grant of Ärje and the Academy of Finland (projects 289076 (SK) and 289104 (KM)). KPC acknowledges the support of Singapore Ministry of Education Academic Research Fund R-155-000-147-112. We kindly thank Jukka Aroviita for insights into the Finnish adaptation of the PMA in the WFD context and Antti Penttinen and Jukka Nyblom for helpful conversations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johanna Ärje.

Appendix

Appendix

Lemma 1

Let \(\varOmega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\varOmega\). Let us consider one value \(\omega _h \in \Omega\) with the corresponding probabilities \(p_h,q_h \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(Z_{n,h}\) a binomial random variable with parameters n and \(p_h\), that is \(Z_{n,h} \sim {\mathcal{B}}(n,p_h)\), and define the function \(A(p_h,q_h)\) as follows

$$\begin{aligned} A(p_h,q_h)=E\big [(nq_h-Z_{n,h})_+ \big ]. \end{aligned}$$

Then, for every \(h=1,\ldots ,c\) we have

$$\begin{aligned} A(p_h,q_h)=n (q_h-p_h) P(Z_{n-1,h} \le \lfloor nq_h\rfloor )+ np_h(1-q_h) P(Z_{n-1,h}=\lfloor nq_h\rfloor ). \end{aligned}$$

Proof

By definition we have

$$\begin{aligned} A(p_h,q_h)& = E\big [(nq_h-Z_{n,h})_+ \big ] \\& = \sum _{z=0}^{\lfloor nq_h\rfloor } (n q_{h}-z)P(Z_{n,h}=z) \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )-\sum _{z=0}^{\lfloor nq_h \rfloor } z P(Z_{n,h}=z) \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )-\sum _{z=1}^{\lfloor nq_h \rfloor } z \dfrac{n!}{z!(n-z)!}p_h^{z}(1-p_h)^{n-z} \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )-\sum _{z=1}^{\lfloor nq_h \rfloor } z \dfrac{n}{z} \dfrac{(n-1)!}{(z-1)!(n-z)!}p_hp_h^{z-1}(1-p_h)^{n-z} \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )-np_h \sum _{z=1}^{\lfloor nq_h \rfloor } \dfrac{(n-1)!}{(z-1)!(n-z)!}p_h^{z-1}(1-p_h)^{n-z} \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )-np_h \sum _{z=0}^{\lfloor nq_h \rfloor -1} \dfrac{(n-1)!}{z!(n-1-z)!}p_h^{z}(1-p_h)^{n-1-z} \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )- np_h \sum _{z=0}^{\lfloor nq_h\rfloor -1} P(Z_{n-1,h}=z) \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )- np_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1) . \end{aligned}$$
(24)

Recalling that a binomial random variable \(Z_{n,h}\) can be represented as the sum of two independent binomials \(Z_{n-1,h}\) and \(Z_{1,h}\), we have

$$\begin{aligned} P(Z_{n,h}\le \lfloor nq_h\rfloor )& = P(Z_{n-1,h}+Z_{1,h}\le \lfloor nq_h\rfloor )\\& = P(Z_{n-1,h}+Z_{1,h}\le \lfloor nq_h\rfloor \vert Z_{1,h}=0)P(Z_{1,h}=0)\\&\quad +P(Z_{n-1,h}+Z_{1,h}\le \lfloor nq_h\rfloor \vert Z_{1,h}=1)P(Z_{1,h}=1)\\& = P(Z_{n-1,h}\le \lfloor nq_h\rfloor )(1-p_h)+P(Z_{n-1,h}\le \lfloor nq_h\rfloor -1)p_h \end{aligned}$$

and we can reformulate Eq. (24) as follows

$$\begin{aligned} A(p_h,q_h)& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )- np_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1) \\& = n(1-p_h)q_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor ) \\&\quad +np_h q_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1) \\&\quad -np_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1) \\& = nq_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor )-np_hq_hP(Z_{n-1,h} \le \lfloor nq_h\rfloor ) \\&\quad +np_hq_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor )- np_hq_h P(Z_{n-1,h} = \lfloor nq_h\rfloor ) \\&\quad -np_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor ) + np_h P(Z_{n-1,h} = \lfloor nq_h\rfloor ) \\& = n(q_h-p_h) P(Z_{n-1,h} \le \lfloor nq_h\rfloor ) \\&\quad +np_h(1-q_h) P(Z_{n-1,h}=\lfloor nq_h \rfloor ), \end{aligned}$$
(25)

proving the Lemma. \(\square\)

Proposition 1

Under the assumption that the sampling frequencies \({\mathbf{X}}=(X_1,\ldots ,X_c)\) are distributed as a multinomial random vector \({\mathcal{M}}(n,{\mathbf{p}})\), the expected value of the estimator \(\hat{I}\) in Eq. (3) is given by

$$\begin{aligned} E \big [\hat{I}\big ]= 1-\sum _{h=1}^c (q_h-p_h)P(Z_{n-1,h}\le \lfloor nq_h\rfloor ) -\sum _{h=1}^c p_h(1-q_h)P(Z_{n-1,h}=\lfloor nq_h\rfloor ), \end{aligned}$$

where the \(Z_{n-1,h}\) for \(h=1,\ldots ,c\) are independent binomial random variables, each one with parameters \(n-1\) and \(p_h\), that is \(Z_{n-1,h} \sim {\mathcal{B}}(n-1,p_h)\).

Proof

From Eq. (5) the expectation of \(\hat{I}\) can be represented as

$$\begin{aligned} E \big [\hat{I}\big ]= 1-\frac{1}{n}\sum _{h=1}^c E\big [(nq_h-X_{h})_+ \big ]. \end{aligned}$$

Recalling that the h-th component of a random vector \({\mathbf{X}}\), multinomial distributed with parameters n and \({\mathbf{p}}\), has marginal binomial distribution with parameters n and \(p_h\), that is \(X_{h} \sim {\mathcal{B}}(n,p_h)\), that expectation can be rewritten as

$$\begin{aligned} E \big [\hat{I}\big ]& = 1-\frac{1}{n}\sum _{h=1}^c E\big [(nq_h-Z_{n,h})_+ \big ] \\& = 1-\frac{1}{n}\sum _{h=1}^c A(p_h,q_h). \end{aligned}$$
(26)

Therefore, the proof follows from combining Eq (25) in Lemma 1 and Eq. (26). \(\square\)

Lemma 2

Let \(\Omega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\Omega\). Let us consider one value \(\omega _h \in \Omega\) with the corresponding probabilities \(p_h,q_h \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(Z_{n,h}\) a binomial random variable with parameters n and \(p_h\), that is \(Z_{n,h} \sim {\mathcal{B}}(n,p_h)\), and define the function \(B(p_h,q_h)\) as follows

$$\begin{aligned} B(p_h,q_h)=E\big [ \big ( (nq_h-Z_{n,h})_+ \big )^2 \big ]. \end{aligned}$$

Then, for every \(h=1,\ldots ,c\) we have

$$\begin{aligned} B(p_h,q_h)& = n^2(1-p_h)^2q_h^2P(Z_{n-2,h}\le \lfloor nq_h \rfloor )\\&+np_h(1-p_h)\big [1-2nq_h(1-q_h)\big ]P(Z_{n-2,h}\le \lfloor nq \rfloor -1)\\&+n^2p_h^2(1-q_h)^2 P(Z_{n-2,h} \le \lfloor nq_h \rfloor -2). \end{aligned}$$

Proof

By definition we have

$$\begin{aligned} B(p_h,q_h)& = E\big [\big ((nq_h-Z_{n,h})_+ \big )^2 \big ] \\& = \sum _{z=0}^{\lfloor nq_h\rfloor } (n q_{h}-z)^2 P(Z_{n,h}=z) \\& = \sum _{z=0}^{\lfloor nq_h\rfloor } \big (n^2q_{h}^2 -2nq_{h}z + z^2 \big ) P(Z_{n,h}=z) \\& = \sum _{z=0}^{\lfloor nq_h\rfloor } \big (n^2q_{h}^2 -2nq_{h}z + z^2+z-z \big ) P(Z_{n,h}=z) \\& = \sum _{z=0}^{\lfloor nq_h\rfloor } \big [n^2q_{h}^2 + (1-2nq_{h})z + z(z-1) \big ] P(Z_{n,h}=z). \end{aligned}$$
(27)

Now, let us split Eq. (27) in three terms. First, we have

$$\begin{aligned} \sum _{z=0}^{\lfloor nq_h\rfloor } n^2q_{h}^2 P(Z_{n,h}=z)=n^2q_{h}^2 P(Z_{n,h} \le \lfloor nq_h\rfloor ). \end{aligned}$$
(28)

Second,

$$\begin{aligned} \sum _{z=0}^{\lfloor nq_h\rfloor } (1-2nq_{h})z P(Z_{n,h}=z)& = \sum _{z=1}^{\lfloor nq_h\rfloor } (1-2nq_{h})z P(Z_{n,h}=z) \\& = (1-2nq_{h})\sum _{z=1}^{\lfloor nq_h\rfloor }z \dfrac{n!}{z!(n-z)!}p_h^z(1-p_h)^{n-z} \\& = (1-2nq_{h})\sum _{z=1}^{\lfloor nq_h\rfloor }z \dfrac{n}{z} \dfrac{(n-1)!}{(z-1)!(n-z)!}p_hp_h^{z-1}(1-p_h)^{n-z} \\& = np_h(1-2nq_{h}) \sum _{z=1}^{\lfloor nq_h\rfloor }\dfrac{(n-1)!}{(z-1)!(n-z)!}p_h^{z-1}(1-p_h)^{n-z} \\& = np_h(1-2nq_{h}) \sum _{z=0}^{\lfloor nq_h\rfloor -1}\dfrac{(n-1)!}{z!(n-z-1)!}p_h^{z}(1-p_h)^{n-z-1} \\& = np_h(1-2nq_{h}) P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1). \end{aligned}$$
(29)

Similarly, we also have

$$\begin{aligned} \sum _{z=0}^{\lfloor nq_h\rfloor } z(z-1) P(Z_{n,h}=z)& = \sum _{z=2}^{\lfloor nq_h\rfloor } z(z-1) P(Z_{n,h}=z) \\& = \sum _{z=2}^{\lfloor nq_h\rfloor } z(z-1) \dfrac{n!}{z!(n-z)!}p_h^z(1-p_h)^{n-z} \\& = \sum _{z=2}^{\lfloor nq_h\rfloor } z(z-1) \dfrac{n(n-1)}{z(z-1)} \dfrac{(n-2)!}{(z-2)!(n-z)!}p_h^2p_h^{z-2}(1-p_h)^{n-z} \\& = n(n-1)p_h^2\sum _{z=2}^{\lfloor nq_h\rfloor } \dfrac{(n-2)!}{(z-2)!(n-z)!}p_h^{z-2}(1-p_h)^{n-z} \\& = n(n-1)p_h^2\sum _{z=0}^{\lfloor nq_h\rfloor -2} \dfrac{(n-2)!}{z!(n-2-z)!}p_h^{z}(1-p_h)^{n-2-z} \\& = n(n-1)p_h^2 P(Z_{n-2,h} \le \lfloor nq_h\rfloor -2). \end{aligned}$$
(30)

Combining results (28), (29) and (30), and recalling that a binomial random variable can be represented as sum of independent components, as for instance \(Z_{n,h}=Z_{n-1,h}+Z_{1,h}\) and \(Z_{n-1,h}=Z_{n-2,h}+Z^*_{1,h}\), we prove the Lemma, in fact

$$\begin{aligned} B(p_h,q_h)& = n^2q_{h}^2 P(Z_{n,h} \le \lfloor nq_h\rfloor )\\&+(1-2nq_{h}) np_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1)\\&+n(n-1)p_h^2 P(Z_{n-2,h} \le \lfloor nq_h\rfloor -2)\\& = n^2(1-p_h)^2q_h^2 P(Z_{n-2,h}\le \lfloor nq_h \rfloor )\\&+np_h(1-p_h)\big [1-2nq_h(1-q_h)\big ]P(Z_{n-2,h}\le \lfloor nq \rfloor -1)\\&+n^2p_h^2(1-q_h)^2 P(Z_{n-2,h} \le \lfloor nq_h \rfloor -2), \end{aligned}$$

where we have used iteratively the following property

$$\begin{aligned} P(Z_{n,h} \le z)& = P(Z_{n-1,h}+Z_{1,h} \le z)\\& = P(Z_{n-1,h} + Z_{1,h} \le z \vert Z_{1,h}=0)P(Z_{1,h}=0)\\&+P(Z_{n-1,h} + Z_{1,h} \le z \vert Z_{1,h}=1)P(Z_{1,h}=1)\\& = P(Z_{n-1,h} \le z)(1-p_h)+P(Z_{n-1,h} \le z-1)p_h. \end{aligned}$$

\(\square\)

Lemma 3

Let \(\Omega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\Omega\). Let us consider two values \(\omega _h,\omega _l \in \Omega\) with the corresponding probabilities \(p_h,p_l,q_h,q_l \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(U_{n,h}\) and \(U_{n,l}\) the components of a multinomial random vector \(\mathbf {U}_n=\big (U_{n,h},U_{n,l},n-U_{n,h}-U_{n,l} \big )\) with parameters n and \({\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)\), that is \(\mathbf {U}_n \sim {\mathcal{M}}(n,{\mathbf{p}}_{hl})\), and define the function \(C(p_h,p_l,q_h,q_l)\) as follows

$$\begin{aligned} C(p_h,p_l,q_h,q_l)=E\big [(nq_h-U_{n,h})_+(nq_l-U_{n,l})_+ \big ]. \end{aligned}$$

Then, for every \(h=1,\ldots ,c\) we have

$$\begin{aligned} C(p_h,p_l,q_h,q_l)& = n^2q_h q_l P(U_{n,h} \le \lfloor nq_h \rfloor , U_{n,l} \le \lfloor nq_l \rfloor )\\&-n^2q_h p_l P(U_{n-1,h} \le \lfloor nq_h \rfloor , U_{n-1,l} \le \lfloor nq_l \rfloor -1)\\&-n^2p_h q_l P(U_{n-1,h} \le \lfloor nq_h \rfloor -1, U_{n-1,l} \le \lfloor nq_l \rfloor )\\&+n(n-1)p_h p_l P(U_{n-2,h} \le \lfloor nq_h \rfloor -1, U_{n-2,l} \le \lfloor nq_l \rfloor -1). \end{aligned}$$

Proof

By definition we have

$$\begin{aligned}&C(p_h,p_l,q_h,q_l)=E\big [(nq_h-U_{n,h})_+(nq_l-U_{n,l})_+ \big ] \\&\quad =\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }(nq_h-u_h)(nq_l-u_l)P(U_{n,h}=u_h,U_{n,l}=u_l) \\&\quad =\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }(n^2q_hq_l-nq_hu_l - nu_hq_l+u_hu_l)P(U_{n,h}=u_h,U_{n,l}=u_l). \end{aligned}$$
(31)

Now, let us split the Eq. (31) in four terms. First, we have

$$\begin{aligned} \sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }n^2q_hq_l P(U_{n,h}& = u_h, U_{n,l}=u_l) \\& = n^2q_hq_lP(U_{n,h} \le \lfloor nq_h\rfloor , U_{n,l} \le \lfloor nq_l\rfloor ). \end{aligned}$$
(32)

Second

$$\begin{aligned}&\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }nq_hu_l P(U_{n,h}=u_h,U_{n,l}=u_l) \\&\quad =\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l \rfloor } nq_hu_l P(U_{n,h}=u_h,U_{n,l}=u_l) \\&\quad =\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor }nq_hu_l \dfrac{n!}{u_h!u_l!(n-u_h-u_l)!}p_h^{u_h}p_l^{u_l}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad =\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor }nq_hu_l \dfrac{n}{u_l} \dfrac{(n-1)!}{u_h!(u_l-1)!(n-u_h-u_l)!}p_h^{u_h} p_l p_l^{u_l-1}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad = n^2q_hp_l \sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor } \dfrac{(n-1)!}{u_h!(u_l-1)!(n-u_h-u_l)!}p_h^{u_h} p_l^{u_l-1}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad = n^2q_hp_l \sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor -1} \dfrac{(n-1)!}{u_h!u_l!(n-1-u_h-u_l)!}p_h^{u_h} p_l^{u_l}(1-p_h-p_l)^{n-1-u_h-u_l} \\&\quad = n^2q_hp_l P(U_{n-1,h} \le \lfloor nq_h\rfloor , U_{n-1,l} \le \lfloor nq_l\rfloor -1). \end{aligned}$$
(33)

Similarly, by exchanging h for l, we also derive

$$\begin{aligned}&\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }nq_lu_h P(U_{n,h}=u_h,U_{n,l}=u_l) \\&\quad = n^2p_hq_l P(U_{n-1,h} \le \lfloor nq_h\rfloor -1 , U_{n-1,l} \le \lfloor nq_l\rfloor ). \end{aligned}$$
(34)

And last

$$\begin{aligned}&\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }u_hu_l P(U_{n,h}= u_h ,U_{n,l}=u_l) = \sum _{u_h=1}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor } u_hu_l P(U_{n,h}=u_h,U_{n,l}=u_l) \\&\quad =\sum _{u_h=1}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor }u_hu_l \dfrac{n!}{u_h!u_l!(n-u_h-u_l)!}p_h^{u_h}p_l^{u_l}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad =\sum _{u_h=1}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor }u_hu_l \dfrac{n(n-1)}{u_hu_l} \dfrac{(n-2)!}{(u_h-1)!(u_l-1)!(n-u_h-u_l)!}p_hp_h^{u_h-1} p_l p_l^{u_l-1}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad = n(n-1)p_hp_l \sum _{u_h=1}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor } \dfrac{(n-2)!}{(u_h-1)!(u_l-1)!(n-u_h-u_l)!}p_h^{u_h-1} p_l^{u_l-1}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad = n(n-1)p_hp_l \sum _{u_h=0}^{\lfloor nq_h\rfloor -1}\sum _{u_l=0}^{\lfloor nq_l\rfloor -1} \dfrac{(n-2)!}{u_h!u_l!(n-u_h-u_l-2)!}p_h^{u_h} p_l^{u_l}(1-p_h-p_l)^{n-u_h-u_l-2} \\&\quad = n(n-1)p_hp_l P(U_{n-2,h} \le \lfloor nq_h\rfloor -1 , U_{n-2,l} \le \lfloor nq_l\rfloor -1). \end{aligned}$$
(35)

Combining results (32), (33), (34) and (35) into Eq. (31) we prove the Lemma. \(\square\)

Proposition 2

Under the assumption that the sampling frequencies \({\mathbf{X}}=(X_1,\ldots ,X_c)\) are distributed as a multinomial random vector \({\mathcal{M}}(n,{\mathbf{p}})\), the variance of the estimator \(\hat{I}\) in Eq. (3) is given by

$$\begin{aligned} Var [{\hat{I}}]& = \sum _{h=1}^c \bigg \{ (1-p_h)^2q_h^2P(Z_{n-2,h}\le \lfloor nq_h \rfloor ) +p_h^2(1-q_h)^2 P(Z_{n-2,h} \le \lfloor nq_h \rfloor -2) \\&\quad +\frac{1}{n} p_h(1-p_h)\big [1-2nq_h(1-q_h)\big ]P(Z_{n-2,h}\le \lfloor nq \rfloor -1) \bigg \} \\&\quad +2\sum _{h=1}^{c-1}\sum _{l=h+1}^c \bigg \{ q_h q_l P(U_{n,h} \le \lfloor nq_h \rfloor , U_{n,l} \le \lfloor nq_l \rfloor ) \\&\quad -q_h p_l P(U_{n-1,h} \le \lfloor nq_h \rfloor , U_{n-1,l} \le \lfloor nq_l \rfloor -1) \\&\quad -p_h q_l P(U_{n-1,h} \le \lfloor nq_h \rfloor -1, U_{n-1,l} \le \lfloor nq_l \rfloor ) \\&\quad + \frac{n-1}{n} p_h p_l P(U_{n-2,h} \le \lfloor nq_h \rfloor -1, U_{n-2,l} \le \lfloor nq_l \rfloor -1) \bigg \} \\&\quad - \left( \sum _{h=1}^c \left\{ (q_h-p_h) P(Z_{n-1,h} \le \lfloor nq_h\rfloor ) +\, p_h(1-q_h) P(Z_{n-1,h}=\lfloor nq_h \rfloor ) \right\} \right) ^2 \end{aligned}$$

where \(Z_{n,h}\) are binomial random variables with parameters n and \(p_h\), that is \(Z_{n,h} \sim {\mathcal{B}}(n,p_h)\), while \(U_{n,h}\) and \(U_{n,l}\) are the components of a multinomial random vector \(\mathbf {U}_n=\big (U_{n,h},U_{n,l},n-U_{n,h}-U_{n,l} \big )\) with parameters n and \({\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)\), that is \(\mathbf {U}_n \sim {\mathcal{M}}(n,{\mathbf{p}}_{hl})\).

Proof

First, we recall the following representation

$$\begin{aligned} Var [\hat{I}]& = \frac{1}{n^2} Var \left[ \sum _{h=1}^c (nq_h-X_{h})_+\right] \\& = \frac{1}{n^2} \sum _{h=1}^c Var \big [(nq_h-X_{h})_+\big ] \\&\quad +\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c Cov\big [(nq_h-X_{h})_+,(nq_l-X_{l})_+\big ] \\& = \frac{1}{n^2} \sum _{h=1}^c E\big [\big ((nq_h-X_{h})_+\big )^2\big ] -\frac{1}{n^2} \sum _{h=1}^c \big (E\big [(nq_h-X_{h})_+\big ]\big )^2 \\&\quad +\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nq_h-X_{h})_+ (nq_l-X_{l})_+\big ] \\&\quad -\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nq_h-X_{h})_+\big ]E\big [(nq_l-X_{l})_+\big ] \\& = \frac{1}{n^2} \sum _{h=1}^c E\big [\big ((nq_h-X_{h})_+\big )^2\big ] \\&\quad +\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nq_h-X_{h})_+ (nq_l-X_{l})_+\big ] \\&\quad - \left( \frac{1}{n} \sum _{h=1}^c E\big [(nq_h-X_{h})_+\big ]\right) ^2 . \end{aligned}$$
(36)

If the random vector \({\mathbf{X}}=(X_1,\ldots ,X_c)\) is multinomial distributed with parameters n and \({\mathbf{p}}\), we know that the h-th component \(X_h\) has marginal binomial distribution with parameters n and \(p_h\) while two components \(X_h\) and \(X_l\) have marginal joint multinomial distribution with parameters n and \({\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)\), therefore we can rewrite Eq. (36) as follows

$$\begin{aligned} Var[\hat{I}]& = \frac{1}{n^2} \sum _{h=1}^c E\big [\big ((nq_h-Z_{n,h})_+\big )^2\big ]\\&\quad +\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nq_h-U_{n,h})_+ (nq_l-U_{n,l})_+\big ]\\&\quad - \left( \frac{1}{n} \sum _{h=1}^c E\big [(nq_h-Z_{n,h})_+\big ]\right) ^2 \end{aligned}$$

where \(Z_{n,h} \sim {\mathcal{B}}(n,p_h)\) and \((U_{n,h},U_{n,l}) \sim {\mathcal{M}}(n, {\mathbf{p}}_{hl})\), or equivalently as

$$\begin{aligned} Var[\hat{I}] = \frac{1}{n^2}\sum _{h=1}^c B(p_h,q_h)+\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c C(p_h,p_l,q_h,q_l)-\left[ \frac{1}{n}\sum _{h=1}^c A(p_h,q_h)\right] ^2 \end{aligned}$$

where

$$\begin{aligned} A(p_h,q_h) = E\big [ (nq_h -Z_{n,h})_+\big ], \end{aligned}$$
$$\begin{aligned} B(p_h,q_h)=E\big [ \big ((nq_h-Z_{n,h})_+\big )^2\big ] \end{aligned}$$

and

$$\begin{aligned} C(p_h,q_h,p_l,q_l) = E\big [(nq_h-U_{n,h})_+(nq_l-U_{n,l})_+ \big ]. \end{aligned}$$

Now combining the results from Lemmas 1, 2 and 3 we prove the Proposition. \(\square\)

Lemma 4

Let \(\Omega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\Omega\). Let us consider one value \(\omega _h \in \Omega\) with the corresponding probabilities \(p_h,q_h \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(Z'_{n,h}\) a binomial random variable with parameters n and \(p_h\) and by \(Z''_{m,h}\) a binomial random variable with parameters m and \(q_h\), that is \(Z'_{n,h} \sim {\mathcal{B}}(n,p_h)\) and \(Z''_{m,h} \sim {\mathcal{B}}(m,q_h)\), respectively. Let us define the function \(D(p_h,q_h)\) as follows

$$\begin{aligned} D(p_h,q_h)=E\big [(nZ''_{m,h}-mZ'_{n,h})_+ \big ]. \end{aligned}$$

Then, for every \(h=1,\ldots ,c\) we have

$$\begin{aligned} D(p_h,q_h)& = nm(1-p_h)q_h P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big ) \\& \quad-nmp_h(1-q_h) P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big ). \end{aligned}$$

Proof

Let us introduce the indicator function \({\mathbb{I}}(A)\) of an event A, which equals to 1 if the event A is observed and 0 otherwise. Then, the quantity \(D(p_h,q_h)\) can be represented as

$$\begin{aligned} D(p_h,q_h)& = E\big [(nZ''_{m,h}-mZ'_{n,h})_+ \big ] \\& = E\big [(nZ''_{m,h}-mZ'_{n,h}){\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ] \\& = nE\big [Z''_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ] - mE\big [Z'_{n,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]. \end{aligned}$$
(37)

Let us split Eq. (37) into two terms. Recalling that a binomial random variable \(Z''_{m,h}\) can be represented as the sum of m independent random variables \(Z''_{1,h,i}\) distributed as Bernoulli \(Z''_{1,h}\) or also as the sum of independent Bernoulli \(Z''_{1,h}\) and binomial \(Z''_{m-1,h}\), we have

$$\begin{aligned} nE\big [Z''_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]& = n E \left[ \left( \sum _{i=1}^m Z''_{1,h,i} \right) {\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\& = n \sum _{i=1}^m E \left[ Z''_{1,h,i}{\mathbb{I}}(nZ''_{m-1,h}+ nZ''_{1,h,i} \ge mZ'_{n,h}) \right] \end{aligned}$$

that, due to the identical distribution of the Bernoulli variables \(Z''_{1,h,i}\) to \(Z''_{1,h}\), can be simplified in the notation. Hence

$$\begin{aligned} nE\big [Z''_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]& = nm E\big [Z''_{1,h}{\mathbb{I}}(nZ''_{m-1,h}+nZ''_{1,h} \ge mZ'_{n,h}) \big ] \\& = nm E\big [{\mathbb{I}}(nZ''_{m-1,h}+n \ge mZ'_{n,h}) \big ]q_h \\& = nm P\big (nZ''_{m-1,h}+n \ge mZ'_{n,h} \big )q_h \\& = nm(1-p_h)q_h P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big ) \\&\quad+nmp_hq_h P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+m \big ), \end{aligned}$$
(38)

where we used the properties

$$\begin{aligned} E\big [Z''_{1,h}{\mathbb{I}}(nZ''_{m-1,h}+ & nZ''_{1,h} \ge mZ'_{n,h}) \big ]\\& = E\big [Z''_{1,h}{\mathbb{I}}(nZ''_{m-1,h}+nZ''_{1,h} \ge mZ'_{n,h} \vert Z''_{1,h}=0)]P(Z''_{1,h}=0) \\&\quad+ E\big [Z''_{1,h}{\mathbb{I}}(nZ''_{m-1,h}+nZ''_{1,h} \ge mZ'_{n,h} \vert Z''_{1,h}=1)]P(Z''_{1,h}=1) \\& = E\big [{\mathbb{I}}(nZ''_{m-1,h}+n \ge mZ'_{n,h})]q_h \\& = P\big (nZ''_{m-1,h}+n \ge mZ'_{n,h}\big )q_h \end{aligned}$$

and

$$\begin{aligned} P\big (nZ''_{m-1,h}+ & n \ge mZ'_{n,h}\big )\\& = P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+mZ'_{1,h} \vert Z'_{1,h}=0 \big )P(Z'_{1,h}=0)\\&\quad +P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+mZ'_{1,h} \vert Z'_{1,h}=1 \big )P(Z'_{1,h}=1)\\& = P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big )(1-p_h)\\&\quad +P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+m \big )p_h. \end{aligned}$$

Similarly, we can also derive

$$\begin{aligned} mE\big [Z'_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]& = mnp_h(1-q_h) P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big ) \\&\quad+mnp_hq_h P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+m \big ). \end{aligned}$$
(39)

Now, combining the results (38) and (39) into Eq. (37) we prove the Lemma, in fact

$$\begin{aligned} D(p_h,q_h)& = nm P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big )(1-p_h)q_h \\&\quad +nm P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+m \big )p_hq_h \\&\quad -nm P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big )p_h(1-q_h) \\&\quad -nm P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+m \big )p_hq_h \\& = nm P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big )(1-p_h)q_h \\&\quad -nm P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big )p_h(1-q_h). \end{aligned}$$
(40)

\(\square\)

Proposition 3

Under the assumption that the sampling frequencies \({\mathbf{X}}=(X_1,\ldots ,X_c)\) and \(\mathbf {Y}=(Y_1,\ldots ,Y_c)\) are distributed as multinomial random vectors \({\mathcal{M}}(n,{\mathbf{p}})\) and \({\mathcal{M}}(m,{\mathbf{q}})\), respectively, the expected value of the estimator \(\hat{I}\) in Eq. (19) is given by

$$\begin{aligned} E \big [\hat{I}\big ]& = 1-\sum _{h=1}^c (1-p_h)q_h P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}\big )\\&\quad +\sum _{h=1}^c p_h(1-q_h)P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big ), \end{aligned}$$

where the \(Z'_{n-1,h}\) are binomial random variables with parameters \(n-1\) and \(p_h\) and the \(Z''_{m-1,h}\) are binomial random variables with parameters \(m-1\) and \(q_h\), that is \(Z'_{n-1,h} \sim {\mathcal{B}}(n-1,p_h)\) and \(Z''_{m-1,h} \sim {\mathcal{B}}(m-1,q_h)\), respectively.

Proof

From Eq. (19) we have

$$\begin{aligned} \hat{I}& = \frac{1}{nm}\sum _{h=1}^{c} \min \{mX_{h},nY_{h}\}\\& = \frac{1}{nm}\sum _{h=1}^c\big [nY_h-(nY_h-mX_{h})_+\big ]\\& = 1-\frac{1}{nm}\sum _{h=1}^c(nY_h-mX_{h})_+, \end{aligned}$$

hence its expectation can be represented as

$$\begin{aligned} E \big [ \hat{I} \big ]=1-\frac{1}{nm}\sum _{h=1}^c E \big [ (nY_h-mX_{h})_+ \big ]. \end{aligned}$$
(41)

Now, recalling that the h-th component of a random vector \({\mathbf{X}}\), multinomial distributed with parameters n and \({\mathbf{p}}\), has marginal binomial distribution with parameters n and \(p_h\), that is \(X_{h} \sim {\mathcal{B}}(n,p_h)\), and the h-th component of a random vector \(\mathbf {Y}\), multinomial distributed with parameters m and \({\mathbf{q}}\), has marginal binomial distribution with parameters m and \(q_h\), that is \(Y_{h} \sim {\mathcal{B}}(m,q_h)\), Eq. (41) can be rewritten as

$$\begin{aligned} E \big [ \hat{I} \big ]& = 1-\frac{1}{nm}\sum _{h=1}^c E \big [ (nZ''_{m,h}-mZ'_{n,h})_+ \big ] \\& = 1-\frac{1}{nm}\sum _{h=1}^c D(p_h,q_h). \end{aligned}$$
(42)

Therefore, the proof follows from combining Eq. (40) in Lemma 4 and Eq. (42). \(\square\)

Lemma 5

Let \(\Omega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\Omega\). Let us consider one value \(\omega _h \in \Omega\) with the corresponding probabilities \(p_h,q_h \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(Z'_{n,h}\) a binomial random variable with parameters n and \(p_h\) and by \(Z''_{m,h}\) a binomial random variable with parameters m and \(q_h\), that is \(Z'_{n,h} \sim {\mathcal{B}}(n,p_h)\) and \(Z''_{m,h} \sim {\mathcal{B}}(m,q_h)\), respectively. Let us define the function \(F(p_h,q_h)\) as follows

$$\begin{aligned} F(p_h,q_h)=E\big [\big (\big (nZ''_{m,h}-mZ'_{n,h}\big )_+ \big )^2 \big ]. \end{aligned}$$

Then, for every \(h=1,\ldots ,c\) we have

$$\begin{aligned} F(p_h,q_h)& = n^2m q_h P\big (nZ''_{m-1,h} + n \ge mZ'_{n,h} \big ) \\&\quad +n^2m(m-1) q_h^2 P \big (nZ''_{m-2,h} +2n \ge mZ'_{n,h} \big ) \\&\quad +m^2n p_h P\big (nZ''_{m,h} \ge mZ'_{n-1,h} +m \big ) \\&\quad +m^2n(n-1) p_h^2 P \big (nZ''_{m,h} \ge mZ'_{n-2,h}+2m \big ) \\&\quad -2n^2m^2 p_hq_hP\left( nZ''_{m-1,h} +n \ge mZ'_{n-1,h}+m\right) . \end{aligned}$$

Proof

By using the indicator function \({\mathbb{I}}(A)\) of an event A, the quantity \(F(p_h,q_h)\) can be represented as

$$\begin{aligned} F(p_h,q_h)& = E\big [\big (\big (nZ''_{m,h}-mZ'_{n,h}\big )_+ \big )^2 \big ] \\& = E\big [(nZ''_{m,h}-mZ'_{n,h})^2{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ] \\& = n^2E\big [Z''^2_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ] + m^2E\big [Z'^2_{n,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ] \\&-2nm E\big [ Z'_{n,h}Z''_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h})\big ]. \end{aligned}$$
(43)

Now, let us split Eq. (43) in three terms and recall that a binomial random variable \(Z''_{m,h}\) can be represented as sum of m independent random variables \(Z''_{1,h,i}\) distributed as Bernoulli \(Z''_{1,h}\) or also as sum of independent Bernoulli \(Z''_{1,h}\) and binomial \(Z''_{m-1,h}\). For the first term, we have

$$\begin{aligned}&n^2E \big [ Z''^2_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]=n^2E \left[ \left( \sum _{i=1}^m Z''_{1,h,i} \right) ^2{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\&\quad =n^2E \left[ \left( \sum _{i=1}^m Z''^2_{1,h,i} + \mathop {\sum \sum }_{i \ne j} Z''_{1,h,i} Z''_{1,h,j} \right) {\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\&\quad =n^2E \left[ \left( _{i=1}^m Z''^2_{1,h,i} \right) {\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\&\qquad +n^2E \left[ \left( \mathop {\sum \sum }_{i \ne j} Z''_{1,h,i} Z''_{1,h,j} \right) {\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\&\quad = n^2 \sum _{i=1}^m E \big [Z''^2_{1,h,i} {\mathbb{I}}(nZ''_{m-1,h} +nZ''_{1,h,i} \ge mZ'_{n,h}) \big ] \\&\qquad +n^2\mathop {\sum \sum }_{i \ne j}E \left[ Z''_{1,h,i} Z''_{1,h,j} {\mathbb{I}}(nZ''_{m-2,h} +mZ''_{1,h,i} +mZ''_{1,h,j} \ge mZ'_{n,h}) \right] \\&\quad = n^2m E \big [{\mathbb{I}}(nZ''_{m-1,h} + n \ge mZ'_{n,h}) \big ] q_h \\&\qquad +\,n^2m(m-1)E \big [ {\mathbb{I}}(nZ''_{m-2,h} +2n \ge mZ'_{n,h})q_h^2 \\&\quad = n^2m q_h P\big (nZ''_{m-1,h} + n \ge mZ'_{n,h} \big ) \\&\qquad +\,n^2m(m-1) q_h^2 P \big (nZ''_{m-2,h} +2n \ge mZ'_{n,h} \big ), \end{aligned}$$
(44)

in which, as in Lemma 4, we have used the independence and the identical distribution of the Bernoulli variables \(Z''_{1,h,i}\). Similarly, we can also derive the second term of Eq. (43)

$$\begin{aligned}&m^2E\big [Z'^2_{n,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]= m^2n p_h P\big (nZ''_{m,h} \ge mZ'_{n-1,h} +m \big ) \\&\quad +\,m^2n(n-1) p_h^2 P \big (nZ''_{m,h} \ge mZ'_{n-2,h}+2m \big ), \end{aligned}$$
(45)

while for the last term, we have

$$\begin{aligned}&E\big [ Z'_{n,h}Z''_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h})\big ] =E\left[ \left( \sum _{i=1}^n \sum _{j=1}^m Z'_{1,h,i}Z''_{1,h,j} \right) {\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\&\quad =\sum _{i=1}^n \sum _{j=1}^m E\left[ Z'_{1,h,i}Z''_{1,h,j} {\mathbb{I}}(nZ''_{m-1,h} +nZ''_{1,h,j} \ge mZ'_{n-1,h}+mZ'_{1,h,i})\right] \\&\quad =nm E\left[ Z'_{1,h}Z''_{1,h} {\mathbb{I}}(nZ''_{m-1,h} +nZ''_{1,h} \ge mZ'_{n-1,h}+mZ'_{1,h}) \right] \\&\quad =nm E\left[ {\mathbb{I}}(nZ''_{m-1,h} +n \ge mZ'_{n-1,h}+m) \right] p_hq_h \\&\quad =nm p_hq_hP\left( nZ''_{m-1,h} +n \ge mZ'_{n-1,h}+m\right) . \end{aligned}$$
(46)

Now, it suffices to plug the results (44), (45) and (46) into Eq. (43) and prove the Lemma. In fact

$$\begin{aligned} F(p_h,q_h)& = n^2m q_h P\big (nZ''_{m-1,h} + n \ge mZ'_{n,h} \big ) \\&\quad +n^2m(m-1) q_h^2 P \big (nZ''_{m-2,h} +2n \ge mZ'_{n,h} \big ) \\&\quad +m^2n p_h P\big (nZ''_{m,h} \ge mZ'_{n-1,h} +m \big ) \\&\quad +m^2n(n-1) p_h^2 P \big (nZ''_{m,h} \ge mZ'_{n-2,h}+2m \big ) \\&\quad -2n^2m^2 p_hq_hP\left( nZ''_{m-1,h} +n \ge mZ'_{n-1,h}+m\right) . \end{aligned}$$
(47)

\(\square\)

Lemma 6

Let \(\Omega\) be a finite set of categorical values \(\{\omega _1\,\ldots ,\omega _c\}\) and \({\mathbf{p}}=\{p_1,\ldots ,p_c\}\) and \({\mathbf{q}}=\{q_1,\ldots ,q_c\}\) two probability distributions defined over \(\Omega\). Let us consider two values \(\omega _h,\omega _l \in \Omega\) with the corresponding probabilities \(p_h,p_l,q_h,q_l \in [0,1]\) in the profiles \({\mathbf{p}}\) and \({\mathbf{q}}\), respectively. Let us denote by \(U'_{n,h}\) and \(U'_{n,l}\) the components of a multinomial random vector \(\mathbf {U}'_n=\big (U'_{n,h},U'_{n,l},n-U'_{n,h}-U'_{n,l} \big )\) with parameters n and \({\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)\), that is \(\mathbf {U}'_n \sim {\mathcal{M}}(n,{\mathbf{p}}_{hl})\), and by \(U''_{m,h}\) and \(U''_{m,l}\) the components of a multinomial random vector \(\mathbf {U}''_m=\big (U''_{m,h},U''_{m,l},m-U''_{m,h}-U''_{m,l} \big )\) with parameters m and \({\mathbf{q}}_{hl}=(q_h,q_l,1-q_h-q_l)\), that is \(\mathbf {U}''_m \sim {\mathcal{M}}(m,{\mathbf{q}}_{hl})\). Let us define the function \(G(p_h,p_l,q_h,q_l)\) as follows

$$\begin{aligned} G(p_h,p_l,q_h,q_l)=E\big [\big (nU''_{m,h}-mU'_{n,h}\big )_+\big (nU''_{m,l}-mU'_{n,l}\big )_+ \big ]. \end{aligned}$$

Then, for every \(h=1,\ldots ,c\) we have

$$\begin{aligned}&G(p_h,p_l,q_h,q_l)\\&\quad =n^2m(m-1)q_hq_l P\big (\{nU''_{m-2,h} +n \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +n \ge mU'_{n,l}\} \big )\\&\qquad +\,m^2n(n-1)p_hp_l P\big (\{nU''_{m,h} \ge mU'_{n-2,h} +m \}\cap\, \{nU''_{m,l} \ge mU'_{n-2,l}+m\} \big )\\&\qquad -\,n^2m^2q_hp_l P\left( \{nU''_{m-1,h} +n \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+m\} \right) \\&\qquad -\,m^2n^2p_hq_l P\left( \{nU''_{m-1,h} \ge mU'_{n-1,h}+m\}\cap\, \{nU''_{m-1,l}+n \ge mU'_{n-1,l}\} \right) . \end{aligned}$$

Proof

By using the indicator function \({\mathbb{I}}(A)\) of an event A, the quantity \(G(p_h,p_l,q_h,q_l)\) can be represented as

$$\begin{aligned} G(p_h,p_l,q_h,q_l)& = E\big [\big (nU''_{m,h}-mU'_{n,h}\big )_+\big (nU''_{m,l}-mU'_{n,l}\big )_+ \big ] \\& = E\big [\big (nU''_{m,h}-mU'_{n,h}\big )\big (nU''_{m,l}-mU'_{n,l}\big ) \\&\quad \times {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ] \\& = E\big [\big (n^2U''_{m,h}U''_{m,l}+m^2U'_{n,h}U'_{n,l}-nmU''_{m,h}U'_{n,l}-mnU'_{n,h}U''_{m,l}\big ) \\&\quad \times {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ]. \end{aligned}$$
(48)

Now, let us split Eq. (48) in four terms and recall that a multinomial random vector \(\mathbf {U}''_m=\big (U''_{m,h},U''_{m,l},m-U''_{m,h}-U''_{m,l} \big )\) can be represented as sum of m independent random vectors \(\mathbf {U}''_{1,i}=\big (U''_{1,h,i},U''_{1,l,i},1-U''_{1,h,i}-U''_{1,l,i} \big )\) identically distributed as multinomial \(\mathbf {U}''_1=\big (U''_{1,h},U''_{1,l},1-U''_{1,h}-U''_{1,l} \big )\) or also as sum of two independent multinomials \(\mathbf {U}''_{1}\) and \(\mathbf {U}''_{m-1}\) with the same parameter of probabilities \({\mathbf{q}}_{hl}=(q_h,q_l,1-q_h-q_l)\). Hence, for the first term, we have

$$\begin{aligned}&n^2E\big [U''_{m,h}U''_{m,l} {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ] \\&\quad =n^2E\left[ \left( \sum _{i=1}^m\sum _{j=1}^m U''_{1,h,i}U''_{1,l,j}\right) {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \right] \\&\quad = n^2\mathop {\sum _{i=1}^m\sum _{j=1}^m}_{i \ne j} E\big [U''_{1,h,i}U''_{1,l,j} \\&\qquad \times {\mathbb{I}}(\{nU''_{m-2,h} +nU''_{1,h,i} \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +nU''_{1,l,j} \ge mU'_{n,l}\}) \big ] \\&\quad = n^2m(m-1) E\big [{\mathbb{I}}(\{nU''_{m-2,h} +n \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +n \ge mU'_{n,l}\}) \big ] q_hq_l \\&\quad = n^2m(m-1)q_hq_l P\big (\{nU''_{m-2,h} +n \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +n \ge mU'_{n,l}\} \big ), \end{aligned}$$
(49)

where when \(i=j\), we use the property

$$\begin{aligned} E\left[ U''_{1,h,i}U''_{1,l,i} {\mathbb{I}}(\{nU''_{m-2,h} +nU''_{1,h,i} \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +nU''_{1,l,i} \ge mU'_{n,l}\}) \right] =0. \end{aligned}$$

In fact, the two components \(U''_{1,h,i}\) and \(U''_{1,l,i}\) of a multinomial \(\mathbf {U}''_{1,i}=\big (U''_{1,h,i},U''_{1,l,i},1-U''_{1,h,i}-U''_{1,l,i} \big )\) can not both be equal to 1.

Similarly, the second term of Eq. (48) is given by

$$\begin{aligned}&m^2E\big [U'_{m,h}U'_{m,l} {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ] \\&\quad = m^2n(n-1)p_hp_l P\left( \{nU''_{m,h} \ge mU'_{n-2,h} +m \}\cap\, \{nU''_{m,l} \ge mU'_{n-2,l}+m\} \right) , \end{aligned}$$
(50)

while for the third term of Eq. (48) we have

$$\begin{aligned}&nm E\big [U''_{m,h}U'_{n,l} {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ] \\&\quad =nmE\left[ \left( \sum _{i=1}^m\sum _{j=1}^n U''_{1,h,i}U'_{1,l,j}\right) {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \right] \\&\quad = nm\sum _{i=1}^m\sum _{j=1}^n E\big [U''_{1,h,i}U'_{1,l,j} \\&\qquad \times {\mathbb{I}}(\{nU''_{m-1,h} +nU''_{1,h,i} \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+ mU'_{1,l,j} \}) \big ] \\&\quad = n^2m^2E\left[ {\mathbb{I}}(\{nU''_{m-1,h} +n \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+m\}) \right] q_hp_l \\&\quad = n^2m^2q_hp_l P\left( \{nU''_{m-1,h} +n \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+m\} \right) . \end{aligned}$$
(51)

Similarly, the last term of Eq. (48) can be obtained as follows

$$\begin{aligned}&mnE\big [U'_{n,h}U''_{m,l} {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ] \\&\quad = m^2n^2p_hq_l P\left( \{nU''_{m,h} \ge mU'_{n-1,h}+m\}\cap\, \{nU''_{m-1,l}+n \ge mU'_{n,l}\} \right) . \end{aligned}$$
(52)

Now, it suffices to plug the results (49), (50), (51) and (52) into Eq. (48) and prove the Lemma. In fact

$$\begin{aligned}&G(p_h,p_l,q_h,q_l) \\&\quad =n^2m(m-1)q_hq_l P\left( \{nU''_{m-2,h} +n \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +n \ge mU'_{n,l}\} \right) \\&\qquad +\,m^2n(n-1)p_hp_l P\left( \{nU''_{m,h} \ge mU'_{n-2,h} +m \}\cap\, \{nU''_{m,l} \ge mU'_{n-2,l}+m\} \right) \\&\qquad -\,n^2m^2q_hp_l P\left( \{nU''_{m-1,h} +n \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+m\} \right) \\&\qquad -\,m^2n^2p_hq_l P\left( \{nU''_{m-1,h} \ge mU'_{n-1,h}+m\}\cap\, \{nU''_{m-1,l}+n \ge mU'_{n-1,l}\} \right) . \end{aligned}$$
(53)

\(\square\)

Proposition 4

Under the assumption that the sampling frequencies \({\mathbf{X}}=(X_1,\ldots ,X_c)\) and \(\mathbf {Y}=(Y_1,\ldots ,Y_c)\) are distributed as multinomial random vectors \({\mathcal{M}}(n,{\mathbf{p}})\) and \({\mathcal{M}}(m,{\mathbf{q}})\), respectively, the variance of the estimator \(\hat{I}\) in Eq. (19) is given by

$$\begin{aligned} Var \big [\hat{I}\big ]& = \sum _{h=1}^c \bigg \{ \frac{1}{m} q_h P\big (nZ''_{m-1,h} + n \ge mZ'_{n,h} \big )+\frac{m-1}{m} q_h^2 P \big (nZ''_{m-2,h} +2n \ge mZ'_{n,h} \big ) \\&+\frac{1}{n} p_h P\big (nZ''_{m,h} \ge mZ'_{n-1,h} +m \big )+\frac{n-1}{n} p_h^2 P \big (nZ''_{m,h} \ge mZ'_{n-2,h}+2m \big )\\&-2 p_hq_hP\left( nZ''_{m-1,h} +n \ge mZ'_{n-1,h}+m \right) \bigg \}\\&+2\sum _{h=1}^{c-1}\sum _{l=h+1}^c \bigg \{ \frac{m-1}{m}q_hq_l P\big (\{nU''_{m-2,h} +n \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +n \ge mU'_{n,l}\} \big ) \\&+\frac{n-1}{n}p_hp_l P\big (\{nU''_{m,h} \ge mU'_{n-2,h} +m \}\cap\, \{nU''_{m,l} \ge mU'_{n-2,l}+m\} \big )\\&-q_hp_l P\left( \{nU''_{m-1,h} +n \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+m\} \right) \\&-p_hq_l P\left( \{nU''_{m-1,h} \ge mU'_{n-1,h}+m\}\cap\, \{nU''_{m-1,l}+n \ge mU'_{n-1,l}\}\right) \bigg \} \\&-\left( \sum _{h=1}^c \left\{ P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big )(1-p_h)q_h -P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big )p_h(1-q_h)\right\} \right) ^2, \end{aligned}$$

where \(Z'_{n,h} \sim {\mathcal{B}}(n,p_h)\), \(Z''_{m,h} \sim {\mathcal{B}}(m,q_h)\), \((U'_{n,h},U'_{n,l}) \sim {\mathcal{M}}(n, {\mathbf{p}}_{hl})\) and \((U''_{m,h},U''_{m,l}) \sim {\mathcal{M}}(m, {\mathbf{q}}_{hl})\),

Proof

First, we recall the following representation

$$\begin{aligned} Var[\hat{I}]& = \frac{1}{n^2m^2} Var\left[ \sum _{h=1}^c (nY_h-mX_{h})_+\right] \\& = \frac{1}{n^2m^2} \sum _{h=1}^c Var\big [(nY_h-mX_{h})_+\big ] \\&\quad +\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c Cov\big [(nY_h-mX_{h})_+,(nY_l-mX_{l})_+\big ] \\& = \frac{1}{n^2m^2} \sum _{h=1}^c E\big [\big ((nY_h-mX_{h})_+\big )^2\big ] -\frac{1}{n^2m^2} \sum _{h=1}^c \big (E\big [(nY_h-mX_{h})_+\big ]\big )^2 \\&\quad +\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nY_h-mX_{h})_+ (nY_l-mX_{l})_+\big ] \\&\quad -\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nY_h-mX_{h})_+\big ]E\big [(nY_l-mX_{l})_+\big ] \\& = \frac{1}{n^2m^2} \sum _{h=1}^c E\big [\big ((nY_h-mX_{h})_+\big )^2\big ] \\&\quad +\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nY_h-mX_{h})_+ (nY_l-mX_{l})_+\big ] \\&\quad - \left( \frac{1}{nm} \sum _{h=1}^c E\big [(nY_h-mX_{h})_+\big ]\right) ^2 . \end{aligned}$$
(54)

If a random vector \({\mathbf{X}}=(X_1,\ldots ,X_c)\) is multinomial distributed with parameters n and \({\mathbf{p}}\), the h-th component \(X_h\) has marginal binomial distribution with parameters n and \(p_h\) while two components \(X_h\) and \(X_l\) have marginal joint multinomial distribution with parameters n and \({\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)\). Also if a random vector \(\mathbf {Y}=(Y_1,\ldots ,Y_c)\) is multinomial distributed with parameters m and \({\mathbf{q}}\), the h-th component \(Y_h\) has marginal binomial distribution with parameters m and \(q_h\) while two components \(Y_h\) and \(Y_l\) have marginal joint multinomial distribution with parameters m ad \({\mathbf{q}}_{hl}=(q_h,q_l,1-q_h-q_l)\). Therefore we can rewrite Eq. (54) as follows

$$\begin{aligned} Var [\hat{I}]&\, =\, \frac{1}{n^2m^2} \sum _{h=1}^c E\big [\big ((nZ''_{m,h}-mZ'_{n,h})_+\big )^2\big ]\\&\quad +\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nU''_{m,h}-mU'_{n,h})_+ (nU''_{m,l}-mU'_{n,l})_+\big ]\\&\quad - \left( \frac{1}{nm} \sum _{h=1}^c E\big [(nZ''_{m,h}-Z'_{n,h})_+\big ]\right) ^2 \end{aligned}$$

where \(Z'_{n,h} \sim {\mathcal{B}}(n,p_h)\), \(Z''_{m,h} \sim {\mathcal{B}}(m,q_h)\), \((U'_{n,h},U'_{n,l}) \sim {\mathcal{M}}(n, {\mathbf{p}}_{hl})\) and \((U''_{m,h},U''_{m,l}) \sim {\mathcal{M}}(m, {\mathbf{q}}_{hl})\), or equivalently as

$$\begin{aligned} Var [\hat{I}] = \frac{1}{n^2m^2}\sum _{h=1}^c F(p_h,q_h)+\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c G(p_h,p_l,q_h,q_l)-\left[ \frac{1}{nm}\sum _{h=1}^c D(p_h,q_h)\right] ^2 \end{aligned}$$

where

$$\begin{aligned} D(p_h,q_h) = E\big [ (nZ''_{m,h} -mZ'_{n,h})_+\big ], \end{aligned}$$
$$\begin{aligned} F(p_h,q_h)=E\big [ \big ((nZ''_{m,h}-mZ'_{n,h})_+\big )^2\big ] \end{aligned}$$

and

$$\begin{aligned} G(p_h,q_h,p_l,q_l) = E\big [(nU''_{m,h}-mU'_{n,h})_+(nU''_{m,l}-mU'_{n,l})_+ \big ]. \end{aligned}$$

Now combining the results from Lemmas 4, 5 and 6 we prove the Proposition. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ärje, J., Choi, KP., Divino, F. et al. Understanding the statistical properties of the percent model affinity index can improve biomonitoring related decision making. Stoch Environ Res Risk Assess 30, 1981–2008 (2016). https://doi.org/10.1007/s00477-015-1202-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00477-015-1202-6

Keywords

Navigation