Understanding the statistical properties of the percent model affinity index can improve biomonitoring related decision making

Ärje, Johanna; Choi, Kwok-Pui; Divino, Fabio; Meissner, Kristian; Kärkkäinen, Salme

doi:10.1007/s00477-015-1202-6

Understanding the statistical properties of the percent model affinity index can improve biomonitoring related decision making

Original Paper
Published: 16 January 2016

Volume 30, pages 1981–2008, (2016)
Cite this article

Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Johanna Ärje¹,
Kwok-Pui Choi²,
Fabio Divino³,
Kristian Meissner⁴ &
…
Salme Kärkkäinen¹

336 Accesses
8 Citations
Explore all metrics

Abstract

The percent model affinity (PMA) index is used to measure the similarity of two probability profiles representing, for example, an ideal profile (i.e. reference condition) and a monitored profile (i.e. possibly impacted condition). The goal of this work is to study the effects of sample size, evenness, true value of the index and number of classes on the statistical properties of the estimator of the PMA index. We derive and extend previous formulas of the expectation and variance of the estimator for estimated monitored profile and fixed reference profile. Using the obtained extension, we find that the estimator is asymptotically unbiased, converging faster when the profiles differ. When both profiles are estimated, we calculate the expectation using transformation rules for expectation and in addition derive the formula for the estimator’s variance. Since the computation of the probabilities in the variance formula is slow, we study the behavior of the variance with simulation experiments and assess whether it could be approximated with the variance for the fixed reference profile. Finally, we provide a set of recommendations for the users of the PMA index to avoid the most common caveats of the index.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of the Sensitivity and Robustness of Modified Chi-Square Ratio Statistic for Cascade Impactor Equivalence Testing Through Monte Carlo Simulations

Article 20 May 2020

Blockwise AICc for Model Selection in Generalized Linear Models

Article 09 March 2017

Bayesian network-based framework for exposure-response study design and interpretation

Article Open access 22 March 2019

References

Alahuhta J, Vuori K-M, Hellsten S, Järvinen M, Olin M, Rask M, Palomäki A (2009) Defining the ecological status of small forest lakes using multiple biological quality elements and palaeolimnological analysis. Fund Appl Limnol / Archiv für Hydrobiologie 175(3):203–216
Article Google Scholar
Aroviita J, Hellsten S, Jyväsjärvi J, Järvenpää L, Järvinen M, Karjalainen S, Kauppila P, Keto A (2012) Guidelines for the ecological and chemical status classification of surface waters for 2012–2013—updated assessment criteria and their application. Environ Admin Guidel 7:144
Google Scholar
Arratia R, Gordon L (1989) Tutorial on large deviation for the binomial distribution. Bull Math Biol 51(1):125–131
Article CAS Google Scholar
Bergstrand K-G (1989) Fördelningsaptering med näroptimalmetoden Ű reviderad version [bucking to order with a close-to-optimal method Ű revised version]. Forskningsstiftelsen Skogsarbeten. 1989-12-11 (In Swedish)
Bloom A (1981) Similarity indices in community studies: potential pitfalls. Mar Ecol-Prog Ser 5:125–128
Article Google Scholar
Cao Y, Epifanio J (2010) Quantifying the responses of macroinvertebrate assemblages to simulated stress: are more accurate similarity indices less useful? Methods Ecol Evol 1:380–388
Article Google Scholar
Cao Y, Hawkins CP (2005) Simulating biological impairment to evaluate the accuracy of ecological indicators. J Appl Ecol 42:954–965
Article Google Scholar
Chao A, Hsieh T, Chazdon L, Colwell R, Gotelli N (2015) Unveiling the species-rank abundance distribution by generalizing the Good–Turing sample coverage theory. Ecology 96(5):1189–1201
Article Google Scholar
Chao A, Shen T (2003) Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environ Ecol Stat 10:429–443
Article Google Scholar
CIS (2003) Monitoring under the Water Framework Directive. Common Implementation Strategy for the Water Framework Directive (2000/60/EC). Guidance Document No 10. Working Group 2.7 - Monitoring. European Communities, Luxembourg
Duncan OD, Duncan B (1955) A methodological analysis of segregation indexes. Am Sociol Rev 20(2):210–217
Article Google Scholar
Goldstein M, Wolf E (1977) On the problem of bias in multinomial classification. Biometrics 33(2):325–331
Article Google Scholar
Jahn J, Schmidt CF, Schrag C (1947) The measurement of ecological segregation. Am Sociol Rev 12(3):293–303
Article Google Scholar
Kauppila T, Kanninen A, Viitasalo M, Räsänen J, Meissner K, Mattila J (2012) Comparing long term sediment records to current biological quality element data—implications for bioassessment and management of eutrophic lake. Limnologica 42(1):19–30
Article CAS Google Scholar
Koskela L, Sinha BK, Nummi T (2007) Some aspects of the sampling distribution of the apportionment index and related infrence. Silva Fennica 41(4):699–715
Article Google Scholar
Marcon Hérault B, Baraloto C, Lang G (2012) The decomposition of Shannon’s entropy and a confidence interval for beta diversity. Oikos 121:516–522
Article Google Scholar
Matossian AD, Matsinos YG, Konstantinidis P, Moustakas A (2013) Post-fire succession indices performance in a Mediterranean ecosystem. Stoch Environ Res Risk Asses 27:323–335
Article Google Scholar
Monk WA, Wood PJ, Hannah DM, Extence CA, Chadd RP, Dunbar MJ (2012) How does macroinvertebrate taxonomic resolution influence ecohydrological relationships in riverine ecosystems. Ecohydrology 5:36–45
Article Google Scholar
Novak MA, Bode RW (1992) Percent model affinity: a new measure of macroinvertebrate community composition. J N Am Benthol Soc 11(1):80–85
Article Google Scholar
Pielou E (1966) The measurement of diversity in different types of biological collections. J Theor Biol 13:131–144
Article Google Scholar
Pollice A, Arima S, Lasinio GJ, Basset A, Rosati I (2015) Bayesian analysis of three indices for lagoons ecological status evaluation. Stoch Environ Res Risk Assess 29(2):477–485
Article Google Scholar
R Core Team R (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Google Scholar
Ransom MR (2000) Sampling distributions of segregation indexes. Sociol Method Res 28(4):454–475
Article Google Scholar
Renkonen O (1938) Statisch-ökologische Untersuchungen über die terrestrische Käferwelt der finnischen Bruchmoore. Ann Zool Soc Bot Fenn Vanamo 6:1–231
Google Scholar
Ricklefs RE, Lau M (1980) Bias and dispersion of overlap indices: results of some Monte Carlo simulations. Ecology 61(5):1019–1024
Article Google Scholar
Romik D (2000) Stirling’s approximation for n!: the ultimate short proof? Am Math Mon 107(6):556–557
Article Google Scholar
Seber GAF (1973) The estimation of animal abundance and related parameters. C. Griffin, London
Google Scholar
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423
Article Google Scholar
Smith EP (1982) Niche breadth, resource availability and inference. Ecology 63(6):1675–1681
Article Google Scholar
Smith EP, Zaret TM (1982) Bias in estimating niche overlap. Ecology 63(5):1248–1253
Article Google Scholar
Stoddard JL, Larsen DP, Hawkins CP, Johnson RK, Norris RH (2006) Setting expectations for the ecological condition of streams: the concept of reference condition. Ecol Appl 16(4):1267–1276
Article Google Scholar
Thompson RM, Townsend CR (2000) Is resolution the solution? The effect of taxonomic resolution on the calculated properties of three stream food webs. Freshw Biol 44(3):413–422
Article Google Scholar
Venrick E (1983) Percent similarity: the prediction of bias. Fish Bull 81(2):375–387
Google Scholar
WFD (2000) Directive 2000/60/EC of the European Parliament and the Council of 23, October 2000. A framework for community action in the field of water policy. Off J Eur Commun L327:72
Google Scholar
Wolda H (1981) Similarity indices, sample size and diversity. Oecologia (Berl) 50:296–302
Article Google Scholar

Download references

Acknowledgments

We thank the Ellen and Artturi Nyyssönen foundation for the grant of Ärje and the Academy of Finland (projects 289076 (SK) and 289104 (KM)). KPC acknowledges the support of Singapore Ministry of Education Academic Research Fund R-155-000-147-112. We kindly thank Jukka Aroviita for insights into the Finnish adaptation of the PMA in the WFD context and Antti Penttinen and Jukka Nyblom for helpful conversations.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of Jyvaskyla, Jyväskylä, Finland
Johanna Ärje & Salme Kärkkäinen
Department of Statistics and Applied Probability, and Department of Mathematics, National University of Singapore, Singapore, Singapore
Kwok-Pui Choi
Division of Physics, Computer Science and Mathematics, University of Molise, Pesche, Italy
Fabio Divino
Freshwater Centre, Finnish Environment Institute, SYKE, Jyväskylä Office, Jyväskylä, Finland
Kristian Meissner

Authors

Johanna Ärje
View author publications
You can also search for this author in PubMed Google Scholar
Kwok-Pui Choi
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Divino
View author publications
You can also search for this author in PubMed Google Scholar
Kristian Meissner
View author publications
You can also search for this author in PubMed Google Scholar
Salme Kärkkäinen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Johanna Ärje.

Appendix

Lemma 1

Let $\varOmega$ be a finite set of categorical values $\{\omega _1\,\ldots ,\omega _c\}$ and ${\mathbf{p}}=\{p_1,\ldots ,p_c\}$ and ${\mathbf{q}}=\{q_1,\ldots ,q_c\}$ two probability distributions defined over $\varOmega$. Let us consider one value $\omega _h \in \Omega$ with the corresponding probabilities $p_h,q_h \in [0,1]$ in the profiles ${\mathbf{p}}$ and ${\mathbf{q}}$, respectively. Let us denote by $Z_{n,h}$ a binomial random variable with parameters n and $p_h$, that is $Z_{n,h} \sim {\mathcal{B}}(n,p_h)$, and define the function $A(p_h,q_h)$ as follows

$$\begin{aligned} A(p_h,q_h)=E\big [(nq_h-Z_{n,h})_+ \big ]. \end{aligned}$$

Then, for every $h=1,\ldots ,c$ we have

$$\begin{aligned} A(p_h,q_h)=n (q_h-p_h) P(Z_{n-1,h} \le \lfloor nq_h\rfloor )+ np_h(1-q_h) P(Z_{n-1,h}=\lfloor nq_h\rfloor ). \end{aligned}$$

Proof

By definition we have

$$\begin{aligned} A(p_h,q_h)& = E\big [(nq_h-Z_{n,h})_+ \big ] \\& = \sum _{z=0}^{\lfloor nq_h\rfloor } (n q_{h}-z)P(Z_{n,h}=z) \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )-\sum _{z=0}^{\lfloor nq_h \rfloor } z P(Z_{n,h}=z) \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )-\sum _{z=1}^{\lfloor nq_h \rfloor } z \dfrac{n!}{z!(n-z)!}p_h^{z}(1-p_h)^{n-z} \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )-\sum _{z=1}^{\lfloor nq_h \rfloor } z \dfrac{n}{z} \dfrac{(n-1)!}{(z-1)!(n-z)!}p_hp_h^{z-1}(1-p_h)^{n-z} \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )-np_h \sum _{z=1}^{\lfloor nq_h \rfloor } \dfrac{(n-1)!}{(z-1)!(n-z)!}p_h^{z-1}(1-p_h)^{n-z} \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )-np_h \sum _{z=0}^{\lfloor nq_h \rfloor -1} \dfrac{(n-1)!}{z!(n-1-z)!}p_h^{z}(1-p_h)^{n-1-z} \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )- np_h \sum _{z=0}^{\lfloor nq_h\rfloor -1} P(Z_{n-1,h}=z) \\& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )- np_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1) . \end{aligned}$$

(24)

Recalling that a binomial random variable $Z_{n,h}$ can be represented as the sum of two independent binomials $Z_{n-1,h}$ and $Z_{1,h}$, we have

$$\begin{aligned} P(Z_{n,h}\le \lfloor nq_h\rfloor )& = P(Z_{n-1,h}+Z_{1,h}\le \lfloor nq_h\rfloor )\\& = P(Z_{n-1,h}+Z_{1,h}\le \lfloor nq_h\rfloor \vert Z_{1,h}=0)P(Z_{1,h}=0)\\&\quad +P(Z_{n-1,h}+Z_{1,h}\le \lfloor nq_h\rfloor \vert Z_{1,h}=1)P(Z_{1,h}=1)\\& = P(Z_{n-1,h}\le \lfloor nq_h\rfloor )(1-p_h)+P(Z_{n-1,h}\le \lfloor nq_h\rfloor -1)p_h \end{aligned}$$

and we can reformulate Eq. (24) as follows

$$\begin{aligned} A(p_h,q_h)& = nq_h P(Z_{n,h} \le \lfloor nq_h\rfloor )- np_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1) \\& = n(1-p_h)q_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor ) \\&\quad +np_h q_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1) \\&\quad -np_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1) \\& = nq_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor )-np_hq_hP(Z_{n-1,h} \le \lfloor nq_h\rfloor ) \\&\quad +np_hq_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor )- np_hq_h P(Z_{n-1,h} = \lfloor nq_h\rfloor ) \\&\quad -np_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor ) + np_h P(Z_{n-1,h} = \lfloor nq_h\rfloor ) \\& = n(q_h-p_h) P(Z_{n-1,h} \le \lfloor nq_h\rfloor ) \\&\quad +np_h(1-q_h) P(Z_{n-1,h}=\lfloor nq_h \rfloor ), \end{aligned}$$

(25)

proving the Lemma. $\square$

Proposition 1

Under the assumption that the sampling frequencies ${\mathbf{X}}=(X_1,\ldots ,X_c)$ are distributed as a multinomial random vector ${\mathcal{M}}(n,{\mathbf{p}})$, the expected value of the estimator $\hat{I}$ in Eq. (3) is given by

$$\begin{aligned} E \big [\hat{I}\big ]= 1-\sum _{h=1}^c (q_h-p_h)P(Z_{n-1,h}\le \lfloor nq_h\rfloor ) -\sum _{h=1}^c p_h(1-q_h)P(Z_{n-1,h}=\lfloor nq_h\rfloor ), \end{aligned}$$

where the $Z_{n-1,h}$ for $h=1,\ldots ,c$ are independent binomial random variables, each one with parameters $n-1$ and $p_h$, that is $Z_{n-1,h} \sim {\mathcal{B}}(n-1,p_h)$.

Proof

From Eq. (5) the expectation of $\hat{I}$ can be represented as

$$\begin{aligned} E \big [\hat{I}\big ]= 1-\frac{1}{n}\sum _{h=1}^c E\big [(nq_h-X_{h})_+ \big ]. \end{aligned}$$

Recalling that the h-th component of a random vector ${\mathbf{X}}$, multinomial distributed with parameters n and ${\mathbf{p}}$, has marginal binomial distribution with parameters n and $p_h$, that is $X_{h} \sim {\mathcal{B}}(n,p_h)$, that expectation can be rewritten as

$$\begin{aligned} E \big [\hat{I}\big ]& = 1-\frac{1}{n}\sum _{h=1}^c E\big [(nq_h-Z_{n,h})_+ \big ] \\& = 1-\frac{1}{n}\sum _{h=1}^c A(p_h,q_h). \end{aligned}$$

(26)

Therefore, the proof follows from combining Eq (25) in Lemma 1 and Eq. (26). $\square$

Lemma 2

Let $\Omega$ be a finite set of categorical values $\{\omega _1\,\ldots ,\omega _c\}$ and ${\mathbf{p}}=\{p_1,\ldots ,p_c\}$ and ${\mathbf{q}}=\{q_1,\ldots ,q_c\}$ two probability distributions defined over $\Omega$. Let us consider one value $\omega _h \in \Omega$ with the corresponding probabilities $p_h,q_h \in [0,1]$ in the profiles ${\mathbf{p}}$ and ${\mathbf{q}}$, respectively. Let us denote by $Z_{n,h}$ a binomial random variable with parameters n and $p_h$, that is $Z_{n,h} \sim {\mathcal{B}}(n,p_h)$, and define the function $B(p_h,q_h)$ as follows

$$\begin{aligned} B(p_h,q_h)=E\big [ \big ( (nq_h-Z_{n,h})_+ \big )^2 \big ]. \end{aligned}$$

Then, for every $h=1,\ldots ,c$ we have

$$\begin{aligned} B(p_h,q_h)& = n^2(1-p_h)^2q_h^2P(Z_{n-2,h}\le \lfloor nq_h \rfloor )\\&+np_h(1-p_h)\big [1-2nq_h(1-q_h)\big ]P(Z_{n-2,h}\le \lfloor nq \rfloor -1)\\&+n^2p_h^2(1-q_h)^2 P(Z_{n-2,h} \le \lfloor nq_h \rfloor -2). \end{aligned}$$

Proof

By definition we have

$$\begin{aligned} B(p_h,q_h)& = E\big [\big ((nq_h-Z_{n,h})_+ \big )^2 \big ] \\& = \sum _{z=0}^{\lfloor nq_h\rfloor } (n q_{h}-z)^2 P(Z_{n,h}=z) \\& = \sum _{z=0}^{\lfloor nq_h\rfloor } \big (n^2q_{h}^2 -2nq_{h}z + z^2 \big ) P(Z_{n,h}=z) \\& = \sum _{z=0}^{\lfloor nq_h\rfloor } \big (n^2q_{h}^2 -2nq_{h}z + z^2+z-z \big ) P(Z_{n,h}=z) \\& = \sum _{z=0}^{\lfloor nq_h\rfloor } \big [n^2q_{h}^2 + (1-2nq_{h})z + z(z-1) \big ] P(Z_{n,h}=z). \end{aligned}$$

(27)

Now, let us split Eq. (27) in three terms. First, we have

$$\begin{aligned} \sum _{z=0}^{\lfloor nq_h\rfloor } n^2q_{h}^2 P(Z_{n,h}=z)=n^2q_{h}^2 P(Z_{n,h} \le \lfloor nq_h\rfloor ). \end{aligned}$$

(28)

Second,

$$\begin{aligned} \sum _{z=0}^{\lfloor nq_h\rfloor } (1-2nq_{h})z P(Z_{n,h}=z)& = \sum _{z=1}^{\lfloor nq_h\rfloor } (1-2nq_{h})z P(Z_{n,h}=z) \\& = (1-2nq_{h})\sum _{z=1}^{\lfloor nq_h\rfloor }z \dfrac{n!}{z!(n-z)!}p_h^z(1-p_h)^{n-z} \\& = (1-2nq_{h})\sum _{z=1}^{\lfloor nq_h\rfloor }z \dfrac{n}{z} \dfrac{(n-1)!}{(z-1)!(n-z)!}p_hp_h^{z-1}(1-p_h)^{n-z} \\& = np_h(1-2nq_{h}) \sum _{z=1}^{\lfloor nq_h\rfloor }\dfrac{(n-1)!}{(z-1)!(n-z)!}p_h^{z-1}(1-p_h)^{n-z} \\& = np_h(1-2nq_{h}) \sum _{z=0}^{\lfloor nq_h\rfloor -1}\dfrac{(n-1)!}{z!(n-z-1)!}p_h^{z}(1-p_h)^{n-z-1} \\& = np_h(1-2nq_{h}) P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1). \end{aligned}$$

(29)

Similarly, we also have

$$\begin{aligned} \sum _{z=0}^{\lfloor nq_h\rfloor } z(z-1) P(Z_{n,h}=z)& = \sum _{z=2}^{\lfloor nq_h\rfloor } z(z-1) P(Z_{n,h}=z) \\& = \sum _{z=2}^{\lfloor nq_h\rfloor } z(z-1) \dfrac{n!}{z!(n-z)!}p_h^z(1-p_h)^{n-z} \\& = \sum _{z=2}^{\lfloor nq_h\rfloor } z(z-1) \dfrac{n(n-1)}{z(z-1)} \dfrac{(n-2)!}{(z-2)!(n-z)!}p_h^2p_h^{z-2}(1-p_h)^{n-z} \\& = n(n-1)p_h^2\sum _{z=2}^{\lfloor nq_h\rfloor } \dfrac{(n-2)!}{(z-2)!(n-z)!}p_h^{z-2}(1-p_h)^{n-z} \\& = n(n-1)p_h^2\sum _{z=0}^{\lfloor nq_h\rfloor -2} \dfrac{(n-2)!}{z!(n-2-z)!}p_h^{z}(1-p_h)^{n-2-z} \\& = n(n-1)p_h^2 P(Z_{n-2,h} \le \lfloor nq_h\rfloor -2). \end{aligned}$$

(30)

Combining results (28), (29) and (30), and recalling that a binomial random variable can be represented as sum of independent components, as for instance $Z_{n,h}=Z_{n-1,h}+Z_{1,h}$ and $Z_{n-1,h}=Z_{n-2,h}+Z^*_{1,h}$, we prove the Lemma, in fact

$$\begin{aligned} B(p_h,q_h)& = n^2q_{h}^2 P(Z_{n,h} \le \lfloor nq_h\rfloor )\\&+(1-2nq_{h}) np_h P(Z_{n-1,h} \le \lfloor nq_h\rfloor -1)\\&+n(n-1)p_h^2 P(Z_{n-2,h} \le \lfloor nq_h\rfloor -2)\\& = n^2(1-p_h)^2q_h^2 P(Z_{n-2,h}\le \lfloor nq_h \rfloor )\\&+np_h(1-p_h)\big [1-2nq_h(1-q_h)\big ]P(Z_{n-2,h}\le \lfloor nq \rfloor -1)\\&+n^2p_h^2(1-q_h)^2 P(Z_{n-2,h} \le \lfloor nq_h \rfloor -2), \end{aligned}$$

where we have used iteratively the following property

$$\begin{aligned} P(Z_{n,h} \le z)& = P(Z_{n-1,h}+Z_{1,h} \le z)\\& = P(Z_{n-1,h} + Z_{1,h} \le z \vert Z_{1,h}=0)P(Z_{1,h}=0)\\&+P(Z_{n-1,h} + Z_{1,h} \le z \vert Z_{1,h}=1)P(Z_{1,h}=1)\\& = P(Z_{n-1,h} \le z)(1-p_h)+P(Z_{n-1,h} \le z-1)p_h. \end{aligned}$$

$\square$

Lemma 3

Let $\Omega$ be a finite set of categorical values $\{\omega _1\,\ldots ,\omega _c\}$ and ${\mathbf{p}}=\{p_1,\ldots ,p_c\}$ and ${\mathbf{q}}=\{q_1,\ldots ,q_c\}$ two probability distributions defined over $\Omega$. Let us consider two values $\omega _h,\omega _l \in \Omega$ with the corresponding probabilities $p_h,p_l,q_h,q_l \in [0,1]$ in the profiles ${\mathbf{p}}$ and ${\mathbf{q}}$, respectively. Let us denote by $U_{n,h}$ and $U_{n,l}$ the components of a multinomial random vector $\mathbf {U}_n=\big (U_{n,h},U_{n,l},n-U_{n,h}-U_{n,l} \big )$ with parameters n and ${\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)$, that is $\mathbf {U}_n \sim {\mathcal{M}}(n,{\mathbf{p}}_{hl})$, and define the function $C(p_h,p_l,q_h,q_l)$ as follows

$$\begin{aligned} C(p_h,p_l,q_h,q_l)=E\big [(nq_h-U_{n,h})_+(nq_l-U_{n,l})_+ \big ]. \end{aligned}$$

Then, for every $h=1,\ldots ,c$ we have

$$\begin{aligned} C(p_h,p_l,q_h,q_l)& = n^2q_h q_l P(U_{n,h} \le \lfloor nq_h \rfloor , U_{n,l} \le \lfloor nq_l \rfloor )\\&-n^2q_h p_l P(U_{n-1,h} \le \lfloor nq_h \rfloor , U_{n-1,l} \le \lfloor nq_l \rfloor -1)\\&-n^2p_h q_l P(U_{n-1,h} \le \lfloor nq_h \rfloor -1, U_{n-1,l} \le \lfloor nq_l \rfloor )\\&+n(n-1)p_h p_l P(U_{n-2,h} \le \lfloor nq_h \rfloor -1, U_{n-2,l} \le \lfloor nq_l \rfloor -1). \end{aligned}$$

Proof

By definition we have

$$\begin{aligned}&C(p_h,p_l,q_h,q_l)=E\big [(nq_h-U_{n,h})_+(nq_l-U_{n,l})_+ \big ] \\&\quad =\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }(nq_h-u_h)(nq_l-u_l)P(U_{n,h}=u_h,U_{n,l}=u_l) \\&\quad =\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }(n^2q_hq_l-nq_hu_l - nu_hq_l+u_hu_l)P(U_{n,h}=u_h,U_{n,l}=u_l). \end{aligned}$$

(31)

Now, let us split the Eq. (31) in four terms. First, we have

$$\begin{aligned} \sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }n^2q_hq_l P(U_{n,h}& = u_h, U_{n,l}=u_l) \\& = n^2q_hq_lP(U_{n,h} \le \lfloor nq_h\rfloor , U_{n,l} \le \lfloor nq_l\rfloor ). \end{aligned}$$

(32)

Second

$$\begin{aligned}&\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }nq_hu_l P(U_{n,h}=u_h,U_{n,l}=u_l) \\&\quad =\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l \rfloor } nq_hu_l P(U_{n,h}=u_h,U_{n,l}=u_l) \\&\quad =\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor }nq_hu_l \dfrac{n!}{u_h!u_l!(n-u_h-u_l)!}p_h^{u_h}p_l^{u_l}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad =\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor }nq_hu_l \dfrac{n}{u_l} \dfrac{(n-1)!}{u_h!(u_l-1)!(n-u_h-u_l)!}p_h^{u_h} p_l p_l^{u_l-1}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad = n^2q_hp_l \sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor } \dfrac{(n-1)!}{u_h!(u_l-1)!(n-u_h-u_l)!}p_h^{u_h} p_l^{u_l-1}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad = n^2q_hp_l \sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor -1} \dfrac{(n-1)!}{u_h!u_l!(n-1-u_h-u_l)!}p_h^{u_h} p_l^{u_l}(1-p_h-p_l)^{n-1-u_h-u_l} \\&\quad = n^2q_hp_l P(U_{n-1,h} \le \lfloor nq_h\rfloor , U_{n-1,l} \le \lfloor nq_l\rfloor -1). \end{aligned}$$

(33)

Similarly, by exchanging h for l, we also derive

$$\begin{aligned}&\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }nq_lu_h P(U_{n,h}=u_h,U_{n,l}=u_l) \\&\quad = n^2p_hq_l P(U_{n-1,h} \le \lfloor nq_h\rfloor -1 , U_{n-1,l} \le \lfloor nq_l\rfloor ). \end{aligned}$$

(34)

And last

$$\begin{aligned}&\sum _{u_h=0}^{\lfloor nq_h\rfloor }\sum _{u_l=0}^{\lfloor nq_l\rfloor }u_hu_l P(U_{n,h}= u_h ,U_{n,l}=u_l) = \sum _{u_h=1}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor } u_hu_l P(U_{n,h}=u_h,U_{n,l}=u_l) \\&\quad =\sum _{u_h=1}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor }u_hu_l \dfrac{n!}{u_h!u_l!(n-u_h-u_l)!}p_h^{u_h}p_l^{u_l}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad =\sum _{u_h=1}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor }u_hu_l \dfrac{n(n-1)}{u_hu_l} \dfrac{(n-2)!}{(u_h-1)!(u_l-1)!(n-u_h-u_l)!}p_hp_h^{u_h-1} p_l p_l^{u_l-1}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad = n(n-1)p_hp_l \sum _{u_h=1}^{\lfloor nq_h\rfloor }\sum _{u_l=1}^{\lfloor nq_l\rfloor } \dfrac{(n-2)!}{(u_h-1)!(u_l-1)!(n-u_h-u_l)!}p_h^{u_h-1} p_l^{u_l-1}(1-p_h-p_l)^{n-u_h-u_l} \\&\quad = n(n-1)p_hp_l \sum _{u_h=0}^{\lfloor nq_h\rfloor -1}\sum _{u_l=0}^{\lfloor nq_l\rfloor -1} \dfrac{(n-2)!}{u_h!u_l!(n-u_h-u_l-2)!}p_h^{u_h} p_l^{u_l}(1-p_h-p_l)^{n-u_h-u_l-2} \\&\quad = n(n-1)p_hp_l P(U_{n-2,h} \le \lfloor nq_h\rfloor -1 , U_{n-2,l} \le \lfloor nq_l\rfloor -1). \end{aligned}$$

(35)

Combining results (32), (33), (34) and (35) into Eq. (31) we prove the Lemma. $\square$

Proposition 2

Under the assumption that the sampling frequencies ${\mathbf{X}}=(X_1,\ldots ,X_c)$ are distributed as a multinomial random vector ${\mathcal{M}}(n,{\mathbf{p}})$, the variance of the estimator $\hat{I}$ in Eq. (3) is given by

$$\begin{aligned} Var [{\hat{I}}]& = \sum _{h=1}^c \bigg \{ (1-p_h)^2q_h^2P(Z_{n-2,h}\le \lfloor nq_h \rfloor ) +p_h^2(1-q_h)^2 P(Z_{n-2,h} \le \lfloor nq_h \rfloor -2) \\&\quad +\frac{1}{n} p_h(1-p_h)\big [1-2nq_h(1-q_h)\big ]P(Z_{n-2,h}\le \lfloor nq \rfloor -1) \bigg \} \\&\quad +2\sum _{h=1}^{c-1}\sum _{l=h+1}^c \bigg \{ q_h q_l P(U_{n,h} \le \lfloor nq_h \rfloor , U_{n,l} \le \lfloor nq_l \rfloor ) \\&\quad -q_h p_l P(U_{n-1,h} \le \lfloor nq_h \rfloor , U_{n-1,l} \le \lfloor nq_l \rfloor -1) \\&\quad -p_h q_l P(U_{n-1,h} \le \lfloor nq_h \rfloor -1, U_{n-1,l} \le \lfloor nq_l \rfloor ) \\&\quad + \frac{n-1}{n} p_h p_l P(U_{n-2,h} \le \lfloor nq_h \rfloor -1, U_{n-2,l} \le \lfloor nq_l \rfloor -1) \bigg \} \\&\quad - \left( \sum _{h=1}^c \left\{ (q_h-p_h) P(Z_{n-1,h} \le \lfloor nq_h\rfloor ) +\, p_h(1-q_h) P(Z_{n-1,h}=\lfloor nq_h \rfloor ) \right\} \right) ^2 \end{aligned}$$

where $Z_{n,h}$ are binomial random variables with parameters n and $p_h$, that is $Z_{n,h} \sim {\mathcal{B}}(n,p_h)$, while $U_{n,h}$ and $U_{n,l}$ are the components of a multinomial random vector $\mathbf {U}_n=\big (U_{n,h},U_{n,l},n-U_{n,h}-U_{n,l} \big )$ with parameters n and ${\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)$, that is $\mathbf {U}_n \sim {\mathcal{M}}(n,{\mathbf{p}}_{hl})$.

Proof

First, we recall the following representation

$$\begin{aligned} Var [\hat{I}]& = \frac{1}{n^2} Var \left[ \sum _{h=1}^c (nq_h-X_{h})_+\right] \\& = \frac{1}{n^2} \sum _{h=1}^c Var \big [(nq_h-X_{h})_+\big ] \\&\quad +\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c Cov\big [(nq_h-X_{h})_+,(nq_l-X_{l})_+\big ] \\& = \frac{1}{n^2} \sum _{h=1}^c E\big [\big ((nq_h-X_{h})_+\big )^2\big ] -\frac{1}{n^2} \sum _{h=1}^c \big (E\big [(nq_h-X_{h})_+\big ]\big )^2 \\&\quad +\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nq_h-X_{h})_+ (nq_l-X_{l})_+\big ] \\&\quad -\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nq_h-X_{h})_+\big ]E\big [(nq_l-X_{l})_+\big ] \\& = \frac{1}{n^2} \sum _{h=1}^c E\big [\big ((nq_h-X_{h})_+\big )^2\big ] \\&\quad +\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nq_h-X_{h})_+ (nq_l-X_{l})_+\big ] \\&\quad - \left( \frac{1}{n} \sum _{h=1}^c E\big [(nq_h-X_{h})_+\big ]\right) ^2 . \end{aligned}$$

(36)

If the random vector ${\mathbf{X}}=(X_1,\ldots ,X_c)$ is multinomial distributed with parameters n and ${\mathbf{p}}$, we know that the h-th component $X_h$ has marginal binomial distribution with parameters n and $p_h$ while two components $X_h$ and $X_l$ have marginal joint multinomial distribution with parameters n and ${\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)$, therefore we can rewrite Eq. (36) as follows

$$\begin{aligned} Var[\hat{I}]& = \frac{1}{n^2} \sum _{h=1}^c E\big [\big ((nq_h-Z_{n,h})_+\big )^2\big ]\\&\quad +\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nq_h-U_{n,h})_+ (nq_l-U_{n,l})_+\big ]\\&\quad - \left( \frac{1}{n} \sum _{h=1}^c E\big [(nq_h-Z_{n,h})_+\big ]\right) ^2 \end{aligned}$$

where $Z_{n,h} \sim {\mathcal{B}}(n,p_h)$ and $(U_{n,h},U_{n,l}) \sim {\mathcal{M}}(n, {\mathbf{p}}_{hl})$, or equivalently as

$$\begin{aligned} Var[\hat{I}] = \frac{1}{n^2}\sum _{h=1}^c B(p_h,q_h)+\frac{2}{n^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c C(p_h,p_l,q_h,q_l)-\left[ \frac{1}{n}\sum _{h=1}^c A(p_h,q_h)\right] ^2 \end{aligned}$$

where

$$\begin{aligned} A(p_h,q_h) = E\big [ (nq_h -Z_{n,h})_+\big ], \end{aligned}$$

$$\begin{aligned} B(p_h,q_h)=E\big [ \big ((nq_h-Z_{n,h})_+\big )^2\big ] \end{aligned}$$

and

$$\begin{aligned} C(p_h,q_h,p_l,q_l) = E\big [(nq_h-U_{n,h})_+(nq_l-U_{n,l})_+ \big ]. \end{aligned}$$

Now combining the results from Lemmas 1, 2 and 3 we prove the Proposition. $\square$

Lemma 4

Let $\Omega$ be a finite set of categorical values $\{\omega _1\,\ldots ,\omega _c\}$ and ${\mathbf{p}}=\{p_1,\ldots ,p_c\}$ and ${\mathbf{q}}=\{q_1,\ldots ,q_c\}$ two probability distributions defined over $\Omega$. Let us consider one value $\omega _h \in \Omega$ with the corresponding probabilities $p_h,q_h \in [0,1]$ in the profiles ${\mathbf{p}}$ and ${\mathbf{q}}$, respectively. Let us denote by $Z'_{n,h}$ a binomial random variable with parameters n and $p_h$ and by $Z''_{m,h}$ a binomial random variable with parameters m and $q_h$, that is $Z'_{n,h} \sim {\mathcal{B}}(n,p_h)$ and $Z''_{m,h} \sim {\mathcal{B}}(m,q_h)$, respectively. Let us define the function $D(p_h,q_h)$ as follows

$$\begin{aligned} D(p_h,q_h)=E\big [(nZ''_{m,h}-mZ'_{n,h})_+ \big ]. \end{aligned}$$

Then, for every $h=1,\ldots ,c$ we have

$$\begin{aligned} D(p_h,q_h)& = nm(1-p_h)q_h P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big ) \\& \quad-nmp_h(1-q_h) P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big ). \end{aligned}$$

Proof

Let us introduce the indicator function ${\mathbb{I}}(A)$ of an event A, which equals to 1 if the event A is observed and 0 otherwise. Then, the quantity $D(p_h,q_h)$ can be represented as

$$\begin{aligned} D(p_h,q_h)& = E\big [(nZ''_{m,h}-mZ'_{n,h})_+ \big ] \\& = E\big [(nZ''_{m,h}-mZ'_{n,h}){\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ] \\& = nE\big [Z''_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ] - mE\big [Z'_{n,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]. \end{aligned}$$

(37)

Let us split Eq. (37) into two terms. Recalling that a binomial random variable $Z''_{m,h}$ can be represented as the sum of m independent random variables $Z''_{1,h,i}$ distributed as Bernoulli $Z''_{1,h}$ or also as the sum of independent Bernoulli $Z''_{1,h}$ and binomial $Z''_{m-1,h}$, we have

$$\begin{aligned} nE\big [Z''_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]& = n E \left[ \left( \sum _{i=1}^m Z''_{1,h,i} \right) {\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\& = n \sum _{i=1}^m E \left[ Z''_{1,h,i}{\mathbb{I}}(nZ''_{m-1,h}+ nZ''_{1,h,i} \ge mZ'_{n,h}) \right] \end{aligned}$$

that, due to the identical distribution of the Bernoulli variables $Z''_{1,h,i}$ to $Z''_{1,h}$, can be simplified in the notation. Hence

$$\begin{aligned} nE\big [Z''_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]& = nm E\big [Z''_{1,h}{\mathbb{I}}(nZ''_{m-1,h}+nZ''_{1,h} \ge mZ'_{n,h}) \big ] \\& = nm E\big [{\mathbb{I}}(nZ''_{m-1,h}+n \ge mZ'_{n,h}) \big ]q_h \\& = nm P\big (nZ''_{m-1,h}+n \ge mZ'_{n,h} \big )q_h \\& = nm(1-p_h)q_h P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big ) \\&\quad+nmp_hq_h P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+m \big ), \end{aligned}$$

(38)

where we used the properties

$$\begin{aligned} E\big [Z''_{1,h}{\mathbb{I}}(nZ''_{m-1,h}+ & nZ''_{1,h} \ge mZ'_{n,h}) \big ]\\& = E\big [Z''_{1,h}{\mathbb{I}}(nZ''_{m-1,h}+nZ''_{1,h} \ge mZ'_{n,h} \vert Z''_{1,h}=0)]P(Z''_{1,h}=0) \\&\quad+ E\big [Z''_{1,h}{\mathbb{I}}(nZ''_{m-1,h}+nZ''_{1,h} \ge mZ'_{n,h} \vert Z''_{1,h}=1)]P(Z''_{1,h}=1) \\& = E\big [{\mathbb{I}}(nZ''_{m-1,h}+n \ge mZ'_{n,h})]q_h \\& = P\big (nZ''_{m-1,h}+n \ge mZ'_{n,h}\big )q_h \end{aligned}$$

and

$$\begin{aligned} P\big (nZ''_{m-1,h}+ & n \ge mZ'_{n,h}\big )\\& = P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+mZ'_{1,h} \vert Z'_{1,h}=0 \big )P(Z'_{1,h}=0)\\&\quad +P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+mZ'_{1,h} \vert Z'_{1,h}=1 \big )P(Z'_{1,h}=1)\\& = P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big )(1-p_h)\\&\quad +P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+m \big )p_h. \end{aligned}$$

Similarly, we can also derive

$$\begin{aligned} mE\big [Z'_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]& = mnp_h(1-q_h) P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big ) \\&\quad+mnp_hq_h P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+m \big ). \end{aligned}$$

(39)

Now, combining the results (38) and (39) into Eq. (37) we prove the Lemma, in fact

$$\begin{aligned} D(p_h,q_h)& = nm P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big )(1-p_h)q_h \\&\quad +nm P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+m \big )p_hq_h \\&\quad -nm P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big )p_h(1-q_h) \\&\quad -nm P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}+m \big )p_hq_h \\& = nm P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big )(1-p_h)q_h \\&\quad -nm P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big )p_h(1-q_h). \end{aligned}$$

(40)

$\square$

Proposition 3

Under the assumption that the sampling frequencies ${\mathbf{X}}=(X_1,\ldots ,X_c)$ and $\mathbf {Y}=(Y_1,\ldots ,Y_c)$ are distributed as multinomial random vectors ${\mathcal{M}}(n,{\mathbf{p}})$ and ${\mathcal{M}}(m,{\mathbf{q}})$, respectively, the expected value of the estimator $\hat{I}$ in Eq. (19) is given by

$$\begin{aligned} E \big [\hat{I}\big ]& = 1-\sum _{h=1}^c (1-p_h)q_h P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h}\big )\\&\quad +\sum _{h=1}^c p_h(1-q_h)P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big ), \end{aligned}$$

where the $Z'_{n-1,h}$ are binomial random variables with parameters $n-1$ and $p_h$ and the $Z''_{m-1,h}$ are binomial random variables with parameters $m-1$ and $q_h$, that is $Z'_{n-1,h} \sim {\mathcal{B}}(n-1,p_h)$ and $Z''_{m-1,h} \sim {\mathcal{B}}(m-1,q_h)$, respectively.

Proof

From Eq. (19) we have

$$\begin{aligned} \hat{I}& = \frac{1}{nm}\sum _{h=1}^{c} \min \{mX_{h},nY_{h}\}\\& = \frac{1}{nm}\sum _{h=1}^c\big [nY_h-(nY_h-mX_{h})_+\big ]\\& = 1-\frac{1}{nm}\sum _{h=1}^c(nY_h-mX_{h})_+, \end{aligned}$$

hence its expectation can be represented as

$$\begin{aligned} E \big [ \hat{I} \big ]=1-\frac{1}{nm}\sum _{h=1}^c E \big [ (nY_h-mX_{h})_+ \big ]. \end{aligned}$$

(41)

Now, recalling that the h-th component of a random vector ${\mathbf{X}}$, multinomial distributed with parameters n and ${\mathbf{p}}$, has marginal binomial distribution with parameters n and $p_h$, that is $X_{h} \sim {\mathcal{B}}(n,p_h)$, and the h-th component of a random vector $\mathbf {Y}$, multinomial distributed with parameters m and ${\mathbf{q}}$, has marginal binomial distribution with parameters m and $q_h$, that is $Y_{h} \sim {\mathcal{B}}(m,q_h)$, Eq. (41) can be rewritten as

$$\begin{aligned} E \big [ \hat{I} \big ]& = 1-\frac{1}{nm}\sum _{h=1}^c E \big [ (nZ''_{m,h}-mZ'_{n,h})_+ \big ] \\& = 1-\frac{1}{nm}\sum _{h=1}^c D(p_h,q_h). \end{aligned}$$

(42)

Therefore, the proof follows from combining Eq. (40) in Lemma 4 and Eq. (42). $\square$

Lemma 5

Let $\Omega$ be a finite set of categorical values $\{\omega _1\,\ldots ,\omega _c\}$ and ${\mathbf{p}}=\{p_1,\ldots ,p_c\}$ and ${\mathbf{q}}=\{q_1,\ldots ,q_c\}$ two probability distributions defined over $\Omega$. Let us consider one value $\omega _h \in \Omega$ with the corresponding probabilities $p_h,q_h \in [0,1]$ in the profiles ${\mathbf{p}}$ and ${\mathbf{q}}$, respectively. Let us denote by $Z'_{n,h}$ a binomial random variable with parameters n and $p_h$ and by $Z''_{m,h}$ a binomial random variable with parameters m and $q_h$, that is $Z'_{n,h} \sim {\mathcal{B}}(n,p_h)$ and $Z''_{m,h} \sim {\mathcal{B}}(m,q_h)$, respectively. Let us define the function $F(p_h,q_h)$ as follows

$$\begin{aligned} F(p_h,q_h)=E\big [\big (\big (nZ''_{m,h}-mZ'_{n,h}\big )_+ \big )^2 \big ]. \end{aligned}$$

Then, for every $h=1,\ldots ,c$ we have

$$\begin{aligned} F(p_h,q_h)& = n^2m q_h P\big (nZ''_{m-1,h} + n \ge mZ'_{n,h} \big ) \\&\quad +n^2m(m-1) q_h^2 P \big (nZ''_{m-2,h} +2n \ge mZ'_{n,h} \big ) \\&\quad +m^2n p_h P\big (nZ''_{m,h} \ge mZ'_{n-1,h} +m \big ) \\&\quad +m^2n(n-1) p_h^2 P \big (nZ''_{m,h} \ge mZ'_{n-2,h}+2m \big ) \\&\quad -2n^2m^2 p_hq_hP\left( nZ''_{m-1,h} +n \ge mZ'_{n-1,h}+m\right) . \end{aligned}$$

Proof

By using the indicator function ${\mathbb{I}}(A)$ of an event A, the quantity $F(p_h,q_h)$ can be represented as

$$\begin{aligned} F(p_h,q_h)& = E\big [\big (\big (nZ''_{m,h}-mZ'_{n,h}\big )_+ \big )^2 \big ] \\& = E\big [(nZ''_{m,h}-mZ'_{n,h})^2{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ] \\& = n^2E\big [Z''^2_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ] + m^2E\big [Z'^2_{n,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ] \\&-2nm E\big [ Z'_{n,h}Z''_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h})\big ]. \end{aligned}$$

(43)

Now, let us split Eq. (43) in three terms and recall that a binomial random variable $Z''_{m,h}$ can be represented as sum of m independent random variables $Z''_{1,h,i}$ distributed as Bernoulli $Z''_{1,h}$ or also as sum of independent Bernoulli $Z''_{1,h}$ and binomial $Z''_{m-1,h}$. For the first term, we have

$$\begin{aligned}&n^2E \big [ Z''^2_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]=n^2E \left[ \left( \sum _{i=1}^m Z''_{1,h,i} \right) ^2{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\&\quad =n^2E \left[ \left( \sum _{i=1}^m Z''^2_{1,h,i} + \mathop {\sum \sum }_{i \ne j} Z''_{1,h,i} Z''_{1,h,j} \right) {\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\&\quad =n^2E \left[ \left( _{i=1}^m Z''^2_{1,h,i} \right) {\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\&\qquad +n^2E \left[ \left( \mathop {\sum \sum }_{i \ne j} Z''_{1,h,i} Z''_{1,h,j} \right) {\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\&\quad = n^2 \sum _{i=1}^m E \big [Z''^2_{1,h,i} {\mathbb{I}}(nZ''_{m-1,h} +nZ''_{1,h,i} \ge mZ'_{n,h}) \big ] \\&\qquad +n^2\mathop {\sum \sum }_{i \ne j}E \left[ Z''_{1,h,i} Z''_{1,h,j} {\mathbb{I}}(nZ''_{m-2,h} +mZ''_{1,h,i} +mZ''_{1,h,j} \ge mZ'_{n,h}) \right] \\&\quad = n^2m E \big [{\mathbb{I}}(nZ''_{m-1,h} + n \ge mZ'_{n,h}) \big ] q_h \\&\qquad +\,n^2m(m-1)E \big [ {\mathbb{I}}(nZ''_{m-2,h} +2n \ge mZ'_{n,h})q_h^2 \\&\quad = n^2m q_h P\big (nZ''_{m-1,h} + n \ge mZ'_{n,h} \big ) \\&\qquad +\,n^2m(m-1) q_h^2 P \big (nZ''_{m-2,h} +2n \ge mZ'_{n,h} \big ), \end{aligned}$$

(44)

in which, as in Lemma 4, we have used the independence and the identical distribution of the Bernoulli variables $Z''_{1,h,i}$. Similarly, we can also derive the second term of Eq. (43)

$$\begin{aligned}&m^2E\big [Z'^2_{n,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \big ]= m^2n p_h P\big (nZ''_{m,h} \ge mZ'_{n-1,h} +m \big ) \\&\quad +\,m^2n(n-1) p_h^2 P \big (nZ''_{m,h} \ge mZ'_{n-2,h}+2m \big ), \end{aligned}$$

(45)

while for the last term, we have

$$\begin{aligned}&E\big [ Z'_{n,h}Z''_{m,h}{\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h})\big ] =E\left[ \left( \sum _{i=1}^n \sum _{j=1}^m Z'_{1,h,i}Z''_{1,h,j} \right) {\mathbb{I}}(nZ''_{m,h} \ge mZ'_{n,h}) \right] \\&\quad =\sum _{i=1}^n \sum _{j=1}^m E\left[ Z'_{1,h,i}Z''_{1,h,j} {\mathbb{I}}(nZ''_{m-1,h} +nZ''_{1,h,j} \ge mZ'_{n-1,h}+mZ'_{1,h,i})\right] \\&\quad =nm E\left[ Z'_{1,h}Z''_{1,h} {\mathbb{I}}(nZ''_{m-1,h} +nZ''_{1,h} \ge mZ'_{n-1,h}+mZ'_{1,h}) \right] \\&\quad =nm E\left[ {\mathbb{I}}(nZ''_{m-1,h} +n \ge mZ'_{n-1,h}+m) \right] p_hq_h \\&\quad =nm p_hq_hP\left( nZ''_{m-1,h} +n \ge mZ'_{n-1,h}+m\right) . \end{aligned}$$

(46)

Now, it suffices to plug the results (44), (45) and (46) into Eq. (43) and prove the Lemma. In fact

$$\begin{aligned} F(p_h,q_h)& = n^2m q_h P\big (nZ''_{m-1,h} + n \ge mZ'_{n,h} \big ) \\&\quad +n^2m(m-1) q_h^2 P \big (nZ''_{m-2,h} +2n \ge mZ'_{n,h} \big ) \\&\quad +m^2n p_h P\big (nZ''_{m,h} \ge mZ'_{n-1,h} +m \big ) \\&\quad +m^2n(n-1) p_h^2 P \big (nZ''_{m,h} \ge mZ'_{n-2,h}+2m \big ) \\&\quad -2n^2m^2 p_hq_hP\left( nZ''_{m-1,h} +n \ge mZ'_{n-1,h}+m\right) . \end{aligned}$$

(47)

$\square$

Lemma 6

Let $\Omega$ be a finite set of categorical values $\{\omega _1\,\ldots ,\omega _c\}$ and ${\mathbf{p}}=\{p_1,\ldots ,p_c\}$ and ${\mathbf{q}}=\{q_1,\ldots ,q_c\}$ two probability distributions defined over $\Omega$. Let us consider two values $\omega _h,\omega _l \in \Omega$ with the corresponding probabilities $p_h,p_l,q_h,q_l \in [0,1]$ in the profiles ${\mathbf{p}}$ and ${\mathbf{q}}$, respectively. Let us denote by $U'_{n,h}$ and $U'_{n,l}$ the components of a multinomial random vector $\mathbf {U}'_n=\big (U'_{n,h},U'_{n,l},n-U'_{n,h}-U'_{n,l} \big )$ with parameters n and ${\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)$, that is $\mathbf {U}'_n \sim {\mathcal{M}}(n,{\mathbf{p}}_{hl})$, and by $U''_{m,h}$ and $U''_{m,l}$ the components of a multinomial random vector $\mathbf {U}''_m=\big (U''_{m,h},U''_{m,l},m-U''_{m,h}-U''_{m,l} \big )$ with parameters m and ${\mathbf{q}}_{hl}=(q_h,q_l,1-q_h-q_l)$, that is $\mathbf {U}''_m \sim {\mathcal{M}}(m,{\mathbf{q}}_{hl})$. Let us define the function $G(p_h,p_l,q_h,q_l)$ as follows

$$\begin{aligned} G(p_h,p_l,q_h,q_l)=E\big [\big (nU''_{m,h}-mU'_{n,h}\big )_+\big (nU''_{m,l}-mU'_{n,l}\big )_+ \big ]. \end{aligned}$$

Then, for every $h=1,\ldots ,c$ we have

$$\begin{aligned}&G(p_h,p_l,q_h,q_l)\\&\quad =n^2m(m-1)q_hq_l P\big (\{nU''_{m-2,h} +n \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +n \ge mU'_{n,l}\} \big )\\&\qquad +\,m^2n(n-1)p_hp_l P\big (\{nU''_{m,h} \ge mU'_{n-2,h} +m \}\cap\, \{nU''_{m,l} \ge mU'_{n-2,l}+m\} \big )\\&\qquad -\,n^2m^2q_hp_l P\left( \{nU''_{m-1,h} +n \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+m\} \right) \\&\qquad -\,m^2n^2p_hq_l P\left( \{nU''_{m-1,h} \ge mU'_{n-1,h}+m\}\cap\, \{nU''_{m-1,l}+n \ge mU'_{n-1,l}\} \right) . \end{aligned}$$

Proof

By using the indicator function ${\mathbb{I}}(A)$ of an event A, the quantity $G(p_h,p_l,q_h,q_l)$ can be represented as

$$\begin{aligned} G(p_h,p_l,q_h,q_l)& = E\big [\big (nU''_{m,h}-mU'_{n,h}\big )_+\big (nU''_{m,l}-mU'_{n,l}\big )_+ \big ] \\& = E\big [\big (nU''_{m,h}-mU'_{n,h}\big )\big (nU''_{m,l}-mU'_{n,l}\big ) \\&\quad \times {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ] \\& = E\big [\big (n^2U''_{m,h}U''_{m,l}+m^2U'_{n,h}U'_{n,l}-nmU''_{m,h}U'_{n,l}-mnU'_{n,h}U''_{m,l}\big ) \\&\quad \times {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ]. \end{aligned}$$

(48)

Now, let us split Eq. (48) in four terms and recall that a multinomial random vector $\mathbf {U}''_m=\big (U''_{m,h},U''_{m,l},m-U''_{m,h}-U''_{m,l} \big )$ can be represented as sum of m independent random vectors $\mathbf {U}''_{1,i}=\big (U''_{1,h,i},U''_{1,l,i},1-U''_{1,h,i}-U''_{1,l,i} \big )$ identically distributed as multinomial $\mathbf {U}''_1=\big (U''_{1,h},U''_{1,l},1-U''_{1,h}-U''_{1,l} \big )$ or also as sum of two independent multinomials $\mathbf {U}''_{1}$ and $\mathbf {U}''_{m-1}$ with the same parameter of probabilities ${\mathbf{q}}_{hl}=(q_h,q_l,1-q_h-q_l)$. Hence, for the first term, we have

$$\begin{aligned}&n^2E\big [U''_{m,h}U''_{m,l} {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ] \\&\quad =n^2E\left[ \left( \sum _{i=1}^m\sum _{j=1}^m U''_{1,h,i}U''_{1,l,j}\right) {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \right] \\&\quad = n^2\mathop {\sum _{i=1}^m\sum _{j=1}^m}_{i \ne j} E\big [U''_{1,h,i}U''_{1,l,j} \\&\qquad \times {\mathbb{I}}(\{nU''_{m-2,h} +nU''_{1,h,i} \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +nU''_{1,l,j} \ge mU'_{n,l}\}) \big ] \\&\quad = n^2m(m-1) E\big [{\mathbb{I}}(\{nU''_{m-2,h} +n \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +n \ge mU'_{n,l}\}) \big ] q_hq_l \\&\quad = n^2m(m-1)q_hq_l P\big (\{nU''_{m-2,h} +n \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +n \ge mU'_{n,l}\} \big ), \end{aligned}$$

(49)

where when $i=j$, we use the property

$$\begin{aligned} E\left[ U''_{1,h,i}U''_{1,l,i} {\mathbb{I}}(\{nU''_{m-2,h} +nU''_{1,h,i} \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +nU''_{1,l,i} \ge mU'_{n,l}\}) \right] =0. \end{aligned}$$

In fact, the two components $U''_{1,h,i}$ and $U''_{1,l,i}$ of a multinomial $\mathbf {U}''_{1,i}=\big (U''_{1,h,i},U''_{1,l,i},1-U''_{1,h,i}-U''_{1,l,i} \big )$ can not both be equal to 1.

Similarly, the second term of Eq. (48) is given by

$$\begin{aligned}&m^2E\big [U'_{m,h}U'_{m,l} {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ] \\&\quad = m^2n(n-1)p_hp_l P\left( \{nU''_{m,h} \ge mU'_{n-2,h} +m \}\cap\, \{nU''_{m,l} \ge mU'_{n-2,l}+m\} \right) , \end{aligned}$$

(50)

while for the third term of Eq. (48) we have

$$\begin{aligned}&nm E\big [U''_{m,h}U'_{n,l} {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ] \\&\quad =nmE\left[ \left( \sum _{i=1}^m\sum _{j=1}^n U''_{1,h,i}U'_{1,l,j}\right) {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \right] \\&\quad = nm\sum _{i=1}^m\sum _{j=1}^n E\big [U''_{1,h,i}U'_{1,l,j} \\&\qquad \times {\mathbb{I}}(\{nU''_{m-1,h} +nU''_{1,h,i} \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+ mU'_{1,l,j} \}) \big ] \\&\quad = n^2m^2E\left[ {\mathbb{I}}(\{nU''_{m-1,h} +n \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+m\}) \right] q_hp_l \\&\quad = n^2m^2q_hp_l P\left( \{nU''_{m-1,h} +n \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+m\} \right) . \end{aligned}$$

(51)

Similarly, the last term of Eq. (48) can be obtained as follows

$$\begin{aligned}&mnE\big [U'_{n,h}U''_{m,l} {\mathbb{I}}(\{nU''_{m,h} \ge mU'_{n,h}\}\cap\, \{nU''_{m,l} \ge mU'_{n,l}\}) \big ] \\&\quad = m^2n^2p_hq_l P\left( \{nU''_{m,h} \ge mU'_{n-1,h}+m\}\cap\, \{nU''_{m-1,l}+n \ge mU'_{n,l}\} \right) . \end{aligned}$$

(52)

Now, it suffices to plug the results (49), (50), (51) and (52) into Eq. (48) and prove the Lemma. In fact

$$\begin{aligned}&G(p_h,p_l,q_h,q_l) \\&\quad =n^2m(m-1)q_hq_l P\left( \{nU''_{m-2,h} +n \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +n \ge mU'_{n,l}\} \right) \\&\qquad +\,m^2n(n-1)p_hp_l P\left( \{nU''_{m,h} \ge mU'_{n-2,h} +m \}\cap\, \{nU''_{m,l} \ge mU'_{n-2,l}+m\} \right) \\&\qquad -\,n^2m^2q_hp_l P\left( \{nU''_{m-1,h} +n \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+m\} \right) \\&\qquad -\,m^2n^2p_hq_l P\left( \{nU''_{m-1,h} \ge mU'_{n-1,h}+m\}\cap\, \{nU''_{m-1,l}+n \ge mU'_{n-1,l}\} \right) . \end{aligned}$$

(53)

$\square$

Proposition 4

Under the assumption that the sampling frequencies ${\mathbf{X}}=(X_1,\ldots ,X_c)$ and $\mathbf {Y}=(Y_1,\ldots ,Y_c)$ are distributed as multinomial random vectors ${\mathcal{M}}(n,{\mathbf{p}})$ and ${\mathcal{M}}(m,{\mathbf{q}})$, respectively, the variance of the estimator $\hat{I}$ in Eq. (19) is given by

$$\begin{aligned} Var \big [\hat{I}\big ]& = \sum _{h=1}^c \bigg \{ \frac{1}{m} q_h P\big (nZ''_{m-1,h} + n \ge mZ'_{n,h} \big )+\frac{m-1}{m} q_h^2 P \big (nZ''_{m-2,h} +2n \ge mZ'_{n,h} \big ) \\&+\frac{1}{n} p_h P\big (nZ''_{m,h} \ge mZ'_{n-1,h} +m \big )+\frac{n-1}{n} p_h^2 P \big (nZ''_{m,h} \ge mZ'_{n-2,h}+2m \big )\\&-2 p_hq_hP\left( nZ''_{m-1,h} +n \ge mZ'_{n-1,h}+m \right) \bigg \}\\&+2\sum _{h=1}^{c-1}\sum _{l=h+1}^c \bigg \{ \frac{m-1}{m}q_hq_l P\big (\{nU''_{m-2,h} +n \ge mU'_{n,h}\}\cap\, \{nU''_{m-2,l} +n \ge mU'_{n,l}\} \big ) \\&+\frac{n-1}{n}p_hp_l P\big (\{nU''_{m,h} \ge mU'_{n-2,h} +m \}\cap\, \{nU''_{m,l} \ge mU'_{n-2,l}+m\} \big )\\&-q_hp_l P\left( \{nU''_{m-1,h} +n \ge mU'_{n-1,h}\}\cap\, \{nU''_{m-1,l} \ge mU'_{n-1,l}+m\} \right) \\&-p_hq_l P\left( \{nU''_{m-1,h} \ge mU'_{n-1,h}+m\}\cap\, \{nU''_{m-1,l}+n \ge mU'_{n-1,l}\}\right) \bigg \} \\&-\left( \sum _{h=1}^c \left\{ P\big (nZ''_{m-1,h}+n \ge mZ'_{n-1,h} \big )(1-p_h)q_h -P\big (nZ''_{m-1,h} \ge mZ'_{n-1,h}+m \big )p_h(1-q_h)\right\} \right) ^2, \end{aligned}$$

where $Z'_{n,h} \sim {\mathcal{B}}(n,p_h)$, $Z''_{m,h} \sim {\mathcal{B}}(m,q_h)$, $(U'_{n,h},U'_{n,l}) \sim {\mathcal{M}}(n, {\mathbf{p}}_{hl})$ and $(U''_{m,h},U''_{m,l}) \sim {\mathcal{M}}(m, {\mathbf{q}}_{hl})$,

Proof

First, we recall the following representation

$$\begin{aligned} Var[\hat{I}]& = \frac{1}{n^2m^2} Var\left[ \sum _{h=1}^c (nY_h-mX_{h})_+\right] \\& = \frac{1}{n^2m^2} \sum _{h=1}^c Var\big [(nY_h-mX_{h})_+\big ] \\&\quad +\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c Cov\big [(nY_h-mX_{h})_+,(nY_l-mX_{l})_+\big ] \\& = \frac{1}{n^2m^2} \sum _{h=1}^c E\big [\big ((nY_h-mX_{h})_+\big )^2\big ] -\frac{1}{n^2m^2} \sum _{h=1}^c \big (E\big [(nY_h-mX_{h})_+\big ]\big )^2 \\&\quad +\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nY_h-mX_{h})_+ (nY_l-mX_{l})_+\big ] \\&\quad -\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nY_h-mX_{h})_+\big ]E\big [(nY_l-mX_{l})_+\big ] \\& = \frac{1}{n^2m^2} \sum _{h=1}^c E\big [\big ((nY_h-mX_{h})_+\big )^2\big ] \\&\quad +\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nY_h-mX_{h})_+ (nY_l-mX_{l})_+\big ] \\&\quad - \left( \frac{1}{nm} \sum _{h=1}^c E\big [(nY_h-mX_{h})_+\big ]\right) ^2 . \end{aligned}$$

(54)

If a random vector ${\mathbf{X}}=(X_1,\ldots ,X_c)$ is multinomial distributed with parameters n and ${\mathbf{p}}$, the h-th component $X_h$ has marginal binomial distribution with parameters n and $p_h$ while two components $X_h$ and $X_l$ have marginal joint multinomial distribution with parameters n and ${\mathbf{p}}_{hl}=(p_h,p_l,1-p_h-p_l)$. Also if a random vector $\mathbf {Y}=(Y_1,\ldots ,Y_c)$ is multinomial distributed with parameters m and ${\mathbf{q}}$, the h-th component $Y_h$ has marginal binomial distribution with parameters m and $q_h$ while two components $Y_h$ and $Y_l$ have marginal joint multinomial distribution with parameters m ad ${\mathbf{q}}_{hl}=(q_h,q_l,1-q_h-q_l)$. Therefore we can rewrite Eq. (54) as follows

$$\begin{aligned} Var [\hat{I}]&\, =\, \frac{1}{n^2m^2} \sum _{h=1}^c E\big [\big ((nZ''_{m,h}-mZ'_{n,h})_+\big )^2\big ]\\&\quad +\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c E\big [(nU''_{m,h}-mU'_{n,h})_+ (nU''_{m,l}-mU'_{n,l})_+\big ]\\&\quad - \left( \frac{1}{nm} \sum _{h=1}^c E\big [(nZ''_{m,h}-Z'_{n,h})_+\big ]\right) ^2 \end{aligned}$$

where $Z'_{n,h} \sim {\mathcal{B}}(n,p_h)$, $Z''_{m,h} \sim {\mathcal{B}}(m,q_h)$, $(U'_{n,h},U'_{n,l}) \sim {\mathcal{M}}(n, {\mathbf{p}}_{hl})$ and $(U''_{m,h},U''_{m,l}) \sim {\mathcal{M}}(m, {\mathbf{q}}_{hl})$, or equivalently as

$$\begin{aligned} Var [\hat{I}] = \frac{1}{n^2m^2}\sum _{h=1}^c F(p_h,q_h)+\frac{2}{n^2m^2}\sum _{h=1}^{c-1}\sum _{l=h+1}^c G(p_h,p_l,q_h,q_l)-\left[ \frac{1}{nm}\sum _{h=1}^c D(p_h,q_h)\right] ^2 \end{aligned}$$

where

$$\begin{aligned} D(p_h,q_h) = E\big [ (nZ''_{m,h} -mZ'_{n,h})_+\big ], \end{aligned}$$

$$\begin{aligned} F(p_h,q_h)=E\big [ \big ((nZ''_{m,h}-mZ'_{n,h})_+\big )^2\big ] \end{aligned}$$

and

$$\begin{aligned} G(p_h,q_h,p_l,q_l) = E\big [(nU''_{m,h}-mU'_{n,h})_+(nU''_{m,l}-mU'_{n,l})_+ \big ]. \end{aligned}$$

Now combining the results from Lemmas 4, 5 and 6 we prove the Proposition. $\square$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ärje, J., Choi, KP., Divino, F. et al. Understanding the statistical properties of the percent model affinity index can improve biomonitoring related decision making. Stoch Environ Res Risk Assess 30, 1981–2008 (2016). https://doi.org/10.1007/s00477-015-1202-6

Download citation

Published: 16 January 2016
Issue Date: October 2016
DOI: https://doi.org/10.1007/s00477-015-1202-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Understanding the statistical properties of the percent model affinity index can improve biomonitoring related decision making

Abstract

Access this article

Similar content being viewed by others

Evaluation of the Sensitivity and Robustness of Modified Chi-Square Ratio Statistic for Cascade Impactor Equivalence Testing Through Monte Carlo Simulations

Blockwise AICc for Model Selection in Generalized Linear Models

Bayesian network-based framework for exposure-response study design and interpretation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Lemma 1

Proof

Proposition 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Proposition 2

Proof

Lemma 4

Proof

Proposition 3

Proof

Lemma 5

Proof

Lemma 6

Proof

Proposition 4

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Understanding the statistical properties of the percent model affinity index can improve biomonitoring related decision making

Abstract

Access this article

Similar content being viewed by others

Evaluation of the Sensitivity and Robustness of Modified Chi-Square Ratio Statistic for Cascade Impactor Equivalence Testing Through Monte Carlo Simulations

Blockwise AICc for Model Selection in Generalized Linear Models

Bayesian network-based framework for exposure-response study design and interpretation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Lemma 1

Proof

Proposition 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Proposition 2

Proof

Lemma 4

Proof

Proposition 3

Proof

Lemma 5

Proof

Lemma 6

Proof

Proposition 4

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation