Introduction

In Europe, the issue of forest decline emerged as the major environmental concern of the 1980s (e.g. Innes 1993). As a consequence, the ‘International Co-operative Programme on Assessment and Monitoring of Air Pollution Effects on Forests’ (ICP Forests) was established in 1985 by the Economic Commission for Europe of the United Nations under the Convention on Long-Range Transboundary Air Pollution (Innes 1993). Within the framework of the ICP Forests Programme, efforts were made to widely harmonise and standardise methods for forest monitoring throughout Europe. The methods were recorded in the ICP Forests manual (UNECE 2010) that was first published in 1985 and has continuously been subject to updates since its publication.

The forest condition survey represents an essential part of the forest monitoring, which was described in Part IV of the ICP Forests manual (Eichhorn et al. 2010) and became mandatory throughout the European Union in 1987 (Redfern and Boswell 2004; Solberg and Strand 1999). The survey on forest condition has been conducted annually on the systematic wide-scale monitoring plots (Level I), which were established wherever forest coincided with a 16 × 16 km grid over Europe, as well as on plots of the intensive monitoring programme (Level II; since the 1990s) (Eichhorn et al. 2010; Ferretti et al. 1999; Innes 1993). The forest condition survey according to ICP Forests, has been taken place annually in Western Germany since 1984 and in the whole Federal Republic of Germany since 1990 (BMELV 2012). The 16 German federal states (including three city states) are responsible for the field assessment of the forest condition survey and publish the results for the federal states annually. Within the federal states grid, densifications are common. The evaluation of the 16 × 16 km grid data for the whole of Germany (415 plots in 2012; ranging from 4 plots per state (Berlin) to 96 plots (Bavaria), and no plots for the city states Bremen and Hamburg) is carried out by the Thünen Institute of Forest Ecosystems and the results are finally published by the Federal Ministry for Food, Agriculture and Consumer Protection, which represents the National Focal Centre of Germany (BMELV 2012; Eichhorn et al. 2010).

The forest condition survey is based on the assessment of defoliation (e.g. Arbeitsgruppe AG Diagnose und Klassifizierung der neuartigen Waldschäden 1984; Eichhorn et al. 2010; Innes et al. 1993), which is the most widely used indicator for assessment of tree condition (Ferretti 1997; Ghosh et al. 1995; Innes et al. 1994). Defoliation is assessed visually using binoculars and a scoring system of 5 % classes (e.g. Durrant and Boswell 2002; Eichhorn et al. 2010). Consistency and reproducibility of defoliation data has frequently been in the focus of scientific criticism due to subjectivity in the visual assessment (Ferretti 1998; Innes 1988; Innes 1993; Schöpfer 1985a). Consistency of observations, however, is of major importance for spatial and temporal data evaluations. Several studies conducted in different European countries during the 1980s and beginning 1990s revealed a poor level of reproducibility and significant variations among observer teams (training courses) and between observer and control teams (field check) (Ferretti et al. 1999; Ghosh et al. 1995; Innes et al. 1993; Kandler and Innes 1995; Köhl et al. 2000; Solberg and Strand 1999). In addition, Klap et al. (2000) found the factor ‘country’ as the most statistically significant predictor for European defoliation data whereas Seidling (2004) detected the factor ‘federal states’ as the most significant predictor for defoliation data of the 89 German Level II plots. Despite criticism, an objective and feasible alternative method for the determination of defoliation is not yet available. The image analysis system CROCO has, however, been recommended for verification of visual assessments (Mizoue 2002; Nakajima et al. 2011). Hence, main emphasis has been placed on the quality assurance (QA) programme to improve and to document the consistency and reproducibility of visually assessed defoliation data (e.g. Ferretti et al. 1999). Regular training courses and further QA procedures (e.g. field checks) are believed to remove a great amount of subjectivity and variation among individual observers (Innes 1993; Köhl 1991; Schöpfer 1985b). For instance, a rapid and steady improvement of defoliation data consistency in Italy following the adoption of the QA programme was reported by Bussotti et al. (2009).

Thus, the present study investigated the defoliation data from the annual training courses in Germany from 1992 to 2012. The aim was (1) to evaluate the consistency of defoliation assessments over time and (2) to determine possible tree species-specific differences in the consistency.

Materials and methods

Procedure of the national training courses

The national training course has taken place annually in June since 1984 before the training courses of the federal states, which in turn have taken place immediately before the field surveys in July and August. The four most frequent tree species of Germany are investigated, namely, beech (Fagus sylvatica), oak (Quercus robur and Quercus petraea), Norway spruce (Picea abies), and pine (Pinus sylvestris). The participating observer teams consist of at least one representative who conducts the forest condition survey or who is responsible for the training course in its respective federal state. The training courses aim at eliminating differences in the assessments of defoliation among the observer teams of different federal states in order to obtain consistent data in terms of spatial and temporal comparability within Germany during the forest condition survey. Therefore, interfering factors such as the social position of trees, tree species, and visibility of the crown are kept constant during the course (Köhl 1993). Easily visible trees are generally chosen, although good visibility does usually not reflect the real condition during the forest condition survey. Moreover, the assessment occurs from a fixed observation point. Before the assessment, five trees of each species are jointly assessed and discussed. Subsequently, ten trees are independently assessed by the observer teams (first round). The results are recorded and distinct discrepancies among the teams are discussed following the first round. Finally, a second round where another ten trees are independently assessed is carried out. Discrepancies among the teams are discussed again. Over the years, the procedure of the training courses has changed. The second round including the discussion in between has been obligatory since 2011 but had already been performed in earlier years, e.g. in 2008 for beech and oak. In general, ten to twenty trees were assessed per tree species; however, the quantity of assessed trees ranged from ten to one hundred (Table 1). During the first years, assessment was not necessarily performed from the same observation point. In 1997, the guidelines for assessment (internal guideline of the AG Dauerbeobachtungsflächen – Waldschäden published in Dammann et al. 2001) as well as the reference book for defoliation ‘Waldbäume – Bilderserien zur Einschätzung von Kronenverlichtungen bei Waldbäumen’ (Arbeitsgruppe AG Diagnose und Klassifizierung der neuartigen Waldschäden 1984; Meining et al. 2007) were introduced. The age of trees was not recorded before 2011, but was over 60 years for most trees. Furthermore, the observer teams of the different federal states have partly changed over time; however, this was not recorded. In some cases, two separate persons representing two different federal states started as a joint team (Berlin and Brandenburg in most years, Bremen and Hamburg in earlier years, and Baden-Württemberg and Bavaria in 2003). The number of persons per team varied as well and was not noted before 2011. Generally, two persons built a team, but sometimes up to five persons or, in some cases, an individual person represented one federal state. Additionally, experience in defoliation assessment varied among the observers from persons being part of the courses since the beginning in the 1980s to inexperienced beginners. The annual training courses have usually taken place in one federal state for two consecutive years (Table 1). Sites are selected so as to reflect a large distribution of the defoliation levels, but the focus is on trees with intermediate defoliation because it has been reported that the defoliation assessment is more difficult than that of healthy or heavily damaged trees (e.g. Solberg and Strand 1999). For 1993, no data were available for oak due to strong pest infestation and for 1996 and 1997, data were unrecoverable.

Table 1 Year, federal state, location, number of assessed trees (trees), and number of participating observer teams (observers) at the national training courses given for the main tree species

Observer errors

Potential errors that may occur during surveys are sampling and non-sampling errors (Kish 1995). Sampling errors can be attributed to the fact that the sample does not represent the entirety of the population and can therefore be reduced by increasing the sample size provided that a probabilistic sampling design is adopted (Kish 1995; Schöpfer 1985b). Non-sampling errors include the observer error, which is the result of the visual assessment of defoliation (Kish 1995; Schöpfer 1985b) and the error which is addressed in the present study. The observer error has to be considered additionally to the sampling error when regarding errors of defoliation assessments during the forest condition survey (e.g. Gertner and Köhl 1995). According to Cochran (1977) and Kish (1995), the total error, which is usually given as mean-squared error (MS), is defined as the sum of variable (random) error and bias (systematic error):

$$ MS=\mathrm{total}\ \mathrm{error}={\left(\mathrm{variable}\ \mathrm{error}\right)}^2+{\left(\mathrm{bias}\right)}^2 $$
(1)

Precision refers to the random error (size of deviation of the estimated value from the sample mean value) whereas accuracy refers to the total error including the bias (size of deviation of the estimated value from the true mean value) (Cochran 1977; Kish 1995). Therefore, an estimate may be precise but biased. The observer error is defined as difference between the true value of defoliation and the assessed value (Gertner and Köhl 1995; Wulff 2002). Hence, random as well as systematic components can be included in the observer error (Köhl et al. 2000). Systematic errors can be ascribed to several sources of error such as different definitions of the assessable crown, weather conditions, flowering of pine trees, as well as the observer's style of assessment (Schöpfer 1985b; Solberg and Strand 1999). Calculation of the observer error is not trivial, since the true value of defoliation is unknown. In studies like the present one, it is not possible to estimate the true errors and the accuracy of defoliation assessments. When it is assumed that the arithmetic mean of defoliation assessments of all observer teams regarding an individual tree is the unbiased true value, then the mean absolute deviation of defoliation assessments represents an estimator for the observer error.

Statistical analyses

In the present study, the consistency of defoliation data was evaluated. Therefore, deviations, correlations, and agreements among observer teams (federal states) were examined and variance components were estimated. The defoliation data are scored in 5 % classes and thus are pseudo-continuous.

The absolute deviation of defoliation assessments between the observer teams and the arithmetic mean of all observer teams was calculated as well as the standard deviation of defoliation assessments, which corresponded to the standard deviation of the deviation. At site level (one tree species per year), the number of degrees of freedom was corrected because the assessments of different teams on an individual tree were not independent.

A two-way analysis of variance (ANOVA) was performed in order to estimate the variance components as well as systematic and random errors (Wulff 2002). The observer teams were assumed to be a random selection of a population of possible observers, some of which will later carry out the forest condition survey (Bravo and Potvin 1991; Wulff 2002). Additionally, the precision of assessment was assumed to be the same for all k observer teams, i.e. σ 1 2=…=σ k 2=σ E 2 (Köhl 1993; Wulff 2002). The general model was described by Bravo and Potvin (1991) and Wulff (2002):

$$ {X}_{ij}=\mu +{S}_i+{O}_j+{E}_{ij}, $$
(2)

where X ij is the assessment by observer team j on tree i, μ is the grand mean of all estimations, S i is the tree effect, O j is the observer bias for observer team j, and E ij is the random error of the assessment by observer team j on tree i. The bias is assumed to be the same on all kinds of trees. The term = μ + S denotes the true value. In practice, μ + S will be the arithmetic mean value of all assessments. The term = O + E denotes the error term, which is differentiated in systematic and random error components. The estimations for the corresponding variance (s 2) components are (Bravo and Potvin 1991; Wulff 2002):

$$ {s}_E^2=M{S}_E $$
(3)
$$ {s}_O^2=\frac{M{S}_O-M{S}_E}{n_i}, $$
(4)

where n is the number of trees. It was tested if observer teams differed significantly from the average defoliation assessment (H0: σ o 2 = 0).

Correlations between the observer teams were examined using the Pearson correlation coefficient r and agreements among observers were examined using the intraclass correlation coefficient (ICC) (Bravo and Potvin 1991):

$$ ICC=\frac{n\left(M{S}_S-M{S}_E\right)}{ nM{S}_S+ kM{S}_O+\left( nk-n-k\right)M{S}_E}, $$
(5)

where k is the number of observer teams.

Furthermore, analysis of variance combined with Tukey's HSD test was used for multiple comparisons of means. In the case of non-normality of data, the non-parametric Kruskal–Wallis H test combined with the multiple comparison test by Castellan and Siegel (1988) were used. The non-parametric Mann–Whitney U test was used for comparison of two means. In addition, simple linear regressions were performed. In the case of heteroscedasticity of variances the generalised least squares method with a variance function was applied. Normality of residuals and homogeneity of variances were tested prior to all statistical analyses. Statistical significance was stated at P ≤ 0.05. The whiskers of the boxplots correspond to 1.5 times the interquartile distance. Evaluation and visualisation of data were performed using R 2.15.1 (R Development Core Team 2012).

Results

The absolute deviation from the arithmetic mean at site level averaged 4.4 % defoliation (3.2–7.1 %) from 1992 to 2012, and no significant differences were observed among the tree species. The absolute deviations of oak (P = 0.015) and spruce (P = 0.017) decreased from 1992 to 2012 (Fig. 1). In 2011 and 2012, for example, the mean absolute deviations for all tree species were 3.5 and 3.6 %, respectively. The highest absolute deviations were observed in 2008 (7.1 %) and 2009 (5.8 %) at pine sites. On average, 93.4 (spruce) to 95.2 % (oak) of the assessments from 1992 to 2012 were located within the ±10 % interval of deviation from the mean (Fig. 2). Assessments outside this interval were detected at all sites in every year. The proportion of assessments within the interval increased over the years for oak (P = 0.001) and spruce (P = 0.02). In 2011, 94.0 (beech) to 99.5 % (spruce) and in 2012, 96.0 (pine) to 99.0 % (oak) of the assessments lay within this interval. Moreover, 72.0 (beech) to 86.6 % (pine) of the assessments were located within the ±5 % interval of deviation in 2011 and 70.5 (pine) to 84.0 % (oak) of assessments in 2012. The lowest proportion of assessments within the ±5 % interval (48.5 and 56.4 %) and ±10 % interval of deviation (78.5 and 85.5 %) were observed at the pine sites in 2008 and 2009. The introduction of two rounds during the training courses had no significant effect on the error of assessment in beech, oak, and spruce stands considering the years 2008 (only beech and oak), 2011, and 2012 (Fig. 3). The range of deviation even tended to increase from the first to the second round. For pine, the absolute deviation, however, decreased from the first to the second round (P < 0.001) (Fig. 3). This decrease occurred in 2011 as well as in 2012.

Fig. 1
figure 1

Mean absolute deviation of defoliation assessments from the arithmetic mean given for the beech sites (top, left side), the oak sites (top, right side), the spruce sites (bottom, left side), and the pine sites (bottom, right side) for the years 1992 to 2012. The lines and parameters of significant (oak, spruce) and insignificant (beech, pine) linear regressions are indicated

Fig. 2
figure 2

Proportion of defoliation assessments that was within the 0 to ±5 % interval of deviation from the arithmetic mean of all assessments of an individual tree (0 to ±5 %) and within the interval of more than ±5 to ±10 % deviation (> ±5 to ±10 %). The proportions are representative for beech (top), oak (upper middle), spruce (lower middle), and pine (bottom) from 1992 to 2012. For 1993, no data were available for oak due to strong pest infestation, and for 1996 and 1997, data were unrecoverable

Fig. 3
figure 3

Absolute deviation of defoliation assessments from the arithmetic mean of individual trees given for the first and second round during the training courses in 2008 (beech, oak), 2011, and 2012. Data are presented separately for beech (top, left side), oak (top, right side), spruce (bottom, left side), and pine (bottom, right side). The number of trees per round is indicated by ‘n’. Significant differences between the first and second round are represented by different letters (Mann–Whiteny U test)

The maximal positive and negative deviations at individual trees were 34.7 and −47.8 % defoliation both of which were observed at the pine sites in 1995. The maximal deviations at individual oak (−25.0 %) and spruce (32.3 %) trees were also found in 1995 whereas the maximal deviation of beech trees (30.7 %) was observed in 2004. Deviations greater than ±20 % were rarely observed and accounted for less than or equal to 1 % of the total assessments of one site except for the pine sites in 1995 and 2008 and the oak site in 1995 where these deviations accounted for up to 3 % of the total assessments.

The average standard deviation at site level was 5.5 % defoliation ranging from 3.9 to 8.6 %. As with the absolute deviation, no significant differences occurred among tree species and the standard deviation of oak (P < 0.005) and spruce (P < 0.026) decreased from 1992 to 2012 (Fig. 4). Mean standard deviations for all tree species in 2011 and 2012 were 4.4 and 4.5 % defoliation, respectively.

Fig. 4
figure 4

Mean standard deviation of defoliation assessments given for the beech sites (top, left side), the oak sites (top, right side), the spruce sites (bottom, left side), and the pine sites (bottom, right side) for the years 1992 to 2012. The lines and parameters of significant (oak, spruce) and insignificant (beech, pine) linear regressions are indicated

Systematic errors in assessed defoliations that were attributed to the observer teams were detected at all tree species and in nearly every year (Table 2). However, few consistent temporal or species-specific patterns regarding systematic errors among observer teams were determined. The variance among observers of oak decreased from 1992 to 2012 (P = 0.036, R 2 = 0.26). The variance among observers (systematic error) was approximately one fifth of the variance of the random error regarding every year between 1992 and 2012 (Table 2). The average ICC was 0.83 and ranged from 0.52 (pine 1994, spruce 2001) to 0.97 (pine 2005, oak 2012) (Table 2). No temporal or species-specific trends were observed. In 2011 and 2012, the mean ICC were 0.89 and 0.91 and the Pearson correlation coefficients r were 0.91 (0.71–0.98) and 0.93 (0.73–0.99), respectively. For both years, ICC and r displayed the highest values for oak trees. The Pearson correlations were significant among all combinations of observer teams (P < 0.001).

Table 2 Results of the two-way analysis of variance and the ICC for the defoliation assessments on beech, oak, spruce, and pine at the national training courses from 1992 to 2012

The four tree species differed in the frequency of the level of defoliation (Fig. 5). Beech trees showed the highest mean defoliation with 37 % whereas pine trees showed the lowest mean defoliation with 25 %. The frequency distributions of defoliation of all tree species displayed a distinct skewness to the right and defoliations more than 60 % were rarely present. The absolute deviation from the mean significantly and non-linearly depended on the defoliation level (P > 0.001) (Fig. 6), being highest at intermediate levels of defoliation.

Fig. 5
figure 5

Frequency distribution of defoliation levels for beech (top, left side), oak (top, right side), spruce (bottom, left side), and pine (bottom, right side) including all data from 1992 to 2012. The number of assessed trees is indicated by ‘n’

Fig. 6
figure 6

Relationship between absolute deviation of defoliation assessments and defoliation level of individual trees. The regression equation is y = −0.00007 x 2 + 0.074 x + 2.56 (P < 0.001; residual df = 1,344) and was derived from generalised least squares using a variance function

Discussion

Temporal changes in the observer error

The mean absolute deviation of 4.4 % defoliation, which may be used as estimate for the observer error, assuming that the arithmetic mean of all assessments of an individual tree represents the unbiased true value, as well as the mean standard deviation of 5.5 % defoliation were comparable or lower than deviations reported in the literature. Additionally, both measures displayed a decreasing trend from 1992 to 2012 for the four tree species, which was statistically significant for oak and spruce sites. In comparison, Schöpfer (1985b) estimated a higher observer error of ±8 % defoliation at the national training courses in 1983 and 1984. The mean deviation that was calculated among three European countries at a training course in 1989 was higher as well (deviation between two countries) (Innes et al. 1993). Standard deviations of single observations of control surveys (field checks) in Sweden amounted to 4.7–12.6 % defoliation between 1995 and 1999 (Wulff 2002). Solberg and Strand (1999) estimated a standard deviation of 10 % for single trees and of 5 % for plot means from field checks in Norway between 1990 and 1995. The reported standard deviations are similar but not directly comparable to our results since the reported ones were derived from pairwise assessments. In general, a deviation of ±10 % defoliation from a reference value is an acceptable limit of deviation (e.g. Innes 1988; Köhl 1991). In the present study, 94 % of the assessments were located within these limits and the proportion of assessments within the limits displayed an increasing trend, which was significant for oak and spruce. In fact, pronounced deviations (more than ±20 % defoliation) from the mean were not observed during the last three years. Proportions determined during the national training courses in Italy and the South European training courses between 1996 and 2004 were slightly lower ranging from 80 to 90 %, and a marginal increase in the proportion was observed over time (Bussotti et al. 2009). The proportions calculated in the present study are high in comparison to those reported in the literature from the second assessments by a control team during field checks. Innes et al. (1994) reported from a field check in Switzerland in 1993 that the quality limits had to be broadened to ±15 % defoliation to achieve an acceptable result with more than 90 % of the assessments lying within this interval. Innes (1993), Ferretti et al. (1999), and Solberg and Strand (1999) came to similar results during field checks in Great Britain (1988), Italy (1996), and Norway (1990–95), respectively. In contrast, Bussotti et al. (2009) found that more than or equal to 90 % of the assessments during the field control in Italy from 1999 to 2004 fell between ±10 % with respect to the reference team. In recent times, the prescribed aim (data quality limit) of ICP Forests for field checks during the forest condition survey was set to at least more than or equal to 70 % of assessments that must not deviate more than ±10 % from one another (Eichhorn et al. 2010). The results at hand suggest to introduce a data quality limit for national and international training courses requiring that 90 % of the assessments have to range within ±10 % from the reference value (here arithmetic mean of all assessments on an individual tree).

The observed trend towards more consistent assessments among observer teams may be ascribed to several changes over time. In particular, the introduction of the guidelines for assessment (internal guideline of the AG Dauerbeobachtungsflächen – Waldschäden published in Dammann et al. 2001) and the defoliation reference book in 1997 (Arbeitsgemeinschaft AG Dauerbeobachtungsflächen der Länder und des Bundes 1997; Meining et al. 2007) may have improved the assessments, although no abrupt improvement occurred following the introduction. Additionally, the definition of the observation point for assessment probably led to greater consistency. A positive effect of the introduction of two rounds was shown for pine. Maintenance of the achieved consistency during the forest condition survey, however, is of great importance. The positive effect was not observed for beech, oak, and spruce where the discussion in between possibly led to an increased uncertainty in the assessment. Since mandatory second rounds have been introduced only recently, no final conclusion can be made whether this implementation represents an improvement for harmonisation.

Explanation for outliers

Occasionally, deviations of more than ±20 % defoliation were determined at individual trees of all tree species but the cases where deviations of more than ±20 % accounted for slightly more than 1 % of total assessments occurred only at three sites and thus played a negligible role during the training courses. The most pronounced deviations and the highest proportion of deviations outside the ±10 % interval of deviation were observed at pine trees in 1995, 2008, and 2009. Innes et al. (1993) reported similar high deviations of ±45 % defoliation at individual trees, which were observed at a training course among three European countries in 1989. Innes et al. (1993) and Wulff (2002) also mentioned difficulties in the defoliation assessment of pine trees. However, despite difficulties in 1995, 2008, and 2009, pine trees did not differ from the other species over time in the present study. In these years, the training courses were carried out in Bavaria. According to the participants, the selected pine trees showed an uncommon type of growth for pine trees in Germany. It was not possible to classify the defoliation using the reference book due to the special type of growth. However, these errors are negligible for the forest condition surveys where this growth type is very rare.

Common reasons for frequent occurrence of deviations of more than ±10 % on individual trees were difficulties in setting the assessable crown or anomalies such as crown damages, substitution crowns, and uncommon growth types (the information was derived from participants but was not recorded in the past). In spite of high consistencies among the observer teams in 2011 and 2012, the assessments on beech and pine displayed higher deviations compared to the other species in both years. The tree species showed an extraordinary fructification in 2011, which in case of beech was additionally accompanied by notably small-sized leaves, which caused the comparable high uncertainties in 2011. However, a systematic influence of fructification on defoliation assessment could not be observed between 1992 and 2012. In 2012, the actual needle set of the pine trees was still lacking, as the training course took place in early June. The absence of the actual needle set increased the level of error because the observers had to imagine the needle set. This error, however, is unimportant for the forest condition survey, which takes place later in the year when the actual needles are developed.

Systematic and random errors

Systematic errors among observer teams were found in nearly every year for the four tree species and have frequently been reported from studies investigating the quality of defoliation assessment (Ghosh et al. 1995; Innes 1993; Mues and Seidling 2003; Solberg and Strand 1999; Wulff 2002). Systematic errors resulting from different weather conditions could be excluded because all assessments took place at the same time and were additionally conducted from the same observation point. Systematic temporal or species-specific patterns regarding significant differences among observer teams (federal states) were hardly observed due probably to changes in the assembling of observer teams over time. However, in several years, two of the federal states showed systematic deviations for one tree species each. In one case, the corresponding tree species is rarely represented in the respective federal state and plays no role for the forest condition survey. The federal state that deviated the most over the years and for the four tree species was meanwhile taken over by a cross-federal states institution, which is now responsible for the forest condition survey. Under the cross-federal states institution, the survey is conducted by the same teams as before but systematic errors have not been observed so far.

Between-observer variances were lower than variances of the random error (within-observer variances). The variances were comparable to variances given for observer teams at the national training courses in Sweden in 1995–1999 (Wulff 2002). In 2011 and 2012, variances were in general low as compared to earlier years. The fructification and small-sized leaves of beech in 2011 resulted in comparably high uncertainties in the assessments due to a relatively high random error whereas a systematic error was not observed. In contrast, the absence of the actual needle set at pine trees in 2012 resulted in a comparably high systematic error, which however was not relevant to the survey as already mentioned.

In spite of significant systematic errors, correlations and agreements among the observer teams were extraordinarily high. Agreements were slightly higher for deciduous trees than for coniferous trees. To evaluate whether assessments of one year and one species were consistent, we applied a three-way evaluation approach. Assessments were considered inconsistent if (1) the mean absolute deviation from the mean was more than ±5 %, (2) less than 90 % of the assessment lay within the ±10 % interval of deviation from the mean, and (3) significant systematic differences among observer teams existed. The present study demonstrated that defoliation assessments at the national training courses were consistent with exception of the assessments at the beech site in 2005, at the spruce site in 1993, and at the pine sites in 1995, 2008, and 2009. The inconsistency at the pine sites in Bavaria can be attributed to the growth type (see above) whereas reasons for the inconsistency at the beech and spruce sites could not be reconstructed. Results from the international cross-comparison courses in Europe in 2001 and 2002 also indicated consistent assessments among teams from one country (Mues and Seidling 2003). Despite consistent defoliation assessments among the German federal states, temporal and spatial evaluations of defoliation data from the forest condition survey should focus on pronounced alterations due to between-observer variance and, particularly, due to often considerable within-observer variance. Solberg and Strand (1999) as well as Wulff (2002) made similar statements for the Norwegian and Swedish forest condition survey.

Dependence of the absolute deviation on the level of defoliation

The distribution of defoliations was right-skewed and displayed slightly higher mean defoliations for the four tree species compared to the nationwide distribution (BMELV 2012). The observed non-linear dependence of the absolute deviation on the degree of defoliation was expected for assessments within an interval and was in line with results from other studies (e.g. Solberg and Strand 1999). Although average defoliation levels of the deciduous tree species were slightly higher than those of the coniferous tree species and hence supposedly more difficult to assess, this did not have a great effect on the absolute deviation. The average deviations ranged within one ±5 % class for all defoliation levels, and therefore, the dependency of the deviation on the defoliation level does not appear to be of critical importance for the quality assessment within the framework of the training courses. It should be noted that the determined relationship between observer error and defoliation level in the present study was based on unbalanced data and it may be worthwhile to investigate this relationship using equal sample sizes of all defoliation levels.

Conclusions

The present study demonstrated that the visual defoliation assessment produced consistent results within Germany at the national training courses for the forest condition survey from 1992 to 2012. Significant tree species-specific differences in the deviations were not observed but the assessment of deciduous trees tended to be slightly more consistent than the assessment of coniferous trees. In large part, pronounced deviations that were observed at the courses (e.g. assessments on pine trees in 1995, 2008, and 2009) are probably of little relevance for the national forest condition survey. Assuming similar assessment behaviours of the observers, a similar distribution of defoliation levels, and that the mean value of all observations on an individual tree represents the unbiased true value, then an observer error of one ±5 % class (absolute deviation of ±4.4 % defoliation from the mean) has to be considered in addition to the sampling error during the forest condition survey. The true bias could, however, not be calculated in the present study. In order to ensure that assessment behaviours do not drift apart subsequent to the training course and that the results can be generalised to the forest condition survey, an intercomparison course during the forest condition survey at which the federal states have to assess trees without previous consultation may be an appropriate measure. In addition, considerable temporal and spatial alterations should be the main focus of interest of the national defoliation assessments rather than short-term trends due to random and systematic observer errors.