Introduction

In telecommunications, quality of transmitted speech is commonly assessed by behavioral measures like the subjective evaluation of stimuli on category rating scales [36]. Such quality ratings are relative in nature, i.e., they reflect a person’s judgment about the quality of a test stimulus either in comparison to another presented stimulus [e.g., in the degradation category rating (DCR) or comparison category rating (CCR) paradigms] or to a long-standing internal reference [e.g., when being requested in an absolute category rating (ACR) paradigm] [25]. Although these measures reliably capture the behavioral descriptions of quality judgments, they cannot be utilized to study the mostly unconscious processes involved in the formation of these judgments within the experiencing person [37, 38].

A new approach in quality assessment tries to overcome this limitation by employing non-invasive electrophysiological methods to measure changes in brain activity associated with internal quality-related representations and processes [2, 20]. Electroencephalography (EEG) proved to be the most frequently employed method due to its high temporal resolution, showing modulations of specific frequency bands in the continuous EEG as well as effects on the morphological and temporal characteristics of event-related brain potential (ERP) components [21, 47].

A number of prior studies have demonstrated effects of stimulus quality on brain activity. In the auditory modality, long-term exposure to degraded speech stimuli was associated with an increase in the relative power of alpha and theta frequency bands located in frontal and parietal-occipital areas, reflecting the influence of emotional processing and mental fatigue [3, 5, 8]. Visual perception was shown to be affected by the perceived quality of images and videos [12, 13, 33, 34, 39, 40, 58] on both explicit and implicit visual processing stages [41], also exerting an influence on the person’s affective state [32]. Moreover, first attempts have been made to investigate the neural correlates of audiovisual quality [7, 9,10,11].

Rapid, short-term changes in neuro-electrical activity time-locked to the onset of a defined sensory, cognitive, affective or motor event manifest in the averaged EEG signal as a stereotypical sequence of ERP components [35, 71]. Specifically the P300 component refers to a positive voltage change occurring approx. 300 ms after the unexpected onset of a meaningful event, which is presumably caused by increased neuro-inhibitory activity [49,50,51, 54, 64]. The P300 waveform results from the superposition of the waveforms of two subcomponents P3a and P3b, both of which typically show middle latencies, with the P3a having a shorter latency than the P3b. While the P3a reflects the allocation of attentional resources—indicating the strength and speed of the involuntary orienting response elicited by novel stimuli [23, 42, 61]—the P3b has been associated with stimulus categorization and memory operations. Accordingly, distinct neural sources associated with attention and memory functions were shown to be involved in the generation of the P3a and P3b, respectively [24, 62, 66, 72].

Usually the P300 or its subcomponents are evoked by applying a variant of the classic oddball paradigm. The three-stimulus oddball paradigm is based on the sequential presentation of stimuli, each of which belongs to one of three different stimulus types: A frequent, task-irrelevant standard (S), an infrequent, task-relevant target (T) and an infrequent, task-irrelevant distractor (D). Due to their unexpected occurrence, T and D are jointly referred to as “oddballs”. Subjects are instructed to respond only to the occurrence of T. Presentation of T triggers the P300 (i.e., the superposition of the P3a and P3b), presentation of D triggers only the P3a. Since the P300 constitutes a very robust and well-studied neuro-electrical phenomenon, an extensive literature base including methodological standards is readily available (e.g., see [17] for the clinical domain).

Using speech stimuli of different complexity (phonemes, words, sentences), previous studies revealed positive and negative effects of degradation intensity (e.g., discrete levels of signal-to-noise ratio, SNR) on P300 amplitude and latency [4, 6, 56]. Interestingly, stimuli with only a small degree of quality degradation, which were explicitly judged as having normal quality, have been shown to affect the P300 characteristics in the same way as highly-impaired stimuli, which were judged as degraded. This implies that the degree of quality degradation has to rise above a certain threshold in order to be perceived (and judged) consciously, yet already influences early implicit processes and probably favors the generation of discomfort, stress and fatigue [4, 55]. So far, cumulative evidence suggests that the P300 can reliably indicate differences in quality percepts and judgments, but it still remains an open question whether its subcomponents P3a and P3b are equally suitable as indicators of quality-related processes.

Fig. 1
figure 1

Simplified model of quality perception (circles internal processes; parallel horizontal lines internal representations; rectangles input signals, output), based on a conceptual model of internal quality formation [57] and a model of cognitive P300 activity [51]

Quality degradations of speech signals can be of different perceptual nature, e.g., a speech signal may be impaired by additive noise, frequency distortions, interruptions, or alike. Depending on the strength of the degradation, all these impairments may result in the same quality judgment (e.g., a moderately degraded speech signal, obtaining a “fair” rating on a five-point category scale according to ITU-R Recommendation BS.1770-3 [25]), although being perceptually quite different: Additive noise will result in a perception of “noisiness”, frequency distortions in a perception of “coloration”, and interruptions in a perception of “discontinuity” [69, 70]. To analyze this observation, multidimensional analyses have been carried out based on multidimensional scaling or semantic differential paradigms, revealing three [68,69,70] or five [59, 60] independent dimensions. Such perceptual dimensions can be utilized to diagnose the sources of impairments, thus providing help for service optimization. However, it still remains unclear how perceptual dimensions of quality are reflected in the human brain, i.e., whether they provoke the same type and strength of neural responses.

The present study aimed at exploring the effects of speech quality on the neural response characteristics of the P300 subcomponents, P3a and P3b. In particular, it focused on answering the following joint questions: First, based on the assumption that the three independent dimensions noisiness, coloration and discontinuity underly the quality perception of transmitted speech signals and therefore have to be regarded as separate perceptual quality features [68]: Are these three dimensions represented as different perceptual references within the experiencing person? And second, assuming that both P3a and P3b indicate attentional and memory-related processes involved in the formation of quality percepts: Does manipulation of the stimulus environment in terms of quality or a single-dimensional degradation of the perceptual reference modulate the morphological and temporal characteristics of both subcomponents?

A simplified model of quality perception sketched in Fig. 1 provided the theoretical framework for the present study, combining a model of internal quality formation [57] with a model of cognitive P300 activity [50, 51]. As the three-stimulus oddball paradigm requires the perceptual discrimination of stimuli, triggering of the P3a and P3b was anticipated during the early perception stage of the quality formation process. After sensory processing & perception of a physical event present within a certain contextual setting, the resulting perceptual event is compared with the perceptual reference, i.e., an internal representation of the current environment (“stimulus context”), containing all recently perceived stimuli (anticipation & matching). In case an oddball (T, D) was presented, the perceptual event deviates from the perceptual reference and attentional resources are allocated for the upcoming orienting response, which evokes the P3a. If the oddball is further classified as task-relevant (T), subsequent memory processing causes the updating and storage of the perceptual reference, eliciting the P3b. Finally, response-related processing leads to the generation of a task-appropriate motor response.

Methods

The design of the present study followed practical guidelines concerning the standardized application of the EEG method to quality assessment [1, 2]. These guidelines recommend the use of minimal stimulus sets, which include only a small number of stimulus types with short presentation time and synchronous onset, as well as the employment of well-established experimental paradigms with sufficiently high reliability like the classic oddball paradigm [67]. Data collection was preceded by a short pretest on one member of the laboratory to demonstrate the effectiveness of the planned manipulations and to refine the experimental setup [e.g., intensity and duration of stimuli, duration of the inter-stimulus interval (ISI), number of trials per block].

Subjects

24 subjects participated in the study (11 males, 13 females; age: \({M} = 28.96\) years, \({SD} = 6.69\), \(R = 29\); 87.5% right-handed), all of whom were native German speakers with no self-reported hearing problems. All subjects gave their informed consent after reading a short information about the EEG method provided by the Deutsche Gesellschaft für Psychologie (DGPs; English: German Psychological Society) and received a monetary compensation at the end of the experiment.

Stimuli

A clean audio recording of the German word /Haus/ (English: “house”) spoken by a female voice in a sound-insulated environment with a high-quality microphone (AKG Acoustics, C414 B-XLS/ST) was chosen as stimulus material for the experiment due to its neutral meaning as well as clear and strong onset (duration: 500 ms, sampling rate: 16 kHz, bit rate: 16 bps).

Three quality-impaired stimuli were produced from the audio speech file by degrading only one of the three perceptual dimensions (noisiness, coloration, discontinuity) at a time:

  1. 1.

    Signal-correlated noise generated according to ITU-T Recommendation P.810 [27] with a distortion rate of 18 Q resulted in a noisiness-impaired stimulus (N).

  2. 2.

    Application of a bandpass filter with a low-cutoff of 1 kHz and a high-cutoff of 2 kHz resulted in a coloration-impaired stimulus (C).

  3. 3.

    Packet loss generated according to ITU-T Recommendation G.722.2 [26] with a packet loss rate of 20% resulted in a discontinuity-impaired stimulus (P).

Besides, the original recording served as a high-quality stimulus (HQ). The intensities of all stimuli were normalized according to ITU-R Recommendation BS.1770-3 [25], resulting in a target loudness of −23 LUFS (Loudness Units relative to Full Scale).

It was assumed that the selected degradation methods (1.–3.) would create a noticeable change in the perceived quality on each perceptual dimension. This quality manipulation was checked empirically by using standardized category rating paradigms (see Sect. 2.4). In addition, an instrumental estimation of speech quality based on the perceptual objective listening quality assessment (POLQA) model was conducted for each stimulus in accordance with with ITU-T Recommendation P.863 [28]. The application of the POLQA model (3 repetitions, duration: 1.5 s) resulted in predicted MOS values for HQ (4.21), N (1.92), C (3.95) and P (1.20), which indicate relevant differences in the predicted speech quality of the four stimuli.

Experimental design and hypotheses

The present study adopted the three-stimulus oddball paradigm to investigate the impact of independent perceptual dimensions of speech quality on P3a and P3b characteristics. It was assumed that the abrupt presentation of an oddball (T, D) would cause an orienting response, which in case of T would be succeeded by memory operations. After T, both P3a and P3b waveforms should be triggered, producing the superposed P300 waveform. After D, only the P3a waveform should be evoked.

Six stimulus sets resulted from the combination of the degraded stimuli (N, C, P) and stimulus types (S, T, D) without repetition. Six additional stimulus sets were generated with HQ as S, combining the degraded stimuli and oddballs (T, D) without repetition. No stimulus sets with HQ as oddball were used, since further increasing the number of stimulus sets would result in exhaustively long test sessions with a higher probability of interference by mental fatigue and drowsiness. Table 1 lists all stimulus sets used in the present study. In the following sections, particular stimulus sets are named according to their stimulus-to-stimulus-type assignments in the order of S-T-D, e.g., N-C-P being the stimulus set with N as S, C as T and P as D, respectively.

Table 1 Twelve stimulus sets used in the experiment

Hypotheses about the effects on P3a and P3b characteristics were derived from the simplified model of quality perception depicted in Fig. 1. According to this model, the repeated presentation of degraded or HQ stimuli serving as S should lead to the formation of different perceptual references, i.e., different internal representations for each degraded or HQ stimulus. Hence, the deviation of a certain degraded stimulus serving as oddball from the constant perceptual reference (anticipation & matching) varies depending on which dimension of the quality percept has changed. It was presumed that the quality of the stimulus context would affect the morphological and temporal characteristics of the P3a. Following the presentation of T, memory processes responsible for the updating of the perceptual reference (memory processing) should vary depending on which represented dimension they operate on. Therefore, effects of the perceptual references on P3b characteristics were expected. Due to the exploratory nature of the study, the directions of the anticipated effects on both P3a and P3b characteristics remained unspecified.

Experimental procedure

Each experimental session started with an interview, during which subjects were asked to report their handedness, potential hearing problems, vision correction, drug consumption and hours of sleep they had last night.

During the oddball task, word stimuli were presented in a pseudo-randomized order such that at least one S fell between two consecutive oddballs, and subjects responded to the occurrence of T by pressing a button box. Moreover, two category rating paradigms were performed in accordance with ITU-R Recommendation BS.1770-3 [25]: an absolute category rating (ACR) task and a comparison category rating (CCR) task—one before and the other one after the oddball task, in randomized order across subjects. During the ACR task, a test stimulus was presented, whose quality had to be evaluated on a unipolar discrete five-point rating scale (mean opinion score, MOS) with the values “1”, “2”, “3”, “4” and “5” as well as respective category labels in German “schlecht”, “dürftig, “ordentlich”, “gut” and “ausgezeichnet” (English: “bad”, “poor”, “fair”, “good” and “excellent”). Each of the four stimuli used in the oddball task (HQ, N, C, P) occurred five times, stimulus presentation was carried out in randomized order. During the CCR task, a reference stimulus and a test stimulus were presented sequentially with a short pause of 1 s in between, afterwards subjects had to evaluate the quality of the second stimulus relative to the quality of the first stimulus on a bipolar discrete seven-point rating scale (comparison mean opinion score, CMOS) with the values “−3”, “−2”, “−1”, “0”, “1”, “2” and “3” as well as corresponding labels in German “viel schlechter”, “schlechter”, “etwas schlechter”, “etwa gleich”, “etwas besser”, “besser” and “viel besser” (English: “much worse”, “worse”, “slightly worse”, “about the same”, “slightly better”, “better” and “much better”). The 16 pairings of stimuli—including pairs of the same stimulus (HQ-HQ, N-N, C-C, P-P)—were presented twice, each time in randomized order. The subjects delivered their quality judgments by pressing buttons on a keyboard, which were labeled with the values of MOS and CMOS, respectively. Subfigure 1 in Fig. 2 illustrates the order of tasks to be performed by each subject.

At the beginning of the oddball task, subjects were instructed to press the button box as fast and as accurately as possible whenever they detected T. Within each block only stimulus-to-stimulus-type assignments of a particular stimulus set were presented. Prior to the start of a new block, subjects were introduced to the new stimulus-to-stimulus-type assignments: First by passively listening to the repetitions of each stimulus type, then by actively performing a sequence of demo trials, after which they could start the test trials. A total number of 300 trials per block was presented with a 70:20:10 frequency ratio for the three stimulus types S, T and D. Each trial consisted of the presentation of a physical event, followed by an inter-stimulus interval (ISI) of 1200 ms plus/minus a jitter of 200 ms. Sequences of experimental blocks (stimulus sets) and trials were randomized for each session. In total, subjects ran through 12 blocks: Six stimulus sets with a degraded stimulus as S, each consisting of 300 trials, and six stimulus sets with HQ as S, each consisting of only 100 trials to reduce the total length of the experiment. Subfigure 2 in Fig. 2 illustrates a randomized sequence of stimulus presentations during the oddball task with Subfigure 3 depicting the composition of the associated neural responses.

All test sessions were carried out in a quiet standardized test room in accordance with ITU-R Recommendation BS.1770-3 [25]. During an entire session subjects sat in front of a monitor, wearing an EEG electrode cap as well as headphones (Sennheiser HD 280) through which stimuli were presented binaurally. The subjects themselves were allowed to set their preferred loudness level at the beginning of the first category rating task in order to create a more ecologically valid listening situation. During each block subjects held the button box in their dominant hand and kept their eyes open. The monitor screen was used to display instructions for all tasks as well as requesting a response from the subject after hearing the test stimulus (ACR) or stimulus pair (CCR) and otherwise turned black. After each block subjects were offered to take a short break. Including the time to prepare the equipment and conduct the whole experiment, a single test session took about two and a half hours.

Fig. 2
figure 2

1 Experimental procedure consisting of three parts: the first part, category rating 1, being either an absolute category rating (ACR) or comparison category rating (CCR) task. The second part, oddball, being a three-stimulus oddball task consisting of twelve consecutive blocks with changing stimulus-to-stimulus-type assignments. The third part, category rating 2, being either a CCR or ACR task, depending on which category rating task was used in category rating 1. 2 Visual illustration of the three-stimulus oddball task. Quality-impaired stimuli generated from an audio recording of the spoken word /Haus/ are presented in a randomized sequence. Each stimulus (black, green, red) is assigned to one of three stimulus types, standard (S), target (T) and distractor (D), which vary in probability and task relevance. Subjects are instructed to respond only to the occurrence of T. 3 Schematic ERP waveforms of the P3a and P3b subcomponents, whose superposition produces the P300 waveform. Presentation of S (dotted line) triggers no P300-related activity, presentation of T (solid line) triggers both P3a and P3b, presentation of D (dashed line) triggers only P3a. Subfigures 2 and 3 are based on Figure 7.1 (Subfigure C) in a review on the P300 literature [51]

Data collection

Running on a standard laboratory PC, Presentation® software (version 18.3; Neurobehavioral Systems, Inc., USA) executed stimulus presentation, sending event triggers to the EEG amplifier. Electrocortical activity was measured at 16 scalp positions (FP1, F3, C3, P3, TP9, FP2, F4, C4, P4, TP10, Fz, FCz, Cz, CPz, Pz, Oz) according to the extended 10–20 system [44] via a medical-grade EEG recording system (g.GAMMAsys; g.tec medical engineering GmbH, Austria) with active electrodes (g.LADYbird). The reference electrode was placed on the left mastoid, the ground electrode on the right mastoid. The sampling rate was set to 512 Hz.

Table 2 Stimulus set comparisons to examine effects of impairment on P3a and P3b characteristics. Summation in brackets symbolizes merging of stimulus sets prior to averaging
Table 3 Stimulus set comparisons to examine effects of perceptual reference on P3a and P3b characteristics. Summation in brackets symbolizes merging of stimulus sets prior to averaging

Data processing and analysis

The recorded EEG data was processed by using the EEGLAB toolbox (v13.6.5b) for MATLAB software (R2015b).

First, the raw signal was down-sampled from 512 to 256 Hz, then filtered with a finite impulse response (FIR) filter—setting the low-cutoff to 0.1 Hz and the high-cutoff to 40 Hz—as well as re-referenced to average reference. The extracted epochs were stimulus-locked with a pre-stimulus time of 200 ms serving as baseline activity and a post-stimulus time of 1000 ms. Maximum and minimum thresholds for the signal amplitude were specified for each subject across all channels, such that at most 10% of epochs would be rejected. Epochs exceeding this threshold were considered to be artifacts and rejected from further analysis.

Ocular artifacts caused by eye movements and blinks were corrected individually for each subject: after computing an independent component analysis (ICA), components associated with ocular activity were subtracted from the data.

Single-subject averaging of single-trial epochs resulted in single-subject average waveforms for each subject, channel, stimulus set and stimulus type. Subsequent grand averaging of single-trial epochs resulted in grand average waveforms for each channel, stimulus set and stimulus type. Average difference epochs T-S and D-S were computed via the following processing steps: Because the number of epochs with S as sensory event (“S epochs”) was always exceeding the number of epochs with T or D as sensory event (“oddball epochs”), both numbers had to be equalized first. This was achieved by randomly selecting S epochs of the same number as oddball epochs. Subtraction of single-trial S epochs from single-trial oddball epochs resulted in single-trial difference epochs, across which the average was taken and variance measures—needed for computing confidence bands and multiple paired t-tests across the time course of the difference epochs (“running t-tests”)—were calculated.

The anticipated effects on P3a and P3b characteristics explicated in Sect. 2.3 could be examined by comparing stimulus sets with degraded stimuli as S and corresponding stimulus sets with HQ as S, given a constant ERP difference and oddball stimulus. Two common ERP parameters, peak amplitude and peak latency, were chosen to quantify the morphological and temporal characteristics of the P3a and P3b subcomponents, respectively. Peak amplitude is defined as the voltage difference between a certain positive/negative maximum (“peak”) in post-stimulus time and the (pre-stimulus) baseline, while peak latency refers to the time passing between the onset of the stimulus and a certain peak. A first analysis step aimed at determining whether different perceptual references exist for quality-impaired vs. HQ stimulus contexts by comparing stimulus sets listed in Table 2. Thus, for each ERP difference (T-S/D-S) and oddball stimulus (N/C/P) the effects of impairment (Impaired, HQ) on P3a and P3b peak parameters were analyzed. A second analysis step further examined, whether different perceptual references exist for each of the three perceptual dimensions of speech quality via stimulus set comparisons summarized in Table 3. Hence, for each ERP difference (T-S/D-S) and oddball stimulus (N/C/P), the effects of perceptual reference (C, P, HQ/N, P, HQ/N, C, HQ) on P3a and P3b peak parameters were analyzed. For each analysis step, stimulus sets with HQ as S were merged prior to computing the difference waveforms as outlined above. To compare neural responses in quality-impaired vs. HQ stimulus contexts (first analysis step), also those two stimulus sets with degraded stimuli as S were merged beforehand. Separate time windows, from 350 to 550 ms and from 550 to 750 ms, were used for ERP peak parameter extraction in order to quantify post-stimulus activity related to the P3a and P3b subcomponents, respectively. Both time windows were selected through visual inspection of the grand average T-S and D-S difference waveforms as well as based on the significant results obtained from the running t-tests (see Sect. 3.2).

Two response measures, target response time and hit rate, were chosen to quantify the subject’s response behavior during the oddball task, the latter being defined as the relative frequency of responses to T, i.e.,

$$\begin{aligned} \text {hit\, rate = }\frac{\text {number\, of\, responses\, to\, T}}{\text {total\, number\, of\, T}}. \end{aligned}$$

Response times less than 1 s were extracted for responses to T in each experimental block (stimulus set).

Using RStudio software (version 1.0.143), independent repeated-measures analyses of variance (ANOVA) were computed for behavioral (category ratings, response measures) as well as electrophysiological data (ERP parameters), each based on a significance level of \(\alpha = 0.05\):

  • To analyze whether the four stimuli differed in absolute or relative quality, two repeated-measures ANOVAs were calculated with stimulus (HQ, N, C, P) as within-subject factor and either MOS (ACR) or CMOS (CCR) as dependent variable.

  • To analyze the effects on response measures, two repeated-measures ANOVAs were calculated with either block (1–12) or stimulus set (see Table 1) as within-subject factor and either target response time or hit rate as dependent variable.

  • To analyze the effects on P3a and P3b peak parameters, 48 repeated-measures ANOVAs were carried out with either perceptual reference (C, P, HQ/N, P, HQ/N, C, HQ) or impairment (Impaired, HQ) as within-subject factor and either peak amplitude or peak latency as dependent variable for each combination of channel (Fz, Cz, Pz; see Sect. 3.2), ERP difference (T-S/D-S), oddball stimulus (N/C/P) and time window (350–550/550–750).

In case of violating the assumption of sphericity, degrees of freedom were corrected according to Huynh-Feldt (\(\epsilon _{\text {HF}}\)). Generalized eta squared (\(\eta _{\text {G}}^2\)) was calculated as a measure of effect size [14, 45]. For all post-hoc pairwise comparisons, multiple paired t-tests with Šidák correction were computed.

Results

Behavioral data analysis

The analysis of the category ratings showed significant effects of stimulus on MOS (\(F[3,69] = 129.316, p < 0.001, \eta _{\text {G}}^2 = 0.755\)) and CMOS (\(F[3,69] = 127.844, p < 0.001, \eta _{\text {G}}^2 = 0.846\)). Post-hoc comparisons revealed significant differences between all stimuli for both category ratings, except HQ vs. C, which was only significant for MOS (\(p = 0.006\)). Figure 3 illustrates average MOS and CMOS as a function of stimulus.

The analysis of the response measures revealed significant effects of block (\(F[11,253] = 2.406, \epsilon _{\text {HF}} = 0.708, p = 0.018, \eta _{\text {G}}^2 = 0.033\)) and stimulus set (\(F[11,253] = 2.719, \epsilon _{\text {HF}} = 0.448, p = 0.034, \eta _{\text {G}}^2 = 0.037\)) on target response time as well as a significant effect of stimulus set on hit rate (\(F[11,253] = 6.059, \epsilon _{\text {HF}} = 0.334, p < 0.001, \eta _{\text {G}}^2 = 0.155\)). Post-hoc comparisons revealed significant differences in average target response times for blocks 2 vs. 9 (\(p = 0.020\)), 2 vs. 10 (\(p = 0.012\)) and 2 vs. 12 (\(p = 0.044\)) as well as for stimulus sets HQ-P-C vs. N-C-P (\(p = 0.010\)) and HQ-P-C vs. HQ-N-P (\(p = 0.041\)). No significant post-hoc effects on hit rate were found. Figure 4 illustrates average target response times and hit rates as a function of block and stimulus set.

Fig. 3
figure 3

Effects of stimulus on average mean opinion score (MOS) and comparison opinion score (CMOS) as measured in absolute category rating (ACR) and comparison category rating (CCR) tasks, respectively. Error bars represent 95% confidence intervals

Fig. 4
figure 4

Effects of block and stimulus set on target response time and hit rate in the oddball task. Error bars represent 95% confidence intervals

Fig. 5
figure 5

1 Grand average ERP waveforms S, T and D for stimulus set C-N-P at electrode position Cz, indicating average neural responses to stimulus types S, T and D. 2 Grand average ERP difference waveforms T-S and D-S for stimulus set C-N-P at electrode position Cz, indicating average neural responses to each stimulus type and average deviations of neural responses to T and D from the neural response to S. Prior to plotting, a second artifact rejection with an absolute threshold of 250 was performed on the electrophysiological data. The legend contains the number of trials per waveform. Colored arrows point at assumed peaks of P3a and P3b subcomponents in both waveforms. Colored lines of dots (above the time axis) indicate, at which time points the ERP difference waveforms significantly deviate from the baseline (based on multiple paired t-tests with Šidák-adjusted \(\alpha = 0.05\)). Error bands represent 95% confidence intervals

Fig. 6
figure 6

Grand average T-S (1) and D-S (2) difference waveforms for each oddball stimulus (N, C, P) at electrode position Cz, indicating average deviations of neural responses to T (1) and D (2) from neural responses to S. Prior to plotting, a second artifact rejection with an absolute threshold of 250 was performed on the electrophysiological data. The legend contains the number of trials for each waveform. Colored lines of dots (above the time axis) indicate, at which time points the ERP difference waveforms significantly deviate from the baseline (based on multiple paired t-tests with Šidák-adjusted \(\alpha = 0.05\)). Error bands represent 95% confidence intervals

ERP plots

Visual inspection of single-subject average ERP plots guided the exclusion of three subjects due to technical artifacts and unacceptably high levels of noise in their recordings.

Grand average ERP plots were generated for three electrode positions Fz, Cz and Pz, at which the topographical P300 characteristics should be most prominent (“P300 scalp distribution”) [29, 52]), displaying the average neural responses across the remaining 21 subjects. Subfigure 1 in Fig. 5 shows grand average ERP waveforms for each stimulus type in stimulus set C-N-P at electrode position Cz to exemplify the successful triggering of the P3a and P3b.

Table 4 Effects of impairment on ERP parameters (peak amplitude, peak latency)
Fig. 7
figure 7

Arithmetic means (M) and standard deviations (SD) of the effects of impairment on ERP parameters (peak amplitude, peak latency). Error bars represent 95% confidence intervals

In addition, grand average ERP difference plots were generated to enable the selection of time windows for the extraction of P3a and P3b peak parameters: While the T-S waveform should indicate superposed post-stimulus activity of both P3a and P3b (i.e., the P300), the D-S waveform should indicate only P3a-related activity. Accordingly, subfigure 2 in Fig. 5 shows grand average ERP difference waveforms T-S and D-S for stimulus set C-N-P at electrode position Cz. Running t-tests with Šidák-adjusted \(\alpha = 0.05\) were calculated across the epoch time course to indicate, whenever each waveform significantly deviated from baseline.

Visual inspection of the grand average waveforms as well as significant results from the running t-tests enabled the localization of both subcomponents: In the T-S waveform (red waveforms in subfigures 1 and 2), the potential peaks around 460 ms and 600 ms should reflect P3a- and P3b-related neural activity, respectively. In the D-S waveform, only one potential peaks manifested around 440 ms, which should indicate P3a-related neural activity.

Table 5 Effects of perceptual reference on ERP parameters (peak amplitude, peak latency)
Table 6 Post-hoc comparisons of the effects of perceptual reference on ERP parameters (peak amplitude, peak latency)
Fig. 8
figure 8

Arithmetic means (M) and standard deviations (SD) of the effects of perceptual reference on ERP parameters (peak amplitude, peak latency). Error bars represent 95% confidence intervals

Electrophysiological data analysis

The analysis of the electrophysiological data was based on comparisons of specific stimulus sets (see Tables 2, 3) in order to isolate effects of impairment and perceptual reference on P3a and P3b peak parameters (peak amplitude, peak latency), which were extracted from two separate time windows (350–550/550–750). Figure 6 illustrates grand average ERP difference waveform plots for each stimulus set comparison listed in Table 3 at electrode position Cz. These plots visualize average neural response differences T-S (1) and D-S (2) for different stimuli serving as S (perceptual reference), given a constant degraded stimulus (N, C, P) as T (1) and D (2), respectively.

Table 4 and Fig. 7 show significant results from four independently conducted ANOVAs of the effects of impairment on peak amplitude and peak latency ordered by channel, ERP difference, oddball stimulus, time window of ERP quantification as well as the extracted ERP parameter.

Accordingly, Tables 5, 6 as well as Fig. 8 show significant results from eight independently conducted ANOVAs of the effects of perceptual reference on peak amplitude and peak latency.

Discussion

The goal of the present study was to explore the neuro-electrical correlates of perceptual references for three independent dimensions of speech quality: Noisiness, coloration and discontinuity [69, 70]. Perceptual references are considered to be part of the early perception stage of the quality formation process [57], exerting an influence on attentional and memory-related processing. The subcomponents of the P300, P3a and P3b, were chosen as neurophysiological indicators of these early quality-related processes (see Fig. 1). An audio recording of the German word |Haus| spoken by a female voice was degraded on each of the three perceptual dimensions at a time, producing three quality-impaired stimuli (N = noisiness-impaired, C = coloration-impaired, P = discontinuity-impaired) with the original recording serving as a high-quality (HQ) stimulus. In order to test the effects of HQ vs. quality-impaired stimulus contexts as well as single perceptual references on P3a and P3b characteristics, 24 subjects were invited to perform an active three-stimulus oddball task with the degraded stimuli (N, C, P) being assigned to three stimulus types, standard (S), target (T) and distractor (D). During the oddball task, each subject’s electrical brain activity was recorded via EEG. Statistical analyses of the resulting behavioral (category rating scores, response measures) and electrophysiological data (ERP parameters) revealed several significant effects.

Behavioral effects: category rating scores

At first, the degradation method applied to the original audio recording affected both the mean opinion score (ACR) and comparison mean opinion score (CCR), with P and N differing significantly from HQ and C. Hence, quality percepts of HQ and C might be described subjectively as “normal” or “close-to-normal”, while quality percepts of N and P appeared to be “clearly degraded”. Both effects demonstrated the successful manipulation of speech quality in the present experiment and were considered in the following analyses of the response measures and the electrophysiological data. Moreover, the observed MOS values matched the predicted MOS values obtained by applying the POLQA model to the four stimuli (see again Sect. 2.2). In general, the observed differences in quality judgments were quite robust with only small variability between subjects (see Fig. 3). This might be an indication that the individual adjustment to a preferred loudness level did not significantly influence quality perception and evaluation of the presented stimuli.

Behavioral effects: response measures

During the oddball task, only weak time-on-task effects (probably due to an increase in mental fatigue and/or drowsiness) were observed, manifesting as a general reduction in average target response times over the course of the experiment. In contrast, the target detection performance as quantified by average hit rates remained unaffected by the amount of time spend on the task, which is not surprising, given that the cognitive load required for optimal task performance should be relatively low in healthy subjects. Furthermore, both response measures were shown to depend on the particular stimulus set: Average responses to T in stimulus set HQ-P-C were faster than in N-C-P and HQ-N-P. Since HQ and P were judged as having the highest and lowest quality of all stimuli, respectively, the difficulty to perceptually discriminate between S and T (T-S discriminability) should be minimal in HQ-P-C, reducing the average target response time and increasing the average hit rate. On the other hand, stimulus sets HQ-C-P and HQ-C-N demonstrated slower and more erroneous responses to T compared to the other stimulus sets with HQ as S, which corresponds with the observation that HQ and C were judged similarly on MOS/CMOS. Besides, variability increased drastically in these more demanding stimulus sets, pointing to relevant differences in auditory discrimination ability between subjects.

Electrophysiological effects: perceptual references for quality-impaired vs. high-quality stimulus contexts

The present study focused mainly on the question of whether P3a and P3b peak parameters are modulated by perceptual references which represent independent perceptual dimensions of speech quality. To investigate this, peak amplitudes and latencies were extracted from difference epochs of specific stimulus sets and contrasted against each other. Visual inspection of the grand average difference waveforms as well as significant time points—obtained from running t-tests across the epoch time course—suggested that both P3a and P3b subcomponents were successfully triggered by the oddball task, demonstrating typical morphological and temporal characteristics (see subplot 2 in Fig. 5). Based on the same information P3a and P3b peak parameters were extracted, using separate time windows from 350 to 550 ms and from 550 to 750 ms, respectively.

Results obtained from the first step of electrophysiological data analysis supported the notion of different perceptual references for quality-impaired vs. HQ stimulus contexts due to significant modulations of peak amplitude at electrode position Cz and peak latency at Pz. Interestingly, given C as D (see comparison 2 in Table 4), a larger P3a peak amplitude was evoked in the HQ stimulus context compared to the degraded stimulus context, the latter in this case consisting of N and P. With MOS/CMOS distances between C and N as well as C and P being larger than between C and HQ, an impaired stimulus context should have produced a larger deviation between D and S (D-S deviation) compared to a HQ stimulus context. Since the P3a constitutes a central part of the orienting response elicited by novel stimuli [61] and the occurrence of C as D was completely task-irrelevant, the opposite effect would have been expected, i.e., a larger D-S deviation causing a stronger (and faster) orienting response with larger P3a peak amplitudes (and shorter P3a peak latencies; e.g., as observed by [22, 43]). Thus, the observed differences in the neural response were attributed to the detection of change not in the quality intensity, but in the perceptual quality dimension underlying the degradation of the presented oddball stimulus.

Electrophysiological effects: perceptual references for single quality dimensions

Corresponding results emerged in the second analysis step, demonstrating that the perceptual reference affected the ERP parameters primarily within the P3a-related time window (350–550) at electrode positions Fz and Cz. For set comparisons 6 and 7 listed in Table 5, the neural response to C as D was stronger for HQ as S relative to N or P as S—again indicating a positive relationship between the size of the D-S deviation and the P3a peak amplitude, which contradicts the orienting response viewpoint. Furthermore, modulations of peak latency with regard to neural responses to T were found for set comparisons 1 and 5 (see Table 6), confirming previously reported negative effects of T-S discriminability (also termed “perceptual distinctiveness” in the literature) on P300 amplitude and latency [15, 16, 30, 48, 53]. Based on this result pattern as well as the fact that T-S discriminability directly determines the difficulty of the oddball task, an alternative account appears plausible: High T-S discriminability should allow for an easy detection task because T is clearly separable from the S sequence, hence stimulus evaluation is fast and less resources are depleted in subsequent memory-related processing as indicated by shorter P300 latencies and larger P300 amplitudes. In contrast, low T-S discriminability should make the detection task more difficult, prolonging stimulus evaluation and increasing the cognitive load of the memory operations as indicated by longer P300 latencies and smaller P300 amplitudes [31]. This explanation is further supported by post-hoc comparison 4 with N serving as T (see Table 6), which revealed a larger P300 peak amplitude for HQ as S relative to C or P as S. Correspondingly, inspection of Fig. 6 (Subfigure 1) shows that at electrode position Cz the average neural response evoked by the presentation of T was modulated by the perceptual reference, which had been established by the previous frequent presentation of S: When N served as oddball, the P3a amplitude was highest for the HQ waveform in contrast to the C and P waveforms.

Despite the prediction of effects on P3b peak parameters, which should indicate changes in memory-related processing, all the observed modulations of P300 parameters following the occurrence of T were only significant in the P3a-related time window. Thus, it remained unclear, whether the described account would suffice as an exhaustive explanation of the observed data. Potential effects of loudness level on P3a and P3b peak parameters [46] were anticipated to be non-significant as preferred loudness levels varied across subjects and standard deviations of quality rating were very small (compare with confidence intervals plotted in Fig. 3). Again, the observed effects on P300 parameters were more likely attributable to the detection of change in the perceptual dimension of the degraded oddball stimulus, causing variation in the resource demands of memory-related processing.

Confounding factors and alternative explanations

Since the goal of the present study was to provide a first exploration of the effects of perceptual references of independent dimensions of speech quality on P3a and P3b parameters, the quality intensity of the three degraded stimuli was not controlled. This is apparent from the predicted MOS values as well as observed MOS/CMOS values, but also from the two response parameters (response time and correct rate), which were extracted from the oddball task (see again Sects. 4.1 and 4.2). Consequently, it could be argued, that the strength and speed of the elicited neural response actually depends on differences in quality intensity between the oddballs and the perceptual references detected during an early matching process, which in turn modulates the P3a and P3b parameters. Whether this alternative account is valid and how it relates to the assumed detection of changes in the underlying quality dimensions of the degraded stimuli cannot be decided based on the present experimental data, however, and therefore needs to be addressed by future studies.

It is noticeable from Table 5 that none of the comparisons with P as oddball turned out to be statistically significant. A potential explanation for this lies in the temporal domain of the degraded stimuli: In case of N and C, the degradation methods induced a continuous degradation across all samples of the audio speech file, but in case of P resulted in a discretely degraded stimulus since only a number of distinct samples were omitted. This in turn might have led to changes both in the quality perception of the degraded stimuli and the neural responses evoked after their occurrence as oddballs: N and C were probably perceived as continuously impaired and also detected faster on the neural level, since the presented sound file was already impaired in the first sample. On the contrary, the characteristics of the neural response to P, which was presumably perceived as a discrete type of quality degradation, might have varied depending on the position of the impaired samples.

Taken together, only one significant modulation of peak latency in the time window from 550 to 750 ms could be interpreted as reflecting changes in P3b-related activity (see comparison 3 in Table 4). This is surprising, given that more T epochs than D epochs were collected as a consequence of the higher probability of T (0.2) vs. D (0.1), presumably resulting in a lower signal-to-noise ratio (SNR) for T-S vs. D-S waveforms. Since the P3a is highly sensitive to the probability of novel stimuli (oddballs)—displaying larger amplitudes and smaller latencies for more infrequent oddballs [18, 19, 52, 63, 65]—this might have led to more pronounced P3a responses after the occurrence of D. According to the theoretical framework depicted in Fig. 1, existing perceptual references should affect mostly attentional and memory-related processes located on the perception stage of the quality formation process. Since the P3a and P3b are generally associated with similar cognitive functions, i.e., P3a and P3b reflecting attentional resource allocation and memory updating operations [50, 51], they were regarded as suitable indicators of these early quality-related processes: The P3a should reflect attentional processing after the unexpected occurrence of an oddball, while the P3b should reflect memory processing after the oddball was categorized as task-relevant. Although the present study provided a first exploration of the validity of this model, the difference in probabilities precluded the addition of a second factor oddball to the experimental design. Hence, the assumption that P3a and P3b reflect distinct quality-related processes of attentional and memory processing could not be fully addressed in the present study. A more detailed analysis will be needed to test whether the three-stimulus oddball paradigm can reliably dissociate distinct processes in human quality perception.

Conclusions

To sum up, the analysis of the electrophysiological data provided first evidence that the three perceptual dimensions of speech quality, noisiness, coloration and discontinuity, are internally represented as different perceptual references as indicated by modulations of the P3a subcomponent. Yet, since the degraded stimuli varied considerably in their average quality judgments (see Fig. 3), the observed effects on P300 parameters might still be ascribed to differences in degradation intensity. In a follow-up experiment, this confounding factor would have to be controlled by adjusting the degraded stimuli, such that their respective judgments on the MOS/CMOS scales are approximately equal. Degraded stimuli produced this way would then vary only in the particular perceptual dimension on which the quality degradation was initially induced.

Altogether, the present study exemplifies the use of neurophysiological methods for the assessment of speech quality, providing more direct indicators of human influencing factors on speech perception. Detailed knowledge about the internal representations and processes governing the formation of quality percepts and judgments will help to improve the evaluation, prediction and optimization of media transmission quality, ultimately increasing acceptance and long-term satisfaction on the user side [37, 38].