Is visual salience top-down or bottom-up?

Listeners perform an amplitude-modulation (AM) detection task by attending to a tone sequence and indicating presence of intermittent modulated target tones (orange note in Figure 1). Concurrently, a busy acoustic scene is presented in the background and subjects are asked to completely ignore it. Background scenes are taken from the JHU DNSS (Dichotic Natural Salience Soundscapes) database for which behavioral estimates of salience timing and strength have been previously collected (Huang and Elhilali, 2017) (see Materialsandmethods for details). In a first experiment, easy and hard AM detection tasks are interleaved in experimental blocks by changing the modulation depth of the target note (easy: 0 dB, hard: 5 dB). As expected, subjects report a higher overall detection accuracy for the easy condition (75.4%) compared to the hard condition (48.2%). Moreover, target detection (in both easy and hard conditions) is disrupted by presence of a salient event in the ignored background scenes; and detection accuracy drops significantly over a period up to a second after onset of the salient event [drop in detection accuracy; hard task, t(62) = 5.25, p=1.96*106; easy task, t(62) = 5.62, p=4.92*107]. Salient events attract listeners attention away from the task at hand and cause a drop in detection accuracy that is proportional to the salience level of background distractors; especially for high and midsalience events [hard task - highsalience event t(62) = 4.97, p=5.57*106; mid salience event t(62) = 3.70, p=4.54*104; low salience event t(62) = 0.75, p=0.46; easy task - high salience event t(62) = 4.20, p=8.54*105; mid salience event t(62) = 2.29, p=0.025; low salience event t(62) = 1.51, p=0.14]. In order to further explore neural underpinnings of changes in the attentional state of listeners, this paradigm is repeated with the easy task while neural activity is measured using Electroencephalography (EEG).

Is visual salience top-down or bottom-up?
Stimulus paradigm during EEG recording.

Listeners are presented with two concurrent sounds in each each trial: (top stimulus) A recording of a natural audio clip, which subjects are asked to ignore; and (bottom stimulus) a rhythmic tone sequence, which subjects pay attention to and detect presence of occasional modulated tones (shown in orange). A segment of one trial neural recording is shown in the bottom. Analyses focus on changes in neural responses due to presence of salient events in the ambient scene or target tones in the attended scene.

The attended tone sequence is presented at a regular tempo of 2.6 Hz and induces a strong overall phase-locked response around this frequency despite the concurrent presentation of a natural scene in the background. Figure 2A shows the grand average spectral profile of the neural response observed throughout the experiment. The plot clearly displays a strong energy at 2.6 Hz, with a left-lateralized fronto-central response, consistent with activation of Heschls gyrus and conforming to prior observations of precise phase-locking to relatively slow rates in core auditory cortex (Lütkenhöner and Steinsträter, 1998; Liégeois-Chauvel et al., 2004; Stropahl et al., 2018). (Figure 2A, inset).

Is visual salience top-down or bottom-up?
Phase-locking results.

(A) Spectral density across all stimuli. The peak in energy at the tone presentation frequency is marked by a red arrow. Inset shows average normalized tone-locking energy for individual electrodes. (B) Spectral density around target tones (top) and salient events (bottom). Black lines show energy preceding the target or event, while colored lines depict energy following. Note that target tones are fewer throughout the experiment leading to lower resolution of the spectral profile. (C) Change in phase-locking energy across target tones, non-events, and salient events. (D) Change in tone-locking energy across high, mid, and low salience events. Error bars depict ±1 SEM.

Taking a closer look at this phase-locked activity aligned to the tone sequence, the response appears to change during the course of each trial, particularly when coinciding with task-specific AM tone targets, as well as when concurring with salient events in the background scene. Phase-locking near modulated-tone targets shows an increase in 2.6 Hz power relative to the average level, reflecting an expected increase in neural power induced by top-down attention (Figure 2B-top). The same phase-locked response is notably reduced when tones coincide with salient events in the background (Figure 2B-bottom - blue curve), indicating diversion of resources away from the attended sequence and potential markers of distraction caused by salient events in the ignored background.

We contrast variability of 2.6 Hz phase-locked energy over 3 windows of interest in each trial: (i) near AM tone targets, (ii) near salient events and (iii) near tones chosen randomly away from either targets or salient events and used as control baseline responses. We compare activity in each of these windows relative to a preceding window (e.g. Figure 1, post vs. pre-event interval). Figure 2C shows that phase-locking to 2.6 Hz after target tones increases significantly [t(443)=4.65, p=4.43*106], whereas it decreases significantly following salient events [t(443)=5.89, p < 107], relative to preceding non-target tones. A random sampling of tones away from target tones or salient events does not show any significant variability [t(443)=0.78, p=0.43, Bayes Factor 0.072] indicating a relatively stable phase-locked power in control segments of the experiment away from task-relevant targets or bottom-up background events (2C, middle bar). Compared to each other, the top-down attentional effect due to target tones is significantly different from the inherent variability in phase-locked responses in control segments [t(886)=3.81, p=1.48*104]; while distraction due to salient events induces a decrease in phase-locking that is significantly different from inherent variability in control segments [t(886)=3.58, p=3.66*103].

Interestingly, this salience-induced decrease is modulated in strength by the level of salience of background events. The decrease in phase-locked energy is strongest for events with a higher level of salience [t(443)=3.78, p=1.8*104]. It is also significant for events with mid-level salience [t(443)=2.57, p=0.01], but marginally reduced though not significant for events with the lowest salience [t(359)=1.33, p=0.20, Bayes Factor BF 0.14] (Figure 2D). A one-way ANOVA did not show a significant difference between the mean suppression at the three salience levels [F(1329)=1.65, p=0.19].

A potential confound to reduced phase-locking due to distraction could be local acoustic variability associated with salient events instead of actual deployment of bottom-up attention that disrupts phase-locking to the attended sequence. While this possibility is unlikely given the significant effect of salient events on behavioral detection of targets, we further reassess loss of phase-locking to the attended rhythm near events by excluding salient events with the highest loudness which could cause energetic masking effects (Moore, 2013). This analysis confirms that phase-locking to 2.6 Hz is still significantly reduced relative to non-event control moments [t(443)=3.88, p < 103]. A complementary measure of loudness is also explored by excluding events with the highest energy in one equivalent rectangular bandwidth (ERB) around the tone frequency at 440 Hz (Moore and Glasberg, 1983). Excluding the loudest 25% events by this measure still yields a significant reduction in tone-locking [t(443)=4.93, p=1.17*106]. In addition, we analyze acoustic attributes of all salient events in background scenes and compare their acoustic attributes to those of randomly selected intervals in non-salient segments. This comparison assesses whether salient events have unique acoustic attributes that are never observed at other moments in the scene. A Bhattacharyya coefficient -BC- (Kailath, 1967) reveals that salient events share the same global acoustic attributes as non-salient moments in the ambient background across a wide range of features (BC for loudness 0.9655, brightness 0.9851, pitch 0.9867, harmonicity .9775 and scale 0.9868). Morever, the significant drop in phase locking is maintained when events are split by strength of low-level acoustic features such as harmonicity or brightness [High Harmonicity, t(443) = 3.75, p=1.97*104; Low Harmonicity, t(443) = 3.77, p=1.82*104; High Brightness, t(443) = 4.18, p=3.51*105; Low Brightness, t(443) = 3.26, p=1.21*103], further validating that the effect of salience is not solely due to low-level acoustic features.

The reduction of phase-locking to the attended sequences rhythm in presence of salient events raises the question whether these attention-grabbing instances result in momentary increased neural entrainment to the background scene. While the ambient scene does not contain a steady rate to examine exact phase-locking, its dynamic nature as a natural soundscape allows us to explore the fidelity of encoding of the stimulus envelope before and after salient events. Generally, synchronization of ignored stimuli tends to be greatly suppressed (Ding and Simon, 2012; Fuglsang et al., 2017). Nonetheless, we note a momentary enhancement in decoding accuracy after high salience events compared to a preceding period [paired t-test, t(102) = 2.18, p=0.03] though no such effects are observed in mid [t(113)=1.09, p=0.28] and low salience [t(107)=0.24, p=0.81] events (Figure 3).

Is visual salience top-down or bottom-up?
Reconstruction of ignored scene envelopes from neural responses before and after salient events for high, mid and low salience instances.

The accuracy quantifies the correlation between neural reconstructions and scene envelopes estimated using ridge regression (see Materialsandmethods). Error bars depict ±1 SEM.

Next, we probe other markers of attentional shift and focus particularly on the Gamma band energy in the neural response (Ray et al., 2008). We contrast spectral profiles of neural responses after target tones, salient events and during control tones. Figure 4A depicts a time-frequency profile of neural energy around modulated target tones (0 on the x-axis denotes the start of the target tone). A strong increase in Gamma activity occurs after the onset of target tones and spans a broad spectral bandwidth from 40 to 120 Hz. Figure 4B shows the same time-frequency profile of neural energy relative to attended tones closest to a salient event. The figure clearly shows a decrease in spectral power post-onset of attended tones nearest salient events which is also spectrally broad, though strongest in a high-Gamma range (60120 Hz).

Is visual salience top-down or bottom-up?
High gamma band energy results.

(A) Time frequency spectrogram of neural responses aligned to onsets nearest modulated targets, averaged across central and frontal electrodes. Contours depict the highest 80% and 95% of the gamma response. (B) Time frequency spectrogram of tones nearest salient events in the background scene. Contours depict the lowest 80% and 95% of the gamma response.(C) Change in energy in the high gamma frequency band (70110 Hz) across target tones, non-events, and salient events relative to a preceding time window. (D) Change in high gamma band energy across high, mid, and low salience events. Error bars depict ±1 SEM.

Figure 4C quantifies the variations of Gamma energy relative to targets, salient events, and control tones as compared to a preceding time window. High-Gamma band energy increases significantly following target tones [t(443)=11.5, p <107]; while it drops significantly for attended tones near salient events [t(443)=6.83, p < 107]. Control non-event segments show no significant variations in Gamma energy [t(443)=1.5, p=0.13, Bayes factor 0.16] confirming a relatively stable Gamma energy throughout the experimental trials overall. The increase in spectral energy around the Gamma band is significantly different in a direct comparison between target and control tones [t(886)=10.3, p < 107]. Similarly, the decrease in spectral energy around the Gamma band is significantly different when comparing salient events against control tones [t(886)=6.68, p < 107]. As with the decrease in tone locking, the Gamma band energy drop is more prominent for higher salience events [t(443)=7.72, p < 107], is lower but still significant for mid-level salience events [t(443)=3.64, p=3.02*104], but not significant for lowsalience events [t(443)=0.84, p=0.40, Bayes Factor 0.076] (Figure 4D). A one-way ANOVA shows that the three levels of salience strength have significantly different changes in gamma power [F(1329)=20.79, p=1.29*109], with all levels found to be significantly different from each other based on a post-hoc Tukey test.

Furthermore, the modulation of gamma band energy by both bottom-up and top-down attention is further modulated by subjects behavior, quantified using signed error (defined as detected targets minus actual targets - see Materialsandmethods). Targets in scenes with negative signed error (suggesting that modulated targets were missed due to lower top-down attentional focus) show a smaller increase in gamma power than events in scenes with positive signed error. This difference is significant based on a two-sample t-test [t(886)=3.96, p=8.06*105]. Conversely, salient events within negative signed error scenes showed significantly higher increase in gamma than those in positive signed error scenes [t(886)=4.32, p=1.74*105], suggesting that lower top-down attention indicated higher bottom-up attention, and vice versa. A qualitatively similar result is obtained by grouping subjects behavior by error size (absolute error) rather than signed error.

Given this push-pull competition between bottom-up and top-down attentional responses to tones in the attended rhythmic sequence, we examine similarities between neural loci engaged during these different phases of the neural response. Using the Brainstorm software package (Tadel et al., 2011), electrode activations are mapped onto brain surface sources using standardized low resolution brain electromagnetic tomography (sLORETA, see Materialsandmethods for details). This analysis of localized Gamma activity across cortical voxels examines brain regions uniquely engaged while attending to target tones or distracted by a salient event (relative to background activity of control tones).

We correlate the topography of these top-down and bottom-up brain voxels using sparse canonical correlation analysis (sCCA) (Roeber et al., 2003; Witten and Tibshirani, 2009; Lin et al., 2013) to estimate multivariate similarity between these brain networks at different time lags (Figure 5A, see Materialsandmethods for details). Canonical correlation analysis (CCA) is a form of multivariate analysis of correlation where high-dimensional data are compared in order to discover interpretable associations (or correlations) represented as data projections -called canonical vectors- (Uurtio et al., 2018). Imposing sparse constraints on this procedure improves interpretability of these projections by confining these mapping to constrained vectors. We cross-correlate brain activation maps at different time lags, and consider that similar brain networks are engaged if a statistically significant correlation emerges from the canonical analysis. Figure 5B shows that a significant correlation between Gamma activity in brain voxels is observed about 1 s after tone onset, with bottom-up attention to salient events engaging these circuits about 0.5 s earlier relative to activation by top-down attention. The contoured area denotes statistically significant canonical correlations with p < 0.005, and highlights that the overlap in bottom-up and top-down brain networks is slightly offset in time (mostly off the diagonal axis) with an earlier activation by salient events. A closer look at canonical vectors resulting from this correlation analysis reveals the topography of brain networks most contributing to this correlation. Canonical vectors reflect the set of weights applied to each voxel map that results in maximal correlation, and can therefore be represented in voxel space. These canonical vectors show a stable pattern over time lags of significant correlation and reveal a topography with strong contributions of frontal and parietal brain. Figure 5C shows a representative profile of overlapped canonical vectors obtained from SCCA analysis corresponding to the the time lag shown with an asterisk in Figure 5B and reveals the engagement of inferior/middle frontal gyrus (IFG/MFG) as well as the superior parietal lobule (SPL).

Is visual salience top-down or bottom-up?
Analysis of overlapping brain networks.

(A) Sparse canonical correlation analysis (SCCA) is applied to compare top-down (near target) X^T and bottom-up (near salient event) X^S activation maps. Activations at different time lags τS and τT are compared using SCCA which yields a canonical correlation value q that maximizes the correlation between linear transformations of the original maps; q=maxwS,wTwSTX^STX^TwT. A statistical significance (p-value) of the correlation value q is also estimated at each computation lag using a permutation-based approach (see Materialsandmethods). (B) Canonical correlation values q comparing neural activation patterns after tones near salient events (x-axis) and target tones (y-axis). The contour depicts all canonical correlations with statistical significance less than p<0.005. (C) Projection of canonical vector (mapping function) that yields maximal correlation between the response after salient events and the response after target tones (at the point shown with an asterisk in panel B). The red dashed lines are visual guides to highlight earliest point of observed significant correlation as well as time index of correlation point indicated by an asterisk.The overlap is right-lateralized and primarily located within thesuperior parietal lobule(SPL), theinferior frontal gyrus(IFG), and themedial frontal gyrus(MFG).

Given the profound effects of bottom-up attention on neural responses, we examine the predictive power of changes in tone-locking and Gamma-energy modulations as biomarkers of auditory salience. We train a neural network classifier to infer whether a tone in the attended sequence is aligned with a salient distractor in the background scene or not. Figure 6 shows that classification accuracy for each neural marker, measured by the area under the ROC curve. Both Gamma and tone-locking yield significant predictions above chance [Gamma energy: 68.5% accuracy, t(9) = 4.12, p <103; Tone-locking: 73% accuracy, t(9) = 6.03, p <105]. Interestingly, the best accuracy is achieved when including both features [79% accuracy, t(9) = 7.20, p<107], alluding to the fact that Gamma-band energy and phase-locking may contribute complementary information regarding the presence of attention-grabbing salient events in the background. Furthermore, an estimate of noise floor for this classification (see Materialsandmethods) yields a prediction range of 2% which is below the improvement in accuracy observed from combining both features. In addition, interaction information (IF) across these features was assessed. IF is an information theory metric that quantifies whether two features are complementary with respect to a class variable (Yeung, 1991; Matsuda, 2000; Shuai and Elhilali, 2014). This measure results in greater mutual information I(F1,F2;S)=0.65 using both gamma energy and tone-locking than the combination of both measures I(F1;S)+I(F2;S)=0.23+0.27, again suggesting a possible complimentary role of both features as biomarkers of salience.

Is visual salience top-down or bottom-up?
Event Prediction Accuracy.

A neural network classifier is used to detect whether a tone in the attended sequence coincides with a salient event or not. The figure quantifies the average prediction accuracy (area under the ROC curve) resulting from training (and testing) the classifier using only high gamma band energy, only tone-locking energy, and both features. Error bars depict ±1 SEM. The noise floor is computed by shuffling feature values and labels (coincidence with salient tone).