Hearing-impaired listeners' auditory, visual, and audio-visual
word recognition speed: Predicting individual differences
 
 

Philip F. Seitz and Ken W. Grant
Army Audiology and Speech Center, Walter Reed Army Medical Center
 
 

[Poster paper presented at the NIH-NIDCD/VA Second Biennial Hearing Aid Research and Development Conference,
Bethesda, MD, September 22-24, 1997]


OBJECTIVES
In older people with acquired sensorineural hearing loss, investigate factors affecting recognition speed for auditory (A), visual (V), and audio-visual (AV) spoken words. Address the following specific questions:

• Are there significant recognition speed differences among A, V, and AV words?

• Are individual differences in unimodal (A, V) and bimodal (AV) recognition speed predicted by audiological factors and age?

• Is there "AV benefit" with respect to recognition speed, and if so, what factors predict AV speed benefit for individuals?

BACKGROUND
The perceptual and cognitive processes involved in spoken word recognition require time in which to operate. Abnormally slow word recognition might be a factor in the speech understanding difficulties experienced by hearing-impaired patients, and might limit the benefit they derive from hearing aids. Given limitations of working memory capacity (Baddeley, 1986; Norman and Bobrow, 1975), the receiver of a spoken sentence must process the signal fast enough to keep up with the continuous flow of phonetic information, arriving at word percepts not much later than when the next word starts. Otherwise, the processing operations and memory requirements for earlier-heard input could interfere with those of later-heard input. Thus, because sentence comprehension depends on the perceiver's ability to retain earlier-heard information in memory while receiving the rest of the utterance, abnormally slow word recognition could limit listeners' ability to understand sentences. The difficulties with sentences produced at faster rates experienced by older listeners (Working Group on Speech Understanding and Aging, 1988; Gordon-Salant and Fitzgibbons, 1997) and by listeners with hearing loss (Picheny et al., 1985; Uchanski et al., 1996) might be traceable in part to abnormally slow word recognition.

Visual speech cues typically afford large intelligibility improvements for people with hearing loss (Erber, 1975; Walden et al., 1974, 1975) and for people with normal hearing listening to degraded signals (Grant and Walden, 1996; MacLeod and Summerfied, 1987; Sumby and Pollack, 1954). Comprehension of spoken discourse also appears to require less attention and effort when presented audio-visually than auditorily (Reisberg et al., 1987). However, perceptual encoding speeds for A, V, and AV speech have not been compared previously. Based on observations of 1) faster choice reaction time to phonetically richer than to phonetically poorer speech signals (e.g., natural vs. synthetic speech [Pisoni and Greene, 1990]), and 2) faster choice reaction time to bimodal than to unimodal non-speech stimuli (e.g., Hughes et al., 1994), it would not be surprising to find faster perceptual encoding of AV than A spoken words. However, in a choice RT experiment using synthetic speech syllables, Massaro and Cohen (1983) found no speed "benefit" for AV versus A stimuli.

APPROACH
Task:
Sternberg "memory scanning": Subject is presented serially a "memory set" of 1-4 spoken words, then is presented a "probe" to which he/she makes a speeded YES/NO button response to indicate whether the probe is a member of the memory set (Sternberg, 1966, 1975). See Figure 1.

 

 

 

Fig. 1.  Schematic representation of a typical memory scanning trial. Trial depicted has a memory set size of 3 and a correct YES response.

Design:
All within-subjects.

Dependent Variables:
• Perceptual Encoding Speed, represented by the intercept of the memory set size-reaction time (RT) function, as fit to a linear model;

• AV Benefit, represented by the difference between RTA and RTAV.

Independent Variables:
• Modality (A, V, AV) of spoken word stimuli;

• Subject variables: age, low-PTA (.5, 1, 2 kHz), high-PTA (3, 4 kHz), and clinically-measured word recognition;

• Derived measure: Encoding speed difference between V and A (RTV - RTA), called "V-A Synchrony"

METHODS
Subjects:
N = 26; mean age = 66.2 y., std. dev. = 6.1;
mean 0.5, 1, 2 kHz average audiometric threshold = 37 dB HL, std. dev. = 11.6 (see Figure 2).
 

graph

 

 Stimuli:
Two tokens of each of 10 words from the CID W-22 list, spoken by one talker, professionally video recorded on optical disc:

bread, ear, felt, jump, live, owl, pie, star, three, wool

This set of words was selected so as to allow 100% accurate closed-set identification in A, V, and AV modalities.

Procedure:
• After passing a 20/30 vision screening test, subjects established their most-comfortable listening levels (MCL) for monotic, better-ear, linear-amplified presentation of the spoken words under an Etymotic ER-3A insert earphone. All subsequent auditory testing was done at individual subjects' MCLs.

• In a screening test, subjects had to demonstrate perfect recognition of the 20 stimuli (10 words × 2 tokens) in A, V, and AV modalities, given advance knowledge of the words. Five subjects out of 33 recruited (15%) did not pass the V (lipreading) part of this screening.

• After practice blocks in each modality, subjects completed three 80-trial A, V, and AV test blocks in counterbalanced order over three 2-hour sessions, providing 240 responses per modality and 60 responses per memory set size within a modality.

• For each subject, 12 RT means were calculated (3 modalities × 4 memory set sizes). Incorrect responses were discarded, as were responses more than 2.5 standard deviations greater than the mean for each of the 12 cells (Ratcliff, 1993). Least-squares lines were fitted to the four data points representing each subject's memory set size-RT function. Two subjects out of 28 were dropped because of excessive errors or excessively long RT in testing. See Figure 3 for overall results.
 

graph



RESULTS AND DISCUSSION
Modality-Related Differences in Perceptual Encoding Speed:
Figure 4 shows that there were significant perceptual encoding speed differences due to modality. The intercept of the linear model of a subject's memory set size-RT function is assumed to represent the sum of encoding speed and motor execution speed (Sternberg, 1975). Motor execution speed is assumed to be constant within a subject, so it is relative differences among conditions within a subject that indicate factor effects. Wilcoxon Signed Ranks Tests (2-tailed) were performed on the intercepts of each modality pairing and all comparisons were significant.
 

graph

 

Encoding Speed Correlations Among Modalities:
Figures 5-7 show that there are strong predictive relations among subjects' encoding speeds for words received in the three modalities.  (In these and following figures, results of 2-tailed Pearson and Kendall's tau-b correlation tests are noted.)  Although part of the pattern of individual differences seen here is due to constant motor execution speed differences, the larger part is likely due to individuals' characteristic speeds for perceptually encoding speech signals. This result lends new support, by way of an RT measure, to C. Watson's (1996) and others' claims that there is a central, modality-independent source of individual differences in speech recognition.




graph

 

 

 

 

 

 

 



graph

Audiological Factors and Encoding Speed:
Table 1 shows that the only audiological factor significantly correlated with encoding speed was age. Age had a moderate predictive relation to V encoding speed, as Figure 8 shows. However, when V and A cues are combined in AV perception, older perceivers no longer appear to be slower. It is notable that the audiometric threshold measures did not predict RT to auditory words, as might be expected based on earlier studies of congenitally hearing-impaired listeners (Seitz and Rakerd, 1996).
   

 

Age

 

 

 

 

 

 

 

Age

1.0

LO- PTA

 

 

 

 

 

 

LO- PTA

-.31

1.0

HI-PTA

 

 

 

 

 

HI- PTA

.19

.41*

1.0

WRec

 

 

 

 

WRec

-.12

-.37

-.40*

1.0

RTA

 

 

 

RTA

.28

.11

.12

-.20

1.0

RTV

 

 

RTV

.54**

-.14

-.02

.07

.72**

1.0

RTAV

 

RTAV

.33

-.03

.06

-.06

.92**

.86**

1.0

|RTV -RTA|

|RTV -RTA|

.47**

-.32

-.16

.32

-.10

.63**

.19

1.0

RTA -RTAV

.09

.35

.15

-.37

.36

-.22

-.04

-.69**

table 1

 

graph

 

 

Audio-Visual Encoding Speed Benefit and Its Predictor, "V-A Synchrony":
It is clear from Figures 3 and 7 that there is an AV benefit with respect to perceptual encoding speed for spoken words. A difficult and interesting challenge is to explain individual differences in AV benefit. The present data point to a construct that can be called V-A synchrony as a predictor of AV benefit. V-A synchrony is simply the absolute difference between an individual's V and A encoding speeds. Figure 9 shows that the V-A synchrony construct accounts for a rather large portion of the variability in AV benefit in the present data (r2 = 0.49). Apparently, in order to obtain AV encoding speed benefit, it is helpful for an individual's A and V unimodal encoding speeds to be similar. Based on analysis of data from 15 of the present study's subjects who also participated in a study of AV benefit in consonant identification (Grant et al., 1998), it might be the case that V-A encoding speed synchrony is an effective predictor of individual differences in AV benefit in speech identification tasks.
 

graph

 

CONCLUSIONS
• There were significant perceptual encoding speed differences related to modality, suggesting potential modality-related advantages and difficulties in recognition of fast speech and connected speech.

• There were strong inter-modal encoding speed correlations, lending support from a reaction time perspective to the idea of a central, modality-independent source of individual differences in speech perception.

• Older subjects tended to be slow at V encoding, but this did not affect their AV encoding speed. There were no other significant correlations among audiological and RT measures.

• There was a highly significant AV benefit with respect to perceptual encoding speed.

• The construct V-A synchrony predicted individual differences in degree of AV encoding speed benefit.

REFERENCES
Baddeley, A.D. (1986).
Working Memory ( Oxford University Press, Oxford, UK).

Erber, N.P. (1975). "Auditory-visual perception of speech," J. Speech Hearing Res. 40, 481-492.

Gordon-Salant, S., and Fitzgibbons, P.J. (1997). "Selected cognitive factors and speech recognition performance among young and elderly listeners," J. Speech Hear. Res. 40, 423-431.

Grant, K.W., and Walden, B.E. (1996). "Evaluating the articulation index for auditory-visual consonant recognition," J. Acoust. Soc. Am. 100, 2415-2424.

Grant, K.W., Walden, B.E., and Seitz, P.F. (1998)."Auditory-visual speech recognition by hearing-impaired subjects:  Consonant recognition, sentence recognition, and auditory-visual integration," J. Acoust. Soc. Am. 103, 2677-2690.

Hughes, H.C., Reuter-Lorenz, P.A., Nozawa, G., and Fendrich, R. (1994). "Visual-auditory interactions in sensorimotor processing: Saccades versus manual responses," J. Exp. Psychol.: Human Percept. Perform. 20, 131-153.

Luce, R.D. (1986). Response Times: Their Role in Inferring Elementary Mental Organization ( New York, Oxford University Press).

Massaro, D.W., and Cohen, M.M. (1983). "Categorical or continuous speech perception: A new test," Speech Commun. 2, 15-35.

MacLeod, A., and Summerfield, Q. (1987). "Quantifying the contribution of vision to speech perception in noise," British J. Audiol. 21, 131-141.

Norman, D.A., and Bobrow, D.G. (1975)."On data-limited and resource-limited processes," Cognitive Psychol. 7, 44-64.

Picheny, M.A., Durlach, N.I., and Braida, L.D. (1985). "Speaking clearly for the hard of hearing I: Intelligibility differences between clear and conversational speech," J. Speech Hear. Res. 28, 96-103.

Pisoni, D.B., and Greene, B.G. (1990). "The Role of Cognitive Factors in the Perception of Synthetic Speech," In Research on Speech Perception Progress Report No. 16 (pp. 193-214). Indiana University.

Ratcliff, R. (1993). "Methods for dealing with reaction time outliers," Psychol. Bull. 114, 510-532.

Reisberg, D., McLean, J., and Goldfield, A. (1987). "Easy to hear but hard to understand: A lipreading advantage with intact auditory stimuli," in Hearing by Eye: The Psychology of Lipreading, edited by B. Dodd and R. Campbell (Lawrence Erlbaum Assoc., Hillsdale, NJ), pp. 97-114.

 

Seitz, P.F., and Rakerd, B. (1996)."Hearing impairment and same-different reaction time," J. Acoust. Soc. Am. 99, 2602.

Sternberg, S. (1966). "High-speed scanning in human memory," Science 153, 652-654.

Sternberg, S. (1975). "Memory scanning: New findings and current controversies," Q. J. Exp. Psychol.27, 1-32.

Sumby, W.H., and Pollack, I. (1954). "Visual contribution to speech intelligibility in noise," J. Acoust. Soc. Am. 26, 212-215.

Uchanski, R.M., Choi, S.S., Braida, L.D., Reed, C.M., and Durlach, N.I. (1996)."Speaking clearly for the hard of hearing IV: Further studies on the role of speaking rate," J. Speech Hear. Res. 39, 494-509.

Working Group on Speech Understanding and Aging (Committee on Hearing, Bioacoustics and Biomechanics, National Research Council, Washington, DC) (1988). "Report on speech understanding and aging," J. Acoust. Soc. Am. 88, 859-893.

Walden, B.E., Prosek, R.A., and Worthington, D.W. (1974). "Predicting audiovisual consonant recognition performance of hearing-impaired adults," J. Speech Hear. Res. 21, 5-36.

Walden, B.E., Prosek, R.A., and Worthington, D.W. (1975)."Auditory and audiovisual feature transmission in hearing-impaired adults," J. Speech Hear. Res. 18, 272-280.

Watson, C.S., Qiu, W.W., Chamberlain, M.M., and Li, X. (1996). "Auditory and visual speech perception: Confirmation of a modality-independent source of individual differences in speech recognition," J. Acoust. Soc. Am. 100, 1153-1162.

ACKNOWLEDGMENT
Supported by research grant numbers R29 DC 01643 and R29 DC 00792 from the National Institute on Deafness and Other Communication Disorders, National Institutes of Health.  This research was carried out as part of an approved Walter Reed Army Medical Center, Department of Clinical Investigation research study protocol, Work Unit 2553, entitled "Scanning Memory for Auditory, Visual, and Audiovisual Spoken Words." The opinions and assertions contained herein are the private views of the authors and are not to be construed as official or reflecting the views of the Department of the Army or the Department of Defense.

CONTACT INFORMATION
Ken W. Grant, Ph.D.
Army Audiology and Speech Center
Walter Reed Army Medical Center
Washington, DC  20307-5001
(202) 782-8596
grant@tidalwave.net