Faculty from the Departments of Neuroscience and of Computer Science at The University of Texas–Austin have published results of research into the non-invasive capture, recording, and interpretation of the dynamically evolving images from sequential brain scans that are induced when neurotypical adults listen to people relating tales from their lives in extended, unscripted narratives. The goal of the publication is to provide an update on progress towards developing brain-computer interfaces that capture cerebral signals non-invasively and decode them, in real time, to produce continuous speech output that recognizably reconstructs the content of an individual’s inner train of speech – generally producing close approximations at the informational level, that include here and there exact reconstructions of sequences of words.
In 1-hour sessions, subjects in the study sat for 16 hours of brain scans using fMRI – functional magnetic resonance imaging – while they listened to extended spontaneous narratives. Data were arranged into ‘brain-scan movies’ of high spatial resolution but only moderate temporal resolution. One fMRI image can be captured every two seconds and will show the brain’s response to about 20 words. The word sequences were aligned with the scanned images, and researchers established correlations to train an encoding model to predict how that subject’s brain would respond to novel natural language utterances. With that in hand, researchers then sought to drive their mechanisms backwards, producing continuous speech from brain-scan sequences. To this end, the researchers introduce two AI-based components. The first is a large language model to identify — out of the many candidate word sequences initially suggested by a neural activity decoder – the majority that can be rejected due either to linguistic impossibility, or to semantic improbability. The second is a ‘beam search algorithm’ that predicts – based on word sequences not rejected in the previous step, what are the most likely next words. Brain activity images are then predicted for each and compared with what then is actually observed, again to reject poor candidates. Applied iteratively, these processes continually predict most likely scan-associated word sequences across an arbitrary amount of time.
Progress reported here is noteworthy. Currently, exact word matches run around 5%, correct gist runs around 60%, and a third is still wrong. The following is an example. The audited narration: “I didn’t know whether to scream or cry or run away, so instead I said, leave me alone.” The decoded output: “I started to scream and cry, and then she said, “I told you to leave me alone”. The authors include an initial discussion of the privacy implications of their findings.
For further reading: J. Tang, A. LeBel, S. Jain et al., 2023,
Semantic reconstruction of continuous language from non-invasive
brain recordings. Nature Neuroscience, 26: 858–866.