Abstract
Interpreting neural activity through meaningful latent representationsremains a complex and evolving challenge at the intersection of neuroscienceand artificial intelligence. We investigate the potential of multimodalfoundation models to align invasive brain recordings with natural language. Wepresent SSENSE, a contrastive learning framework that projects single-subjectstereo-electroencephalography (sEEG) signals into the sentence embedding spaceof a frozen CLIP model, enabling sentence-level retrieval directly from brainactivity. SSENSE trains a neural encoder on spectral representations of sEEGusing InfoNCE loss, without fine-tuning the text encoder. We evaluate ourmethod on time-aligned sEEG and spoken transcripts from a naturalisticmovie-watching dataset. Despite limited data, SSENSE achieves promisingresults, demonstrating that general-purpose language representations can serveas effective priors for neural decoding.