Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis

Abstract

Non-verbal communication often comprises of semantically rich gestures thathelp convey the meaning of an utterance. Producing such semantic co-speechgestures has been a major challenge for the existing neural systems that cangenerate rhythmic beat gestures, but struggle to produce semanticallymeaningful gestures. Therefore, we present RAG-Gesture, a diffusion-basedgesture generation approach that leverages Retrieval Augmented Generation (RAG)to produce natural-looking and semantically rich gestures. Our neuro-explicitgesture generation approach is designed to produce semantic gestures groundedin interpretable linguistic knowledge. We achieve this by using explicit domainknowledge to retrieve exemplar motions from a database of co-speech gestures.Once retrieved, we then inject these semantic exemplar gestures into ourdiffusion-based gesture generation pipeline using DDIM inversion and retrievalguidance at the inference time without any need of training. Further, wepropose a control paradigm for guidance, that allows the users to modulate theamount of influence each retrieval insertion has over the generated sequence.Our comparative evaluations demonstrate the validity of our approach againstrecent gesture generation approaches. The reader is urged to explore theresults on our project page.