https://grisoon.github.io/INFP/ INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations Yongming Zhu^*, Longhao Zhang^*, Zhengkun Rong^*, Tianshu Hu^*+, Shuang Liang, Zhipeng Ge Bytedance ^*Equal Contribution ^+Corresponding Author Paper Video Code Data We present INFP, an audio-driven interactive head generation framework for dyadic conversations. Given the dual-track audio in dyadic conversations and a single portrait image of arbitrary agent, our framework can dynamically synthesize verbal, non-verbal and interactive agent videos with lifelike facial expressions and rhythmic head pose movements. Additionally, our framework is lightweight yet powerful, making it practical in instant communication scenarios such as the video conferencing. INFP denotes our method is Interactive, Natural, Flash and Person-generic. Abstract Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the motion latent codes to animate a static image. The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio-driven head generation in interactive scenarios. To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet. Extensive experiments and visualizations demonstrate superior performance and effectiveness of our method. Objective illustration Interpolate start reference image. Existing interactive head generation (left) applied manual role assignment and explicit role switching. Our proposed INFP (right) is a unified framework which can dynamically and naturally adapt to various conversational states. Method Overview Interpolate start reference image. Schematic illustration of INFP. The first stage (Motion-Based Head Imitation) learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the latent codes to animate a static image. The second stage (Audio-Guided Motion Generation) learns the mapping from the input dyadic audio to motion latent codes through denoising, to achieve the audio-driven interactive head generation. Generated Videos Motion Diverstiy Our method can generate motion-adapted synthesis results for the same reference image based on different audio inputs. Out-of-Distribution Our method can support non-human realistic and side-face images. Instant Communication Thanks to the flash inference speed of INFP (over 40 fps at Nvidia Tesla A10), our method can achive the real-time agent-agent communication. Alternatively, human-agent interaction is also enabled. Comparing to SOTA Methods Interactive Head Generation Unlike existing methods should explicitly and manually switch roles between Listener and Speaker, our method can dynamically adapt to various states, leading to smoother and more natural results. INFP can naturally adapt to related tasks like talking or listening head generation without any modification. Talking Head Generation Our method generates results with high lip-sync accuracy, expressive facial expressions and rhythmic head pose movements. Our method can support singing and different languages. Listening Head Generation Our method generates results with high fidelity, natural facial behaviors and diverse head motions. Note that all results except ours stem from the homepage of DIM. Ethics Concerns The purpose of this work is only for research. The images and audios used in these demos are from public sources. If there are any concerns, please contact us (hutianshu007@gmail.com) and we will delete it in time. Acknowledgement This website is built using the Academic Project Page Template. We would like to thank Nerfies for providing the source code.