awesome-gesture_generation

Awesome Gesture Generation

Continuing editing (Not finished yet)

The goal of project is focus on Audio-driven Gesture Generation with output is 3D keypoints gesture.
Input: Audio, Text, Gesture, ..etc. -> Output: Gesture Motion

Gesture Generation is the process of generating gestures from speech or text. The goal of Gesture Generation is to generate gestures that are natural, realistic, and appropriate for the given context. The generated gestures can be used to animate virtual characters, robots, or embodied conversational agents.

ACM CCS: • Human-centered computing → Human computer interaction (HCI).

Paper by Folder : 📁/survey

1. Survey
2. Papers
- Relation of speech and gesture
- GENEA Challenge
- 2024
- 2023
- 2022
- 2021
- 2020
- 2019
- 2018
- Before 2017
- Others
3. Selected Approach
- 3.1 Selected rule base approach
- 3.2 Selected data-driven approach
  - 3.2.a Statistical approach
  - 3.2.b Deep learning approach
4. Pipeline
5. Learning objective
6. Metric Evaluation
7. Datasets
8. Toolkit
9. Playlist & Talks
10. Code
11. Books

Main resource

1. Survey

Comprehensive preview

【EUROGRAPHICS 2023】A Comprehensive Review of Data-Driven Co-Speech Gesture Generation; [paper]
2014 - Gesture and speech in interaction: An overview ; [paper]

Survey review

【HAI 2021】Speech-based Gesture Generation for Robots and Embodied Agents: A Scoping Review [paper]

Evaluation survey

【IEEE 2022】 A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents [paper] ;

GENEA Challenge

The GENEA Challenge 2020: A large, crowdsourced evaluation of gesture generation systems on common data [paper] ; [homepage] ; [youtube] ; [Svito-zar/genea_numerical_evaluations]
The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation [paper] ; [homepage] ; [youtube] ; [web]
The GENEA Challenge 2023: A large-scale evaluation of gesture [paper]
GENEA Workshop 2021: The 2nd Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents [paper] ; [homepage]
Evaluating gesture-generation in a large-scale open challenge - The GENEA Challenge 2022 [paper]

2. Papers

2.1 Relation of speech and gesture

The Relation of Speech and Gestures: Temporal Synchrony Follows Semantic Synchrony [paper]
Complexity Matters E05: Complexity Matching and Synchronization between Gestures and Speech [paper]
Easier Said Than Done? Task Difficulty’s Influence on Temporal Alignment, Semantic Similarity, and Complexity Matching Between Gestures and Speech [paper]
Advances in Visual Semiotics - The Semiotic Web [book]
Gestural beats: The rhythm hypothesis [paper]

2.2 GENEA

GENEA 2024

GENEA Workshop 2024 - ICMI 2024 Accepted papers [Homepage]

Paper	🏆
Gesture Area Coverage to Assess Gesture Expressiveness and Human-Likeness	🏆

GENEA 2023

GENEA Challenge 2023 [Homepage]

Method (Team*)	Paper	Video	🏆
FineMotion	【ICMI 2023】The FineMotion entry to the GENEA Challenge 2023: DeepPhase for conversational gestures generation [paper]	[youtube]
Gesture Motion Graphs	【ICMI 2023】Gesture Motion Graphs for Few-Shot Speech-Driven Gesture Reenactment [paper]	[youtube]
Diffusion-based	【ICMI 2023】(SG) Diffusion-based co-speech gesture generation using joint text and audio representation [paper]	[youtube]	⭐
UEA Digital Humans	【ICMI 2023】The UEA Digital Humans entry to the GENEA Challenge 2023 [paper] ; [JonathanPWindle/UEA-DH-GENEA23]	[youtube]
FEIN-Z	【ICMI 2023】FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation [paper]	[youtube]
DiffuseStyleGesture+	【ICMI 2023】(SF) The DiffuseStyleGesture+ entry to the GENEA Challenge 2023 [paper]	[youtube]	🏆
Discrete Diffusion	【ICMI 2023】Discrete Diffusion for Co-Speech Gesture Synthesis [paper]	[youtube]
KCL-SAIR	【ICMI 2023】The KCL-SAIR team’s entry to the GENEA Challenge 2023 Exploring Role-based Gesture Generation in Dyadic Interactions: Listener vs. Speaker [paper]	[youtube]
Gesture Generation	【ICMI 2023】Gesture Generation with Diffusion Models Aided by Speech Activity Information [paper]	[youtube]
Co-Speech Gesture Generation	【ICMI 2023】Co-Speech Gesture Generation via Audio and Text Feature Engineering [paper]	[youtube]
DiffuGesture	【ICMI 2023】DiffuGesture: Generating Human Gesture From Two-person Dialogue With Diffusion Models [paper]	[youtube]
KU-ISPL	【ICMI 2023】The KU-ISPL entry to the GENEA Challenge 2023-A Diffusion Model for Co-speech Gesture generation [paper]	[youtube]

GENEA Workshop 2023 - ICMI 2023 Accepted papers [Homepage]

Papers	Video	🏆
【ICMI 2023】 MultiFacet A Multi-Tasking Framework for Speech-to-Sign Language Generation [paper]
【ICMI 2023】 Look What I Made It Do - The ModelIT Method for Manually Modeling Nonverbal Behavior of Socially Interactive Agents [paper]
【ICMI 2023】 A Methodology for Evaluating Multimodal Referring Expression Generation for Embodied Virtual Agents [paper]
【ICMI 2023】 Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agent [paper]; [aldelb/non_verbal_facial_animation]		🏆

GENEA 2022

GENEA Challenge 2022 - Accepted papers [Homepage]

Team (Method)	Paper	Video	🏆
DeepMotion	【ICMI 2022】The DeepMotion entry to the GENEA Challenge 2022 [paper]	[youtube]
DSI	【ICMI 2022】Hybrid Seq2Seq Architecture for 3D Co-Speech Gesture Generation [paper]	[youtube]
FineMotion	【ICMI 2022】ReCell: replicating recurrent cell for auto-regressive pose generation [paper] [FineMotion/GENEA_2022]	[youtube]
Forgerons	【ICMI 2022】Ubisoft Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022 [paper]	[youtube]
GestureMaster	【ICMI 2022】GestureMaster: Graph-based Speech-driven Gesture Generation [paper]	[youtube]
IVI Lab	【ICMI 2022】The IVI Lab entry to the GENEA Challenge 2022 – A Tacotron2 Based Method for Co-Speech Gesture Generation With Locality-Constraint Attention Mechanism [paper] [Tacotron2-SpeechGesture]	[youtube]	🏆
ReprGesture	【ICMI 2022】The ReprGesture entry to the GENEA Challenge 2022 [paper] [YoungSeng/ReprGesture]	[youtube]
TransGesture	【ICMI 2022】TransGesture: Autoregressive Gesture Generation with RNN-Transducer [paper]	[youtube]
UEA Digital Humans	【ICMI 2022】UEA Digital Humans entry to the GENEA Challenge 2022 [paper] [UEA/GENEA22]	[youtube]

GENEA Workshop 2022 - ICMI 2022 Accepted papers [Homepage]

Papers	Video	🏆
【ICMI 2022】 Understanding Interviewees’ Perceptions and Behaviour towards Verbally and Non-verbally Expressive Virtual Interviewing Agents [paper]	[youtube]
【ICMI 2022】 Emotional Respiration Speech Dataset [paper]	[youtube]
【ICMI 2022】 Automatic facial expressions, gaze direction and head movements generation of a virtual agent [paper]	[youtube]	🏆
【ICMI 2022】 Can you tell that I’m confused? An overhearer study for German backchannels by an embodied agent [paper]	[youtube]

GENEA 2021

GENEA Challenge 2021 - ICMI 2021 Accepted papers [Homepage]

Papers	Video	🏆
【ICMI 2021】 Probabilistic Human-like Gesture Synthesis from Speech using GRU-based WGAN [paper] [wubowen416/gesture-generation-using-WGAN]	[youtube]	🏆
【ICMI 2021】 Influence of Movement Energy and Affect Priming on the Perception of Virtual Characters Extroversion and Mood [paper]	❌
【ICMI 2021】 Crossmodal clustered contrastive learning: Grounding of spoken language to gesture [paper] [dondongwon/CC_NCE_GENEA]	[youtube]

GENEA 2020

GENEA Challenge 2020 - Accepted papers [Homepage]

Papers	Video
【IVA 2020】 The StyleGestures entry to the GENEA Challenge 2020 [paper] ; [[simonalexanderson/StyleGestures]]	[youtube]
【IVA 2020】 The FineMotion entry to the GENEA Challenge 2020 [paper] ; [FineMotion/GENEA_2020]	[youtube]
【IVA 2020】 Double-DCCCAE: Estimation of Sequential Body Motion Using Wave-Form - AlltheSmooth [paper]	[youtube]
【IVA 2020】 CGVU: Semantics-guided 3D Body Gesture Synthesis [paper]	[youtube]
【IVA 2020】 Interpreting and Generating Gestures with Embodied Human Computer Interactions [paper]	[youtube]
【IVA 2020】 The Nectec Gesture Generation System entry to the GENEA Challenge 2020 [paper]	[youtube]

2024

【CVPR 2024】 DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures [paper]; [Ditzley/DiffTED]
【CVPR 2024】 EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling [paper]; [PantoMatrix/PantoMatrix]
【CVPR 2024】 Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion [paper]; [kiranchhatre/amuse]
【CVPR 2024】 Using Language-Aligned Gesture Embeddings for Understanding Gestures Accompanying Math Terms [paper]
【SIGGRAPH ASIA 2024】 Body Gesture Generation for Multimodal Conversational Agents [paper]; [homepage]
【SIGGRAPH 2024】Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [paper] ; [video] ; [LuMen-ze/Semantic-Gesticulator-Official]
【ACMMM 2024】SynTalker - Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation [paper] ; [homepage] ; [video] ; [RobinWitch/SynTalker]
【ACMMM 2024】MDT-A2G- Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation [paper] ; [homepage]
【ACM MM 2024】 MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion [paper] ; [homepage]
【NeurIPS 2024】MambaTalk - Efficient Holistic Gesture Synthesis with Selective State Space Models [paper] ; [homepage] ; [kkakkkka/MambaTalk]
【ICMI 2024】Gesture Area Coverage to Assess Gesture Expressiveness and Human-Likeness [paper] ; [AI-Unicamp/gesture-area-coverage]

2023

【CVPR 2023】 Co-speech Gesture Synthesis by Reinforcement Learning with Contrastive Pre-trained Rewards [paper] ; [RLracer/RACER]
【PAKDD 2023】 RLMixer: A Reinforcement Learning Approach For Integrated Ranking With Contrastive User Preference Modeling [paper]
【IJCAI 2023】 DiffuseStyleGesture - Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models [paper] ; [YoungSeng/DiffuseStyleGesture] ; [youtube]
【CVPR 2023】 Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation [paper] ; [Advocate99/DiffGesture]
【CVPR 2023】 Diverse 3D Hand Gesture Prediction from Body Dynamics by Bilateral Hand Disentanglement [paper] ; [XingqunQi-lab/Diverse-3D-Hand-Gesture-Prediction]
【CVPR 2023】 QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation [paper] ; [YoungSeng/QPGesture] ; [video]
【CVPR 2023】 GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents [paper] ; [homepage] ; [video]
【CVPR 2023】 Continual Learning for Personalized Co-Speech Gesture Generation [paper]; [homepage]
【CVPR 2023】 Guided Motion Diffusion for Controllable Human Motion Synthesis [paper] ; [homepage]
【CVPR 2023】 Sequential Texts Driven Cohesive Motions Synthesis with Natural Transitions [paper]
【CVPR 2023】 EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation [paper]
【CVPR 2023】 FineDance: A Fine-grained Choreography Dataset for 3D Full Body Dance Generation [paper]
【CVPR 2023】 Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models [paper]
【CVPR 2023】 Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation [paper]
【CVPR 2023】 Semi-supervised Speech-driven 3D Facial Animation via Cross-modal Encoding [paper]
【ACM MM 2023】UnifiedGesture - A Unified Gesture Synthesis Model for Multiple Skeletons [paper] ; [YoungSeng/UnifiedGesture]
【ICMI 2023】 AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis [paper] ; [hvoss-techfak/AQGT]
【ICMI 2023】Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation [paper]
DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model [paper]
BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer [paper]
EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation [paper] ;
Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis [paper] ; [homepage] ; [video]
EMoG: Synthesizing Emotive Co-speech 3D Gesture with Diffusion Model [paper]
MRecGen: Multimodal Appropriate Reaction Generator [paper] ; [SSYSteve/MRecGen]
Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [paper]
The KCL-SAIR team’s entry to the GENEA Challenge 2023 Exploring Role-based Gesture Generation in Dyadic Interactions: Listener vs. Speaker [paper]
The KU-ISPL entry to the GENEA Challenge 2023-A Diffusion Model for Co-speech Gesture generation [paper]
Co-Speech Gesture Generation via Audio and Text Feature Engineering [paper]
Gesture Motion Graphs for Few-Shot Speech-Driven Gesture Reenactment [paper]
Gesture Generation with Diffusion Models Aided by Speech Activity Information [paper]
FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation [paper]
Discrete Diffusion for Co-Speech Gesture Synthesis [paper]
DiffuGesture: Generating Human Gesture From Two-person Dialogue With Diffusion Models [paper]
The FineMotion entry to the GENEA Challenge 2023: DeepPhase for conversational gestures generation [paper]
Am I listening - Evaluating theQuality of Generated Data-driven Listening Motion [paper]
Unified speech and gesture synthesis using flow matching [paper] ; [homepage] ;

2022

【SIGGRAPH 2022】 A Motion Matching-based Framework for Controllable Gesture Synthesis from Speech [paper] ; [homepage]
【ICMI 2022】The DeepMotion entry to the GENEA Challenge 2022 [paper]
【ICMI 2022】The ReprGesture entry to the GENEA Challenge 2022 [paper] ; [YoungSeng/ReprGesture] ; [youtube]
【CVPR 2022】 HA2G - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation [paper] ; [alvinliu0/HA2G]
【CVPR 2022】 SEEG - SEEG: Semantic Energized Co-Speech Gesture Generation [paper] ; [akira-l/seeg]
【CVPR 2022】 DiffGAN - Low-Resource Adaptation for Personalized Co-Speech Gesture Generation [paper]
【CVPR 2022】 Audio-Driven Neural Gesture Reenactment With Video Motion Graphs [paper]
【ICMI 2022】 ZeroEGGS Exemplar-based stylized gesture generation from speech: An entry to the GENEA Challenge 2022 [paper]
【AAMAS 2022】 Multimodal analysis of the predictability of hand-gesture properties [paper]
【ICMI 2022】 GestureMaster GestureMaster: Graph-based Speech-driven Gesture Generation [paper]
【ICRA 2022】Context-Aware Body Gesture Generation for Social Robots [paper]
【IROS 2022】Gesture2Vec: Clustering Gestures using Representation Learning Methods for Co-speech Gesture Generation [paper] [pjyazdian/Gesture2Vec] ; [youtube] ; [youtube]
Evaluating Data-Driven Co-Speech Gestures of Embodied Conversational Agents through Real-Time Interaction [paper]
ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech [paper] ; [ubisoft/ubisoft-laforge-ZeroEGGS] ; [youtube]
Voice2Face: Audio-Driven Facial and Tongue Rig Animations [paper] ; [youtube] ; [web]
Deep Gesture Generation for Social Robots Using Type-Specific Libraries [paper] ; [youtube] ; [web]
Automatic text‐to‐gesture rule generation for embodied conversational agents [paper] [youtube]
Evaluating Data-Driven Co-Speech Gestures of Embodied Conversational Agents through Real-Time Interaction [paper] ; [web]
Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation [paper]
Text/Speech-Driven Full-Body Animation [paper]
Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech using Adversarial Disentanglement of Multimodal Style Encoding [paper]

2021

【ICCV 2021】 Speech Drives Templates: Co-Speech Gesture Synthesis With Learned Templates [paper] ; shenhanqian/speechdrivestemplates ; [youtube] ; poster
【ICCV 2021】 Audio2Gestures Audio2Gestures: Generating Diverse Gestures From Speech Audio With Conditional Variational Autoencoders [paper]
【ICMI 2021】 Probabilistic Human-like Gesture Synthesis from Speech using GRU-based WGAN [paper]
【ICMI 2021】 Crossmodal Clustered Contrastive Learning: Grounding of Spoken Language to Gesture [paper] ; [dondongwon/CC_NCE_GENEA]
【IVA 2021】 Speech2Properties2Gestures: Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech [paper] ; [homepage]
【IVA 2021】 Learning Speech-driven 3D Conversational Gestures from Video [paper]
【CASA 2021】 ExpressGesture ExpressGesture: Expressive gesture generation from speech through database matching [paper] ; [youtube]
【AAMAS 2021】 CMCF CCFM: An Architecture for Realtime Gesture Generation by Clustering Gestures by Communicative Function and Motion [paper]
【IEEEVR 2021】 Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents [paper] ; [UttaranB127/Text2Gestures] ; homepage
Evaluating Data-Driven Co-Speech Gestures of Embodied Conversational Agents [paper]
Multimodal analysis of the predictability of hand-gesture properties [paper]
Deep Gesture Generation for Social Robots Using Type-Specific Libraries [paper]
A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents [paper] ; [youtube] ; [homepage] ; [nagyrajmund/gesturebot]
Moving Fast and Slow: Analysis of Representations and Post-Processing in Speech-Driven Automatic Gesture Generation [paper]
ExpressGesture: Expressive gesture generation from speech through database matching [paper]
Passing a Non-verbal Turing Test: Evaluating Gesture Animations Generated from Speech [paper]

2020

【SIGGRAPH Asia 2020】 Trimodal Speech gesture generation from the trimodal context of text, audio, and speaker identity [paper] ; [ai4r/Gesture-Generation-from-Trimodal-Context]
【ICMI 2020】 Gesticulator Gesticulator: A framework for semantically-aware speech-driven gesture generation [paper] ; [Svito-zar/gesticulator]
【ECCV 2020】 Mix-StAGE Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach [paper]
【EUROGRAPHICS 2020】 StyleGestures Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows [paper] ; [simonalexanderson/StyleGestures] ; [youtube]
【EUROGRAPHICS 2020】 StyleGestures Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows [paper] ; [simonalexanderson/StyleGestures] ; [youtube]
【EMNLP 2020】 AiSLE No Gestures Left Behind: Learning Relationships between Spoken Language and Freeform Gestures [paper] ; [pchahuja/aisle]
The GENEA Challenge 2020: A large, crowdsourced evaluation of gesture generation systems on common data [paper] ; [homepage] ; [youtube] ; [youtube] ; [Svito-zar/genea_numerical_evaluations]
Gesticulator: A framework for semantically-aware speech-driven gesture generation [paper] ; [youtube] ; [Svito-zar/gesticulator] ; [homepage] ; [dataset]
Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings [paper] ; [youtube] ; [homepage]
Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents [paper]
Affective synthesis and animation of arm gestures from speech prosody [paper]
FineMotion - Audio and Text-Driven approach for Conversational Gestures Generation [paper] ; [FineMotion/GENEA_2020]
Modeling the Conditional Distribution of Co-Speech Upper Body Gesture Jointly Using Conditional-GAN and Unrolled-GAN [paper] ; [wubowen416/co-speech-gesture-generation-using-CGAN]
【IVA 2020】 Generating coherent spontaneous speech and gesture from text [paper] ; [video] ; [homepage]

2019

【SIGGRAPH MIG 2019】 Multi-objective adversarial gesture generation [paper]
【ICMI 2019】 DRAM To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations [paper]
【CVPR 2019】 Speech2Gesture Learning Individual Styles of Conversational Gesture [paper]
Analyzing Input and Output Representations for Speech-Driven Gesture Generation [paper] ; [GestureGeneration/Speech_driven_gesture_generation_with_autoencoder] ; [youtube] ; [youtube] ; [homepage]
On the Importance of Representations for Speech-Driven Gesture Generation [paper]

2018

A Neural Network Approach to Missing Marker Reconstruction in Human Motion Capture [paper] ; [youtube] ; [youtube] ; [Svito-zar/NN-for-Missing-Marker-Reconstruction]
Data Driven Non-Verbal Behavior Generation for Humanoid Robots [paper]
A Neural Network Approach to Missing Marker Reconstruction in Human Motion Capture [paper] ; [Svito-zar/NN-for-Missing-Marker-Reconstruction] ; [youtube]
Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network [paper]
A Speech-Driven Hand Gesture Generation Method and Evaluation in Android Robots [paper] ; [youtube]
Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots [paper]

Before 2017

DCNF Predicting Co-verbal Gestures - A Deep and Temporal Modeling Approach [paper]
2017 - CDBN Speech-driven animation with meaningful behaviors [paper]

Others

【SIGGRAPH 2022】 GANimator for generate data GANimator: Neural Motion Synthesis from a Single Sequence [paper] ; PeizhuoLi/ganimator ; [youtube]
【CVPR 2021】 Body2Hands: Learning To Infer 3D Hands From Conversational Gesture Body Dynamics [paper]
Rig Inversion by Training a Differentiable Rig Function [paper] ; [youtube]

3. Approachs

3.1 Rule Base approach

[1994] Rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents [paper]
Speech to sequence gesture
- 【SIGGRAPH 2001】 BEAT: the Behavior Expression Animation Toolkit [paper]
- 【HRI 2012】 Robot Behavior Toolkit: Generating Effective Social Behaviors for Robots [paper]
- Gesture Generation by Imitation: From Human Behavior to Computer Character Animation [books]

3.2 Selected data-driven approach

3.2.a Statistical approach

Probabilistic model of speech to gesture
- 【IVA 2006】 Towards a Common Framework for Multimodal Generation: The Behavior Markup Language [paper]
- 【SIGGRAPH 2010】 Gesture Controllers [paper]
Probabilistic model of personal style
- 【ACM Transactions on Graphics 2008】Gesture modeling and animation based on a probabilistic re-creation of speaker style [paper]
Neural classification model of personal style
- 【IVA 2015】Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach [paper]

3.2.b Deep learning approach

This section is – not accurate –> continue edditing

MLP (Multilayer perceptron)
- 【ICMI 2020】 Gesticulator Gesticulator: A framework for semantically-aware speech-driven gesture generation [paper] ; [Svito-zar/gesticulator]
RNN (Recurrent Neural Networks)
- 【MM 2021】 Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning [paper] ; [UttaranB127/speech2affective_gestures] ; [homepage]
- 【IVA 2018】Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network [paper]
- 【CVPR 2022】 HA2G - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation [paper] ; alvinliu0/HA2G
- 【SIGGRAPH Asia 2020】 Trimodal Speech gesture generation from the trimodal context of text, audio, and speaker identity [paper] ; [ai4r/Gesture-Generation-from-Trimodal-Context]
- 【ICMI 2022】TransGesture: Autoregressive Gesture Generation with RNN-Transducer [paper]
CNN (Convolutional Networks)
- 【IVA 2021】 Learning Speech-driven 3D Conversational Gestures from Video [paper]
Transformers
- 【IEEEVR 2021】 Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents [paper] ; [UttaranB127/Text2Gestures] ; homepage
Generative models – not accurate – continue edditing
- Normalising Flows
  - 【EUROGRAPHICS 2020】 StyleGestures Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows [paper] ; [simonalexanderson/StyleGestures] ; [youtube]
- WGAN
  - 【ICMI 2021】 Probabilistic Human-like Gesture Synthesis from Speech using GRU-based WGAN [paper]
- VAEs
  - 【ICCV 2021】 Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders ; [paper] ; [JingLi513/Audio2Gestures] ; [homepage]
  - Freeform Body Motion Generation from Speech [paper] ; [TheTempAccount/Co-Speech-Motion-Generation] ; [youtube]
  - 【CVMP 2021】 Flow-VAE Speech-Driven Conversational Agents using Conditional Flow-VAEs [paper]
- Learnable noise codes
  - 【ICCV 2021】 Speech Drives Templates: Co-Speech Gesture Synthesis With Learned Templates ; [paper] ; [ShenhanQian/SpeechDrivesTemplates] ;
- CaMN BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis [paper] ; [PantoMatrix/BEAT]
Diffusion
- 【SIGGRAPH 2023】 Listen, denoise, action! Audio-driven motion synthesis with diffusion models [paper] ; (Code repository (coming soon)) ; [youtube] ; [homepage] ; [video]
- 【IJCAI 2023】 DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models [paper] ; youngseng/diffusestylegesture ; [youtube]
- 【CVPR 2023】 Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation [paper] ; [advocate99/diffgesture]
Periodic autoencoders (DeepPhase)
- Rhythmic Gesticulator - Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings [paper] ; [Aubrey-ao/HumanBehaviorAnimation] ; [youtube] ; [youtube]
- 【CVPR 2023】QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation [paper] ; [YoungSeng/QPGesture] ; [video]
Text to Gesture
- 【CVPR 2022】 Generating Diverse and Natural 3D Human Motions from Text [paper] [homepage] ; [poster] ; [EricGuo5513/text-to-motion]

Others (Uncategory)
- 【SIGGRAPH 2022】 A Motion Matching-based Framework for Controllable Gesture Synthesis from Speech [paper] ; [homepage]
- 【CVPR 2022】 SEEG - SEEG: Semantic Energized Co-Speech Gesture Generation [paper] ; [akira-l/seeg]
- 【CVPR 2022】 DiffGAN - Low-Resource Adaptation for Personalized Co-Speech Gesture Generation [paper]
- 【ICMI 2022】 ZeroEGGS Exemplar-based stylized gesture generation from speech: An entry to the GENEA Challenge 2022 [paper]
- 【CVPR 2022】 Audio-Driven Neural Gesture Reenactment With Video Motion Graphs [paper]
- 【AAMAS 2022】 Multimodal analysis of the predictability of hand-gesture properties [paper]
- 【ICMI 2022】 GestureMaster GestureMaster: Graph-based Speech-driven Gesture Generation [paper]
- 【ICCV 2021】 Audio2Gestures Audio2Gestures: Generating Diverse Gestures From Speech Audio With Conditional Variational Autoencoders [paper]
- 【IVA 2021】 Speech2Properties2Gestures: Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech [paper] ; [homepage]
- 【ECCV 2020】 Mix-StAGE Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach [paper]
- 【CVPR 2019】 Speech2Gesture Learning Individual Styles of Conversational Gesture [paper]

4. Pipelines

5. Learning Objective

Full name	Description
Adversarial Loss (Adv)	Used in Generative Adversarial Networks (GANs), this loss function pits a generator network against a discriminator network, with the goal of the generator producing samples that can fool the discriminator into thinking they are real.
Categorical Cross Entropy (CCE)	A common loss function used in multi-class classification tasks, where the goal is to minimize the difference between the predicted and true class labels.
Cross-modal Cluster Noise Contrastive Estimation (CC-NCE)	Used in multimodal learning to learn joint representations across different modalities, this loss function maximizes the similarity between matching modalities while minimizing the similarity between non-matching modalities.
Edge Transition Cost (ETC)	Used in graph-based image segmentation, this loss function measures the similarity between adjacent pixels in an image to preserve the coherence and smoothness of segmented regions.
Expectation Maximization (EM)	Used for maximum likelihood estimation when dealing with incomplete or missing data, this algorithm involves computing the expected likelihood of the missing data and updating model parameters to maximize the likelihood of the observed data given the expected values.
Geodesic Distance (GeoD)	Used in deep learning for image segmentation, this loss function penalizes the discrepancy between the predicted segmentation map and the ground truth, while also considering the spatial relationships between different image regions.
Wasserstein-GAN Gradient Penalty (WGAN-GP)	An extension of the Wasserstein GAN algorithm that adds a gradient penalty term to the loss function, used to enforce the Lipschitz continuity constraint and ensure stability during training.
Hamming Distance (Hamm)	Used in information theory, this metric measures the number of positions at which two strings differ.
Huber Loss (Huber)	A robust loss function used in regression tasks that is less sensitive to outliers than the Mean Squared Error (MSE) loss.
Imitation Reward (IR)	Used in imitation learning to train a model to mimic the behavior of an expert agent, by providing a reward signal based on how closely the model’s behavior matches that of the expert.
Kullback–Leibler Divergence (KL)	Used to measure the difference between two probability distributions, this loss function is commonly used in probabilistic models and deep learning for regularization and training.
L2 Distance (L2)	Measures the Euclidean distance between two points in space, commonly used in regression tasks.
Mean Absolute Error (MAE)	A loss function used in regression tasks that measures the average difference between the predicted and true values.
Maximum Likelihood Estimation (MLE)	A statistical method used to estimate the parameters of a probability distribution that maximize the likelihood of observing the data.
Mean Squared Error (MSE)	A common loss function used in regression tasks that measures the average squared difference between the predicted and true values.
Negative Log-likelihood (NLL)	Used in probabilistic models to maximize the likelihood of the observed data by minimizing the negative log-likelihood.
Structural Similarity Index Measure (SIMM)	Used in image processing to measure the similarity between two images based on their luminance, contrast, and structural content.
Task Reward (TR)	Used in reinforcement learning to provide a reward signal to an agent based on its performance in completing a given task.
Variance (Var)	A statistical metric used to measure the variability of a set of data points around their mean.
Within-cluster Sum of Squares (WCSS)	Used in cluster analysis to measure the variability of data points within a cluster by computing the sum of squared distances between each data point and the mean of the cluster.

5. Metric Evaluation

Evaluation aspects

Human-likeness : looks like the motion of a real human

Appropriateness (specificity) : appropriate for the given speech, controlling for the human-likeness of the motion

🧑‍🦲 : Upper-body tier 🧍 : Full-body tier

🧍‍♂️ : motion

📃 : text

🔊 : audio

⚙️ : custom by teams

Metric (Description)	Body tier	Type
FNA (Full-body Natural Motion )	🧍	🧍‍♂️
FBT (Full-body Text-based )	🧍	📃
FSA (Full-body Custom by Teams )	🧍	⚙
FSB (Full-body Custom by Teams )	🧍	⚙️
FSC (Full-body Custom by Teams )	🧍	⚙️
FSD (Full-body Custom by Teams )	🧍	⚙️
FSF (Full-body Custom by Teams )	🧍	⚙️
FSG (Full-body Custom by Teams )	🧍	⚙️
FSH (Full-body Custom by Teams )	🧍	⚙️
FSI (Full-body Custom by Teams )	🧍	⚙️
UNA (Upper-body Natural Motion )	🧑‍🦲	🧍‍♂️
UBA (Upper-body Audio-based )	🧑‍🦲	🔊
UBT (Upper-body Text-based )	🧑‍🦲	📃
USJ (Upper-body Custom by Teams)	🧑‍🦲	⚙️
USK (Upper-body Custom by Teams)	🧑‍🦲	⚙️
USL (Upper-body Custom by Teams)	🧑‍🦲	⚙️
USM (Upper-body Custom by Teams)	🧑‍🦲	⚙️
USN (Upper-body Custom by Teams)	🧑‍🦲	⚙️
USO (Upper-body Custom by Teams)	🧑‍🦲	⚙️
USP (Upper-body Custom by Teams)	🧑‍🦲	⚙️
USQ (Upper-body Custom by Teams)	🧑‍🦲	⚙️

Objective metrics

3.1 Average acceleration and jerk

3.2 Comparing speed histograms

3.3 Canonical correlation analysis

3.4 Fréchet gesture distance

3.5 System ranking comparison

Canonical correlation analysis

4. Datasets

Modalities type:

🔊 : audio

📃 : text

🤯 : emotion

🚶 : gesture motion

ℹ️ : gesture properties

🎞️ : gesture segment

Type
- 👥 : Dialog (Conversation between two people 🤼) 👤 : Monolog (Self conversation 🧍)

Dataset	Modalities	Type	Download	Paper
IEMOCAP	🚶, 🔊, 📃, 🤯	👥	sail.usc.edu/iemocap	[paper]
Creative-IT	🚶, 🔊, 📃, 🤯	👥	sail.usc.edu/CreativeIT
Gesture-Speech Dataset	🚶, 🔊	👤	dropbox
CMU Panoptic	🚶, 🔊, 📃	👥	domedb.perception.cmu	[paper]
Speech-Gesture	🚶, 🔊	👤	amirbar/speech2gesture	[paper]
TED Dataset [homepage]	🚶, 🔊	👤	youtube-gesture-dataset
Talking With Hands ([github])	🚶, 🔊	👥	facebookresearch/TalkingWithHands32M	[paper]
PATS ([homepage], [github])	🚶, 🔊, 📃	👤	chahuja.com/pats	[paper]
Trinity Speech-Gesture I	🚶, 🔊, 📃	👤	Trinity Speech-Gesture I
Trinity Speech-Gesture II	🚶, 🔊, 🎞️	👤	Trinity Speech GestureII
Speech-Gesture 3D extension	🚶, 🔊	👤	nextcloud.mpi-klsb
Talking With Hands GENEA Extension	🚶, 🔊, 📃	👥	zenodo/6998231	[paper]
SaGA	🚶, 🔊, ℹ️	👥	phonetik.uni-muenchen	[paper]
SaGA++	🚶, 🔊, ℹ️	👥	zenodo/6546229
ZEGGS Dataset [youtube]	🚶, 🔊	👤	ubisoft-laforge-ZeroEGGS	[paper]
BEAT Dataset ([homepage] [homepage], [github])	🚶, 🔊, 📃, ℹ️, 🤯	👥, 👤	github.io/BEAT	[paper]
InterAct homepage	🚶, 🔊, 📃	👥	hku-cg.github.io	[paper]

2022 GENEA Challenge

Challenge dataset: GENEA Challenge 2022 Dataset Files
3D coordinates of submitted motion: GENEA Challenge 2022 3D coordinates of submitted motion
Submitted BVH files: GENEA Challenge 2022 submitted BVH files
User-study video stimuli: GENEA Challenge 2022 user-study video stimuli

5. Toolkit

Algorithms
- SGToolkit: An Interactive Gesture Authoring Toolkit for Embodied Conversational Agents [paper] ; [homepage] ; [youtube]
Recognition:
- OpenPose - CMU-Perceptual-Computing-Lab/openpose
- MMPose - open-mmlab/mmpose
- AlphaPose - MVIG-SJTU/AlphaPose
Audio pre-processing:

-
Mesh processing:
- Utility to trim BVH files: github.com/ghenter/trim_bvh
Visualization:
- github.com/TeoNikolov/genea_visualizer
- [ICMI 2021] HEMVIP: Human Evaluation of Multiple Videos in Parallel ; [jonepatr/hemvip/] ; [data and analysis code]

9. Playlist & Talks

GENEA

SIGGRAPH

ACM SIGGRAPH MIG 2019 Playlist

10. Code

Objective evaluation code: github.com/genea-workshop/genea_numerical_evaluations
Text-based baseline: github.com/youngwoo-yoon/Co-Speech_Gesture_Generation
Audio-based baseline: github.com/genea-workshop/Speech_driven_gesture_generation_with_autoencoder
Interface for subjective evaluations: jonepatr/hemvip
Code for creating attention-check videos: youngwoo-yoon/create_attention_check
Utility to trim BVH files: github.com/ghenter/trim_bvh
Modified PyMO for the challenge dataset: youngwoo-yoon/PyMO

11. Books

PapersWithCode Ranking

TEDTalk (Extract skeleton from video Dataset)
- AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis [paper] ; [hvoss-techfak/AQGT]
- Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings [paper] ; [aubrey-ao/humanbehavioranimation] ; [youtube]
- Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation [paper] alvinliu0/HA2G ; [youtube] ; [homepage]
- Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning [paper] [UttaranB127/speech2affective_gestures] ; [homepage] ; [youtube]
- Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity [paper] ; [ai4r/Gesture-Generation-from-Trimodal-Context]
BEAT (Motion Capture Dataset)
- BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis [paper] ; [PantoMatrix/BEAT]
- Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity [paper] ; [ai4r/Gesture-Generation-from-Trimodal-Context]
- Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders [paper]
- Learning Individual Styles of Conversational Gesture [paper]
- Robots Learning to Say `No’: Prohibition and Rejective Mechanisms in Acquisition of Linguistic Negation [paper]

Contributing

Your contributions are always welcome! Please take a look at the contribution guidelines first.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Created by OpenHuman

OpenHuman.ai - Open Store for Realistic Digital Human

This site is open source. Improve this page.