Continuing editing (Not finished yet)
The goal of project is focus on Audio-driven Gesture Generation with output is 3D keypoints gesture.
Input: Audio, Text, Gesture, ..etc. -> Output: Gesture Motion
Gesture Generation is the process of generating gestures from speech or text. The goal of Gesture Generation is to generate gestures that are natural, realistic, and appropriate for the given context. The generated gestures can be used to animate virtual characters, robots, or embodied conversational agents.
ACM CCS: • Human-centered computing → Human computer interaction (HCI).
Paper by Folder : 📁/survey | 📁/approach | 📁/papers | 📁/dataset | 📁/books |
Main resource
【EUROGRAPHICS 2023】A Comprehensive Review of Data-Driven Co-Speech Gesture Generation; [paper]
2014 - Gesture and speech in interaction: An overview ; [paper]
Method (Team*) | Paper | Video | 🏆 |
---|---|---|---|
Method (Team*) | Paper | Video | 🏆 |
---|---|---|---|
FineMotion | 【ICMI 2023】The FineMotion entry to the GENEA Challenge 2023: DeepPhase for conversational gestures generation [paper] | [youtube] | |
Gesture Motion Graphs | 【ICMI 2023】Gesture Motion Graphs for Few-Shot Speech-Driven Gesture Reenactment [paper] | [youtube] | |
Diffusion-based | 【ICMI 2023】Diffusion-based co-speech gesture generation using joint text and audio representation [paper] | [youtube] | |
UEA Digital Humans | 【ICMI 2023】The UEA Digital Humans entry to the GENEA Challenge 2023 [paper] ; [JonathanPWindle/UEA-DH-GENEA23] | [youtube] | |
FEIN-Z | 【ICMI 2023】FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation [paper] | [youtube] | |
DiffuseStyleGesture+ | 【ICMI 2023】The DiffuseStyleGesture+ entry to the GENEA Challenge 2023 [paper] | [youtube] | 🏆 |
Discrete Diffusion | 【ICMI 2023】Discrete Diffusion for Co-Speech Gesture Synthesis [paper] | [youtube] | |
KCL-SAIR | 【ICMI 2023】The KCL-SAIR team’s entry to the GENEA Challenge 2023 Exploring Role-based Gesture Generation in Dyadic Interactions: Listener vs. Speaker [paper] | [youtube] | |
Gesture Generation | 【ICMI 2023】Gesture Generation with Diffusion Models Aided by Speech Activity Information [paper] | [youtube] | |
Co-Speech Gesture Generation | 【ICMI 2023】Co-Speech Gesture Generation via Audio and Text Feature Engineering [paper] | [youtube] | |
DiffuGesture | 【ICMI 2023】DiffuGesture: Generating Human Gesture From Two-person Dialogue With Diffusion Models [paper] | [youtube] | |
KU-ISPL | 【ICMI 2023】The KU-ISPL entry to the GENEA Challenge 2023-A Diffusion Model for Co-speech Gesture generation [paper] | [youtube] |
Papers | Video | 🏆 |
---|---|---|
【ICMI 2023】 MultiFacet A Multi-Tasking Framework for Speech-to-Sign Language Generation [paper] | ||
【ICMI 2023】 Look What I Made It Do - The ModelIT Method for Manually Modeling Nonverbal Behavior of Socially Interactive Agents [paper] | ||
【ICMI 2023】 A Methodology for Evaluating Multimodal Referring Expression Generation for Embodied Virtual Agents [paper] | ||
【ICMI 2023】 Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agent [paper]; [aldelb/non_verbal_facial_animation] | 🏆 |
Team (Method) | Paper | Video | 🏆 |
---|---|---|---|
DeepMotion | 【ICMI 2022】The DeepMotion entry to the GENEA Challenge 2022 [paper] | [youtube] | |
DSI | 【ICMI 2022】Hybrid Seq2Seq Architecture for 3D Co-Speech Gesture Generation [paper] | [youtube] | |
FineMotion | 【ICMI 2022】ReCell: replicating recurrent cell for auto-regressive pose generation [paper] [FineMotion/GENEA_2022] | [youtube] | |
Forgerons | 【ICMI 2022】Ubisoft Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022 [paper] | [youtube] | |
GestureMaster | 【ICMI 2022】GestureMaster: Graph-based Speech-driven Gesture Generation [paper] | [youtube] | |
IVI Lab | 【ICMI 2022】The IVI Lab entry to the GENEA Challenge 2022 – A Tacotron2 Based Method for Co-Speech Gesture Generation With Locality-Constraint Attention Mechanism [paper] [Tacotron2-SpeechGesture] | [youtube] | 🏆 |
ReprGesture | 【ICMI 2022】The ReprGesture entry to the GENEA Challenge 2022 [paper] [YoungSeng/ReprGesture] | [youtube] | |
TransGesture | 【ICMI 2022】TransGesture: Autoregressive Gesture Generation with RNN-Transducer [paper] | [youtube] | |
UEA Digital Humans | 【ICMI 2022】UEA Digital Humans entry to the GENEA Challenge 2022 [paper] [UEA/GENEA22] | [youtube] |
Papers | Video | 🏆 |
---|---|---|
【ICMI 2022】 Understanding Interviewees’ Perceptions and Behaviour towards Verbally and Non-verbally Expressive Virtual Interviewing Agents [paper] | [youtube] | |
【ICMI 2022】 Emotional Respiration Speech Dataset [paper] | [youtube] | |
【ICMI 2022】 Automatic facial expressions, gaze direction and head movements generation of a virtual agent [paper] | [youtube] | 🏆 |
【ICMI 2022】 Can you tell that I’m confused? An overhearer study for German backchannels by an embodied agent [paper] | [youtube] |
Papers | Video | 🏆 |
---|---|---|
【ICMI 2021】 Probabilistic Human-like Gesture Synthesis from Speech using GRU-based WGAN [paper] [wubowen416/gesture-generation-using-WGAN] | [youtube] | 🏆 |
【ICMI 2021】 Influence of Movement Energy and Affect Priming on the Perception of Virtual Characters Extroversion and Mood [paper] | ❌ | |
【ICMI 2021】 Crossmodal clustered contrastive learning: Grounding of spoken language to gesture [paper] [dondongwon/CC_NCE_GENEA] | [youtube] |
Papers | Video |
---|---|
【IVA 2020】 The StyleGestures entry to the GENEA Challenge 2020 [paper] ; [[simonalexanderson/StyleGestures]] | [youtube] |
【IVA 2020】 The FineMotion entry to the GENEA Challenge 2020 [paper] ; [FineMotion/GENEA_2020] | [youtube] |
【IVA 2020】 Double-DCCCAE: Estimation of Sequential Body Motion Using Wave-Form - AlltheSmooth [paper] | [youtube] |
【IVA 2020】 CGVU: Semantics-guided 3D Body Gesture Synthesis [paper] | [youtube] |
【IVA 2020】 Interpreting and Generating Gestures with Embodied Human Computer Interactions [paper] | [youtube] |
【IVA 2020】 The Nectec Gesture Generation System entry to the GENEA Challenge 2020 [paper] | [youtube] |
[1994] Rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents [paper]
Speech to sequence gesture
Probabilistic model of speech to gesture
Probabilistic model of personal style
Neural classification model of personal style
This section is – not accurate –> continue edditing
MLP (Multilayer perceptron)
RNN (Recurrent Neural Networks)
CNN (Convolutional Networks)
Transformers
Generative models – not accurate – continue edditing
Normalising Flows
WGAN
VAEs
Learnable noise codes
CaMN BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis [paper] ; [PantoMatrix/BEAT]
Diffusion
Periodic autoencoders (DeepPhase)
Text to Gesture
Others (Uncategory)
【ICMI 2022】 GestureMaster GestureMaster: Graph-based Speech-driven Gesture Generation [paper]
-
Full name | Description |
---|---|
Adversarial Loss (Adv) | Used in Generative Adversarial Networks (GANs), this loss function pits a generator network against a discriminator network, with the goal of the generator producing samples that can fool the discriminator into thinking they are real. |
Categorical Cross Entropy (CCE) | A common loss function used in multi-class classification tasks, where the goal is to minimize the difference between the predicted and true class labels. |
Cross-modal Cluster Noise Contrastive Estimation (CC-NCE) | Used in multimodal learning to learn joint representations across different modalities, this loss function maximizes the similarity between matching modalities while minimizing the similarity between non-matching modalities. |
Edge Transition Cost (ETC) | Used in graph-based image segmentation, this loss function measures the similarity between adjacent pixels in an image to preserve the coherence and smoothness of segmented regions. |
Expectation Maximization (EM) | Used for maximum likelihood estimation when dealing with incomplete or missing data, this algorithm involves computing the expected likelihood of the missing data and updating model parameters to maximize the likelihood of the observed data given the expected values. |
Geodesic Distance (GeoD) | Used in deep learning for image segmentation, this loss function penalizes the discrepancy between the predicted segmentation map and the ground truth, while also considering the spatial relationships between different image regions. |
Wasserstein-GAN Gradient Penalty (WGAN-GP) | An extension of the Wasserstein GAN algorithm that adds a gradient penalty term to the loss function, used to enforce the Lipschitz continuity constraint and ensure stability during training. |
Hamming Distance (Hamm) | Used in information theory, this metric measures the number of positions at which two strings differ. |
Huber Loss (Huber) | A robust loss function used in regression tasks that is less sensitive to outliers than the Mean Squared Error (MSE) loss. |
Imitation Reward (IR) | Used in imitation learning to train a model to mimic the behavior of an expert agent, by providing a reward signal based on how closely the model’s behavior matches that of the expert. |
Kullback–Leibler Divergence (KL) | Used to measure the difference between two probability distributions, this loss function is commonly used in probabilistic models and deep learning for regularization and training. |
L2 Distance (L2) | Measures the Euclidean distance between two points in space, commonly used in regression tasks. |
Mean Absolute Error (MAE) | A loss function used in regression tasks that measures the average difference between the predicted and true values. |
Maximum Likelihood Estimation (MLE) | A statistical method used to estimate the parameters of a probability distribution that maximize the likelihood of observing the data. |
Mean Squared Error (MSE) | A common loss function used in regression tasks that measures the average squared difference between the predicted and true values. |
Negative Log-likelihood (NLL) | Used in probabilistic models to maximize the likelihood of the observed data by minimizing the negative log-likelihood. |
Structural Similarity Index Measure (SIMM) | Used in image processing to measure the similarity between two images based on their luminance, contrast, and structural content. |
Task Reward (TR) | Used in reinforcement learning to provide a reward signal to an agent based on its performance in completing a given task. |
Variance (Var) | A statistical metric used to measure the variability of a set of data points around their mean. |
Within-cluster Sum of Squares (WCSS) | Used in cluster analysis to measure the variability of data points within a cluster by computing the sum of squared distances between each data point and the mean of the cluster. |
Appropriateness (specificity) : appropriate for the given speech, controlling for the human-likeness of the motion
🧑🦲 : Upper-body tier | 🧍 : Full-body tier |
🧍♂️ : motion | 📃 : text | 🔊 : audio | ⚙️ : custom by teams |
Metric (Description) | Body tier | Type | 2020 | 2021 | 2022 | 2023 |
---|---|---|---|---|---|---|
FNA (Full-body Natural Motion ) | 🧍 | 🧍♂️ | ||||
FBT (Full-body Text-based ) | 🧍 | 📃 | ||||
FSA (Full-body Custom by Teams ) | 🧍 | ⚙ | ||||
FSB (Full-body Custom by Teams ) | 🧍 | ⚙️ | ||||
FSC (Full-body Custom by Teams ) | 🧍 | ⚙️ | ||||
FSD (Full-body Custom by Teams ) | 🧍 | ⚙️ | ||||
FSF (Full-body Custom by Teams ) | 🧍 | ⚙️ | ||||
FSG (Full-body Custom by Teams ) | 🧍 | ⚙️ | ||||
FSH (Full-body Custom by Teams ) | 🧍 | ⚙️ | ||||
FSI (Full-body Custom by Teams ) | 🧍 | ⚙️ | ||||
UNA (Upper-body Natural Motion ) | 🧑🦲 | 🧍♂️ | ||||
UBA (Upper-body Audio-based ) | 🧑🦲 | 🔊 | ||||
UBT (Upper-body Text-based ) | 🧑🦲 | 📃 | ||||
USJ (Upper-body Custom by Teams) | 🧑🦲 | ⚙️ | ||||
USK (Upper-body Custom by Teams) | 🧑🦲 | ⚙️ | ||||
USL (Upper-body Custom by Teams) | 🧑🦲 | ⚙️ | ||||
USM (Upper-body Custom by Teams) | 🧑🦲 | ⚙️ | ||||
USN (Upper-body Custom by Teams) | 🧑🦲 | ⚙️ | ||||
USO (Upper-body Custom by Teams) | 🧑🦲 | ⚙️ | ||||
USP (Upper-body Custom by Teams) | 🧑🦲 | ⚙️ | ||||
USQ (Upper-body Custom by Teams) | 🧑🦲 | ⚙️ |
Modalities type:
🔊 : audio | 📃 : text | 🤯 : emotion | 🚶 : gesture motion | ℹ️ : gesture properties | 🎞️ : gesture segment |
Type
👥 : Dialog (Conversation between two people 🤼) | 👤 : Monolog (Self conversation 🧍) |
Dataset | Modalities | Type | Download | Paper |
---|---|---|---|---|
IEMOCAP | 🚶, 🔊, 📃, 🤯 | 👥 | sail.usc.edu/iemocap | [paper] |
Creative-IT | 🚶, 🔊, 📃, 🤯 | 👥 | sail.usc.edu/CreativeIT | |
Gesture-Speech Dataset | 🚶, 🔊 | 👤 | dropbox | |
CMU Panoptic | 🚶, 🔊, 📃 | 👥 | domedb.perception.cmu | [paper] |
Speech-Gesture | 🚶, 🔊 | 👤 | amirbar/speech2gesture | [paper] |
TED Dataset [homepage] | 🚶, 🔊 | 👤 | youtube-gesture-dataset | |
Talking With Hands ([github]) | 🚶, 🔊 | 👥 | facebookresearch/TalkingWithHands32M | [paper] |
PATS ([homepage], [github]) | 🚶, 🔊, 📃 | 👤 | chahuja.com/pats | [paper] |
Trinity Speech-Gesture I | 🚶, 🔊, 📃 | 👤 | Trinity Speech-Gesture I | |
Trinity Speech-Gesture II | 🚶, 🔊, 🎞️ | 👤 | Trinity Speech GestureII | |
Speech-Gesture 3D extension | 🚶, 🔊 | 👤 | nextcloud.mpi-klsb | |
Talking With Hands GENEA Extension | 🚶, 🔊, 📃 | 👥 | zenodo/6998231 | [paper] |
SaGA | 🚶, 🔊, ℹ️ | 👥 | phonetik.uni-muenchen | [paper] |
SaGA++ | 🚶, 🔊, ℹ️ | 👥 | zenodo/6546229 | |
ZEGGS Dataset [youtube] | 🚶, 🔊 | 👤 | ubisoft-laforge-ZeroEGGS | [paper] |
BEAT Dataset ([homepage] [homepage], [github]) | 🚶, 🔊, 📃, ℹ️, 🤯 | 👥, 👤 | github.io/BEAT | [paper] |
InterAct homepage | 🚶, 🔊, 📃 | 👥 | hku-cg.github.io | [paper] |
Algorithms
Recognition:
Audio pre-processing:
-
Mesh processing:
Visualization:
TEDTalk (Extract skeleton from video Dataset)
BEAT (Motion Capture Dataset)
Your contributions are always welcome! Please take a look at the contribution guidelines first.
This project is licensed under the MIT License - see the LICENSE.md file for details.
OpenHuman.ai - Open Store for Realistic Digital Human