StatMuse · Anywhere · Dec. 13, 2018
💸 Undisclosed Salary
We need to create a system that can predict 51 "blend shape" coefficients per frame (at 60 fps) given any new arbitrary audio clip of our specific speaker talking.
Blend shapes are identifiers for specific facial features, e.g. eyeBlinkLeft, mouthSmileRight, noseSneerRight, etc. The full list can be found here: https://developer.apple.com/documentation/arkit/arfaceanchor/blendshapelocation
Each blend shape has a floating point value indicating the current position of that feature relative to its neutral configuration, ranging from 0.0 (neutral) to 1.0 (maximum movement).
Our training data will consist of 3-5 hours of a speaker's recorded speech audio, aligned with blend shape coefficients, obtained through motion capture, which reflect how they moved their face as they spoke.
One obvious challenge will be in handling the ambiguity within the data, which is largely driven by the speaker’s emotions while delivering sounds (i.e., the same sound can produce a variety of different facial poses based on the emotional state of the speaker).
It appears a similar problem has been solved here by the Nvidia team: https://research.nvidia.com/publication/2017-07_Audio-Driven-Facial-Animation
From the paper:
"We present a machine learning technique for driving 3D facial animation by audio input in real time and with low latency. Our deep neural network learns a mapping from input waveforms to the 3D vertex coordinates of a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone. During inference, the latent code can be used as an intuitive control for the emotional state of the face puppet.
At first, the problem may seem intractable because of its inherent ambiguity—the same sounds can be uttered with vastly different facial expressions, and the audio track simply does not contain enough information to distinguish between the different variations [Petrushin 1998]. While modern convolutional neural networks have proven extremely effective in various inference and classification tasks, they tend to regress toward the mean if there are ambiguities in the training data. To tackle these problems, we present three main contributions:
Their solution is predicting the relative position of 5K vertices of a fixed-mesh face model vs us wanting to predict 51 blend shape coefficients. Aside from that distinction, the ideal solution would be very similar to Nvidia’s approach.
Timeline and Cost Estimates:
Our ideal timeline is rather ambitious! Ideally, we are hoping to get something working by end of January 2019. We're definitely open to discussion though; please mention in your proposal how long you think this will realistically take.
Please submit a short, informal proposal that includes:
If you have any clarifying questions you would like answered before submitting a proposal, please feel free to send them to email@example.com and we'll be happy to respond.
You can submit your proposal to firstname.lastname@example.org, and we’ll be in touch.