Remote Machine Learning Contract

StatMuse · Anywhere · Dec. 13, 2018


💸 Undisclosed Salary

Project Description:

We need to create a system that can predict 51 "blend shape" coefficients per frame (at 60 fps) given any new arbitrary audio clip of our specific speaker talking.

Blend shapes are identifiers for specific facial features, e.g. eyeBlinkLeft, mouthSmileRight, noseSneerRight, etc. The full list can be found here:

Each blend shape has a floating point value indicating the current position of that feature relative to its neutral configuration, ranging from 0.0 (neutral) to 1.0 (maximum movement).

Our training data will consist of 3-5 hours of a speaker's recorded speech audio, aligned with  blend shape coefficients, obtained through motion capture, which reflect how they moved their face as they spoke.

One obvious challenge will be in handling the ambiguity within the data, which is largely driven by the speaker’s emotions while delivering sounds (i.e., the same sound can produce a variety of different facial poses based on the emotional state of the speaker). 

It appears a similar problem has been solved here by the Nvidia team:

From the paper:

"We present a machine learning technique for driving 3D facial animation by audio input in real time and with low latency. Our deep neural network learns a mapping from input waveforms to the 3D vertex coordinates of a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone. During inference, the latent code can be used as an intuitive control for the emotional state of the face puppet.

At first, the problem may seem intractable because of its inherent ambiguity—the same sounds can be uttered with vastly different facial expressions, and the audio track simply does not contain enough information to distinguish between the different variations [Petrushin 1998]. While modern convolutional neural networks have proven extremely effective in various inference and classification tasks, they tend to regress toward the mean if there are ambiguities in the training data. To tackle these problems, we present three main contributions:

  • A convolutional network architecture tailored to effectively process human speech and generalize over different speakers (Sections 3.1 and 3.2).
  • A novel way to enable the network to discover variations in the training data that cannot be explained by the audio alone, i.e., apparent emotional state (Section 3.3).
  • A three-way loss function to ensure that the network remains temporally stable and responsive under animation, even with highly ambiguous training data (Section 4.3)."

Their solution is predicting the relative position of 5K vertices of a fixed-mesh face model vs us wanting to predict 51 blend shape coefficients. Aside from that distinction, the ideal solution would be very similar to Nvidia’s approach.

Timeline and Cost Estimates:

Our ideal timeline is rather ambitious! Ideally, we are hoping to get something working by end of January 2019. We're definitely open to discussion though; please mention in your proposal how long you think this will realistically take.


Please submit a short, informal proposal that includes:

  • Resume and summary of your experience with machine learning
  • Summary of the steps you would take in approaching this problem, including desired programming language(s) and infrastructure
  • Estimate of the time required
  • Your rate, either as fixed cost or hourly

If you have any clarifying questions you would like answered before submitting a proposal, please feel free to send them to and we'll be happy to respond. 

You can submit your proposal to, and we’ll be in touch.

Machine Learning Audio Analysis 3d Facial Animation