Emotion is defined as a natural instinctive state of mind deriving from one’s circumstances, mood, or relationships with others. Emotions are believed to be species specific rather than culture-specific. In the case of humans, emotions are expressed through behavior, actions, thoughts and feelings. Among these expressions facial expressions is one of the most natural forms of display of human emotions. Facial expressions in humans are controlled by the action of more than 40 muscles. A motion detector, such as Kinect, used for gaming can track the movements of these muscles. Using a machine learning technique, these movements can be classified as different emotions.
We used the client-server architecture for our experiments. All the training is done on the server side. For the server system, we used an Intel core i7 6700 HQ processor with four processor cores and an NVIDIA graphics processor using the common unified device architecture (CUDA) with 512 GPU cores. Since we did not do any training on the client side, we used a much less resource intensive system with only a dual-core Intel core i5 processor.
For detecting motion, we used a Microsoft Kinect for Windows sensor. It has an RGB camera which can record frames at a rate of 30 fps. This speed was adequate for us to capture the required frames for our task. It also provided a special set of configurations for the depth sensor to detect the nearby objects. The distances were categorized as too near, normal, too far, or unknown. The too near value means an object was detected, but it is too near to the sensor to provide a reliable distance measurement. The too far value means an object was detected, but too far to reliably measure. The unknown value means no object was detected. As we can see from the figure, in the near range, any depth from 0.4 m to 3 m can be efficiently captured. This depth data can be converted to RGB value for further processing.
For the early training, we use the Cohn-Kanade dataset. This is an AU-Coded Facial Expression Database for research in automatic facial image analysis and synthesis and for perceptual studies. It includes 486 sequences from 97 posers. Each sequence begins with a neutral expression and proceeds to a peak expression. The peak expression for each sequence is encoded using FACS and given an emotion label. For the late training phase, samples were recorded directly from end users on the client system and their facial expressions were tracked using the Kinect sensors.
Confusion matrix for the early training phase
We modified a level of an open source version of a video game similar to the famous classic video game Super Mario Bros to incorporate information about player emotions. We altered the gameplay to make it easier if the player appeared angry, or sad and make it tougher if the player appeared happy. To test our system we relied on user feedback. Users were asked to play the unmodified levels and their modified versions in a blind test.