Match Box

Match Box is an iPhone game prototype that uses isolated word recognition. My colleague Sebastian Böhm and I have been working on this for a couple of weeks during the lab course Similarity Modeling 1 at our university.

The main idea of the game is the following:

  • Create a picture with your iPhone.
  • In one word describe an object contained in the picture. This will be recorded into a sound file.
  • Send the picture and the spoken word to a friend.
  • Your friend will have to guess an object shown on the sent picture without listening to the description you have sent along. The word-matching algorithm then compares the two spoken words and gives your friend a visual feedback if his description matches yours, and if not, how similar it sounds.

Since we haven't had any previous experience in speech recognition, this project was really for us to learn about this particularly interesting area. In order for the isolated word recognition to work, we have developed a library called Word Match that basically allows you to extract features from recorded speech and measure their similarity. This is achieved by extracting Mel-Frequency Cepstral Coefficients (MFCCs) [1] from both samples and then measuring the distance between them using Dynamic Time Warping (DTW) [2]. The MFCCs represent an uncorrelated set of values expressing the important characteristics of a spoken word. A very nice description of the process of extracting MFCCs is given by Jurafsky [3]. DTW allows to compensate for non-linear time-stretching between the two utterances, i.e. the individual pace of a speaker.

The game prototype MatchBox is basically built around our word-matching library with additional code for simple peer-to-peer networking, audio recording and user interaction. Audio capturing is done using the Voice Processing Audio Unit of iOS. This delivers great recording quality for speech as noise canceling and automatic gain reduction are built-in features. MFCC/DTW also required us to perform some additional thresholding operations at the begin and end of the recorded utterance. These values strongly influence the performance of the comparison algorithm and therefore have to be configured carefully.

As I am currently working on other projects that require efficient audio analysis, I took this project as an opportunity to implement the MFCC extraction process as efficiently as it was possible for me. In the sources of this project, you will find a class MFCCProcessor that uses Apple's Accelerate framework (i.e. vDSP) for performing the calculations necessary in order to retrieve the MFCC components from a window of audio samples. The MFCC extraction is very robust and most efficient on both platforms, iOS and OS X. Our implementation has been tested against a commonly used MFCC Matlab implementation for its correctness.

The complete source code of this project is available under the MIT license. Please note that the MatchBox application itself rather represents a nice use-case scenario for the extraction process than a ready-to-deploy iOS application. We have mostly focused our work on the MFCC and DTW code which we have tested thoroughly. I believe there are some really useful pieces of code for developers working with audio feature extraction on OSX / iOS.

References:

[1] P. Mermelstein, “Distance measures for speech recognition, psychological and instrumental,” Pattern Recognition and Artificial Intelligence, vol. 116, 1976.

[2] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization
for spoken word recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 26, no. 1, pp. 43–49, 1978.

[3] D. Jurafsky, J. Martin, and A. Kehler, Speech and language processing: An
introduction to natural language processing, computational linguistics, and speech recognition. pp. 295 - 302, Prentice Hall, 2009.