Tuenti voice control

Publicado el 07/2/2012 por Ismael González UI Designer
Michael Clark Engineer

Tuenti Voice Control is a proof-of-concept that allows users to browse Tuenti with their voice instead of using a mouse or keyboard. It was created for HackMeUp 15, a 24 hour code competition held between Tuenti engineers every quarter, and uses the Experimental Speech API available in Google Chrome since 2011.

In our demonstration video, Ismael Gonzalez demonstrates browsing Tuenti tabs, going to specific profiles and starting chats. Watch the video now!

After creating a Chrome plugin that communicates speech-to-text data to the website, we spent the remaining three hours adding commands related to Tuenti. By the deadline we could:

  1. Access top-level pages on the site like “mensajes” and “salir”
  2. Target specific friends with “chat” or “perfil” followed by their name -- users can also go directly to Jose’s profile by speaking “perfil jose” or be more specific with “perfil jose manuel”
  3. Write speech directly to the chat conversation and send the message -- it is possible to begin chatting with Natalia with “chat natalia” and output any following text to the screen “hola natalia como estas”

Plugin architecture

Chrome’s Experimental Speech API implements a subset of the features detailed in the W3C Recommendation for Speech Grammer (March 2004) and allows extensions to start speech recognition and retrieve the captured text. To use experimental extension APIs, you must start Chrome with the command line option --enable-experimental-extension-apis.

Google Chrome extensions are composed of HTML pages with specific functions. We use a single content script to capture events from the browser and send requests to a background page:

window.addEventListener("speechstart", function(e) {
  chrome.extension.sendRequest('speechstart', function(response) {
    triggerSimpleEvent('speechstarted');
  });
});

This background page is able to access the experimental API and start speech recognition:

chrome.experimental.speechInput.start({
  language: 'ES_es'
  }, function () {
    if (chrome.extension.lastError) {
      console.debug("Couldn't start speech input: " + chrome.extension.lastError.message);
    }
  });

The background page then communicates the result to the content script via an asynchronous request.

  // Target active tab
  chrome.tabs.getSelected(null, function (tab) {
    chrome.tabs.sendRequest(tab.id, {
      success: true,
      result: result
    }, function (response) {
      // Handle request callback
    });
  });

If recognition has been successful, the content script appends a JSON-serialized version of the speech data array to the DOM and fires a ‘speechresult’ event.

  chrome.extension.onRequest.addListener(
    function(request, sender, sendResponse) {
      var voice = document.getEleventById('voice');
      voice.setAttribute('success', request.success ? 'true' : '');
      voice.setAttribute('data', JSON.stringify(request.success? request.result.hypotheses : []));
      triggerSimpleEvent('speechresult');
    }
  );

Serialization is required because the content script and underlying website have different Javascript contexts and objects cannot be shared between them.

Processing the speech result

The W3C recommendation includes a method for specifying a grammar. This is crucial for achieving high accuracy and precision in speech recognition system as error rates decrease as the vocabulary size shrinks: 0-9 can be recognized without error, but vocabulary sizes of 200, 5000 or 100000 can have error rates of 3%, 7% or 45%. After experimentation we found that custom grammars are not implemented in Chrome, as of December 2011, and that Google returns any set of words from its dictionary.

We solved this issue by converting recognized text to a bag-of-words and calculating the probability of a user wanting to perform an action on a friend based on the number of occurrences of words related to that action/user pair.

  1. Words were normalized and used to access a hash map that maps words to friends who have that name or commands that are referenced by that word
  2. Each friend or command is then increased by a value weighted by the confidence factor returned by Google and the index of the result
  3. The action/word with the highest cumulative weight is performed

This approach worked flawlessly when words present in text returned by Google correspond to a valid action/friend pair. This is helped by speaking clearly and using a high quality noise-cancelling microphone (Apple MacBook Pro) to ensure that the speech recognizer can detect the beginning and end of the command.

Conclusions

Before starting the project we did not know if it would be possible, especially using a single key, to start speech recognition, let alone recognize commands. It was and we think that such techniques can provide a better web experience. For this to happen, both Google (and other browser makers) and the W3C must work together to provide a stable API that can be used by all websites without extensions.

HackMeUps are code competitions we hold every quarter at Tuenti. If you would like to participate in development at Tuenti, consider applying! We are always looking for talented candidates.

Tuenti Voice Control was created by Michael Clark (frontend developer) and Ismael Gonzalez (CSS architect).

Useful links