Make a Voice-Controlled Audio Player with the Web Speech API

This article was peer reviewed by Edwin Reynoso and Mark Brown. Thanks to all of SitePoint’s peer reviewers for making SitePoint content the best it can be!

The Web Speech API is a JavaScript API that enables web developers to incorporate speech recognition and synthesis into their web pages.

There are many reasons to do this. For example, to enhance the experience of people with disabilities (particularly users with sight problems, or users with limited ability to move their hands), or to allows users to interact with a web app while performing a different task (such as driving).

If you have never heard of the Web Speech API, or you would like a quick primer, then it might be a good idea to read Aurelio De Rosa’s articles Introducing the Web Speech API, Speech Sythesis API and the Talking Form .

Browser Support

Browsers vendors have only recently started implementing both the Speech Recognition API and the Speech Synthesis API. As you can see, support for these is still far from perfect, so if you are following along with this tutorial, please use an appropriate browser.

In addition, the speech recognition API currently requires an Internet connection, as the speech gets passed through the wire and the results are returned to the browser. If the connection uses HTTP, the user has to permit a site to use their microphone on every request. If the connection uses HTTPS, then this is only necessary once.

Speech Recognition Libraries

Libraries can help us manage complexity and can ensure we stay forward compatible. For example when another browser starts supporting the Speech Recognition API, we would not have to worry about adding vendor prefixes.

One such library is Annyang, which is incredibly easy to work with. Tell me more.

To initialize Annyang, we add their script to our website:

<script src="//cdnjs.cloudflare.com/ajax/libs/annyang/1.6.0/annyang.min.js"></script>

We can check if the API is supported like so:

if (annyang) { /*logic */ }

And add commands using an object with the command names as keys and the callbacks as methods. :

var commands = {
  'show divs': function() {
    $('div').show();
  },
  'show forms': function() {
    $("form").show();
  }
};

Finally, we just add them and start the speech recognition using:

annyang.addCommands(commands);
annyang.start();

Voice-controlled Audio Player

In this article, we will be building a voice-controlled audio player. We will be using both the Speech Synthesis API (to inform users which song is beginning, or that a command was not recognized) and the Speech Recognition API (to convert voice commands to strings which will trigger different app logic).

The great thing about an audio player that uses the Web Speech API is that users will be able to surf to other pages in their browser or minimize the browser and do something else while still being able to switch between songs. If we have a lot of songs in the playlist, we could even request a particular song without searching for it manually (if we know its name or singer, of course).

We will not be relying on a third-party library for the speech recognition as we want to show how to work with the API without adding extra dependencies in our projects. The voice-controlled audio player will only be supporting browsers that support the interimResults attribute. The latest version of Chrome should be a safe bet.

As ever, you can find the complete code on GitHub, and a demo on CodePen.

Getting Started — a Playlist

Let’s start with a static playlist. It consists of an object with different songs in an array. Each song is a new object containing the path to the file, the singer’s name and the name of the song:

var data = {
  "songs": [
    {
      "fileName": "https://www.ruse-problem.org/songs/RunningWaters.mp3",
      "singer" : "Jason Shaw",
      "songName" : "Running Waters"
    },
    ...

We should be able to add a new objects to the songs array and have the new song automatically included into our audio player.

The Audio Player

Now we come to the player itself. This will be an object containing the following things:

some setup data
methods pertaining to the UI (e.g. populating the list of songs)
methods pertaining to the Speech API (e.g. recognizing and processing commands)
methods pertaining to the manipulation of audio (e.g. play, pause, stop, prev, next)

Setup Data

This is relatively straight forward.

var audioPlayer = {
  audioData: {
    currentSong: -1,
    songs: []
  },

The currentSong property refers to the index of the song that the user is currently on. This is useful, for example, when we have to play the next/previous song, or stop/pause the song.

The songs array contains all the songs that the user has listened to. This means that the next time the user listens to the same song, we can load it from the array and not have to download it.

You can see the full code here.

UI Methods

The UI will consist of a list of available commands, a list of available tracks and a context box to inform the user of both the current operation and the previous command. I won’t go into the UI methods in detail, rather offer a brief overview. You can find the code for these methods here.

load

This iterates over our previously declared playlist and appends the name of the song, as well as the name of the artist to a list of available tracks.

changeCurrentSongEffect

This indicates which song is currently playing (by marking it green and adding a pair of headphones next to it) as well as those which have finished playing.

playSong

This indicates to the user that a song is playing, or when it has ended. It does this via the changeStatusCode method, which adds this information to the box and to inform the user of this change via the Speech API.

changeStatusCode

As mentioned above, this updates the status message in the context box (e.g. to indicate that a new song is playing) and utilizes the speak method to announce this change to the user.

changeLastCommand

A small helper which updates the last command box.

toggleSpinner

A small helper to hide or show the spinner icon (which indicates to the user that his voice command is currently processing).

Player Methods

The player will be responsible for what you might expect, namely: starting, stopping and pausing playback, as well as moving backwards and forwards through the tracks. Again, I don’t want to go into the methods in detail, but would rather point you towards our GitHub repo.

Play

This checks if the user has listened to a song yet. If not, it starts the song, otherwise it just calls the playSong method we discussed previously on the currently cached song. This is located in audioData.songs and corresponds to the currentSong index.

pauseSong

This pauses or completely stops (returns playback time to the song’s beginning) a song, depending on what is passed as the second parameter. It also updates the status code to notify the user that the song has either been stopped or paused.

stop

This either pauses or stops the song based on its first and only parameter:

This checks whether the previous song is cached and if so, it pauses the current song, decrements currentSong and plays the current song again. If the new song is not in the array, it does the same but it first loads the song from the file name/path corresponding to the decremented currentSong index.

If the user has listened to a song before, this method tries to pause it. If there is a next song in our data object (i.e. our playlist) it loads it and plays it. If there is no next song it just changes the status code and informs the user that they have reached the final song.

searchSpecificSong

This takes a keyword as an argument and performs a linear search across song names and artists, before playing the first match.

Speech API Methods

The Speech API is surprisingly easy to implement. In fact, it only takes two lines of code to get a web app talking to users:

var utterance = new SpeechSynthesisUtterance('Hello');
window.speechSynthesis.speak(utterance);

What we are doing here is creating an utterance object which contains the text we wish to be spoken. The speechSynthesis interface (which is available on the window object) is responsible for processing this utterance object and controlling the playback of the resulting speech.

Go ahead and try it out in your browser. It’s that easy!

speak

We can see this in action in our speak method, which reads aloud the message passed as an argument:

speak: function(text, scope) {
  var message = new SpeechSynthesisUtterance(text.replace("-", " "));
  message.rate = 1;
  window.speechSynthesis.speak(message);
  if (scope) {
    message.onend = function() {
      scope.play();
    }
  }
}

If there is a second argument (scope), we call the play method on scope (which would be an Audio object) after the message has finished playing.

processCommands

This method is not as exciting. It receives a command as a parameter and calls the appropriate method to respond to it. It checks if the user wants to play a specific song with a regular expression, otherwise, it enters a switch statement to test different commands. If none corresponds to the command received, it informs the user that the command was not understood.

You can find the code for it here.

Tying Things Together

By now we have a data object representing our playlist, as well as an audioPlayer object representing the player itself. Now we need to write some code to recognize and deal with user input. Please note that this will only work in webkit browsers.

The code to have users talk to your app is equally as simple as before:

var recognition = new webkitSpeechRecognition();
recognition.onresult = function(event) {
  console.log(event)
}
recognition.start();

This will invite the user to allow a page access to their microphone. If you allow access you can start talking and when you stop the onresult event will be fired, making the results of the speech capture available as a JavaScript object.

Reference: The HTML5 Speech Recognition API

We can implement this in our app as follows:

if (window['webkitSpeechRecognition']) {
  var speechRecognizer = new webkitSpeechRecognition();

  // Recognition will not end when user stops speaking
  speechRecognizer.continuous = true;

  // Process the request while the user is speaking
  speechRecognizer.interimResults = true;

  // Account for accent
  speechRecognizer.lang = "en-US";

  speechRecognizer.onresult = function (evt) { ... }
  speechRecognizer.onend = function () { ... }
  speechRecognizer.start();
} else {
  alert("Your browser does not support the Web Speech API");
}

As you can see we test for the presence of webkitSpeechRecognition on the window object. If it is there, then we’re good to go, otherwise we inform the user that the browser doesn’t support it. If all’s good, we then set a couple of options. Of these lang is an interesting one which can improve the results of the recognition, based on where you hail from.

We then declare handlers for the onresult and the onend events, before kicking things off with the start method.

Handling a result

There are a few things we want to do when the speech recognizer gets a result, at least in the context of the current implementation of speech recognition and our needs. Each time there is a result, we want to save it in an array and set a timeout to wait for three seconds, so the browser can collect any further results. After the thee seconds are up, we want to use the gathered results and loop over them in reverse order (newer results have better chance of being accurate) and check whether the recognized transcript contains one of our available commands. If it does, we execute the command and restart the speech recognition. We do this because waiting for a final result can take up to a minute, making our audio player seem quite unresponsive and pointless, as it would be faster to just click on a button.

speechRecognizer.onresult = function (evt) {
  audioPlayer.toggleSpinner(true);
  results.push(evt.results);
  if (!timeoutSet) {
    setTimeout(function() {
      timeoutSet = false;
      results.reverse();
      try {
        results.forEach(function (val, i) {
          var el = val[0][0].transcript.toLowerCase();
          if (currentCommands.indexOf(el.split(" ")[0]) !== -1) {
            speechRecognizer.abort();
            audioPlayer.processCommands(el);
            audioPlayer.toggleSpinner();
            results = [];
            throw new BreakLoopException;
          }
          if (i === 0) {
            audioPlayer.processCommands(el);
            speechRecognizer.abort();
            audioPlayer.toggleSpinner();
            results = [];
          }
        });
      }
      catch(e) {return e;}
    }, 3000)
  }
  timeoutSet = true;
}

As we are not using a library we have to write more code to set up our speech recognizer, looping over each result and checking if its transcript matches a given keyword.

Lastly, we restart speech recognition as soon as it ends:

speechRecognizer.onend = function () {
  speechRecognizer.start();
}

You can see the full code for this section here.

And that’s it. We now have an audio player which is fully functional and voice-controlled. I urge to to download the code from Github and have a play about with it, or check out the CodePen demo. I have also made available a version which is served via HTTPS.

Conclusion

I hope this practical tutorial has served as a healthy introduction as to what is possible with the Web Speech API. I think that we will see use of this API grow, as implementations stabilize and new features are added. For example, I see a YouTube of the future which is completely voice-controlled, where we can watch the videos of different users, play specific songs and move between songs just with voice commands.

There are also many other areas where the Web Speech API could bring improvements, or open new possibilities. For example browsing email, navigating websites, or searching the web — all with your voice.

Are you using this API in your projects? I’d love to hear from you in the comments below.

Frequently Asked Questions (FAQs) on Voice-Controlled Audio Player with Web Speech API

How does the Web Speech API work in a voice-controlled audio player?

The Web Speech API is a powerful tool that allows developers to incorporate speech recognition and synthesis into their web applications. In a voice-controlled audio player, the API works by converting spoken commands into text, which the application can then interpret and act upon. For instance, if a user says “play”, the API will convert this into text, and the application will understand this as a command to start playing the audio. This process involves complex algorithms and machine learning techniques to accurately recognize and interpret human speech.

What are the benefits of using a voice-controlled audio player?

A voice-controlled audio player offers several benefits. Firstly, it provides a hands-free experience, which can be particularly useful when the user is busy with other tasks. Secondly, it can enhance accessibility for users with physical disabilities who may find traditional controls difficult to use. Lastly, it offers a novel and engaging user experience, which can make your application stand out from the competition.

Can I use the Web Speech API in any web browser?

The Web Speech API is supported by most modern web browsers, including Google Chrome, Mozilla Firefox, and Microsoft Edge. However, it’s always a good idea to check the specific browser compatibility before implementing the API in your application, as support can vary between different versions and platforms.

How can I improve the accuracy of speech recognition in my voice-controlled audio player?

The accuracy of speech recognition can be improved by using a high-quality microphone, reducing background noise, and training the API to better understand the user’s voice and accent. Additionally, you can implement error handling in your application to deal with unrecognized commands and provide feedback to the user.

Can I customize the voice commands in my voice-controlled audio player?

Yes, you can customize the voice commands in your voice-controlled audio player. This can be done by defining your own set of commands in the application code, which the Web Speech API will then recognize and interpret. This allows you to tailor the user experience to your specific needs and preferences.

Is it possible to use the Web Speech API for languages other than English?

Yes, the Web Speech API supports a wide range of languages. You can specify the language in the API settings, and it will then recognize and interpret commands in that language. This makes it a versatile tool for developing applications for international audiences.

How secure is the Web Speech API?

The Web Speech API is designed with security in mind. It uses secure HTTPS connections to transmit speech data, and it does not store any personal information. However, as with any web technology, it’s important to follow best practices for security, such as regularly updating your software and protecting your application against common web vulnerabilities.

Can I use the Web Speech API in mobile applications?

While the Web Speech API is primarily designed for web applications, it can also be used in mobile applications through web views. However, for native mobile applications, you may want to consider using platform-specific speech recognition APIs, which may offer better performance and integration.

What are the limitations of the Web Speech API?

While the Web Speech API is a powerful tool, it does have some limitations. For instance, it requires an internet connection to work, and its accuracy can be affected by factors such as background noise and the user’s accent. Additionally, support for the API can vary between different web browsers and platforms.

How can I get started with the Web Speech API?

To get started with the Web Speech API, you’ll need a basic understanding of JavaScript and web development. You can then explore the API documentation, which provides detailed information on its features and how to use them. There are also many online tutorials and examples available, which can help you learn how to incorporate the API into your own applications.