Speech recognition: voice control (Java, Kaldi)

Speech recognition: voice control (Java, Kaldi)

speech recognition

Speech recognition is a field of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers.

The first speech recognition applications you may think of are Google Assistant, Alexa, Siri. It’s used in various fields: banking, e-commerce, workplace, IOT, language learning, etc.

To be honest, this topic is not something you would easily dive in without a strong theoretical foundation. Though there are ready-to-use solutions with noticeable accuracy that may be more than enough for your needs. One of them will be used to implement voice control for a web application.

Docker container setup

Kaldi is an open-source speech recognition toolkit written in C++ freely available under the Apache License v2.0. It uses deep neural networks under the hood, which trained by large amounts of audio data. Kaldi and neural networks are too complex topics, so we definitely not going to cover them in this article more than basically. We will cover a ready-to-use solution that can be quickly adjusted and applied to your needs.

We’ll need Docker for it. There is an existing Docker image with installed and configured Kaldi and GStreamer. So, all we need to do is to provide a speech recognition model for a Kaldi instance running inside a container.

Let’s go through the steps:

  • Download and install Docker
  • Download the image using the following command:
docker pull madiskarli/mindtitan-kaldi-gstreamer-server:1.0
  • Download a speech recognition model. The following resource provides great models – Zamia Speech models.
    We’re going to use kaldi-generic-en-tdnn_f which is trained on ~1200 hours of English-speaking audio.
  • Extract the contents of downloaded archive to a folder having such structure:
kaldi-models/en/exp/nnet3_chain/
  • Create a configuration file nnet2.yaml and put into kaldi-models/en:
use-nnet2: True
decoder:
    nnet-mode: 3
    use-threaded-decoder:  true
    model : exp/nnet3_chain/model/final.mdl
    word-syms : exp/nnet3_chain/model/graph/words.txt
    fst : exp/nnet3_chain/model/graph/HCLG.fst
    mfcc-config : exp/nnet3_chain/conf/mfcc_hires.conf
    frame-subsampling-factor: 3
    ivector-extraction-config : exp/nnet3_chain/ivectors_test_hires/conf/ivector_extractor.conf
    max-active: 7000
    beam: 15.0
    lattice-beam: 6.0
    acoustic-scale: 1.0 #0.083
    do-endpointing : true
    endpoint-silence-phones : "1:2:3:4:5:6:7:8:9:10"
    traceback-period-in-secs: 0.25
    chunk-length-in-secs: 0.25
    num-nbest: 10
out-dir: tmp

use-vad: False
silence-timeout: 10

logging:
    version : 1
    disable_existing_loggers: False
    formatters:
        simpleFormater:
            format: '%(asctime)s - %(levelname)7s: %(name)10s: %(message)s'
            datefmt: '%Y-%m-%d %H:%M:%S'
    handlers:
        console:
            class: logging.StreamHandler
            formatter: simpleFormater
            level: DEBUG
    root:
        level: DEBUG
        handlers: [console]
  • Run a container from a folder that contains kaldi-models. This command will connect you to a container.
docker run -it -p 8888:80 -v $(pwd)/kaldi-models/en:/opt/models madiskarli/mindtitan-kaldi-gstreamer-server:1.0 /bin/bash
  • Run the command in the container to start speech recognition service
/opt/start.sh -y /opt/models/nnet2.yaml
  • Check in /opt/worker.log that there are no errors and everything is running successfully

From now on, we can send HTTP PUT request with audio bytes to recognize to http://localhost:8888/client/dynamic/recognize and receive a JSON response with a recognized speech in a text format.

Speech recognition application

When a speech recognition container is successfully set up, it’s time to decide how to test it in an interesting and practical way. Let’s create a web-application that will have a fixed (for simplicity) list of images which can be changed using voice commands: “next” and “previous”.

  • Generate a Spring Boot web application using Spring Initializr
  • Set up Spring MVC to render views
  • Implement a service to send HTTP request for speech recognition
@Service
public class SpeechRecognitionService {

    @Autowired
    private RestTemplate restTemplate;

    public Optional<String> recognize(byte[] blob) throws IOException {
        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.parseMediaType("audio/x-raw"));

        HttpEntity<byte[]> requestEntity = new HttpEntity<>(blob, headers);

        ResponseEntity<String> response = restTemplate.exchange(
                "http://localhost:8888/client/dynamic/recognize", HttpMethod.PUT, requestEntity, String.class
        );

        return Optional.of(response.getBody());
    }
}
  • Implement a page that captures audio when you press a button and switches images when you pronounce “next” or “previous
<%@ page contentType="text/html;charset=UTF-8" language="java" %>

<html>
    <head>
        <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
    </head>
    <body>
        <script>
            var SpeechRecognition = (function() {
                return {
                    curImgPos: 1,
                    imagesNum: 3,
                    incrementPos: function() {
                        if (this.curImgPos >= this.imagesNum) {
                            this.curImgPos = 1;
                        } else {
                            this.curImgPos++;
                        }
                    },
                    decrementPos: function() {
                        if (this.curImgPos <= 1) {
                            this.curImgPos = this.imagesNum;
                        } else {
                            this.curImgPos--;
                        }
                    },
                    executeCommand: function (recognizedText) {
                        if (recognizedText.toLowerCase() == 'next') {
                            this.incrementPos();
                            $('#image').attr('src', '/images/' + this.curImgPos + '.jpg');
                        } else if (recognizedText.toLowerCase() == 'previous') {
                            this.decrementPos();
                            $('#image').attr('src', '/images/' + this.curImgPos + '.jpg');
                        } else {
                            alert('Can not recognize a command: ' + recognizedText);
                        }
                    },
                    initAudio: function() {
                        var self = this;
                        var audioChunks;
                        $('#record-btn').on('mousedown', function() {
                            navigator.mediaDevices.getUserMedia({
                                audio:true
                            }).then(stream => {
                                audioChunks = [];
                                self.rec = new MediaRecorder(stream);
                                self.rec.ondataavailable = e => {
                                    audioChunks.push(e.data);
                                    if (self.rec.state == "inactive"){
                                        var blob = new Blob(audioChunks, { type:'audio/x-mpeg-3' });
                                        var fd = new FormData();
                                        fd.append('blob', blob);
                                        $.ajax({
                                            url: '/recognize',
                                            type: 'POST',
                                            data: fd,
                                            cache: false,
                                            processData: false,
                                            contentType: false,
                                            success: function (data) {
                                                if (data.success) {
                                                    self.executeCommand(data.recognizedText)
                                                } else {
                                                    console.log("Could not recognize speech");
                                                }
                                            },
                                            error: function (e) {
                                                console.log("Could not recognize speech: " + e);
                                            }
                                        });
                                    }
                                }
                                self.rec.start();
                        }).catch(e => console.log(e));
                        });
                        $('#record-btn').on('mouseup', function() {
                            self.rec.stop();
                        });
                    },
                    init: function () {
                        this.initAudio();
                    }
                };
            })();
            $(window).on('load', function() {
                SpeechRecognition.init();
            });
        </script>

        <div>
            <div style="display: flex; text-align: center; flex-direction: column;">
                <h3>Speech Recognition</h3>
                <div>
                    <img id="image" src="/images/1.jpg" />
                </div>

                <div style="margin-top: 1rem;">
                    <button id="record-btn" class="btn btn-primary">Press And Command</button>
                </div>
            </div>
        </div>
    </body>

</html>

Full code can be found on a GitHub page.

So, we’ve set up a speech recognition container with Kaldi and GStreamer and created a web application that reacts on your commands (switching images in this example).

More detailed can be found in the video:

3 thoughts on “Speech recognition: voice control (Java, Kaldi)

  1. dude, you rock. your docker version works where jcsilva/docker-kaldi-gstreamer-server:latest worker stops working after first translation and doesn’t send back to the client. I tried to get this working ubuntu but had big problems with compiling the gstreamer. This whole process was a big headache until I tried your version.

Leave a Reply

Your email address will not be published. Required fields are marked *