Introduction

In this article, we'll explore TensorFlow.js, and the Coco SSD model for object detection. In it, I'll describe the steps one has to take to load the pre-trained Coco SSD model, how to use it, and how to build a simple implementation to detect objects from a given image. Moreover, besides presenting an example, I want to provide a small preface to what object detection is, explain what's behind the Coco SSD model, and introduce TensorFlow Object Detection API, the library initially used to train the model.

Try the demo on your own webcam feed. Take a look!

As the demand for data products increases, the community has been rapidly developing solutions that allow us to create and apply all the recent and groundbreaking advances of the field of AI in a diversity of platforms. During the first years of the so-called Big Data or AI era, it was common to have a machine learning model running on a script. Then, as our problems and requirements evolved, these models were moved into platforms such as production systems, the cloud, IoT devices, mobile devices, and the browser.

To answer the call for a battle-tested trusted and browser-first solution, in March 2018, the TensorFlow team released TensorFlow.js, a library aimed towards the web and Javascript developers to develop and train machine learning models in Javascript and deploying in the browser.

Similar to its big and more complete counterpart, TensorFlow.js provides many tools and out-of-the-boxes models that simplify the already-arduous and time-consuming task of training a machine learning model from scratch. For starters, it provides the means to convert pre-trained models from Python into TensorFlow.js, supports transfer learning, a technique for retraining pre-existing models with custom data, and even a way to create ML solutions without having to deal with the low-level implementations through the library ml5.js.

tensorflow.js logo

On the models' side, TensorFlow.js comes with several pre-trained models that serve different purposes like PoseNet to estimate in real-time the human pose a person is performing, the toxicity classifier to detect whether a piece of text contains toxic content, and lastly, the Coco SSD model, an object detection model that identifies and localize multiple objects in an image. If you'd ask me, what makes TensorFlow.js interesting, compelling, and attractive is how simple it is to load a pre-trained model and get it running. In another setting, let's say, in "normal" TensorFlow, if we'd like to use a pre-trained object detection model, we'd have to manually download it and then import it, and while this is not necessarily hard, it's another step that here, in TensorFlow.js, we can avoid.

TensorFlow Object Detection API

The task of image classification is a staple deep learning application. Here, you feed an image to the model, and it tells you its label. For example, in a typical cat and dog classifier, the label of the following image would (hopefully) be "cat."

white cat staring
A cat. Do you agree? Taken from Wikipedia.

And indeed, there's a cat here. However, where's exactly is the cat? For us, the question is easy to answer but not for our deep learning models.

For use cases in which we, the end-user, need to know the precise location of an object, there's a deep learning technique known as object detection. And it is precisely that, it detects objects on a frame, which could be an image or a video. The use cases for object detection include surveillance, visual inspection and analysing drone imagery among others.

TensorFlow Object Detection API is TensorFlow's framework dedicated to training and deploying detection models. The package, based on the paper "Speed/accuracy trade-offs for modern convolutional object detectors" by Huang et al. provides supports for several object detection architectures such as SSD (Single Shot Detector) and Faster R-CNN (Faster Region-based Convolutional Neural Network), as well as feature extractors like MobileNet and Inception. The variety in architectures and extractors provides us with lots of options but deciding on which one to use depends on our use-case - what accuracy do we need and how much time can we spend in making the predictions. This accuracy/speed trade-off allows us to build models that suit a whole range of needs and platforms, for example, a light and portable object detector capable of running on a phone.

The model we'll be using in this article, COCO SSD, is on the "fast-but-less-accurate" side of the spectrum, making it capable of being used in a browser, However, before we start with the tutorial, I'd like to give an introduction to COCO SSD and explain what it is.

accuracy vs time
Accuracy vs. time. Image taken from https://arxiv.org/pdf/1611.10012.pdf

COCO-SSD

In this tutorial, we'll use COCO-SSD, a pre-trained model ported for TensorFlow.js. To explain it, let's take a look at what each term –"COCO" and "SSD" –means.

COCO refers to the"Common Objects in Context"  dataset, the data on which the model was trained on. This collection of images is mostly used for object detection, segmentation, and captioning, and it consists of over 200k labeled images belonging to one of 90 different categories, such as "person," "bus," "zebra," and "tennis racket."

coco dataset example
COCO dataset examples. Image from COCO website.

Then, there's the term "SSD," which points out the model architecture. SSD or Single Shot Detector is a neural network architecture made of a single feed-forward convolutional neural network that predicts the image's objects labels and their position during the same action. The counterpart of this "single-shot" characteristic, is an architecture that uses a "proposal generator," a component whose purpose is to search for regions of interests within an image.

Once the regions of interests have been identified, the typical second step is to extract the visual features of these regions and determine which objects are present in them, a process known as "feature extraction." COCO-SSD default's feature extractor is lite_mobilenet_v2, an extractor based on the MobileNet architecture. In general, MobileNet is designed for low resources devices, such as mobile, single-board computers, e.g., Raspberry Pi, and even drones.

The following image shows the building blocks of a MobileNetV2 architecture. In it, you can see that each block is made of only three layers. The first one is a 1 x 1 convolutional layer with ReLU6 as the activation function, followed by a depthwise convolutional of kernel size 3 x 3 (also with ReLU6), and lastly, a 1 x 1 linear convolution.

MobileNetV2. Taken from MobileNetV2: Inverted Residuals and Linear Bottlenecks.

Now that know a bit of the theory behind object detection and the model, it's time to apply it to a real use case.

Building a Web App for Object Detection

In this tutorial, we'll create a simple React web app that takes as input your webcam live video feed and sends its frames to a pre-trained COCO SSD model to detect objects on it. The only requirements are a browser (I'm using Google Chrome), and Python (either version works). Now, open your favorite code editor, create a new file, and name it index.html.

The first step is to load the TensorFlow.js library, the COCO model, and the React library from a CDN (Content Delivery Network). By doing it this way, we avoid installing stuff locally in our machines...isn't that cool?

<!DOCTYPE html>
<html>

<head>
  <meta charset="UTF-8" />
  <title>TensorFlow.js OBD Demo</title>
  <!-- Load TensorFlow.js-->
  <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
  <!-- Load the coco-ssd model. -->
  <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/coco-ssd"></script>
  <!-- Load React. -->
  <script src="https://unpkg.com/react@16/umd/react.development.js" crossorigin></script>
  <script src="https://unpkg.com/react-dom@16/umd/react-dom.development.js" crossorigin></script>

  <script src="https://unpkg.com/babel-standalone@6.26.0/babel.min.js"></script>
</head>
    
    ...
    
    

Once that's done, the following step is to create the <body> tag and an excellent header using <h1>. Next, right under the header, we're going to add a <script> tag which will import detect.js – the one that contains our web app and all its magic. To import it, add the following line:

<script src="detect.js" type="text/babel"></script>

Notice the type attribute "text/babel", which is essential because, without it, we'd encounter errors like "Uncaught SyntaxError: Unexpected token <. "

Lastly, we'll add a <div> tag for putting React component, which in this case is the rendering of the video and its detections.

This is how index.html looks like:

<!DOCTYPE html>
<html>

<head>
  <meta charset="UTF-8" />
  <title>TensorFlow.js OBD Demo</title>
  <!-- Load TensorFlow.js-->
  <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
  <!-- Load the coco-ssd model. -->
  <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/coco-ssd"></script>
  <!-- Load React. -->
  <script src="https://unpkg.com/react@16/umd/react.development.js" crossorigin></script>
  <script src="https://unpkg.com/react-dom@16/umd/react-dom.development.js" crossorigin></script>

  <script src="https://unpkg.com/babel-standalone@6.26.0/babel.min.js"></script>
</head>

<body>

  <h1>Demo of TensorFlow.js Coco SSD's model object detection</h1>
  <!-- Load our React component. -->
  <script src="detect.js" type="text/babel"></script>

  <!-- We will put our React component inside this div. -->
  <div id="root"></div>
  
</body>

To summarize, this HTML file is just the "shell" of the app, and we are mostly using to load required libraries, export our JavaScript file, and to display the video. Now we create a new one named detect.js.

The detect.js script will be the central part of our tutorial. In this file, we are going to write a React component that, in a nutshell, does the following things.

  1. It requests the user's permission to use its webcam.
  2. If the user accepts (please do), it will fire up the webcam and consume its stream.
  3. It loads the COCO SSD model.
  4. The model consumes the webcam feed, and check for objects.
  5. Then, it uses the model output to render the bounding boxes in the video frame.
  6. Returns said frame.

That was the outline, now, let's write the script.

We'll start by creating a class named App. After that, we'll create two React refs – an object that provides a way to access the nodes that we'll be creating in the render method – to reference the video and the canvas that'll be used for drawing the bounding boxes. Following the refs, we'll define a dictionary, named styles, that we'll provide the (CSS) styling to the video and canvas.

class App extends React.Component {
  // reference to both the video and canvas
  videoRef = React.createRef();
  canvasRef = React.createRef();

  // we are gonna use inline style
  styles = {
    position: 'fixed',
    top: 150,
    left: 150,
  };
  
  ...
}

In the next step, we'll define the function detectFromVideoFrame which takes as parameters the model (I'll show soon how to create it), and a video frame. For now, it looks like this.

detectFromVideoFrame = (model, video) => {
    ...
  };

Now, things get a bit more tricky. From inside this function, we could  call model.detect(video) to perform the predictions. But the problem is that the predictions are not instantaneously produced because after all, the model needs to process the input. Therefore, while the model is thinking, we'd be blocking the main application thread, or in simple words, the app will "freeze" while the prediction is being cooked.

However, luckily for us, there's a way to circumvent this, and that way is called a Promise (cute). A Promise is programming pattern that will return a value sometime in the future, and they are used for "deferred and asynchronous computations" (as defined by Mozilla), meaning that we won't block the main thread (our web app) while we wait for the model to come. So, in our application, we'll use a Promise to detect the objects.

To use a Promise, we simply have to call ".then(...)", after .detect(video). This is how the function looks so far:

detectFromVideoFrame = (model, video) => {
    model.detect(video).then(() => {
      ...
    });
  };

But what happens when the Promise has been fulfilled? Suppose everything worked, and the Promise delivered the detection. Now we need to do something with it.

To define how we'll use the fine, we use a callback function – a function that will be executed after another one has finished – inside the Promise. And in the callback I'm about to present, we'll perform our detections.

detectFromVideoFrame = (model, video) => {
    model.detect(video).then(predictions => {
      this.showDetections(predictions);

      requestAnimationFrame(() => {
        this.detectFromVideoFrame(model, video);
      });
    }, (error) => {
      console.log("Couldn't start the webcam")
      console.error(error)
    });
  };

This is how the final function looks like. Everything you see inside "predictions => {...} " is the callback. Inside it, we're calling this.showDetections(...) (I'll define it soon), and a function I won't explain (it's out of the scope of this tutorial), named requestAnimationFrame(), which will call detectFromVideoFrame (you heard that right).

A brief note before I move on. A Promise is not always successful, and it can fail for a million reasons. If we wish to handle this error, or simply log what happened, we could add to the  Promise, a second, and an optional callback function that will be called if the Promise fails. In this example, the callback will simply log, "Couldn't start the webcam.".

The following function I want to define is showDetections, and its purpose is to draw the detections bounding boxes, as well as the labels, and confidence score over the video.  To manage this, first, we're going to iterate over all the predictions, and at each iteration, we'll get the coordinates of the predicted bounding box by accessing the property bbox of the prediction. Then, we'll do some cosmetic changes in the canvas' context, e.g., line width, and draw the box using ctx.strokeRect(x, y, width, height);.

Once the square is drawn, the following step is drawing the label and score. To better visualize these things, I'll add a small rectangle – using ctx.fillRect – that serves as a background for the text. Then, using ctx.fillText, we'll write the prediction class in the left top corner of the image, and the score in the bottom left. That's the end of it.

showDetections = predictions => {
    const ctx = this.canvasRef.current.getContext("2d");
    ctx.clearRect(0, 0, ctx.canvas.width, ctx.canvas.height);
    const font = "24px helvetica";
    ctx.font = font;
    ctx.textBaseline = "top";

    predictions.forEach(prediction => {
      const x = prediction.bbox[0];
      const y = prediction.bbox[1];
      const width = prediction.bbox[2];
      const height = prediction.bbox[3];
      // Draw the bounding box.
      ctx.strokeStyle = "#2fff00";
      ctx.lineWidth = 1;
      ctx.strokeRect(x, y, width, height);
      // Draw the label background.
      ctx.fillStyle = "#2fff00";
      const textWidth = ctx.measureText(prediction.class).width;
      const textHeight = parseInt(font, 10);
      // draw top left rectangle
      ctx.fillRect(x, y, textWidth + 10, textHeight + 10);
      // draw bottom left rectangle
      ctx.fillRect(x, y + height - textHeight, textWidth + 15, textHeight + 10);

      // Draw the text last to ensure it's on top.
      ctx.fillStyle = "#000000";
      ctx.fillText(prediction.class, x, y);
      ctx.fillText(prediction.score.toFixed(2), x, y + height - textHeight);
    });
  };

So far, we have defined in two functions, the main functionality of the app: detect objects, and drawing boxes. Now, for the final steps, we'll combine them under another function, and then, we'll render everything as have just created so we can see it on the browser.

The function that wraps up both detectFromVideoFrame and showDetections is a React method named componentDidMount(). (I won't explain this one because it's out of the reach of this article. For now, trust me when I say that this function will be executed once an instance of the class is created).

The first thing we'll do in componentDidMount is asking the user for permission to access the webcam. If the user doesn't accept, then nothing happens. But if it does, then we'll declare two Promises. The first of them, webcamPromise, will be used to read the webcam stream, and the second one, loadModelPromise, to load the model. Then, we'll execute both Promises by calling

Promise.all([loadModelPromise, webcamPromise]),

and as we already learned, this will run a callback function, and from this function, we'll call detectFromVideoFrame.

componentDidMount() {
    if (navigator.mediaDevices.getUserMedia) {
      // define a Promise that'll be used to load the webcam and read its frames
      const webcamPromise = navigator.mediaDevices
        .getUserMedia({
          video: true,
          audio: false,
        })
        .then(stream => {
          // pass the current frame to the window.stream
          window.stream = stream;
          // pass the stream to the videoRef
          this.videoRef.current.srcObject = stream;

          return new Promise(resolve => {
            this.videoRef.current.onloadedmetadata = () => {
              resolve();
            };
          });
        }, (error) => {
          console.log("Couldn't start the webcam")
          console.error(error)
        });

      // define a Promise that'll be used to load the model
      const loadlModelPromise = cocoSsd.load();
      
      // resolve all the Promises
      Promise.all([loadlModelPromise, webcamPromise])
        .then(values => {
          this.detectFromVideoFrame(values[0], this.videoRef.current);
        })
        .catch(error => {
          console.error(error);
        });
    }
  }

Lastly, to complete our App class, we need to define React's component render() function, and it will simply return a <div> whose inner nodes are a <video> and <canvas>.

render() {
    return (
      <div> 
        <video
          style={this.styles}
          autoPlay
          muted
          ref={this.videoRef}
          width="720"
          height="600"
        />
        <canvas style={this.styles} ref={this.canvasRef} width="720" height="650" />
      </div>
    );
  }

Then, finally, at the very end of the file (not in the class!), we need to select our DOM container, the "place" in which we'll render our component, and for this, we'll use the root <div> tag we created in index.html. Following this, we'll call ReactDOM.render() using as parameters a React element (the App class), and the DOM container from the previous line.

const domContainer = document.querySelector('#root');
ReactDOM.render(React.createElement(App), domContainer);

And that's the end of the code! Now it's your turn to play. To launch the web app, go to the root directory of the app, and launch a web server. An easy way to create a one is with Python, using the following command $ python3 -m http.server or $ python -m SimpleHTTPServer if you're using Python 2 (please update it).

Once the server is up and running, open your browser, and go to http://localhost:8000/, and you'll be greeted by a prompt window requesting permission to access the webcam. Upon accepting said request, wait a bit until the model is downloaded (meanwhile you can check out your face in the screen [remember, we are using a Promise]), and voila, rejoice with the glory of out-of-the-box deep learning. Have fun!

A small note before I finish. By default, the loaded model uses is based on a "lite_mobilenet_v2" architecture. However, there are two other options: "mobilenet_v1" and "mobilenet_v2." If you wish to use one of them, use as a parameter of the model.load() function, a ModelConfig interface, and in the attribute base, write the desired architecture. Like this:

cocoSsd.load({base: "mobilenet_v2"})

Enjoy a video showcasing the app!

Recap

In this article, I explained how we can build an object detection web app using TensorFlow.js. First, I introduced the TensorFlow.js library and the Object Detection API. Then, described the model to be used, COCO SSD, and said a couple of words about its architecture, feature extractor, and the dataset it was trained on. Next, following the theory, we built a React app that uses a pre-trained COCO SSD model to detect objects from a web cam stream.

The tutorial's complete source code is available at: https://github.com/juandes/tensorflowjs-objectdetection-tutorial