Clothes shopping is a taxing experience. My eyes get bombarded with too much information. Sales, coupons, colors, toddlers, flashing lights, and crowded aisles are just a few examples of all the signals forwarded to my visual cortex, whether or not I actively try to pay attention. The visual system absorbs an abundance of information. Should I go for that H&M khaki pants? Is that a Nike tank top? What color are those Adidas sneakers?

Can a computer automatically detect pictures of shirts, pants, dresses, and sneakers? It turns out that accurately classifying images of fashion items is surprisingly straight-forward to do, given quality training data to start from. In this tutorial, we’ll walk through building a machine learning model for recognizing images of fashion objects using the Fashion-MNIST dataset. We’ll walk through how to train a model, design the input and output for category classifications, and finally display the accuracy results for each model.

Image Classification

The problem of Image Classification goes like this: Given a set of images that are all labeled with a single category, we are asked to predict these categories for a novel set of test images and measure the accuracy of the predictions. There are a variety of challenges associated with this task, including viewpoint variation, scale variation, intra-class variation, image deformation, image occlusion, illumination conditions, background clutter etc.

How might we go about writing an algorithm that can classify images into distinct categories? Computer Vision researchers have come up with a data-driven approach to solve this. Instead of trying to specify what every one of the image categories of interest look like directly in code, they provide the computer with many examples of each image class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class. In other words, they first accumulate a training dataset of labeled images, then feed it to the computer in order for it to get familiar with the data.

Floral dress identification using image classification - deep learning

Given that fact, the complete image classification pipeline can be formalized as follows:

  • Our input is a training dataset that consists of N images, each labeled with one of K different classes.
  • Then, we use this training set to train a classifier to learn what every one of the classes looks like.
  • In the end, we evaluate the quality of the classifier by asking it to predict labels for a new set of images that it has never seen before. We will then compare the true labels of these images to the ones predicted by the classifier.

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) is the most popular neural network model being used for image classification problem. The big idea behind CNNs is that a local understanding of an image is good enough. The practical benefit is that having fewer parameters greatly improves the time it takes to learn as well as reduces the amount of data required to train the model. Instead of a fully connected network of weights from each pixel, a CNN has just enough weights to look at a small patch of the image. It’s like reading a book by using a magnifying glass; eventually, you read the whole page, but you look at only a small patch of the page at any given time.

Consider a 256 x 256 image. CNN can efficiently scan it chunk by chunk — say, a 5 × 5 window. The 5 × 5 window slides along the image (usually left to right, and top to bottom), as shown below. How “quickly” it slides is called its stride length. For example, a stride length of 2 means the 5 × 5 sliding window moves by 2 pixels at a time until it spans the entire image.

A convolution is a weighted sum of the pixel values of the image, as the window slides across the whole image. Turns out, this convolution process throughout an image with a weight matrix produces another image (of the same size, depending on the convention). Convolving is the process of applying a convolution.

convolution process

The sliding-window shenanigans happen in the convolution layer of the neural network. A typical CNN has multiple convolution layers. Each convolutional layer typically generates many alternate convolutions, so the weight matrix is a tensor of 5 × 5 × n, where n is the number of convolutions.

As an example, let’s say an image goes through a convolution layer on a weight matrix of 5 × 5 × 64. It generates 64 convolutions by sliding a 5 × 5 window. Therefore, this model has 5 × 5 × 64 (= 1,600) parameters, which is remarkably fewer parameters than a fully connected network, 256 × 256 (= 65,536).

The beauty of the CNN is that the number of parameters is independent of the size of the original image. You can run the same CNN on a 300 × 300 image, and the number of parameters won’t change in the convolution layer.

Data Augmentation

Image classification research datasets are typically very large. Nevertheless, data augmentation is often used in order to improve generalisation properties. Typically, random cropping of rescaled images together with random horizontal flipping and random RGB colour and brightness shifts are used. Different schemes exist for rescaling and cropping the images (i.e. single scale vs. multi scale training). Multi-crop evaluation during test time is also often used, although computationally more expensive and with limited performance improvement. Note that the goal of the random rescaling and cropping is to learn the important features of each object at different scales and positions. Keras does not implement all of these data augmentation techniques out of the box, but they can easily implemented through the preprocessing function of the ImageDataGenerator modules.

image Data Generator modules

Fashion MNIST Dataset

Recently, Zalando research published a new dataset, which is very similar to the well known MNIST database of handwritten digits. The dataset is designed for machine learning classification tasks and contains in total 60 000 training and 10 000 test images (gray scale) with each 28x28 pixel. Each training and test case is associated with one of ten labels (0–9). Up till here Zalando’s dataset is basically the same as the original handwritten digits data. However, instead of having images of the digits 0–9, Zalando’s data contains (not unsurprisingly) images with 10 different fashion products. Consequently, the dataset is called Fashion-MNIST dataset, which can be downloaded from GitHub. The data is also featured on Kaggle. A few examples are shown in the following image, where each row contains one fashion item.

Multiple fashion items T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot

The 10 different class labels are:

  1. T-shirt/top
  2. Trouser
  3. Pullover
  4. Dress
  5. Coat
  6. Sandal
  7. Shirt
  8. Sneaker
  9. Bag
  10. Ankle boot

According to the authors, the Fashion-MNIST data is intended to be a direct drop-in replacement for the old MNIST handwritten digits data, since there were several issues with the handwritten digits. For example, it was possible to correctly distinguish between several digits, by simply looking at a few pixels. Even with linear classifiers it was possible to achieve high classification accuracy. The Fashion-MNIST data promises to be more diverse so that machine learning (ML) algorithms have to learn more advanced features in order to be able to separate the individual classes reliably.

Embedding Visualization of Fashion MNIST

Embedding is a way to map discrete objects (images, words, etc.) to high dimensional vectors. The individual dimensions in these vectors typically have no inherent meaning. Instead, it’s the overall patterns of location and distance between vectors that machine learning takes advantage of. Embeddings, thus, are important for input to machine learning; since classifiers and neural networks, more generally, work on vectors of real numbers. They train best on dense vectors, where all values contribute to define an object.

TensorBoard has a built-in visualizer, called the Embedding Projector, for interactive visualization and analysis of high-dimensional data like embeddings. The embedding projector will read the embeddings from my model checkpoint file. Although it’s most useful for embeddings, it will load any 2D tensor, including my training weights.

Here, I’ll attempt to represent the high-dimensional Fashion MNIST data using TensorBoard. After reading the data and create the test labels, I use this code to build TensorBoard’s Embedding Projector:

from tensorflow.contrib.tensorboard.plugins import projector

logdir = 'fashionMNIST-logs'

# Creating the embedding variable with all the images defined above under X_test
embedding_var = tf.Variable(X_test, name='fmnist_embedding')

# Format: tensorflow/contrib/tensorboard/plugins/projector/projector_config.proto
config = projector.ProjectorConfig()

# You can add multiple embeddings. Here I add only one.
embedding = config.embeddings.add()
embedding.tensor_name =

# Link this tensor to its metadata file (e.g. labels).
embedding.metadata_path = os.path.join(logdir, 'metadata.tsv')

# Use this logdir to create a summary writer
summary_writer = tf.summary.FileWriter(logdir)

# The next line writes a projector_config.pbtxt in the logdir. TensorBoard will read this file during startup.

# Periodically save the model variables in a checkpoint in logdir.
with tf.Session() as sesh:
    saver = tf.train.Saver(), os.path.join(logdir, 'model.ckpt'))
# Create the sprite image
rows = 28 
cols = 28
label = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

sprite_dim = int(np.sqrt(X_test.shape[0]))
sprite_image = np.ones((cols * sprite_dim, rows * sprite_dim))

index = 0 
labels = [] 
for i in range(sprite_dim): 
    for j in range(sprite_dim):

            i * cols: (i + 1) * cols,
            j * rows: (j + 1) * rows
        ] = X_test[index].reshape(28, 28) * -1 + 1

        index += 1
# After constructing the sprite, I need to tell the Embedding Projector where to find it
embedding.sprite.image_path = os.path.join(logdir, 'sprite.png')
embedding.sprite.single_image_dim.extend([28, 28])

# Create the metadata (labels) file
with open(embedding.metadata_path, 'w') as meta:
    for index, label in enumerate(labels):
        meta.write('{}\t{}\n'.format(index, label))

Principal Component Analysis: A straightforward technique for reducing dimensions is Principal Component Analysis (PCA). The Embedding Projector computes the top 10 principal components. The menu lets me project those components onto any combination of two or three. PCA is a linear projection, often effective at examining global geometry.

Principal Component Analysis - examine global geometry

t-SNE: A popular non-linear dimensionality reduction technique is t-SNE. The Embedding Projector offers both two- and three-dimensional t-SNE views. Layout is performed client-side animating every step of the algorithm. Because t-SNE often preserves some local structure, it is useful for exploring local neighborhoods and finding clusters.

t-SNE embedding projector using 2-D and 3-D

Custom: I can also construct specialized linear projections based on text searches for finding meaningful directions in space. To define a projection axis, enter two search strings or regular expressions. The program computes the centroids of the sets of points whose labels match these searches, and and uses the difference vector between centroids as a projection axis.

You can view the full code for the visualization steps at this notebook: TensorBoard-Visualization.ipynb

Training CNN Models on Fashion MNIST

Let’s now move to the fun part: I will create a variety of different CNN-based classification models to evaluate performances on Fashion MNIST. I will be building our model using the Keras framework. For more information on the framework, you can refer to the documentation here. Here are the list of models I will try out and compare their results:

  1. CNN with 1 Convolutional Layer
  2. CNN with 3 Convolutional Layer
  3. CNN with 4 Convolutional Layer
  4. VGG-19 Pre-Trained Model

For all the models (except for the pre-trained one), here is my approach:

  • Split the original training data (60,000 images) into 80% training(48,000 images) and 20% validation (12000 images) optimize the classifier, while keeping the test data (10,000 images) to finally evaluate the accuracy of the model on the data it has never seen. This helps to see whether I’m over-fitting on the training data and whether I should lower the learning rate and train for more epochs if validation accuracy is higher than training accuracy or stop over-training if training accuracy shift higher than the validation.
  • Train the model for 10 epochs with batch size of 256, compiled with categorical_crossentropy loss function and Adam optimizer.
  • Then, add data augmentation, which generates new training samples by rotating, shifting and zooming on the training samples, and train the model on updated data for another 50 epochs.

Here’s the code to load and split the data:

# Import libraries
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split

# Load training and test data into dataframes
data_train = pd.read_csv('data/fashion-mnist_train.csv')
data_test = pd.read_csv('data/fashion-mnist_test.csv')

# X forms the training images, and y forms the training labels
X = np.array(data_train.iloc[:, 1:])
y = to_categorical(np.array(data_train.iloc[:, 0]))

# Here I split original training data to sub-training (80%) and validation data (20%)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)

# X_test forms the test images, and y_test forms the test labels
X_test = np.array(data_test.iloc[:, 1:])
y_test = to_categorical(np.array(data_test.iloc[:, 0]))

After loading and splitting the data, I preprocess them by reshaping them into the shape the network expects and scaling them so that all values are in the [0, 1] interval. Previously, for instance, the training data were stored in an array of shape (60000, 28, 28) of type uint8 with values in the [0, 255] interval. I transform it into a float32 array of shape (60000, 28 * 28) with values between 0 and 1.

# Each image's dimension is 28 x 28
img_rows, img_cols = 28, 28
input_shape = (img_rows, img_cols, 1)

# Prepare the training images
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_train = X_train.astype('float32')
X_train /= 255

# Prepare the test images
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
X_test = X_test.astype('float32')
X_test /= 255

# Prepare the validation images
X_val = X_val.reshape(X_val.shape[0], img_rows, img_cols, 1)
X_val = X_val.astype('float32')
X_val /= 255

1 — 1-Conv CNN

Here’s the code for the CNN with 1 Convolutional Layer:

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D

cnn1 = Sequential()
cnn1.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
cnn1.add(MaxPooling2D(pool_size=(2, 2)))


cnn1.add(Dense(128, activation='relu'))
cnn1.add(Dense(10, activation='softmax'))


After training the model, here’s the test loss and test accuracy:

After applying data augmentation, here’s the test loss and test accuracy:

For visual purpose, I plot the training and validation accuracy and loss:

Training and validation accuracy
training and validation loss

You can view the full code for this model at this notebook: CNN-1Conv.ipynb

2 — 3-Conv CNN

Here’s the code for the CNN with 3 Convolutional Layer:

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D

cnn3 = Sequential()
cnn3.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
cnn3.add(MaxPooling2D((2, 2)))

cnn3.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
cnn3.add(MaxPooling2D(pool_size=(2, 2)))

cnn3.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))


cnn3.add(Dense(128, activation='relu'))
cnn3.add(Dense(10, activation='softmax'))


After training the model, here’s the test loss and test accuracy:

After applying data augmentation, here’s the test loss and test accuracy:

For visual purpose, I plot the training and validation accuracy and loss:

You can view the full code for this model at this notebook: CNN-3Conv.ipynb

3 — 4-Conv CNN

Here’s the code for the CNN with 4 Convolutional Layer:

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, BatchNormalization

cnn4 = Sequential()
cnn4.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))

cnn4.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
cnn4.add(MaxPooling2D(pool_size=(2, 2)))

cnn4.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))

cnn4.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
cnn4.add(MaxPooling2D(pool_size=(2, 2)))


cnn4.add(Dense(512, activation='relu'))

cnn4.add(Dense(128, activation='relu'))

cnn4.add(Dense(10, activation='softmax'))


After training the model, here’s the test loss and test accuracy:

After applying data augmentation, here’s the test loss and test accuracy:

For visual purpose, I plot the training and validation accuracy and loss:

You can view the full code for this model at this notebook: CNN-4Conv.ipynb

4 — Transfer Learning

A common and highly effective approach to deep learning on small image datasets is to use a pre-trained network. A pre-trained network is a saved network that was previously trained on a large dataset, typically on a large-scale image-classification task. If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pre-trained network can effectively act as a generic model of the visual world, and hence its features can prove useful for many different computer-vision problems, even though these new problems may involve completely different classes than those of the original task.

I attempted to implement the VGG19 pre-trained model, which is a widely used ConvNets architecture for ImageNet. Here’s the code you can follow:

import keras
from keras.applications import VGG19
from keras.applications.vgg19 import preprocess_input
from keras.layers import Dense, Dropout
from keras.models import Model
from keras import models
from keras import layers
from keras import optimizers

# Create the base model of VGG19
vgg19 = VGG19(weights='imagenet', include_top=False, input_shape = (150, 150, 3), classes = 10)

# Preprocessing the input 
X_train = preprocess_input(X_train)
X_val = preprocess_input(X_val)
X_test = preprocess_input(X_test)

# Extracting features
train_features = vgg19.predict(np.array(X_train), batch_size=256, verbose=1)
test_features = vgg19.predict(np.array(X_test), batch_size=256, verbose=1)
val_features = vgg19.predict(np.array(X_val), batch_size=256, verbose=1)

# Flatten extracted features
train_features = np.reshape(train_features, (48000, 4*4*512))
test_features = np.reshape(test_features, (10000, 4*4*512))
val_features = np.reshape(val_features, (12000, 4*4*512))

# Add Dense and Dropout layers on top of VGG19 pre-trained
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_dim=4 * 4 * 512))
model.add(layers.Dense(10, activation="softmax"))

# Compile the model

After training the model, here’s the test loss and test accuracy:

For visual purpose, I plot the training and validation accuracy and loss:

You can view the full code for this model at this notebook: VGG19-GPU.ipynb

Nanonets make Transfer Learning easier

Having experienced the accuracy problem with transfer learning, I set out to solve it by building an easy to use cloud-based Deep Learning service that uses Transfer Learning. It contains a set of pre-trained models that have been trained on millions of parameters. I can upload Fashion MNIST data, and then the service selects the best model to use for the task. Lastly, it creates a new NanoNet on top of the existing pre-trained model and fits the NanoNet to the data.

Because the NanoNets models are heavily pre-trained, I used a much smaller training dataset of only ~100 images per class. From this model I got 83.3% test accuracy. This is 7% more than the VGG19 model in spite of using 1/60th of the data! The reason that NanoNets model performs better is: large amount of pre-training, optimal hyper-parameter selection and data augmentation.

1. Using a GUI:

2. Using NanoNets API:

Below, we will give you a step-by-step guide to training your own model using the Nanonets API, in 9 simple steps.

Step 1: Clone the Repo

git clone
cd nanonets_fashion_mnist
sudo pip install requests

Step 2: Get your free API Key

Get your free API Key from

Step 3: Set the API key as an Environment Variable


Step 4: Create a New Model

python ./code/
Note: This generates a MODEL_ID that you need for the next step

Step 5: Add Model Id as Environment Variable


Step 6: Upload the Training Data

Collect the images of object you want to detect. Once you have dataset ready in folder images (image files), start uploading the dataset.

python ./code/

Step 7: Train Model

Once the Images have been uploaded, begin training the Model

python ./code/

Step 8: Get Model State

The model takes ~30 minutes to train. You will get an email once the model is trained. In the meanwhile you check the state of the model

watch -n 100 python ./code/

Step 9: Make Prediction

Once the model is trained. You can make predictions using the model

python ./code/ PATH_TO_YOUR_IMAGE.jpg

There are 10 test images for each class in the folder “images/test” that you can use for prediction.