Crowd counting is an active area of research and has seen several developments since the advent of deep learning. In this blog, we’ll review in brief the Dense and Sparse Crowd Counting Methods and Techniques which can be used in a wide range of applications in industries, hospitals, crowd gathering events, and many more.

What is Crowd Counting?

Crowd counting is a technique to estimate the number of people in an image or a video. Consider the below image and make a wild guess regarding the number of people in it.

crowd watching game
Credits: unsplash

There are too many people crammed in this picture which makes it a huge task for our brain to accurately predict the right number. One quick solution is to start counting the people from the bottom-left, and go one by one but you are sure to lose track and miss the count some time in the middle. This task seems rather not human solvable. However, a machine can do it. Just feed the logic to it, and it pops-up with the precise count.

Need help with counting people on your CCTV footage or analysing drone imagery? Check out what Nanonets can do for you here.

Why Crowd Counting?

Crowd counting has several use-cases in various industries. Some of them are:

  1. Counting crowds in community events in real time to get metrics on what performances, shows and gigs work, in what setting, etc.
  2. Counting crowds in forbidden areas in a manufacturing unit to enforce safety rules and minimize health risks.
  3. Managing high traffic roads and public spaces.
  4. Automating resource allotment by constantly monitoring consumer count.
  5. Counting attendance in educational institutions.  
  6. Urban Planning
  7. Video surveillance

Crowd Counting - Methods and Techniques

Several techniques have been used to come up with the right solution to the above question. Initially, computer scientists developed basic machine learning and computer vision algorithms like detection, regression, and density-based approaches to predict crowd density and density maps. Nonetheless, these methods are also bound with various challenges such as variations in scale and perspective, occlusions, non-uniform density, etc. Later, when Convolutional Neural Networks proved its capability in various computer vision tasks by overcoming these failures, researchers shifted their attention to it, in order to exploit its features in deriving the algorithms.

Crowd counting tasks can broadly be divided into dense and sparse crowd counting.

Dense vs Sparse crowds - When the crowd has loads of people stacked up at one place, then it’s termed as the dense crowd, and when the people are sparsely placed, it’s a sparse crowd. The methods and techniques which we would be exploring a deal with both dense and sparse crowd. Sparse crowd counting is relatively easy in comparison to Dense crowd counting, hence the algorithms need to work harder for a dense crowd.

Literature Review

Now, let’s review a few research papers about how crowd counting techniques were implemented earlier, and how neural networks achieved state-of-the-art performance with advancements in deep learning.

Firstly, starting off with the traditional approaches that were used a few years back.

It all started in the year 2013, by Chen Change Loy et al. and his team at Queen Mary University of London. The goal of their research was to identify the crowd population profiling and density estimation in public spaces for global situation analysis. Below are the titles of the research in the early phases that have been carried out on crowd counting techniques.

  • A Survey of Recent Advances in CNN-based Single Image Crowd Counting and Density Estimation
  • Crowd Counting and Profiling: Methodology and Evaluation

To summarise their research, they used classic machine learning techniques on images and videos like Regression and Support Vector Machines and proposed three approaches for crowd counting, let’s discuss them in brief in the next section.

Counting by Detection

Counting by detection can be classified into three types, based on the features we use to identify the crowd in images and videos.

  • Monolithic Detection: It trains the classifier using the full-body appearance that’s available in the training images using typical features such as Haar wavelets, gradient-based features such as a histogram of oriented gradient (HOG), etc. Learning approaches such as SVMs, random forests have been used that employ a sliding window approach. But these are limited to sparse crowds. To deal with dense crowds, part-based detection is often more useful.
  • Part-based detection: Rather than taking the whole human body, this technique considers a part, say head or shoulders and applies a classifier to it. Head solely isn’t sufficient in estimating the presence of a person reliably, therefore head + shoulder is the preferred combination in this technique.
  • Shape matching: Ellipses are considered to draw boundaries around humans, and then a stochastic process is used to estimate the number and shape configuration.

Below are the three images that correspond to crowd counting by detection; figure one, two, and three represent monolithic detection, part-based detection, and shape matching detection respectively.

Figure. 1, 2, 3 Traditional Crowd Counting Techniques
Figure. 1,2,3 Crowd Counting techniques (src)

Counting by Regression

Counting by detection is not very accurate when the crowd is dense and the background clutter is high. To overcome these problems, counting by regression is used wherein the features extracted from the local image patches are mapped to the count. Here, neither segmentation nor tracking of individuals is involved. One of the earliest attempts involves extracting the low-level features such as edge details, foreground pixels, and then apply regression modelling to it by mapping the features and the count. Let’s discuss two papers to see how regression is used in various scenarios,

Here, a regression model is learned only when sparse and imbalanced data are available. A cumulative-attribute based regression model is used to map the features extracted from sparse and imbalanced images onto a cumulative attribute space.

This model is used when there’s a need to apply regression onto various localised regions in an image. Rather than training a multi-output regression model, a single regression model is used to estimate people in various localised regions, i.e. the model learns the functional mapping between interdependent low-level features and multi-dimensional structured outputs.

Counting by Estimating the Density

A majority of the previous approaches ignored the spatial information persisting in the images. However, this approach focuses on the density by learning the mapping between local features and object density maps, thereby incorporating spatial information in the process. This avoids learning each individual separately and therefore tracks a group of individuals at a time. The mapping described could be linear or nonlinear. Let’s see how a nonlinear mapping can be used in terms of a random forest classifier.

  • COUNT Forest: CO-voting Uncertain Number of Targets using Random Forest for Crowd Density Estimation

A random forest regressor is used to vote for densities of multiple target objects to learn a nonlinear mapping. This mapping exists between the patch features and the relative locations of all objects inside the patch. A crowdedness prior parameter is defined to deal with the differences between crowded and uncrowded image patches which gives two different forests corresponding to the prior.

Deep Learning for Crowd Counting

Putting traditional approaches aside, presently, Convolutional Neural Network(CNN) based computer vision techniques are being used to achieve a better accuracy over the conventional techniques. There is a big bunch of CNNs designed to attain the crowd density. Let’s segregate them into different groups for better clarity.

  • Basic CNNs: These comprise the initial deep learning approaches used. These have basic convolutional layers, kernels, and pooling layers.
  • Scale-aware models: A more robust CNN wherein multi-column or multi-resolution architectures are used.
  • Context-aware models: Both the local and global contextual information is incorporated into CNN.
  • Multi-task frameworks: Besides crowd counting, other tasks such as crowd-velocity estimation, foreground-background subtraction are used.

For now, let’s understand a few popular ones.

This is one of the initial approaches proposed with the CNN regression model. An Alexnet is taken as the base neural network wherein the final layer of 4096 neurons is replaced with a single neuron to estimate the count. Besides that, the training data is augmented with negative samples whose ground truth is zero. Below is a simple five-layer convolutional architecture that was first used to identify crowd in a given image.

CNN regression model
Image Source

A new dataset of images is used comprising of 1198 images with 330,000 annotations to train the model. A Multi-Column CNN architecture maps the image to its crowd density map. This model utilizes filters with various receptive fields. The features learned by each column CNN are adaptive to variations in people/head size due to perspective effect or image resolution. Here, the density map is computed accurately based on geometry-adaptive kernels.

  • CrowdNet: A Deep Convolutional Network for Dense Crowd Counting

CrowdNet is a combination of deep and shallow, fully convolutional neural networks. This feature helps in capturing both the low-level and high-level features. The dataset is augmented to learn scale-invariant representations. The deep network is similar to the well-known VGG-16 network. It captures the high-level semantics needed for crowd counting and returns the density maps as shown in the below image.

4 block representing density maps for crowd counting
Density Maps for Crowd Counting (src)

They’ve developed a shallow network which is used to identify the low-level head blob patterns of the people away from the camera. It has 3 convolutional layers and a VGG network is used that has 5 max-pool layers each with a stride of 2. Hence, the resultant output features have a spatial resolution of only 1/32 times the input image.

crowd counting network
Image Source

This paper contributes two neural networks, first, a Counting CNN which is a regression model that learns to map the appearance of the image patches to the corresponding density maps, and the second, a Hydra CNN which is a scale-aware counting model that uses image patches extracted at multiple scales to estimate the final density. Below is the architecture explaining the two networks.

CCNN representation
hydra CNN architecture
Hydra CNN Architecture Image Source

The crowd density variations are taken into consideration to improve the accuracy and localisation of the predicted crowd count. It relays patches from a grid within a crowd scene to independent CNN regressors on a switch classifier. A particular regressor is trained on a crowd scene patch if the performance of the regressor on the patch is the best. A switch classifier is trained alternately with the training of multiple CNN regressors to correctly relay a patch to a particular regressor.

switching CNN architecture
Architecture of Switching CNN (src)

This is an unsupervised learning technique wherein the data isn’t annotated. Procuring annotated data is often expensive. Henceforth, this paper explains an architecture named “Grid Winner-Take-All autoencoder” to learn several filters. It divides a convolutional layer into the grids of neurons. Within each grid, only the highly activated neuron is allowed to update the filter. Therefore, GWTA auto-encoder is able to leverage the diversity of features, allowing scalable and efficient training with diverse crowd data. The architecture is a bit complex when compared to the previous reviewed neural networks, the last two layers are the convolutional layers which are trained based on supervised learning to regress the density map by back propagating l2 loss between the predicted and the ground truth map.

architecture of GWTA
Architecture of GWTA (src)

This is specifically used for video datasets. When the crowd density needs to be estimated in a video, it’s divided into several frames. Thus, the temporal information between the frames needs to be considered while estimating the crowd. To accomplish this, an architecture named “Temporal-Channel-Aware”(TCA) block is used. Specifically, we use 3D kernels to capture the temporal features. These TCA blocks are stacked into a 3D convolutional neural network.

3D CNN (src)
TCA Block (src)

There are a lot more approaches available that go beyond CNN. A few are mentioned below.

Crowd Counting with Nanonets

The Nanonets API allows you to build Object Detection models with ease. You can upload your data, annotate it, set the model to train and wait for getting predictions through a browser based UI without writing a single line of code, worrying about GPUs or finding the right architectures for your deep learning models.

Nanonets also provides a ready to use model for Pedestrian Detection using aerial images that you can directly use out of the box without having to gather training data and spend time building and training models. To use this model, simply visit the website mentioned below and look for 'Pedestrian Detection in Aerial Images' in the ready to use models.  

using the GUI:

To learn more about people counting with Nanonets you can check out this case study.

Check out this Github repository for pedestrian detection which will help you build a model yourself using the Nanonets API.

Using Nanonets API

Below, we will give you a step-by-step guide to training your own model using the Nanonets API, in 9 simple steps.

Step 1: Clone the Repo

git clone
cd nanonets-pedestrian-detection
sudo pip install requests tqdm

Step 2: Get your free API Key

Get your free API Key from

Step 3: Set the API key as an Environment Variable


Step 4: Create a New Model

python ./code/

Note: This generates a MODEL_ID that you need for the next step

Step 5: Add Model Id as Environment Variable


Step 6: Upload the Training Data

Collect the images of objects you want to detect. Once you have dataset ready in folder images (image files), start uploading the dataset.

python ./code/

Step 7: Train Model

Once the Images have been uploaded, begin training the Model

python ./code/

Step 8: Get Model State

The model takes ~30 minutes to train. You will get an email once the model is trained. In the meanwhile you check the state of the model

watch -n 100 python ./code/

Step 9: Make Prediction

Once the model is trained. You can make predictions using the model

python ./code/ PATH_TO_YOUR_IMAGE.jpg


In this blog, we’ve seen a bunch of techniques to achieve state-of-the-art performance for crowd counting. These are ranging right from using regression to identify crowd based on the boundaries drawn around humans to training huge datasets of crowd images using CNNs. However, there can be a lot more possibilities to further improve these techniques by tweaking parameters or by implementing from scratch. This problem can help automate the tedious manual tasks and can be useful to manage resources in a more feasible way.