DeepLab v3: Semantic Segmentation (2017)
Hands-on PyTorch Tutorial
DeepLab v3 is a semantic segmentation model that can use ResNet-50, ResNet-101 and MobileNet-V3 backbones. This hands-on article explains how to use DeepLab v3 with PyTorch.
1 A Quick Introduction to Semantic Segmentation
Semantic segmentation divides an image into semantically different parts, such as roads, cars, buildings, the sky, etc. Below is an example of an image from the PASCAL-Context Dataset and its semantic segmentation ground truth.
In other words, a semantic segmentation model classifies each pixel in an image. It does not distinguish between different instances of the same type. For example, person A and person B belong to the same person class, and there is no more distinction between them. So, we can not distinguish each person when they overlap each other.
In addition to semantic segmentation, there is a task called instance segmentation, which not only classifies but can also distinguish each instance of objects belonging to the same class. For example, in the image above, an instance segmentation model would distinguish each person as a separate object (instance). However, DeepLab v3 is a semantic segmentation model and handles only class-level classifications.
2 Python Environment Setup
Here, I’m using virtualenv to set up a Python environment. If you prefer other tools like Conda or mini Conda, that should work well as long as the same dependencies are available.
# Create a project folder and move there
mkdir deeplab-v3
cd deeplab-v3
# Create and activate a Python environment using venv
python3 -m venv venv
source venv/bin/activate
# We should always upgrade pip as it's usually old version
# that has older information about libraries
pip install --upgrade pip
# Also, install the dependencies
pip install torch torchvision
At the time of writing this article, I installed the following versions.
torch==1.12.1
torchvision==0.13.1
You can put the above lines into requirements.txt
and execute pip install -r requirements.txt
to install the dependencies. It may be handy to do this if you are experimenting on multiple computers and need the same dependencies on all machines.
Next, let’s download a test image.
3 Image Download
We’ll download a dog image from the PyTorch hub site as follows:
from urllib import request
= "https://github.com/pytorch/hub/raw/master/images/dog.jpg"
url = "dog.jpg"
filename
request.urlretrieve(url, filename)
It saves the downloaded image to the file name “dog.jpg”. We can programmatically open the image as follows:
from PIL import Image
= Image.open(filename)
input_image
# in case alpha channel exists
= input_image.convert("RGB")
input_image input_image.show()
Converting the image to RGB eliminates the alpha (transparency) channel in case such a channel exists.
4 DeepLab v3 Model Download
Let’s download the DeepLab v3 model with the ResNet-101 backbone. The below code is from the PyTorch hub’s sample code.
import torch
= torch.hub.load(
model 'pytorch/vision:v0.10.0',
'deeplabv3_resnet101',
=DeepLabV3_ResNet101_Weights.DEFAULT)
weights
eval() # evaluation mode model.
When downloading the model for the first time, it saves the downloaded model to ~/.cache/torch/hub/checkpoints/
. So, the next time we call torch.hub.load
, it will load from the local cache, which is much faster.
DeepLabV3_ResNet101_Weights.DEFAULT
means we are downloading the most up-to-date weights for this model.
Note: DeepLab v3 ResNet-101 is a reasonably big model running slowly on the CPU. If it’s too slow, you may consider using a smaller backbone like ResNet-50.
5 Preprocessing for DeepLab v3
Let’s create a transform that preprocesses the image. Since we use the ResNet backbone (trained on ImageNet images), the normalization values are the same as for the original ResNet image classification model. We normalize all pixel values by the mean RGB value of the ImageNet images and their standard deviations.
from torchvision import transforms
= transforms.Compose([
preprocess
transforms.ToTensor(),
transforms.Normalize(=[0.485, 0.456, 0.406],
mean=[0.229, 0.224, 0.225]
std
),
])
= preprocess(input_image)
input_tensor print('input_tensor', input_tensor.shape)
The shape of the input tensor is torch.Size([3, 1213, 1546])
, which means the input image has three channels (RGB), and the image size of 1213 x 1546.
Next, we create a mini-batch using the preprocessed image. Since we have one image only, the batch size is 1. So, we add a new dimension to the input image tensor by calling the unsqueeze function.
= input_tensor.unsqueeze(0)
input_batch
print('input_batch', input_batch.shape)
Now, the shape of the input batch is torch.Size([1, 3, 1213, 1546])
, which means a batch of size 1, three color channels (RGB), and the image size is 1213 x 1546.
We are ready to perform inference with DeepLab v3 ResNet-101.
6 Inference with DeepLab v3 ResNet-101
If you have a computer with NVIDIA GPU (CUDA) or Apple’s Metal Performance Shaders (MPS), you should move the batch and the model to the device for faster processing. The following code should work for any device: CUDA, MPS, or CPU.
if torch.cuda.is_available():
= torch.device('cuda')
device elif torch.backends.mps.is_available():
= torch.device('mps')
device else:
= torch.device('cpu')
device
= input_batch.to(device)
input_batch model.to(device)
We don’t need the above code when experimenting on a computer with a CPU only. However, having such a code block is handy when switching between computers with different devices.
Let’s run the model for inference:
with torch.no_grad():
= model(input_batch) output
We use with torch.no_grad():
as it is not necessary to calculate the gradients for inference. It should use fewer computer resources (and potentially a little faster).
7 Understanding the Inference Output
The output is a dictionary with keys “out” and “aux”. We can ignore “aux” since it’s for the loss value calculation during training. The “out” value contains a batch of output images. Since the batch size is 1, we take the first item from it.
= output['out'][0]
out
print('out', out.shape)
The shape of “out” is torch.Size([21, 1213, 1546])
, which means the model predicts 21 classes for each pixel in the image of size 1213 x 1546. The 21 values per pixel is an unnormalized probability. So, the bigger the value, the higher the probability for the class. We can check the maximum value per pixel to see the most probable predicted classes.
= out.argmax(0)
prediction
print(prediction.shape)
The shape of the prediction is torch.Size([1213, 1546])
, which means it selected one class (with the biggest value) per pixel.
The class definition is from the Pascal VOC 2012 semantic segmentation dataset.
0,background
1,aeroplane
2,bicycle
3,bird
4,boat
5,bottle
6,bus
7,car
8,cat
9,chair
10,cow
11,diningtable
12,dog
13,horse
14,motorbike
15,person
16,pottedplant
17,sheep
18,sofa
19,train
20,tvmonitor 255,ignore label (excluded from the ground truth for training/evaluation)
8 Visualizing the Inference Output
Following the sample code in PyTorch Hub, we color the predicted classes in the image as a semantic segmentation result.
# Assign a color for each of 21 classes
= torch.tensor([2 ** 25 - 1, 2 ** 15 - 1, 2 ** 21 - 1])
palette = torch.as_tensor([i for i in range(21)])[:, None] * palette
colors = (colors % 255).numpy().astype("uint8")
colors
# Converts the result to Numpy and show the result with PIL Image
= prediction.byte().cpu().numpy()
prediction_numpy = Image.fromarray(prediction_numpy)
r
r.putpalette(colors) r.show()
It seems to classify the dog well, except for some parts of the tail. Let’s print the unique values of the predicted classes.
print(torch.unique(prediction, return_counts=True))
The output gives the following:
[0, 8, 12]), tensor([1241642, 4433, 629223])) (tensor(
The first element indicates classes: “background”, “cat”, and “dog”, and the second element shows the count for each class. So, most of the pixels are “background” (black), the second most are “dog” (green), and the rest are “cat” (cyan). As such, the model thought some parts of the tail were “cat”, possibly due to the fluffy white hairs.
I hope it’s clear that semantic segmentation classifies each pixel, unlike image classification and object detection.
9 Source Codes
import torch
from urllib import request
from PIL import Image
from torchvision import transforms
from torchvision.models.segmentation import DeepLabV3_ResNet101_Weights
# Download the dog image from PyTorch hub
= "https://github.com/pytorch/hub/raw/master/images/dog.jpg"
url = "dog.jpg"
filename
request.urlretrieve(url, filename)
# Open the image
= Image.open(filename)
input_image = input_image.convert("RGB")
input_image
input_image.show()
# Download the model
= torch.hub.load(
model 'pytorch/vision:v0.10.0',
'deeplabv3_resnet101',
=DeepLabV3_ResNet101_Weights.DEFAULT)
weightseval() # evaluation mode
model.
# Preprocessing the image
= transforms.Compose([
preprocess
transforms.ToTensor(),
transforms.Normalize(=[0.485, 0.456, 0.406],
mean=[0.229, 0.224, 0.225]
std
),
])= preprocess(input_image)
input_tensor print('input_tensor', input_tensor.shape)
# Create a batch (of size 1)
= input_tensor.unsqueeze(0)
input_batch print('input_batch', input_batch.shape)
# Choose a device for faster processing
if torch.cuda.is_available():
= torch.device('cuda')
device elif torch.backends.mps.is_available():
= torch.device('mps')
device else:
= torch.device('cpu')
device print('device', device)
= input_batch.to(device)
input_batch
model.to(device)
# Run inference
with torch.no_grad():
= model(input_batch)
output
# Output shape
= output['out'][0]
out print('out', out.shape)
# Predicted class per pixel
= out.argmax(0)
prediction
# Unique predicted class values
print(torch.unique(prediction, return_counts=True))
# Assign a color for each of 21 classes
= torch.tensor([2 ** 25 - 1, 2 ** 15 - 1, 2 ** 21 - 1])
palette = torch.as_tensor([i for i in range(21)])[:, None] * palette
colors = (colors % 255).numpy().astype("uint8")
colors
= prediction.byte().cpu().numpy()
prediction_numpy = Image.fromarray(prediction_numpy)
r
r.putpalette(colors) r.show()
10 References
- Rethinking Atrous Convolution for Semantic Image Segmentation
Liang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig Adam