Convolutional Neural Networks (CNNs) are well known for its ability to understand the spatial and positional features. 2D convolutional networks and widely used in computer vision related tasks. There are plenty of research happened and on going with 2D CNNs and the famous ImageNet challenge has gained an accuracy even better than humans!
Research teams have introduced several network architectures for solving the problem of image classification and related computer vision tasks. LeNet(1998), AlexNet(2012), VGGNet(2014), GoogleNet(2014), ResNet(2015) are some of the famous CNN architectures in use now. (I’ve discussed about using pre-trained models to perform transfer learning with these architectures here. Take a look. 🙂 )

It was all about 2D images. Then what about videos? 3D convolutions which applies a 3D kernel to the data and the kernel moves 3-directions (x, y and z) to calculates the feature representations is helpful in video event detection related tasks.
Same as in the area of 2D CNN architectures, researchers have introduced CNN architectures that are having 3D convolutional layers. They are performing well in video classification, event detection tasks. Some of these architectures have been adopted from the prevailing 2D CNN models by introducing 3D layers for them.

A 3D Convo operation
Tran et al. from Facebook AI Research introduced the C3D model to learn spatiotemporal features in videos using 3D convolutional Networks.This is the paper : “Learning Spatiotemporal Features with 3D Convolutional Networks“ In the original paper they have used Dropout to regularize the network.
Instead of using dropout, I tried using Batch Normalization to regularize the network. Each convolutional layer id followed by a 3D batch normalization layer. With batch normalization, you can use bit bigger learning rates to train the network and it allows each layer of the network to learn by itself a little bit more independently from other layers.
This is just the PyTorch porting for the network. I use this network for video classification tasks which each video is having 16 RGB frames with the size of 112×112 pixels. So the tensor given as the input is (batch_size, 3, 16, 112, 112) . You can select the batch size according to the computation capacity you have.
import torch.nn as nn class C3D_BN(nn.Module): """ The C3D network as described in [1] Batch Normalization as described in [2] """ def __init__(self): super(C3D_BN, self).__init__() self.conv1 = nn.Conv3d(3, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1)) self.conv1_bn = nn.BatchNorm3d(64) self.pool1 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2)) self.conv2 = nn.Conv3d(64, 128, kernel_size=(3, 3, 3), padding=(1, 1, 1)) self.conv2_bn = nn.BatchNorm3d(128) self.pool2 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2)) self.conv3a = nn.Conv3d(128, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1)) self.conv3a_bn = nn.BatchNorm3d(256) self.conv3b = nn.Conv3d(256, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1)) self.conv3b_bn = nn.BatchNorm3d(256) self.pool3 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2)) self.conv4a = nn.Conv3d(256, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1)) self.conv4a_bn = nn.BatchNorm3d(512) self.conv4b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1)) self.conv4b_bn = nn.BatchNorm3d(512) self.pool4 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2)) self.conv5a = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1)) self.conv5a_bn = nn.BatchNorm3d(512) self.conv5b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1)) self.conv5b_bn = nn.BatchNorm3d(512) self.pool5 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 1, 1)) self.fc6 = nn.Linear(8192, 4096) self.fc7 = nn.Linear(4096, 4096) self.fc8 = nn.Linear(4096, 8) self.relu = nn.ReLU() def forward(self, x): h = self.relu(self.conv1_bn(self.conv1(x))) h = self.pool1(h) h = self.relu(self.conv2_bn(self.conv2(h))) h = self.pool2(h) h = self.relu(self.conv3a_bn(self.conv3a(h))) h = self.relu(self.conv3b_bn(self.conv3b(h))) h = self.pool3(h) h = self.relu(self.conv4a_bn(self.conv4a(h))) h = self.relu(self.conv4b_bn(self.conv4b(h))) h = self.pool4(h) h = self.relu(self.conv5a_bn(self.conv5a(h))) h = self.relu(self.conv5b_bn(self.conv5b(h))) h = self.pool5(h) h = h.view(-1, 8192) h = self.relu(self.fc6(h)) h = self.relu(self.fc7(h)) h = self.fc8(h) return h """ References ---------- [1] Tran, Du, et al. "Learning spatiotemporal features with 3d convolutional networks." Proceedings of the IEEE international conference on computer vision. 2015. [2] Ioffe, Surgey, et al. "Batch Normalization: Accelerating deep network training by reducing internal covariate shift." arXiv:1502.03167v2 [cs.LG] 13 Feb 2015 """
Let the 3D Convo power be with you! Happy coding! 🙂
As shown in the graph, TensorFlow is the most popular and widely used deep learning framework right now. When it comes to Keras, it’s not working independently. It works as an upper layer for prevailing deep learning frameworks; namely with TensorFlow, Theano & CNTK (MXNet backend for Keras is on the way). To be more précised, Keras act as a wrapper for these frameworks. Working with Keras is easy as working with Lego blocks. What you have to know is where to fix the right component. So it is the ultimate deep learning tool for human beings!
Let’s start with a simple experiment that involves classifying Dog & Cat images from Kaggle. First make sure to download the training & testing image files from Kaggle (
Extracting the teeny tiny features in images, feeding the features into deep neural networks with number of hidden neuron layers and granting the silicon chips “eyes” to see has become a hot topic today. Computer vision has gone so far from the era of pattern recognition and feature engineering. With the advancement of machine learning algorithms combined with deep learning; understanding the content in the images and using them in real world applications has become a MUST more than a trend.
Fill the name, description and select the domain you going to build the model. Here I’ve selected Landmarks because the images I’m going to use contains landmarks and structural buildings.
All together 53 images with different tags were uploaded for training.

