[Deep Learning] Convolutional Neural Network (CNN)

2023. 11. 7.

What is a Convolution? 

: a standard operation, long used in compression, signal processing, computer vision, and image processing 


Convolution = Filtering = Feature Extraction 




Main difference with the MLP 

1) Local Connection

: Local connections can capture local patterns better than fully-connected models 

-> search for all the local patterns by sliding the same kernel

-> have the chance to react to patterns that are in different positions


2) Weight Sharing 

: differently to MLPs that employ different weights for different neurons 




Kernel Size

: the wieghts of the filter w are learned from data

- the length of the convolving filter, the kernel size = k 



Input Channels

- In general, the convolution takes a multi-channel input, filters it with some filters, and returns multi-channel output

- The multi-channel input is processed by a filter of dimensionality

- The convolutional framework is flexible enough to manage multi-channel inputs as well.

multi channel input size  = kernel_size * input channels 



Output Channels 

: the number of different feature maps generated through the convolution operation.


- Each output channel is responsible for detecting and storing various features.

- In CNN, we want each filter to react to different local patterns 

- All these outputs produced by the convolution are gathered in single matrix with dimensionality

- To capture multiple patterns, we have to process the input with many different filters


-> Multiple output channels can be used by the model to learn more complex patterns.





: quantifies the amount of movement(step size) by which we slide a filter over an input

- hyperparameter of the convolution layer 

- if stride facter is bigger than 1, the effect is to compress the input 




Dilated Convolution 

- a technique that expands the filter by inserting holes between its consecutive elements

- can be done to cover a larger area of the input



Stacking Convolutional Layers

- we can stack multiple convolutional layers to form a deep convolutional network.

- optionally, normalization is applied right after the convolution (layernorm or batch norm) 

- then, a non-linearity is applied (ReLU or LeakyReLU)

=> after non-linearity, the set of features = feature maps 


- After stacking multiple convolutional layers, we can apply a final linear transformation(fully-connected layer)

- Finally, we apply a flattern operation that stacks in a single big vectore all the output channels 

=> Feature Extraction 




Receptive Field 

: the region of the input space that affects a particular unit of the network



- Show what specific area of the input image a particular neuron is looking at.

- To put it simply, the receptive field of a neuron represents the area in the input that the neuron uses for its computations.

- Smaller receptive fields enable neurons to detect finer and more detailed features.


=> Therefore, in CNNs, the receptive field is a crucial concept that explains how the model perceives and understands various parts of an image. 



* The receptive field depends on different factors

1) Kernel Size
2) Number of Layers
3) Stride Factor
4) Dilation factor





- the number of paramters in 1D convolutional layer

=> kernel size x input channel x output channel 


- the number of paramters in 2D convolutional layer

=> kernel size(x) x kernel size(y) x input channel x output channel 





: helps to make feature maps approximately invariant to small transitions of the input 


- pooling is often applied after the non-liearity(activation function) 

- the size of the sliding window and its stride factor are hyperparameters of the pooling operation 


(ex) Max Pooling, Avg Pooling


