Convolutional Layer

Training Convolutional Nets to Discover Calcified Plaque in IVUS Sequences

Ricardo Ñanculef , ... Simone Balocco , in Intravascular Ultrasound, 2020

2.1.1 Convolutional architectures

Given an unknown function f 0 : Ten Y that ane needs to learn from information, neural networks implement a hypothesis f : X Y that decomposes as the limerick f = f 1 f ii f M of more simple functions f thousand referred to as layers. In classic feed-forward nets (FFNs), layers receive as input a vector a (m−i) of size I one thousand−ane and compute every bit output a vector a (one thousand) of size I chiliad , implementing a map of the class a (m) = thousand m ( W (one thousand) a (thousand−ane) + b (one thousand)), where West (m) is a matrix of shape I m × I m−1, b R I m and g grand (⋅) is a nonlinear office applied component-wise. Compared to FFNs, the early layers of a CNN allow two boosted types of computation: convolution and pooling.

Convolutional layers receive every bit input an image A (g−one) (with K one thousand channels) and compute equally output a new image A (m) (composed of O 1000 channels). The output at each channel is known equally a feature map, and is computed equally

(1) A o ( chiliad ) = yard chiliad k W o k ( m ) * A k ( m 1 ) + b o ( chiliad ) ,

where * denotes the (2D) convolution operation a

(two) Westward o k * A chiliad [ due south , t ] = p , q A k [ south + p , t + q ] West o thousand [ P 1 p , Q 1 q ] ,

where W o one thousand ( m ) is matrix of shape P thousand × Q m and b o ( m ) R . The matrix Westward o g ( m ) parameterizes a spatial filter that the layer can utilize to detect or enhance some feature in the incoming image. The specific action of this filter is automatically learnt from information in the preparation process of the network.

Pooling layers of a CNN implement a spatial dimensionality reduction operation designed to reduce the number of trainable parameters for the next layers and let them to focus on larger areas of the input blueprint. Given an image A (m−one), a typical pooling layer with pool sizes P m , Q g Northward , and strides α yard , β one thousand N implements a channel-wise operation of the form

(3) A o ( 1000 ) [ s , t ] = κ p , q ( A o ( m 1 ) [ α chiliad s + p , β m t + q ] ) ρ i / ρ ,

where κ , ρ N are stock-still parameters. Note that using P m = Q m = α g = β m corresponds to separate each aqueduct of the input image into nonoverlapping P grand × Q yard patches and substitute the values in that region by a unmarried value determined by ρ and κ. In max pooling layers (ρ = , κ = 1), this value is the maximum of the values establish in the patch. In boilerplate pooling layers (ρ = ane, κ = 1/PQ), ane takes the boilerplate of the values in the corresponding patch. The right choice of this function tin can make the model more than robust to distortions in the input design.

Architecting a deep CNN stands for devising an appropriate succession of convolutional, pooling, and traditional (fully connected) layers, too as their hyperparameters. As depicted in Fig. iii, typical architectures introduce a pooling layer after one or two convolutional layers, defining a convolutional block that is repeated until the size of feature map is small enough to innovate traditional layers. The transition from bidimensional (or multidimensional) layers to 1-dimensional fully connected layers requires a special reshaping operation chosen "a flatten layer."

Fig. 3

Fig. iii. Case of a CNN architecture popularized after AlexNet.

Read full chapter

URL:

https://world wide web.sciencedirect.com/scientific discipline/article/pii/B9780128188330000096

Deep learning: a review

KC Santosh , ... Swarnendu Ghosh , in Deep Learning Models for Medical Imaging, 2022

2.three.5 Output layer

Convolutional layers are skilful for feature extraction from images as they deal with the spatial redundancy by weight sharing. As nosotros go deeper down the network, the features get more than exclusive and informative, and redundancy is reduced. This is primarily due to repeated cascaded convolutions and information compression by subsampling layers. As redundancy is reduced, nosotros finish up with a compressed feature representation nearly the content of the image. Now output layers bargain with mapping this feature to necessary output categories. This mapping role does non require any weight sharing anymore because the unabridged feature vector is needed to make informative decision. The standard practice is converting the learned features from the convolutional feature extractors to a vector that works as a image descriptor. This conversion tin be done in ii means, which are shown in Fig. 2.23. One style is to simply reshape all the activations of the terminal layer of the feature extractor into a 1-dimensional tensor [44,48]. The second method is using a full-scale boilerplate pooling. For an activation map of resolution h × w , an average pool with a kernel of the same resolution h × w volition reduce the map to a scalar value signifying the gross activation [46,47]. This way the last layer tin be mapped to a feature vector. This vector is then continued to the output classifier. The classifier is a standard multilayer consisting of optional hidden layers and an output layer with the requisite number of output neurons that map from the feature descriptor to the output space.

Figure 2.23

Effigy 2.23. Boilerplate pooling versus tensor flattening for conversion of spatial representation to characteristic embeddings.

Read total chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012823504100012X

AI and Cloud Computing

Chengsheng Yuan , ... Sheng Wu , in Advances in Computers, 2021

three.two Structure of our FLD method

Our FLD scheme contains five components: Input, Convolution, SPP-net, full Connexion and Output. The construction of information technology is illustrated in Fig. viii.

Fig. 8

Fig. eight. Our FLD model structure.

Source: Writer.

3.2.one Convolutional layers and pooling layers

The convolutional layer operations are regarded as the high-level characteristic representations of the given fingerprints by finding the correction from raw pixel level intensity. Drawing on [29], nosotros construct an improved CNNs with SPP-net for FLD. As shown in Fig. 9, (A) is our model structure, (B) is the Alexnet model architecture. In both construction, the output of each layer is entered as the input of the successive layer. For simplicity, we only prove difference between the 2 model architectures. Different from the former method in [26], the sequence of the normalization layer and pooling layer has been adjusted to reduce the parameters of trained model in our scheme and to eliminate the epitome size problem.

Fig. 9

Fig. nine. Comparing of our model structure and the original model architecture: (A) the office of our model construction and (B) the part of original model architecture.

Source: Author.

In the CNNs, alternate convolutional layers, pooling layers, full connection layers, and a last classification layers are included. In Fig. 10, we visualize a feature map using a convolutional operation and the pooling operation. Convolutional features are represented through computing the inner product of original fingerprint paradigm and filters, and the process of convolution is considered equally the process of feature extraction. Side by side, ReLU is viewed equally the activation function to compute feature maps. Later on the convolution, max-pooling functioning is performed to reduce the dimensionality of feature maps and prevent over-fitting. The principle of max-pooling counts the maximum in the sliding windows. Such as the green solid line window in Fig. 10, all the convolutions are followed a not-linear activation operation ReLU. Model takes a iii-channel fingerprint image as input. The first convolution operation generates 64-channel feature map with spatial dimension of 224   ×   224. Then, the second maximum pooling operation reduces the spatial dimension to 112   ×   112.

Fig. 10

Fig. 10. Architecture of the unmarried layer CNN characteristic extraction process.

Source: Author.

three.2.2 The construction of SPP-cyberspace deep feature vector

The scale of fingerprints images is various, due to different fingerprint acquisition devices; different scale contains unlike detail data of the images. Our scheme attempts to larn multiscale loftier-level features, so as to capture detail spatial information. To achieve this goal, a fixed-length vector set is necessary for the classifiers or total connection layers. Therefore, we establish a network layer construction based Spatial Pyramid Pooling network (SPP-internet).

The SPP-net go on spatial information by pooling in the local spatial sub-regions. Features maps of the terminal convolutional layers are partitioned into sizes proportional to sub-regions, and then the number of sub-regions is fixed for images of an capricious size/scale. Equally shown in Fig. 8, in our scheme, post-obit five successive convolutional layers and ii successive general pooling layers, a spatial pyramid pooling layer is installed. It contains ii full connection layers, and a classification layer. Afterwards, to eliminate the image size problem, the spatial pyramid layer has been added between the last convolution layer and the outset total confront layer.

Suppose the size of the last convolutional layer output image (characteristic map) is a  × a, each feature map is divided into n  × north blocks. SPP-net is viewed as convolution operation, where the size/scale of sliding window is win =⌈a/n⌉ and the stride is str   =⌊a/north⌋, where ⌈.⌉ and ⌊.⌋ are ceiling operator and floor operator, respectively. Three layers pyramids are used to extract loftier-level semantic features, in which the divided sub-regions are set, respectively, n  × n equally i   ×   i, ii   ×   two and 4   ×   four. In each spatial sub-region, SPP-net is used to puddle the responses of each convolutional kernel. The concluding output is composed by concatenating three layers pooling results to generate a fixed-length semantic representation Km for input image of arbitrary size/calibration, in which k is the number of sub-regions and Yard is the number of the features maps in the last convolutional layers. Subsequently this process, the fixed-length vectors are fed into the first full connexion layer.

3.ii.3 Fine tuning fingerprint neural network parameters

Limited by the number of fingerprint images, trained model classifiers are prone to over-fit. In order to resolve this problem, we obtain some parameters from pre-trained model using ImageNet 2012 database. Then nosotros apply the grooming samples from LivDet databases to fine-tune the pre-trained model parameters. The preparation procedure of our network model is based on the open source Caffe DCNN library [30,31]. Afterward the pre-preparation using ImageNet 2012, the weights and bias parameters learned in the five convolutional and general pooling layers are viewed as the initialization parameters of our network model. The pre-training model parameters are directly used as the initialization of our model, and we but need to use the training samples of fingerprint data fix to fine-tune our model parameters.

Read total chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/S0065245820300863

AI and Cloud Calculating

Bin Yang , ... Enguo Cao , in Advances in Computers, 2021

4.iii Convolutional module

The convolutional layers are capable of extracting different features from an image such as edges, textures, objects, and scenes [50]. As pointed above, forgery is improve captured around the boundary of forgery regions. Thus, the low-level features are critical to identify manipulated regions. The filters in convolutional layer will create characteristic maps that are continued to the local region of the previous layer.

Ii pairs of convolutional (C1 and C3) and pooling layers (P2 and P4) are designed post-obit the micro neural network in our CNN. In the convolutional layers, we use kernel size of m  × m  × C, where C is the depth of a filter and m is the size of convolutional kernel. The parameters C and m accept different values for unlike layers in the network as is demonstrated in Fig. 14. For case, the convolutional kernel for the start layer is 5   ×   5   ×   32. The size of the output (C1) is 64   ×   64   ×   32, which means the number of feature maps is 32 and the resolution of feature maps is 64   ×   64. The convolution operation tin can be denoted as:

(twenty) x l j = fifty = 1 northward 10 i 50 1 × thousand ij l one + b fifty j

where × denotes convolution, x l j is the jth output map in layer l, the convolutional kernel k ij l i (also chosen weight) can be updated while grooming the network. It connecting the ithursday output map in layer l    1 and the jth output map in layer l. b l j is the trainable bias parameter of the jth output map in layer l.

The pooling layer is used for a downwardly sampling functioning subsequently obtaining feature maps through convolution process. In classical CNN, convolution layers are followed by a subsampling layer. The size of constructive maps is reduced past pooling layer, and some invariance features are introduced. A max-pooling layer is a variant which has shown some merit in [51]. The output of a max-pooling layer is given by the maximum activation over non-overlapping regions, instead of averaging the inputs as in a classical subsampling layer. A bias is added to the resulting pooling and the output map is passed through the squashing function.

In our network, a max-pooling layer with filter of size 2   ×   2 is used to decrease the size of characteristic maps to 30   ×   30 after C1 layer. Let l denotes the index of a max-pooling layer. The layer's output is a set P l of square maps with size westward l . Nosotros get the P l from P l    one. The square maps size w l is obtained by w l   = w fifty    1/k, where g is the size of the square max-pooling kernel. Post-obit the pooling layer (P2) is another pair of convolution and pooling layer with 64 kernels of size 3   ×   3 and a filter of size 2   ×   2. Dropout [52] is a wildly used technique for avoiding overfitting in neural networks. Therefore, the Rectified Linear Units (ReLUs) [53] and dropout are used in our proposed CNN architecture. Based on Eq. (17), the operation is expressed as:

(21) f m , n = max x m , northward fifty 0

where ten m,n is for the input patch centered in the feature map bespeak (m, n) in layer fifty.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0065245820300814

Machine learning

Jim Jeffers , ... Avinash Sodani , in Intel Xeon Phi Processor High Functioning Programming (Second Edition), 2016

Enshroud-Blocking Strategy for Convolutional Layers

The convolutional layer (forward-propagation) operation consists of a 6-nested loop as shown in Fig. 24.3. When written in the naïve style as in Fig. 24.6, the convolutional operation is bandwidth jump for many instances. It is simple to come across that unless the activations (input[] and output[]) and weights completely fit in cache (which is often non the case), the third loop of neural network convolution operation (line 3) pulls in OFH*OFW output activations, (OFH*STRIDE   +   KH     1)*(OFW*STRIDE   +   KW     1) input activations (denoted as IFH*IFW), and KH*KW weights, while it performs KH*KW*OFW*OFH multiply-and-accrue operations. The bytes to flops ratio can be easily computed to exist:

Fig. 24.6. A naïve 6-nested loop for forwards-propagation of convolutional layers.

B/F = data_size*(OFW*OFH   +   IFW*IFH   +   KW*KH)/

(2*KW*KH*OFH*OFW).

For a typical CNN layer with IFM   =   OFM   =   1024, OFH   =   OFW   =   12, KH   =   KW   =   three, STRIDE   =   ane (as in layer C5 of the OverFeat-FAST CNN), we obtain a Bytes-Flops (B/F) ratio of 0.54. The single-precision bytes-to-flops (B/F) ratio for Knights Landing is computed as follows: 490   GB/s/(68   *   one.four   *   16   *   2   *   two) GFlops/due south   =   0.08, assuming 490   GB/s of bandwidth. Clearly the kernel written like in Fig. 24.6 will be heavily bandwidth bound. The naïve loop therefore is theoretically express to 15% efficiency (0.08/0.54). Since the functioning is bandwidth leap, and the algorithmic B/F ratio is 0.54, the achievable performance is (490   GB/s)/(0.54   GB/s-per-GFlops) or 907   GFlops out of a peak of 6092   GFlops. This is also the ratio of algorithmic B/F and machine B/F. At present, fifty-fifty if we assume that after loop two the content tin exist stored in on-die caches, the B/F ratio improves only to 0.24 and the efficiency is express to about 30% (0.08/0.24).

Conspicuously there is a need and opportunity to block loops 1 and ii in on-dice caches over input and output characteristic maps. We first consider blocking on loop i over output feature maps (block-size   =   OB) and produce the loop construction in Fig. 24.7. If OB output features, each of size OFH*OFW tin be stored in the enshroud and then the B/F ratio becomes:

Fig. 24.7. A blocked vii-nested loop for forward-propagation of convolutional layers, with the loop over output characteristic maps blocked (loop-1 and loop-vii).

B/F = data_size*

(OB*OFH*OFW + OB*IFM*KH*KW + IFM*IFH*IFW)/

(OB*2*OFW*OFH*KH*KW*IFM)

In Fig. 24.7, we stream through the input features, and we reuse each input feature to compute OB output features. For the C5 layer of OverFeat-FAST, the values for variables are IFM   =   OFM   =   1024, OH   =   OW   =   12, and KH   =   KW   =   iii. For this, we compute the B/F ratio for OB   =   16 to exist: 0.033. This is beneath the 0.08 machine B/F ratio and makes the loop compute jump. Many convolutional layers become compute bound when OB is set up to the SIMD-Width (sixteen 32b-precision for Knights Landing) of the processor. We can further improve the B/F ratio by additionally blocking on loop 2 (Fig. 24.7) over IFM, wherein the B/F ratio further improves to:

B/F = data_size*

(OB*OFH*OFW   +   OB*IB*KH*KW   +   IB*IFH*IFW)/

(IB*OB*two*OFW*OFH*KH*KW)

For the instance C5 layer tin can achieve a B/F ratio of 0.02. Notation, yet, that a B/F ratio below the machine B/F ratio is sufficient, and blocking the loop over input feature maps with a simple SIMD_WIDTH sized cake keeps the implementation uncomplicated, and B/F ratio within requisite limits. Moreover keeping the innermost loop over OB (line 7, Fig. 24.7) enables vectorization of fused-multiply-and-add performance.

The astern pass likewise has a like loop blocking, wherein loops 1 and two are swapped, and the blocking is performed over the input-feature maps dimension. The cache-blocking strategy for weight updates is the aforementioned every bit that for forward-propagation.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780128091944000247

13th International Symposium on Process Systems Engineering (PSE 2018)

Feng Hua , ... Tong Qiu , in Estimator Aided Chemical Applied science, 2018

three The neural network architecture

One mutual blazon of artificial neural network models is the so-chosen "black-box" models, where the prior knowledge of the chemic mechanism is completely neglected. This kind of pure data-driven model can simulate much faster than traditional kinetic models and with loftier predictive accuracy given a large grooming gear up. Notwithstanding, even with sufficient training data, the model might withal have a poor performance when the input data is out the range of the training set.

In order to improve the ductility of the neural network, we make use of the prior knowledge of the reaction network to guide the blueprint of the neural network architecture. By combining the structural features of naphtha pyrolysis network with the designed neural net, nosotros transform the data-driven model to a hybrid model of both mechanism and data.

Our proposed CNN architecture is shown in Fig.2 . The input consists of detailed molecular composition of feedstock and operating conditions. The reaction network is embedded in the convolution layer where the features of the network is learned. The output of the convolutional layer goes into loops. Hither, we set 2 loops in the structure.

Figure 2

Effigy 2. A schematic view of the proposed CNN compages

The output layer exports the yields of nine key products: H2, CH4, C2H4, C2H6, C3H6, C3H8, C4H6, NC4H8 and IC4H8.

3.1 Convolutional and pooling layer

The blueprint of convolutional layer is presented in Fig3. Each convolutional layer consists of 16 kernels. Each kernel contains 4694 sets of parameters representing 4694 reactions and each set of parameters operates on the relevant feedstock components. Later on activation function, the convolutional layer gives out a xvi*4694 matrix. The max pooling operation is equivalent to summarizing the reaction information by components. The output of the max-pooling layer is 142 values, representing 142 components in the reaction network.

Figure 3

Effigy 3. A schematic view of the convolutional and max-pooling layer

3.2 Loop

Each loop consists of a fully connected layer, a convolutional and a pooling layer. Other than the output of the previous layer, the input of the showtime fully connected layer involves 5 operating conditions including whorl inlet temperature (CIT), coil outlet temperature (COT), coil inlet pressure level (CIP), federate and water/oil ratio. Thus, the number of input is sixteen*142   +   5   =   2277. We set the number of neurons in the fully-connected layer 142. Each neuron in the fully-connected layer adds the weighted input together plus a certain bias. The event is then transferred by sure activation part and delivered to the side by side layer. From the mathematic point of view, the operation can be regarded as a row vector (ane*2277) multiplied by a matrix of 2277*142 and added to a column vector of 1*142. Thus, the output is decreased to 142, in line with the input of the next layer. A schematic view of the loop is given in Fig4.

Figure 4

Effigy iv. A schematic view of the loop

Read full affiliate

URL:

https://www.sciencedirect.com/science/article/pii/B978044464241750135X

Learning

Zhongzhi Shi , in Intelligence Science, 2021

vii.viii.5.1 Feed-forward propagation of the convolutional layer

Each neuron in the convolutional layer extracts the features in the local accepted field of the same location of all feature maps in the sometime layer. The neurons in the same feature map share the same weight matrix. The convolutional process can be viewed every bit the convolutional neurons seamlessly scan the onetime feature maps line by line through the weight matrix. The output, O ( x , y ) ( fifty , thousand ) , of the neuron located in line ten–cavalcade y in the chiliadthursday feature map of the fiftyth convolutional layer can be computed by Eq. (7.26), where tanh(·) is the activation role:

(seven.26) O ( x , y ) ( l , k ) = tanh ( t = 0 f i r = 0 kh c = 0 kw West ( r , c ) ( k , t ) o ( 10 + r , y + c ) ( l i , t ) + Bias ( 50 , chiliad ) )

From Eq. (7.26), we need to traverse all neurons of the convolutional window in all feature maps of the former layer to compute the output of a neuron in the convolutional layer. The feed-forwards propagation of the total connection layer is similar to the convolutional layer. This can exist viewed every bit a convolutional operation on the convolutional weight matrix and the input with the aforementioned size.

Read total chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780323853804000075

Machine learning and its application in microscopic paradigm analysis

F. Xing , L. Yang , in Automobile Learning and Medical Imaging, 2016

4.2.ii.2 CNN compages

The proposed structured regression model contains several convolutional layers (C), max-pooling layers (M), and fully continued layers (F). Fig. 4.4 illustrates the architectures and mapped proximity patches in the proposed model on the Cyberspace dataset. The detailed model configuration is: Input(39 × 39 × iii) − C(34 × 34 × 32) − 1000(17 × 17 × 32) − C(fourteen × 14 × 32) − K(7 × vii × 32) − F(1024) − F(1024) − F(289). The input image size depends on cell scales, and a 39 × 39 patch is large enough to cover a unmarried jail cell in NET images. Due to the small size of the input prototype patch, it is sufficient to stack two pairs of C-M layers for characteristic computation. Meanwhile, multiple F layers are designed to acquire more college level feature representation, which can benefit the final regression. The activation function of the concluding F (regression) layer is chosen as the sigmoid function, and an ReLu part is used for all the other F and C layers. The sizes of C and M layers are defined as width × top × depth, where width × height determines the dimensionality of each feature map and depth represents the number of feature maps. Since the input prototype size is relatively pocket-size, the filter size is called equally 6 × vi for the first convolutional layer and 3 × 3 for the other. The max-pooling layer uses a window of size ii × 2 with a stride of two, which has been widely adopted in current object detection algorithms and gives an encouraging performance. Like CNN architectures are used for chest cancer and HeLa cervical cancer datasets, but with input patch sizes of 49 × 49 × 3 and 31 × 31 × iii, respectively.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/commodity/pii/B9780128040768000049

Motorcar learning for object detection

Zuo Xiang , ... Patrick Seeling , in Calculating in Communication Networks, 2020

19.2.2 Model divide

Typical components of CNNs for object detection are convolutional layers, pooling layers, fully continued layers, and batch normalization layers. The most mutual grade of a CNN architecture in CV applications stacks several convolutional layers with proper activation functions, follows them with pooling layers, and repeats this pattern until the paradigm has been merged spatially to a small-scale size (Section 8.ii.7). At some point, information technology is common to transition to fully connected layers. The last fully connected layer holds the output, such equally the class scores [306]. Considering that border nodes are usually express in available CPU and memory resource (concrete or virtual), the total amount of layers that can be offloaded from the server and deployed in-network is limited. By comparing front layers of different object detection models, such as YOLOv2, SSD, VGG, and Faster R-CNN, the common structures that all have in common are unlike combinations of convolutional layers followed by pooling layers, as in Table xix.1.

Table 19.ane. Structure of the first ten layers in different object detection models.

Model Structure of showtime ten layers
YOLOv2 Conv. + Pool. + Conv. + Puddle. + 3 Conv. + Pool. + 2 Conv.
SSD 2 Conv. + Pool. + 2 Conv. + Pool. + 3 Conv. + Pool.
VGG16 ii Conv. + Pool. + 2 Conv. + Puddle. + 3 Conv. + Pool.
Faster R-CNN 2 Conv. + Pool. + 2 Conv. + Pool. + iii Conv. + Puddle.

Choosing a proper split point of a model needs to take into consideration that i) the function before the split point should be capable of running on network devices and ii) split bespeak should result in bandwidth savings to improve congestion. Consequently, the number of layers before the split point should not be too high, and to realize bandwidth savings, the output data of the front part should exist smaller than the original input image size. In the example of this affiliate, YOLOv2 is applied and analyzed for an explanation of model split strategies. Every bit YOLOv2 has structural similarity with other ordinarily employed feature extractors, the model split approach can easily be transferred to those.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780128204887000347

Computational Analysis and Understanding of Natural Languages: Principles, Methods and Applications

Ehsan Fathi , Babak Maleki Shoja , in Handbook of Statistics, 2018

7.1 What is Convolution?

Post-obit, is 1d detached convolution which is the simplest definition for any convolution operator of a filter over some other function.

( f × g ) [ north ] = m = Yard M f [ n m ] [ g [ m ] ]

where north is a specific point in time, and M, in the context of NLP, is the window size. Basically, you multiply a filter at different locations of the input. Convolution is very potent and constructive to extract features from images. Next figure, shows a 2D instance where yellow foursquare is filter weights and the green i is input (Fig. 24).

Fig. 24

Fig. 24

Fig. 24. 2nd convolution instance: yellow shows the filter weights and green shows input from Stanfod UFLDL wiki. (A) Pace i. (B) Footstep two. (C) Stride 3. (D) Step 4. (E) Step v. (F) Footstep 6. (G) Footstep 7. (H) Pace 8. (I) Step ix.

Single layer CNN is a simple variant using one convolutional layer and pooling. This is based on the piece of work of Collobert and Weston ( Collobert et al., 2011) and Kim (2014) on convolutional neural networks for sentence classification. First we define the notations clearly.

Give-and-take vectors: x i R k

Sentence: x 1:n = x 110 two ⊕… ⊕ ten n (vectors concatenated)

Concatenation of words in range: x i:i+j

Convolutional filter: westward R h k (goes over window of h words)

Windows size tin can be 2 or higher, e.chiliad., 3

According to notations, we start with word vectors in a grand dimensional vector. And so, we stand for a judgement through concatenation. For concatenating all northward word vectors, circled plus operator is used and they are concatenated lengthwise assuming they are all concatenated as a long row. We may want to excerpt a specific words in a range, from fourth dimension step i to fourth dimension footstep i + j. Therefore, our convolutional filter (defined in terms of the window size of h the vector size k) will be a vector due west of parameter that are going to be learned with standard stochastic gradient decent-type optimization methods. The size of the filtering affects the learning significantly. Longer the filter leads to more computation to handle. Also, longer filters are able to capture more phrases but you are more likely to overfit your model. The size of the filter should be a hyperparameter. There are some tricks, where you tin can have multiple filters with multiple lengths which let you lot to prevent overfitting. We will talk over it more than in this section.

Presume that it is needed to have a convolutional filter that at each time step looks at three different word vectors and tries to combine them into some kind of feature representation such every bit a single number. And then, we have iii times number of dimensions of each give-and-take vector filter. Fig. 25 shows a very elementary example of a convolutional filter for a ii-dimensional discussion vector with a window size of 3, hence, we basically take a six dimensional due west here. You lot should note that w is a single vector, not a matrix, just as out word vectors that are concatenated into a single vector.

Fig. 25

Fig. 25. Example of convolutional filter for a two-dimensional word vector with a window size of 3.

Now we discuss why it is a neural network and depict the computations. In society to compute a feature in a time footstep, for the previous example, we take an inner product of west vector of parameters multiplied past the i-th time step plus window size, h.

c i = f ( Due west T X i : i + h one + b )

For case, to calculate c 1, we take W times Ten 13 or simply nosotros have the concatenation of those word vectors in our product. b is the bias term and we add a nonlinearity at the end.

Having the sentence, x ane:north = x ix ii ⊕… ⊕ x n , all possible windows of length h are ten 1:h , 10 2:h+1, …, 10 nh+i:northward . Since we have the ciphering of c i at every fourth dimension pace, information technology means that nosotros take a feature map defined as C = [ c 1 , c 2 , , c northward h + one ] R due north h + ane . Each c value takes the same w and has interproducts with dissimilar windows in each time step.

If go along concatenating the words, when we reach the last word vector, we require two more than words to apply the filter. Equally illustrated in Fig. 26, zero vectors (in this example ii vectors) are to employ last word of the sentence. This may also exist done on the left side of the judgement.

Fig. 26

Fig. 26. How to utilize filter to last words of the sentence.

C vector is going to exist a long nh + 1 dimensional vector and it is going to be of different length for sentences with different number of words. Even so, if we want to plug information technology into a softmax classifier, it needs a fixed dimensional vector. Due to this variable length vector at this indicate, nosotros want to eventually have a fixed dimensional feature vector representing the whole sentence. To do so, a new type of building cake called a pooling operator or pooling layer volition be introduced. In detail, we apply a max-over-time pooling layer (or max-pooling layer). The idea is to capture the most important activation. As there are unlike elements computed for every window, we promise that the inner product is large enough for that filter if it sees a certain kind of phrase. Assume that the word vectors are relatively normalized. Then information technology is desirable to have a large cosine similarity betwixt the filter and certain design, due east.one thousand., positive words or phrases and only one filter volition only be skilful at picking upwards that design. This will be capture past the largest c i which has a very large activation for that detail filter w. Consequently, nosotros become this as ĉ = thou a 10 C and information technology tin can ignore all the remainder sentence. It is going to exist able to pick out 1 item bigram very accurately. The trouble here is that ĉ is just a unmarried number of all the element in C vector. However, in addition to simply one particular type of bigram or trigram, we need to extract more than features, hence, we are going to have multiple filters west through convolving multiple of them. As we train this model, we promise that some of the filters will exist very active and have very big interproducts with particular types of bigrams or trigrams.

We can accept multiple different window sizes and in each time pace nosotros will max pool to become a single number for that filter for that judgement. It should be noted that since we accept a random initialization, when we apply different filters of dissimilar of aforementioned length they learn different feature. In other words, equally nosotros employ SGD, different filters volition motility and start to pick up different patters in club to maximize overall objective role equally the result of random initialization.

Information technology is worth to mention that there are several researches that explore dissimilar pooling schemes and in that location is no proper mathematical reason of which one works better, however, in the max pool we tin can intuitively meet that we try to fire when a specific type of North-gram is observed and this signal is passed to the next higher layer. Thus, using a unmarried value (every bit the result of max pool) is better than other approaches like averaging all values in C (averaging may also launder out the strong signal that we go from 1 particular unigram, bigram, or trigram).

Now nosotros are going to discuss another idea that combines the concept of word vectors with some extensions and instead of representing the sentence only equally a single concatenation of all of the give-and-take vectors, we start with two copies of the sentence. Then, we are going to backpropagate into only one ready and keep other "static." In order to explicate this, remember that word vectors can exist trained on a very large unsupervised scope and so they capture semantic similarities. Now, if you start backpropagating your specific job into the word vectors, they will start to move around when you meet that give-and-take vector in your supervised classification problem in that dataset. This ways that as y'all push sure vectors that you encounter in your training datasets somewhere else, the vectors that you do non see stay where they are and might misclassified if they simply appear in the examination prepare. Consequently, by having these two channels, we try to have some of the goodness of the offset copy of the word vectors to be actually expert on that task while the 2d set of discussion vectors to stay where they are, having proper general semantic similarities in vector space goodness that we have from supervise give-and-take vectors.

Both of these channels are going to exist added to each of c i due south before applying max-pool, hence we will pool over both those channels.

The final model which is the simplest one, is but concatenating all the c i ˆ s to obtain last feature vector, z = [ c 1 ˆ , , c m ˆ ] where 1000 is the number of filters. Then nosotros will plug z directly into softmax, and railroad train y = softmax(W (S) z + b) with standard logistic regression cantankerous entropy error. Annotation that past using two copies of word vectors, we certainly doubling the memory requirement of the model. Nevertheless, information technology is only the second copies of word vectors that we are going to backpropagate into for the chore.

Following is a graphical description the model (Kim, 2014) (Fig. 27). Here, nosotros accept n words and each discussion has thou dimensions. This item model shows usa two applications of a bigram filter (shown with cherry lines) and i of a trigram filter (shown with yellow lines), and then, they are max-pooled to a single number. For each of the filters, nosotros obtain one long sent of features and so nosotros get a unmarried number after max-pooling over all the activations.

Fig. 27

Fig. 27. Graphical description of CNN for n words with chiliad dimensions.

Read full chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/S016971611830021X