CN114898360B - Food material image classification model establishment method based on attention and depth feature fusion - Google Patents
Food material image classification model establishment method based on attention and depth feature fusion Download PDFInfo
- Publication number
- CN114898360B CN114898360B CN202210342846.8A CN202210342846A CN114898360B CN 114898360 B CN114898360 B CN 114898360B CN 202210342846 A CN202210342846 A CN 202210342846A CN 114898360 B CN114898360 B CN 114898360B
- Authority
- CN
- China
- Prior art keywords
- feature
- convolution
- dimension
- network
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 235000013305 food Nutrition 0.000 title claims abstract description 54
- 239000000463 material Substances 0.000 title claims abstract description 41
- 230000004927 fusion Effects 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000013145 classification model Methods 0.000 title claims abstract description 13
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 230000007246 mechanism Effects 0.000 claims abstract description 8
- 230000005284 excitation Effects 0.000 claims abstract description 5
- 238000011176 pooling Methods 0.000 claims description 27
- 230000004913 activation Effects 0.000 claims description 19
- 230000009467 reduction Effects 0.000 claims description 12
- 238000010586 diagram Methods 0.000 claims description 6
- 230000003213 activating effect Effects 0.000 claims description 5
- 238000007906 compression Methods 0.000 claims description 2
- 230000006835 compression Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 description 8
- 238000012360 testing method Methods 0.000 description 6
- 238000012549 training Methods 0.000 description 5
- 238000012512 characterization method Methods 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 244000291564 Allium cepa Species 0.000 description 1
- 235000010167 Allium cepa var aggregatum Nutrition 0.000 description 1
- 240000002234 Allium sativum Species 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 241000234295 Musa Species 0.000 description 1
- 240000008790 Musa x paradisiaca Species 0.000 description 1
- 235000018290 Musa x paradisiaca Nutrition 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000013329 compounding Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 235000005911 diet Nutrition 0.000 description 1
- 230000037213 diet Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 235000012041 food component Nutrition 0.000 description 1
- 239000005417 food ingredient Substances 0.000 description 1
- 235000004611 garlic Nutrition 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 235000004280 healthy diet Nutrition 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 235000016709 nutrition Nutrition 0.000 description 1
- 230000035764 nutrition Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a food material image classification model building method based on attention and depth feature fusion, which comprises the steps of collecting food material image data, including historical image data and image data to be classified; embedding compressed excitation attention into a ResNet network and a DenseNet121 network which are parallel, and then forming a parallel attention feature extraction network by the two networks to extract food material image features; inputting the deep food material characteristics into a deep characteristic fusion module, and further extracting deep food material characteristics; and establishing a food material image classification model, and classifying to obtain the food material types. The invention utilizes the feature extraction network to embed the attention mechanism, so that the extracted features are more focused on the local details of the food materials, the sub-network features have better food material classification features, and meanwhile, the classification accuracy is effectively improved. In the feature fusion, network parameters are greatly reduced, and the addition of the features prevents the network deepening gradient from disappearing, so that the food classification is efficient and quick.
Description
Technical Field
The invention belongs to the field of image processing, and particularly relates to a food image classification model building method based on attention and depth feature fusion.
Background
With the improvement of living standard of residents, health consciousness of people is obviously improved, and the expectations and demands on healthy diet are higher and higher. Accurate diet assessment is an important way to assess the effectiveness of food nutrition compounding. Currently food suppliers rely primarily on manual methods for sorting and evaluating food materials. But this process is extremely tedious, laborious, expensive and subjective. Along with the rapid development of the Internet industry and multimedia technology, the image classification research in the food material field is paid more attention to in the fields of multimedia analysis and application, but the existing food material classification algorithm has the problems of unobvious extraction characteristics, low classification accuracy and the like, and cannot meet the demands of people.
Disclosure of Invention
The invention aims to provide a method for establishing a food material image classification model based on attention and depth feature fusion, which can enable a network to extract more distinguishable food material features, effectively fuse the depth features of two different networks and improve the accuracy of food material classification.
The invention provides a food material image classification model establishment method based on attention and depth feature fusion, which comprises the following steps:
S1, acquiring food material image data, wherein the food material image data comprises historical image data and image data to be classified;
S2, embedding compressed excitation attention (Squeeze-and-Excitation Attention) into a parallel ResNet network and a parallel DenseNet121 network, and then forming a parallel attention feature extraction network by the two networks to extract food image features;
S3, inputting the parallel ResNet network and DenseNet121 network extraction features into a deep feature fusion module, and further extracting deep food features;
s4, establishing a food material image classification model, and classifying the image data to be classified to obtain food material types.
Step S2, resNet the network comprises a Res Block structure Block and a first SE attention layer; the Res Block structure Block comprises a first convolution layer, a first pooling layer and a first activation layer; denseNet121 the 121 network includes a Dense Block structure Block and a second SE attention layer; the Dense Block structure Block includes a second convolutional layer, a second pooling layer, and a second active layer.
The SE attention layer includes: coding the spatial features on a channel into a global feature, and adopting global average pooling to output the numerical distribution condition of c feature graphs of the layer; adopting a door mechanism in a Sigmoid form to enable a network to learn nonlinear relations among all channels; and adopting a bottleneck structure comprising two full connections to reduce the feature dimension, wherein the dimension reduction coefficient r is a super parameter, then activating by using a ReLU (RECTIFIED LINEAR Unit, adjusting a linear Unit) function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature map y.
The SE attention layer specifically comprises: encoding the spatial features on a channel into a global feature, and outputting the numerical distribution of c feature maps of the layer by adopting global average pooling,
zc=FGAP(uc)
Wherein F GAP (·) represents global average pooling; u c represents the original feature map of the c-th channel;
The adoption of a door mechanism in the sigmoid form enables a network to learn the nonlinear relation among all channels:
sc=σ(g(zc,w))
Wherein s c represents the activation value of the c-th channel; z c represents the numerical distribution of the c-th channel profile; w represents a network weight; g (·) represents a pooling function; sigma (·) represents Sigmoid activation function;
adopting a bottleneck structure comprising two full connections to reduce the feature dimension, wherein the dimension reduction coefficient r is a super parameter, then activating by using a ReLU function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature diagram y, wherein the final attention feature diagram y is as follows:
wherein s c represents the activation value of the c-th channel; u c represents the original feature map of the c-th channel, resulting in subnet 1 feature F in1 and subnet 2 feature F in2.
The step S3 comprises the following steps:
A1. Inputting the characteristics of the subnet 1 and the characteristics of the subnet 2, and splicing the characteristics of the subnet 1 and the characteristics of the subnet 2 into spliced subnet characteristics F in a third dimension;
A2. Inputting the spliced subnetwork feature F into a1 st branch, carrying out average pooling of 3*3 to obtain an average pooled feature F 1, then carrying out convolution of 1*1, compressing feature dimensions to 1024, and obtaining a feature obtained by subjecting an average pooled feature F 1 to 1*1 convolution kernel;
A3. Inputting the characteristic F of the spliced subnetwork into a 2nd branch, firstly using 3*3 convolution to reduce the dimension to 512 at the same time, then respectively carrying out 1*3 and 3*1 asymmetric convolution on the two branches, finally converging and splicing from 512 liter dimension to 1024 to obtain the characteristic that F 2、F21、F22 and F' 2,F2 represent the characteristic that the dimension of the characteristic F of the spliced subnetwork is 512 after 3*3 convolution kernel; f 21 represents the feature of F 2 after 1*3 asymmetric convolution; f 22 represents the feature of F 2 after 3*1 asymmetric convolution; f' 2 represents the characteristics of F 21 and F 22 after channel splicing in the 3 rd dimension;
A4. Inputting the characteristic F of the spliced subnetwork into a 3 rd branch, firstly using a hole convolution ratio of 3*3 as 2, and then respectively carrying out asymmetric convolution of 1*3 and 3*1 to obtain characteristics of the characteristic F of the spliced subnetwork after the hole convolution of 3*3 convolution kernels and the proportion of 2 represented by F 3、F31 and F 32,F3; f 31 denotes the feature of F 3 after an asymmetric convolution of 1*3; f 32 denotes the feature of F 3 after an asymmetric convolution of 3*1;
A5. Inputting the characteristic F of the spliced subnetwork into a4 th branch, firstly using a cavity convolution ratio of 3*3 as 3, and then respectively carrying out asymmetric convolution of 1*3 and 3*1 to obtain characteristics of F 4、F41 and F 42,F4 which represent the characteristic F of the spliced subnetwork after the cavity convolution with the 3*3 convolution kernel and the proportion of 3; f 41 denotes the feature of F 4 after an asymmetric convolution of 1*3; f 42 denotes the feature of F 4 after an asymmetric convolution of 3*1;
A6. Splicing F 31,F32,F41 and F 42 to a dimension 2048 in a 3 rd dimension, and then performing convolution dimension reduction of 1*1 to 1024 to obtain characteristics of F '3 and F' 3,F'3 which are subjected to multi-branch splicing of F 31、F32、F41 and F 42 in a third dimension; f '3 represents the feature of F' 3 after 1*1-size convolution dimensionality reduction;
A7. Adding F 2 and F 3, and then performing 3*3 convolution to obtain characteristics of F '2 and F' 2,F"2 representing the spliced three dimensions of F '3 and F' 3; f' "2 represents the characteristic of F" 2 after 3*3 convolution;
A8. The final output fusion feature F out is F 1 plus F' "2, with dimensions 1024.
The step S3 comprises the following steps:
A1. Inputting a subnet 1 feature F in1 and a subnet 2 feature F in2, and splicing the subnet 1 feature F in1 and the subnet 2 feature F in2 into a size of 7 x n in a third dimension, wherein f= Concat [ F in1,Fin2, axis=3 ]; wherein F represents the characteristics of the spliced subnetwork; concat [. Cndot. ] represents a splice operation; axis represents feature dimensions;
A2. Inputting the characteristic F of the spliced subnetwork into the 1 st branch to carry out 3*3 average pooling to obtain the characteristic F 1,F1=Avg_Pool(F,poolsize = [3,3 ]), wherein avg_pool (DEG) represents average pooling operation; pool size represents the pooling use convolution kernel size; then performing 1*1 convolution, compressing the feature dimension to 1024, wherein F' 1=Conv2d(F1,fs=1024,size=[1,1]),F'1 represents the feature obtained by carrying out 1*1 convolution kernels on the feature F 1 after the average pooling; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A3. Inputting the characteristic F of the spliced subnetwork into a 2 nd branch, firstly using 3*3 convolution to simultaneously reduce the dimension to 512, then respectively carrying out 1*3 and 3*1 asymmetric convolution on the two branches, and finally converging and splicing from 512 liter dimension to 1024; f 2=Conv2d(F,fs=512,size=[3,3]),F2 represents a feature with a dimension of 512, which is obtained by a 3*3 convolution kernel of the spliced subnetwork feature F; f 21=Conv2d(F2,size=[1,3]);F21 represents the feature of F 2 after 1*3 asymmetric convolution; f 22=Conv2d(F2,size=[3,1]),F22 represents the feature of F 2 after 3*1 asymmetric convolution; f' 2=Concat[F21,F22,axis=3],F'2 represents the characteristics of F 21 and F 22 after channel splicing in the 3 rd dimension; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size; concat [. Cndot. ] represents a splice operation;
A4. Inputting the characteristic F of the spliced subnetwork into a 3 rd branch, firstly using a cavity convolution ratio of 3*3 as 2, and then respectively carrying out asymmetric convolution of 1*3 and 3*1; f 3=Dilated_Conv(F,ratio=2,fs=512,size=[3,3]),F3 represents the characteristic of the spliced subnetwork after 3*3 convolution kernel and hole convolution with the proportion of 2; ratio represents the proportion of hole convolution; dilated _conv (·) represents a hole convolution; f 31=Conv2d(F3,size=[1,3]),F31 denotes the feature of F 3 after an asymmetric convolution of 1*3; f 32=Conv2d(F3,size=[3,1]);F32 denotes the feature of F 3 after an asymmetric convolution of 3*1; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A5. Inputting the characteristic F of the spliced subnetwork into a 4 th branch, firstly using a cavity convolution ratio of 3*3 to be 3, and then respectively carrying out asymmetric convolution of 1*3 and 3*1; f 4 = Dilated _conv (F, ratio=3, fs=512, size= [3,3 ]), ratio represents the hole convolution ratio; ; dilated _conv (·) represents a hole convolution; f 4 represents the characteristic of the spliced subnetwork characteristic F after 3*3 convolution kernel and cavity convolution with the proportion of 3; f 41=Conv2d(F4,size=[1,3]),F41 denotes the feature of F 4 after an asymmetric convolution of 1*3; f 42=Conv2d(F4,size=[3,1]),F42 denotes the feature of F 4 after an asymmetric convolution of 3*1;
A6. splice F 31,F32,F41 and F 42 in dimension 3 to dimension 2048, then perform convolution dimension reduction of 1*1 to 1024; f' 3=Concat[F31,F32,F41,F42,axis=3],F'3 represents the characteristics of the multi-branched F 31、F32、F41 and F 42 spliced in the third dimension; f '3=Conv2d(F3,fs=1024,size=[1,1]),F"3 represents the feature of F' 3 after 1*1-size convolution dimensionality reduction; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size; concat [. Cndot. ] represents a splice operation; axis represents feature dimensions;
A7. Adding F 2 and F 3, and then performing a convolution of 3*3, wherein F ' 2=Add[F'2,F"3],F"2 represents the characteristics of F ' 3 and F ' 3 after three-dimensional stitching; f' "2=Conv2d(F"2,fs=1024,size=[3,3]),F"'2 represents the characteristic of F" 2 after 3*3 convolution; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A8. The final output fusion feature F out (7 x 1024) is F 1 summed with F' "2, with dimensions 1024, F out=Add[F'1,F"'2.
According to the method for establishing the food image classification model based on the attention and depth feature fusion, which is provided by the invention, the attention mechanism is embedded into the feature extraction network, so that the extracted features are more focused on the local details of the food, and the sub-network features have better food classification features. The depth feature fusion module effectively fuses the extracted features of two single networks, combines complementary depth features of different sub-networks to form a feature with stronger characterization capability, and effectively improves classification accuracy. In the feature fusion, the convolution kernel comprises asymmetric convolution and cavity convolution instead of common convolution, so that network parameters are greatly reduced, and meanwhile, the residual error structure is adopted to add the features to prevent the network deepening gradient from disappearing, so that food material classification is efficient and rapid.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a flow chart of an embodiment of the present invention.
Fig. 3 is a schematic diagram of a depth feature fusion module according to an embodiment of the present invention.
Detailed Description
FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides a method for establishing a food material image classification model based on attention and depth feature fusion, which comprises the following steps:
S1, acquiring food material image data, wherein the food material image data comprises historical image data and image data to be classified;
S2, embedding compressed excitation attention (Squeeze-and-Excitation Attention) into a parallel ResNet network and a parallel DenseNet121 network, and then forming a parallel attention feature extraction network by the two networks to extract food image features;
S3, inputting the parallel ResNet network and DenseNet121 network extraction features into a deep feature fusion module, and further extracting deep food features;
s4, establishing a food material image classification model, and classifying the image data to be classified to obtain food material types.
In the food material classification, the difference between different categories is very fine, the distinguishing information including distinguishing is often the local area where the image exists, such as banana, musa, shallot and garlic are very similar in appearance, the human eyes cannot recognize the image at a glance, and the human eyes cannot distinguish the image through some fine shapes and colors, so that for the food material classification identification problem, it is important to capture the local features with distinguishing degrees, and the SE attention mechanism is embedded into ResNet networks (the sub-feature extraction network 1) and DenseNet networks (the sub-feature extraction network 2) to extract better local features.
Fig. 2 is a schematic flow chart of an embodiment of the present invention. In the step S2, the ResNet network includes a Res Block structure Block and a first SE attention layer; the Res Block structure Block comprises a first convolution layer, a first pooling layer, a first activation layer and the like; denseNet121 the 121 network includes a Dense Block structure Block and a second SE attention layer; the Dense Block structure Block comprises a second convolution layer, a second pooling layer, a second activation layer, and the like.
SE attention includes: the SE attention focuses on the relation among channel channels, and automatically learns the importance degree among different channels. SE attention first encodes the spatial features on a channel into a global feature, uses global averaging pooling to output the value distribution z c of the c-th channel feature map of the layer,
zc=FGAP(uc)
Wherein F GAP (·) represents global average pooling; u c represents the original feature map of the c-th channel;
The adoption of a door mechanism in the sigmoid form enables a network to learn the nonlinear relation among all channels:
sc=σ(g(zc,w))
Wherein s c represents the activation value of the c-th channel; z c represents the numerical distribution of the c-th channel profile; w represents a network weight; g (·) represents a pooling function; sigma (·) represents Sigmoid activation function;
In order to reduce the complexity of the model and improve the generalization capability, adopting a bottleneck (bottleneck) structure comprising two full connections to reduce the feature dimension, taking the dimension-reducing coefficient r as a super parameter, then activating with a ReLU (RECTIFIED LINEAR Unit, adjusting a linear Unit) function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature graph y, wherein the final attention feature graph y is as follows:
Wherein s c represents the activation value of the c-th channel; u c represents the original feature map of the c-th channel, resulting in subnet 1 feature F in1 (7×7×n1) and subnet 2 feature F in2 (7×7×n2).
Fig. 3 is a schematic diagram of a depth feature fusion module according to an embodiment of the invention. The traditional feature fusion is to splice the features on the third dimension, namely, the channels are overlapped, so that feature information circulation is not smooth, and the problems of high feature dimension, high calculation cost and the like can be caused by directly using the feature splice from two parallel subnets for subsequent reasoning. In order to improve the expression capability of the network and achieve a better food material image recognition effect, a novel depth feature fusion module is provided, as shown in fig. 3. The parallel 2 sub-network feature extraction features are combined, and more complementary features are learned, so that the classification accuracy of food material images is improved. Considering that the reasonable utilization of computing resources mainly adopts small-size convolution kernels, including 1×1,1×3 and 3×1 small convolution kernels, the method is beneficial to reducing the consumption of computing resources, and simultaneously, in order to obtain richer image features, multiple groups of convolution operations are executed in parallel.
The step S3 comprises the following steps:
A1. Inputting a subnet 1 feature F in1 (7×7×n1) and a subnet 2 feature F in2 (7×7×n2), and concatenating a subnet 1 feature F in1 (7×7×n1) and a subnet 2 feature F in2 (7×7×n2) in a third dimension into a size of 7×7×n, f= Concat [ F in1,Fin2, axis=3 ]; wherein F represents the characteristics of the spliced subnetwork; concat [. Cndot. ] represents a splice operation; axis represents feature dimensions;
A2. Inputting the characteristic F of the spliced subnetwork into the 1 st branch to carry out 3*3 average pooling to obtain the characteristic F 1,F1=Avg_Pool(F,poolsize = [3,3 ]), wherein avg_pool (DEG) represents average pooling operation; pool size represents the pooling use convolution kernel size; then performing 1*1 convolution, compressing the feature dimension to 1024, wherein F' 1=Conv2d(F1,fs=1024,size=[1,1]),F'1 represents the feature obtained by carrying out 1*1 convolution kernels on the feature F 1 after the average pooling; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A3. Inputting the characteristic F of the spliced subnetwork into a 2 nd branch, firstly using 3*3 convolution to simultaneously reduce the dimension to 512, then respectively carrying out 1*3 and 3*1 asymmetric convolution on the two branches, and finally converging and splicing from 512 liter dimension to 1024; f 2=Conv2d(F,fs=512,size=[3,3]),F2 represents a feature with a dimension of 512, which is obtained by a 3*3 convolution kernel of the spliced subnetwork feature F; f 21=Conv2d(F2,size=[1,3]);F21 represents the feature of F 2 after 1*3 asymmetric convolution; f 22=Conv2d(F2,size=[3,1]),F22 represents the feature of F 2 after 3*1 asymmetric convolution; f' 2=Concat[F21,F22,axis=3],F'2 represents the characteristics of F 21 and F 22 after channel splicing in the 3 rd dimension; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size; concat [. Cndot. ] represents a splice operation;
A4. Inputting the characteristic F of the spliced subnetwork into a 3 rd branch, firstly using a cavity convolution ratio of 3*3 as 2, and then respectively carrying out asymmetric convolution of 1*3 and 3*1; f 3=Dilated_Conv(F,ratio=2,fs=512,size=[3,3]),F3 represents the characteristic of the spliced subnetwork after 3*3 convolution kernel and hole convolution with the proportion of 2; ratio represents the proportion of hole convolution; dilated _conv (·) represents a hole convolution; f 31=Conv2d(F3,size=[1,3]),F31 denotes the feature of F 3 after an asymmetric convolution of 1*3; f 32=Conv2d(F3,size=[3,1]);F32 denotes the feature of F 3 after an asymmetric convolution of 3*1; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A5. Inputting the characteristic F of the spliced subnetwork into a 4 th branch, firstly using a cavity convolution ratio of 3*3 to be 3, and then respectively carrying out asymmetric convolution of 1*3 and 3*1; f 4 = Dilated _conv (F, ratio=3, fs=512, size= [3,3 ]), ratio represents the hole convolution ratio; ; dilated _conv (·) represents a hole convolution; f 4 represents the characteristic of the spliced subnetwork characteristic F after 3*3 convolution kernel and cavity convolution with the proportion of 3; f 41=Conv2d(F4,size=[1,3]),F41 denotes the feature of F 4 after an asymmetric convolution of 1*3; f 42=Conv2d(F4,size=[3,1]),F42 denotes the feature of F 4 after an asymmetric convolution of 3*1;
A6. Splice F 31,F32,F41 and F 42 in dimension 3 to dimension 2048, then perform convolution dimension reduction of 1*1 to 1024; f' 3=Concat[F31,F32,F41,F42,axis=3],F'3 represents the characteristics of the multi-branched F 31、F32、F41 and F 42 spliced in the third dimension; f '3=Conv2d(F3,fs=1024,size=[1,1]),F"3 represents the feature of F' 3 after 1*1-size convolution dimensionality reduction; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size; concat [. Cndot. ] represents a splice operation; axis represents feature dimensions;
A7. Adding F 2 and F 3, and then performing a convolution of 3*3, wherein F ' 2=Add[F'2,F"3],F"2 represents the characteristics of F ' 3 and F ' 3 after three-dimensional stitching; f' "2=Conv2d(F"2,fs=1024,size=[3,3]),F"'2 represents the characteristic of F" 2 after 3*3 convolution; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A8. The final output fusion feature F out (7 x 1024) is the sum of F 1 and F' "2, with dimensions 1024, F out=Add[F'1,F"'2.
In this example, the present invention was tested on Food-41 Food material dataset. Food-41 was collected from a large Food supply chain platform Mealcome (MLC dataset) in china. It contains 4100 images of 41 food ingredient types, divided into 3 data sets: 60% training set, 20% validation set and 20% test set. The experiment adopts PyTorch deep learning platform, the loss function adopts multi-classification cross entropy loss (Categorical Cross Entropy Loss), the network weight optimization adopts random gradient descent (Stochastic GRADIENT DESCENT, SGD) optimizer, the learning rate gamma is updated after the attenuation step length iteration and multiplied by the attenuation coefficient alpha, the basic learning rate is set to 0.001, the attenuation coefficient is set to 0.94, the momentum (momentum) is set to 0.9, and the iteration number (epoch) is set to 30. Firstly, the VGG16, resNet50, inceptionV, densNet121 and MobileNetV classical networks pre-trained on ImageNet are selected for fine tuning (41 is the last full connection layer of the reconstructed network, 1000 is the ImageNet, and 41 is the Food-41). 30 epoch training runs were iterated over the Food-41 training set and 3 test-to-average experimental comparisons were performed using the Food-41 test set. Then, selecting three networks with highest accuracy of ResNet50 0, inceptionV3 and DenseNet121, taking the networks as the feature extraction networks (ResNet-InceptionV 3, inceptionV3-DensNet121 and ResNet-DenseNet 121) in a group of two networks as a feature extraction network, constructing a network model based on attention and depth feature fusion, iterating 30 epoch training in a Food-41 training set, and performing 3-time test average value taking evaluation by using the Food-41 testing set.
TABLE 1
Classical network model | Accuracy (%) |
VGG16 | 90.60 |
ResNet50 | 94.68 |
InceptionV3 | 93.90 |
DensNet121 | 93.98 |
MobileNetV2 | 93.42 |
TABLE 2
Wherein, table 1 shows the experimental results on Food-41 test set after fine tuning of six classical networks. Table 2 shows the comparison of the experimental results of the method on the Food-41 test set. As shown in tables 1 and 2, the experimental structure shows that the method achieves a better effect in Food-41 data set. The ResNet50 0-DensNet121 in table 2 reached a maximum of 95.73%, and the method constructed from parallel attention feature extraction networks consisting of two networks had a boost and higher accuracy compared to any single network, and the ReNet-InceptionV 3 in table 1 had a boost of up to 1.89% compared to InceptionV3 alone. According to the method, compression excitation attention is embedded in the sub-feature extraction network, so that local detail features of food material images can be focused better, and then the features extracted by the two sub-networks are fused to generate a feature with stronger characterization capability, so that the accuracy of food material image classification is improved.
Claims (1)
1. A food material image classification model building method based on attention and depth feature fusion is characterized by comprising the following steps:
S1, acquiring food material image data, wherein the food material image data comprises historical image data and image data to be classified;
s2, embedding compression excitation attention into a ResNet network and a DenseNet network which are parallel, and then forming a parallel attention feature extraction network by the two networks to extract food material image features;
In particular implementations, resNet network includes a Res Block structure Block and a first SE attention layer; the Res Block structure Block comprises a first convolution layer, a first pooling layer and a first activation layer; denseNet121 the 121 network includes a Dense Block structure Block and a second SE attention layer; the Dense Block structure Block comprises a second convolution layer, a second pooling layer and a second activation layer;
Wherein the SE attention layer comprises: coding the spatial features on a channel into a global feature, and adopting global average pooling to output the numerical distribution condition of c feature graphs of the layer; adopting a door mechanism in a sigmoid form to enable a network to learn nonlinear relations among all channels; adopting a bottleneck structure comprising two full connections to reduce the feature dimension, wherein the dimension reduction coefficient r is a super parameter, then activating by using a ReLU function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature map y;
Specifically, the SE attention layer specifically includes: encoding the spatial features on a channel into a global feature, and outputting the numerical distribution of c feature maps of the layer by adopting global average pooling,
zc=FGAP(uc)
Wherein F GAP (·) represents global average pooling; u c represents the original feature map of the c-th channel;
The adoption of a door mechanism in the sigmoid form enables a network to learn the nonlinear relation among all channels:
sc=σ(g(zc,w))
Wherein s c represents the activation value of the c-th channel; z c represents the numerical distribution of the c-th channel profile; w represents a network weight; g (·) represents a pooling function; sigma (·) represents a sigmoid activation function;
Adopting a bottleneck structure comprising two full connections to reduce the feature dimension, wherein the dimension reduction coefficient r is a super parameter, then activating by using a ReLU function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature diagram y, wherein the final attention feature diagram y is as follows:
Wherein s c represents the activation value of the c-th channel; u c represents the original feature map of the c-th channel, resulting in subnet 1 feature F in1 and subnet 2 feature F in2;
S3, inputting the parallel ResNet network and DenseNet121 network extraction features into a deep feature fusion module, and further extracting deep food features; the method comprises the following steps:
A1. Inputting the characteristics of the subnet 1 and the characteristics of the subnet 2, and splicing the characteristics of the subnet 1 and the characteristics of the subnet 2 into spliced subnet characteristics F in a third dimension;
A2. Inputting the spliced subnetwork feature F into a1 st branch, carrying out average pooling of 3*3 to obtain an average pooled feature F 1, then carrying out convolution of 1*1, compressing feature dimensions to 1024, and obtaining a feature obtained by subjecting an average pooled feature F 1 to 1*1 convolution kernel;
A3. Inputting the characteristic F of the spliced subnetwork into a 2nd branch, firstly using 3*3 convolution to reduce the dimension to 512 at the same time, then respectively carrying out 1*3 and 3*1 asymmetric convolution on the two branches, finally converging and splicing from 512 liter dimension to 1024 to obtain the characteristic that F 2、F21、F22 and F' 2,F2 represent the characteristic that the dimension of the characteristic F of the spliced subnetwork is 512 after 3*3 convolution kernel; f 21 represents the feature of F 2 after 1*3 asymmetric convolution; f 22 represents the feature of F 2 after 3*1 asymmetric convolution; f' 2 represents the characteristics of F 21 and F 22 after channel splicing in the 3 rd dimension;
A4. Inputting the characteristic F of the spliced subnetwork into a 3 rd branch, firstly using a hole convolution ratio of 3*3 as 2, and then respectively carrying out asymmetric convolution of 1*3 and 3*1 to obtain characteristics of the characteristic F of the spliced subnetwork after the hole convolution of 3*3 convolution kernels and the proportion of 2 represented by F 3、F31 and F 32,F3; f 31 denotes the feature of F 3 after an asymmetric convolution of 1*3; f 32 denotes the feature of F 3 after an asymmetric convolution of 3*1;
A5. inputting the characteristic F of the spliced subnetwork into a4 th branch, firstly using a cavity convolution ratio of 3*3 as 3, and then respectively carrying out asymmetric convolution of 1*3 and 3*1 to obtain characteristics of F 4、F41 and F 42,F4 which represent the characteristic F of the spliced subnetwork after the cavity convolution with the 3*3 convolution kernel and the proportion of 3; f 41 denotes the feature of F 4 after an asymmetric convolution of 1*3; f 42 denotes the feature of F 4 after an asymmetric convolution of 3*1;
A6. Splicing F 31,F32,F41 and F 42 to a dimension 2048 in a3 rd dimension, and then performing convolution dimension reduction of 1*1 to 1024 to obtain characteristics of F '3 and F' 3,F'3 which are subjected to multi-branch splicing of F 31、F32、F41 and F 42 in a third dimension; f '3 represents the feature of F' 3 after 1*1-size convolution dimensionality reduction;
A7. Adding F 2 and F 3, and then performing 3*3 convolution to obtain characteristics of F '2 and F' 2,F"2 representing the spliced three dimensions of F '3 and F' 3; f' "2 represents the characteristic of F" 2 after 3*3 convolution;
A8. The finally output fusion characteristic F out is formed by adding F 1 and F' "2, and the dimension is 1024; s4, establishing a food material image classification model, and classifying the image data to be classified to obtain food material types.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210342846.8A CN114898360B (en) | 2022-03-31 | 2022-03-31 | Food material image classification model establishment method based on attention and depth feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210342846.8A CN114898360B (en) | 2022-03-31 | 2022-03-31 | Food material image classification model establishment method based on attention and depth feature fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114898360A CN114898360A (en) | 2022-08-12 |
CN114898360B true CN114898360B (en) | 2024-04-26 |
Family
ID=82715937
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210342846.8A Active CN114898360B (en) | 2022-03-31 | 2022-03-31 | Food material image classification model establishment method based on attention and depth feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114898360B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112488301A (en) * | 2020-12-09 | 2021-03-12 | 孙成林 | Food inversion method based on multitask learning and attention mechanism |
CN113486981A (en) * | 2021-07-30 | 2021-10-08 | 西安电子科技大学 | RGB image classification method based on multi-scale feature attention fusion network |
CN113887410A (en) * | 2021-09-30 | 2022-01-04 | 杭州电子科技大学 | Deep learning-based multi-category food material identification system and method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109389078B (en) * | 2018-09-30 | 2022-06-21 | 京东方科技集团股份有限公司 | Image segmentation method, corresponding device and electronic equipment |
-
2022
- 2022-03-31 CN CN202210342846.8A patent/CN114898360B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112488301A (en) * | 2020-12-09 | 2021-03-12 | 孙成林 | Food inversion method based on multitask learning and attention mechanism |
CN113486981A (en) * | 2021-07-30 | 2021-10-08 | 西安电子科技大学 | RGB image classification method based on multi-scale feature attention fusion network |
CN113887410A (en) * | 2021-09-30 | 2022-01-04 | 杭州电子科技大学 | Deep learning-based multi-category food material identification system and method |
Non-Patent Citations (1)
Title |
---|
采用融合卷积网的图像分类算法;李聪 等;计算机工程与科学;20191215(第12期);第89-96页 * |
Also Published As
Publication number | Publication date |
---|---|
CN114898360A (en) | 2022-08-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108596039B (en) | Bimodal emotion recognition method and system based on 3D convolutional neural network | |
CN108256482B (en) | Face age estimation method for distributed learning based on convolutional neural network | |
CN109000930B (en) | Turbine engine performance degradation evaluation method based on stacking denoising autoencoder | |
CN109389171B (en) | Medical image classification method based on multi-granularity convolution noise reduction automatic encoder technology | |
CN110728656A (en) | Meta-learning-based no-reference image quality data processing method and intelligent terminal | |
CN112818764B (en) | Low-resolution image facial expression recognition method based on feature reconstruction model | |
CN104268593A (en) | Multiple-sparse-representation face recognition method for solving small sample size problem | |
CN111160189A (en) | Deep neural network facial expression recognition method based on dynamic target training | |
CN116645716B (en) | Expression recognition method based on local features and global features | |
CN110264407B (en) | Image super-resolution model training and reconstruction method, device, equipment and storage medium | |
CN112418261B (en) | Human body image multi-attribute classification method based on prior prototype attention mechanism | |
CN107240136A (en) | A kind of Still Image Compression Methods based on deep learning model | |
CN109657707A (en) | A kind of image classification method based on observing matrix transformation dimension | |
CN109767789A (en) | A kind of new feature extracting method for speech emotion recognition | |
CN110458189A (en) | Compressed sensing and depth convolutional neural networks Power Quality Disturbance Classification Method | |
CN111695611B (en) | Bee colony optimization kernel extreme learning and sparse representation mechanical fault identification method | |
CN111368734B (en) | Micro expression recognition method based on normal expression assistance | |
CN114169377A (en) | G-MSCNN-based fault diagnosis method for rolling bearing in noisy environment | |
CN112766283A (en) | Two-phase flow pattern identification method based on multi-scale convolution network | |
CN112365139A (en) | Crowd danger degree analysis method under graph convolution neural network | |
Li et al. | A deep learning method for material performance recognition in laser additive manufacturing | |
CN115965864A (en) | Lightweight attention mechanism network for crop disease identification | |
CN108401150A (en) | A kind of compressed sensing reconstruction algorithm statistic of attribute evaluation method of analog vision subjective perception | |
CN112508121B (en) | Method and system for sensing outside of industrial robot | |
CN114170657A (en) | Facial emotion recognition method integrating attention mechanism and high-order feature representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |