CN114898360B - Food material image classification model establishment method based on attention and depth feature fusion - Google Patents

Food material image classification model establishment method based on attention and depth feature fusion Download PDF

Info

Publication number
CN114898360B
CN114898360B CN202210342846.8A CN202210342846A CN114898360B CN 114898360 B CN114898360 B CN 114898360B CN 202210342846 A CN202210342846 A CN 202210342846A CN 114898360 B CN114898360 B CN 114898360B
Authority
CN
China
Prior art keywords
feature
convolution
dimension
network
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210342846.8A
Other languages
Chinese (zh)
Other versions
CN114898360A (en
Inventor
潘丽丽
马俊勇
雷前慧
蒋湘辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University of Forestry and Technology
Original Assignee
Central South University of Forestry and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University of Forestry and Technology filed Critical Central South University of Forestry and Technology
Priority to CN202210342846.8A priority Critical patent/CN114898360B/en
Publication of CN114898360A publication Critical patent/CN114898360A/en
Application granted granted Critical
Publication of CN114898360B publication Critical patent/CN114898360B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a food material image classification model building method based on attention and depth feature fusion, which comprises the steps of collecting food material image data, including historical image data and image data to be classified; embedding compressed excitation attention into a ResNet network and a DenseNet121 network which are parallel, and then forming a parallel attention feature extraction network by the two networks to extract food material image features; inputting the deep food material characteristics into a deep characteristic fusion module, and further extracting deep food material characteristics; and establishing a food material image classification model, and classifying to obtain the food material types. The invention utilizes the feature extraction network to embed the attention mechanism, so that the extracted features are more focused on the local details of the food materials, the sub-network features have better food material classification features, and meanwhile, the classification accuracy is effectively improved. In the feature fusion, network parameters are greatly reduced, and the addition of the features prevents the network deepening gradient from disappearing, so that the food classification is efficient and quick.

Description

Food material image classification model establishment method based on attention and depth feature fusion
Technical Field
The invention belongs to the field of image processing, and particularly relates to a food image classification model building method based on attention and depth feature fusion.
Background
With the improvement of living standard of residents, health consciousness of people is obviously improved, and the expectations and demands on healthy diet are higher and higher. Accurate diet assessment is an important way to assess the effectiveness of food nutrition compounding. Currently food suppliers rely primarily on manual methods for sorting and evaluating food materials. But this process is extremely tedious, laborious, expensive and subjective. Along with the rapid development of the Internet industry and multimedia technology, the image classification research in the food material field is paid more attention to in the fields of multimedia analysis and application, but the existing food material classification algorithm has the problems of unobvious extraction characteristics, low classification accuracy and the like, and cannot meet the demands of people.
Disclosure of Invention
The invention aims to provide a method for establishing a food material image classification model based on attention and depth feature fusion, which can enable a network to extract more distinguishable food material features, effectively fuse the depth features of two different networks and improve the accuracy of food material classification.
The invention provides a food material image classification model establishment method based on attention and depth feature fusion, which comprises the following steps:
S1, acquiring food material image data, wherein the food material image data comprises historical image data and image data to be classified;
S2, embedding compressed excitation attention (Squeeze-and-Excitation Attention) into a parallel ResNet network and a parallel DenseNet121 network, and then forming a parallel attention feature extraction network by the two networks to extract food image features;
S3, inputting the parallel ResNet network and DenseNet121 network extraction features into a deep feature fusion module, and further extracting deep food features;
s4, establishing a food material image classification model, and classifying the image data to be classified to obtain food material types.
Step S2, resNet the network comprises a Res Block structure Block and a first SE attention layer; the Res Block structure Block comprises a first convolution layer, a first pooling layer and a first activation layer; denseNet121 the 121 network includes a Dense Block structure Block and a second SE attention layer; the Dense Block structure Block includes a second convolutional layer, a second pooling layer, and a second active layer.
The SE attention layer includes: coding the spatial features on a channel into a global feature, and adopting global average pooling to output the numerical distribution condition of c feature graphs of the layer; adopting a door mechanism in a Sigmoid form to enable a network to learn nonlinear relations among all channels; and adopting a bottleneck structure comprising two full connections to reduce the feature dimension, wherein the dimension reduction coefficient r is a super parameter, then activating by using a ReLU (RECTIFIED LINEAR Unit, adjusting a linear Unit) function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature map y.
The SE attention layer specifically comprises: encoding the spatial features on a channel into a global feature, and outputting the numerical distribution of c feature maps of the layer by adopting global average pooling,
zc=FGAP(uc)
Wherein F GAP (·) represents global average pooling; u c represents the original feature map of the c-th channel;
The adoption of a door mechanism in the sigmoid form enables a network to learn the nonlinear relation among all channels:
sc=σ(g(zc,w))
Wherein s c represents the activation value of the c-th channel; z c represents the numerical distribution of the c-th channel profile; w represents a network weight; g (·) represents a pooling function; sigma (·) represents Sigmoid activation function;
adopting a bottleneck structure comprising two full connections to reduce the feature dimension, wherein the dimension reduction coefficient r is a super parameter, then activating by using a ReLU function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature diagram y, wherein the final attention feature diagram y is as follows:
wherein s c represents the activation value of the c-th channel; u c represents the original feature map of the c-th channel, resulting in subnet 1 feature F in1 and subnet 2 feature F in2.
The step S3 comprises the following steps:
A1. Inputting the characteristics of the subnet 1 and the characteristics of the subnet 2, and splicing the characteristics of the subnet 1 and the characteristics of the subnet 2 into spliced subnet characteristics F in a third dimension;
A2. Inputting the spliced subnetwork feature F into a1 st branch, carrying out average pooling of 3*3 to obtain an average pooled feature F 1, then carrying out convolution of 1*1, compressing feature dimensions to 1024, and obtaining a feature obtained by subjecting an average pooled feature F 1 to 1*1 convolution kernel;
A3. Inputting the characteristic F of the spliced subnetwork into a 2nd branch, firstly using 3*3 convolution to reduce the dimension to 512 at the same time, then respectively carrying out 1*3 and 3*1 asymmetric convolution on the two branches, finally converging and splicing from 512 liter dimension to 1024 to obtain the characteristic that F 2、F21、F22 and F' 2,F2 represent the characteristic that the dimension of the characteristic F of the spliced subnetwork is 512 after 3*3 convolution kernel; f 21 represents the feature of F 2 after 1*3 asymmetric convolution; f 22 represents the feature of F 2 after 3*1 asymmetric convolution; f' 2 represents the characteristics of F 21 and F 22 after channel splicing in the 3 rd dimension;
A4. Inputting the characteristic F of the spliced subnetwork into a 3 rd branch, firstly using a hole convolution ratio of 3*3 as 2, and then respectively carrying out asymmetric convolution of 1*3 and 3*1 to obtain characteristics of the characteristic F of the spliced subnetwork after the hole convolution of 3*3 convolution kernels and the proportion of 2 represented by F 3、F31 and F 32,F3; f 31 denotes the feature of F 3 after an asymmetric convolution of 1*3; f 32 denotes the feature of F 3 after an asymmetric convolution of 3*1;
A5. Inputting the characteristic F of the spliced subnetwork into a4 th branch, firstly using a cavity convolution ratio of 3*3 as 3, and then respectively carrying out asymmetric convolution of 1*3 and 3*1 to obtain characteristics of F 4、F41 and F 42,F4 which represent the characteristic F of the spliced subnetwork after the cavity convolution with the 3*3 convolution kernel and the proportion of 3; f 41 denotes the feature of F 4 after an asymmetric convolution of 1*3; f 42 denotes the feature of F 4 after an asymmetric convolution of 3*1;
A6. Splicing F 31,F32,F41 and F 42 to a dimension 2048 in a 3 rd dimension, and then performing convolution dimension reduction of 1*1 to 1024 to obtain characteristics of F '3 and F' 3,F'3 which are subjected to multi-branch splicing of F 31、F32、F41 and F 42 in a third dimension; f '3 represents the feature of F' 3 after 1*1-size convolution dimensionality reduction;
A7. Adding F 2 and F 3, and then performing 3*3 convolution to obtain characteristics of F '2 and F' 2,F"2 representing the spliced three dimensions of F '3 and F' 3; f' "2 represents the characteristic of F" 2 after 3*3 convolution;
A8. The final output fusion feature F out is F 1 plus F' "2, with dimensions 1024.
The step S3 comprises the following steps:
A1. Inputting a subnet 1 feature F in1 and a subnet 2 feature F in2, and splicing the subnet 1 feature F in1 and the subnet 2 feature F in2 into a size of 7 x n in a third dimension, wherein f= Concat [ F in1,Fin2, axis=3 ]; wherein F represents the characteristics of the spliced subnetwork; concat [. Cndot. ] represents a splice operation; axis represents feature dimensions;
A2. Inputting the characteristic F of the spliced subnetwork into the 1 st branch to carry out 3*3 average pooling to obtain the characteristic F 1,F1=Avg_Pool(F,poolsize = [3,3 ]), wherein avg_pool (DEG) represents average pooling operation; pool size represents the pooling use convolution kernel size; then performing 1*1 convolution, compressing the feature dimension to 1024, wherein F' 1=Conv2d(F1,fs=1024,size=[1,1]),F'1 represents the feature obtained by carrying out 1*1 convolution kernels on the feature F 1 after the average pooling; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A3. Inputting the characteristic F of the spliced subnetwork into a 2 nd branch, firstly using 3*3 convolution to simultaneously reduce the dimension to 512, then respectively carrying out 1*3 and 3*1 asymmetric convolution on the two branches, and finally converging and splicing from 512 liter dimension to 1024; f 2=Conv2d(F,fs=512,size=[3,3]),F2 represents a feature with a dimension of 512, which is obtained by a 3*3 convolution kernel of the spliced subnetwork feature F; f 21=Conv2d(F2,size=[1,3]);F21 represents the feature of F 2 after 1*3 asymmetric convolution; f 22=Conv2d(F2,size=[3,1]),F22 represents the feature of F 2 after 3*1 asymmetric convolution; f' 2=Concat[F21,F22,axis=3],F'2 represents the characteristics of F 21 and F 22 after channel splicing in the 3 rd dimension; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size; concat [. Cndot. ] represents a splice operation;
A4. Inputting the characteristic F of the spliced subnetwork into a 3 rd branch, firstly using a cavity convolution ratio of 3*3 as 2, and then respectively carrying out asymmetric convolution of 1*3 and 3*1; f 3=Dilated_Conv(F,ratio=2,fs=512,size=[3,3]),F3 represents the characteristic of the spliced subnetwork after 3*3 convolution kernel and hole convolution with the proportion of 2; ratio represents the proportion of hole convolution; dilated _conv (·) represents a hole convolution; f 31=Conv2d(F3,size=[1,3]),F31 denotes the feature of F 3 after an asymmetric convolution of 1*3; f 32=Conv2d(F3,size=[3,1]);F32 denotes the feature of F 3 after an asymmetric convolution of 3*1; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A5. Inputting the characteristic F of the spliced subnetwork into a 4 th branch, firstly using a cavity convolution ratio of 3*3 to be 3, and then respectively carrying out asymmetric convolution of 1*3 and 3*1; f 4 = Dilated _conv (F, ratio=3, fs=512, size= [3,3 ]), ratio represents the hole convolution ratio; ; dilated _conv (·) represents a hole convolution; f 4 represents the characteristic of the spliced subnetwork characteristic F after 3*3 convolution kernel and cavity convolution with the proportion of 3; f 41=Conv2d(F4,size=[1,3]),F41 denotes the feature of F 4 after an asymmetric convolution of 1*3; f 42=Conv2d(F4,size=[3,1]),F42 denotes the feature of F 4 after an asymmetric convolution of 3*1;
A6. splice F 31,F32,F41 and F 42 in dimension 3 to dimension 2048, then perform convolution dimension reduction of 1*1 to 1024; f' 3=Concat[F31,F32,F41,F42,axis=3],F'3 represents the characteristics of the multi-branched F 31、F32、F41 and F 42 spliced in the third dimension; f '3=Conv2d(F3,fs=1024,size=[1,1]),F"3 represents the feature of F' 3 after 1*1-size convolution dimensionality reduction; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size; concat [. Cndot. ] represents a splice operation; axis represents feature dimensions;
A7. Adding F 2 and F 3, and then performing a convolution of 3*3, wherein F ' 2=Add[F'2,F"3],F"2 represents the characteristics of F ' 3 and F ' 3 after three-dimensional stitching; f' "2=Conv2d(F"2,fs=1024,size=[3,3]),F"'2 represents the characteristic of F" 2 after 3*3 convolution; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A8. The final output fusion feature F out (7 x 1024) is F 1 summed with F' "2, with dimensions 1024, F out=Add[F'1,F"'2.
According to the method for establishing the food image classification model based on the attention and depth feature fusion, which is provided by the invention, the attention mechanism is embedded into the feature extraction network, so that the extracted features are more focused on the local details of the food, and the sub-network features have better food classification features. The depth feature fusion module effectively fuses the extracted features of two single networks, combines complementary depth features of different sub-networks to form a feature with stronger characterization capability, and effectively improves classification accuracy. In the feature fusion, the convolution kernel comprises asymmetric convolution and cavity convolution instead of common convolution, so that network parameters are greatly reduced, and meanwhile, the residual error structure is adopted to add the features to prevent the network deepening gradient from disappearing, so that food material classification is efficient and rapid.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a flow chart of an embodiment of the present invention.
Fig. 3 is a schematic diagram of a depth feature fusion module according to an embodiment of the present invention.
Detailed Description
FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides a method for establishing a food material image classification model based on attention and depth feature fusion, which comprises the following steps:
S1, acquiring food material image data, wherein the food material image data comprises historical image data and image data to be classified;
S2, embedding compressed excitation attention (Squeeze-and-Excitation Attention) into a parallel ResNet network and a parallel DenseNet121 network, and then forming a parallel attention feature extraction network by the two networks to extract food image features;
S3, inputting the parallel ResNet network and DenseNet121 network extraction features into a deep feature fusion module, and further extracting deep food features;
s4, establishing a food material image classification model, and classifying the image data to be classified to obtain food material types.
In the food material classification, the difference between different categories is very fine, the distinguishing information including distinguishing is often the local area where the image exists, such as banana, musa, shallot and garlic are very similar in appearance, the human eyes cannot recognize the image at a glance, and the human eyes cannot distinguish the image through some fine shapes and colors, so that for the food material classification identification problem, it is important to capture the local features with distinguishing degrees, and the SE attention mechanism is embedded into ResNet networks (the sub-feature extraction network 1) and DenseNet networks (the sub-feature extraction network 2) to extract better local features.
Fig. 2 is a schematic flow chart of an embodiment of the present invention. In the step S2, the ResNet network includes a Res Block structure Block and a first SE attention layer; the Res Block structure Block comprises a first convolution layer, a first pooling layer, a first activation layer and the like; denseNet121 the 121 network includes a Dense Block structure Block and a second SE attention layer; the Dense Block structure Block comprises a second convolution layer, a second pooling layer, a second activation layer, and the like.
SE attention includes: the SE attention focuses on the relation among channel channels, and automatically learns the importance degree among different channels. SE attention first encodes the spatial features on a channel into a global feature, uses global averaging pooling to output the value distribution z c of the c-th channel feature map of the layer,
zc=FGAP(uc)
Wherein F GAP (·) represents global average pooling; u c represents the original feature map of the c-th channel;
The adoption of a door mechanism in the sigmoid form enables a network to learn the nonlinear relation among all channels:
sc=σ(g(zc,w))
Wherein s c represents the activation value of the c-th channel; z c represents the numerical distribution of the c-th channel profile; w represents a network weight; g (·) represents a pooling function; sigma (·) represents Sigmoid activation function;
In order to reduce the complexity of the model and improve the generalization capability, adopting a bottleneck (bottleneck) structure comprising two full connections to reduce the feature dimension, taking the dimension-reducing coefficient r as a super parameter, then activating with a ReLU (RECTIFIED LINEAR Unit, adjusting a linear Unit) function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature graph y, wherein the final attention feature graph y is as follows:
Wherein s c represents the activation value of the c-th channel; u c represents the original feature map of the c-th channel, resulting in subnet 1 feature F in1 (7×7×n1) and subnet 2 feature F in2 (7×7×n2).
Fig. 3 is a schematic diagram of a depth feature fusion module according to an embodiment of the invention. The traditional feature fusion is to splice the features on the third dimension, namely, the channels are overlapped, so that feature information circulation is not smooth, and the problems of high feature dimension, high calculation cost and the like can be caused by directly using the feature splice from two parallel subnets for subsequent reasoning. In order to improve the expression capability of the network and achieve a better food material image recognition effect, a novel depth feature fusion module is provided, as shown in fig. 3. The parallel 2 sub-network feature extraction features are combined, and more complementary features are learned, so that the classification accuracy of food material images is improved. Considering that the reasonable utilization of computing resources mainly adopts small-size convolution kernels, including 1×1,1×3 and 3×1 small convolution kernels, the method is beneficial to reducing the consumption of computing resources, and simultaneously, in order to obtain richer image features, multiple groups of convolution operations are executed in parallel.
The step S3 comprises the following steps:
A1. Inputting a subnet 1 feature F in1 (7×7×n1) and a subnet 2 feature F in2 (7×7×n2), and concatenating a subnet 1 feature F in1 (7×7×n1) and a subnet 2 feature F in2 (7×7×n2) in a third dimension into a size of 7×7×n, f= Concat [ F in1,Fin2, axis=3 ]; wherein F represents the characteristics of the spliced subnetwork; concat [. Cndot. ] represents a splice operation; axis represents feature dimensions;
A2. Inputting the characteristic F of the spliced subnetwork into the 1 st branch to carry out 3*3 average pooling to obtain the characteristic F 1,F1=Avg_Pool(F,poolsize = [3,3 ]), wherein avg_pool (DEG) represents average pooling operation; pool size represents the pooling use convolution kernel size; then performing 1*1 convolution, compressing the feature dimension to 1024, wherein F' 1=Conv2d(F1,fs=1024,size=[1,1]),F'1 represents the feature obtained by carrying out 1*1 convolution kernels on the feature F 1 after the average pooling; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A3. Inputting the characteristic F of the spliced subnetwork into a 2 nd branch, firstly using 3*3 convolution to simultaneously reduce the dimension to 512, then respectively carrying out 1*3 and 3*1 asymmetric convolution on the two branches, and finally converging and splicing from 512 liter dimension to 1024; f 2=Conv2d(F,fs=512,size=[3,3]),F2 represents a feature with a dimension of 512, which is obtained by a 3*3 convolution kernel of the spliced subnetwork feature F; f 21=Conv2d(F2,size=[1,3]);F21 represents the feature of F 2 after 1*3 asymmetric convolution; f 22=Conv2d(F2,size=[3,1]),F22 represents the feature of F 2 after 3*1 asymmetric convolution; f' 2=Concat[F21,F22,axis=3],F'2 represents the characteristics of F 21 and F 22 after channel splicing in the 3 rd dimension; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size; concat [. Cndot. ] represents a splice operation;
A4. Inputting the characteristic F of the spliced subnetwork into a 3 rd branch, firstly using a cavity convolution ratio of 3*3 as 2, and then respectively carrying out asymmetric convolution of 1*3 and 3*1; f 3=Dilated_Conv(F,ratio=2,fs=512,size=[3,3]),F3 represents the characteristic of the spliced subnetwork after 3*3 convolution kernel and hole convolution with the proportion of 2; ratio represents the proportion of hole convolution; dilated _conv (·) represents a hole convolution; f 31=Conv2d(F3,size=[1,3]),F31 denotes the feature of F 3 after an asymmetric convolution of 1*3; f 32=Conv2d(F3,size=[3,1]);F32 denotes the feature of F 3 after an asymmetric convolution of 3*1; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A5. Inputting the characteristic F of the spliced subnetwork into a 4 th branch, firstly using a cavity convolution ratio of 3*3 to be 3, and then respectively carrying out asymmetric convolution of 1*3 and 3*1; f 4 = Dilated _conv (F, ratio=3, fs=512, size= [3,3 ]), ratio represents the hole convolution ratio; ; dilated _conv (·) represents a hole convolution; f 4 represents the characteristic of the spliced subnetwork characteristic F after 3*3 convolution kernel and cavity convolution with the proportion of 3; f 41=Conv2d(F4,size=[1,3]),F41 denotes the feature of F 4 after an asymmetric convolution of 1*3; f 42=Conv2d(F4,size=[3,1]),F42 denotes the feature of F 4 after an asymmetric convolution of 3*1;
A6. Splice F 31,F32,F41 and F 42 in dimension 3 to dimension 2048, then perform convolution dimension reduction of 1*1 to 1024; f' 3=Concat[F31,F32,F41,F42,axis=3],F'3 represents the characteristics of the multi-branched F 31、F32、F41 and F 42 spliced in the third dimension; f '3=Conv2d(F3,fs=1024,size=[1,1]),F"3 represents the feature of F' 3 after 1*1-size convolution dimensionality reduction; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size; concat [. Cndot. ] represents a splice operation; axis represents feature dimensions;
A7. Adding F 2 and F 3, and then performing a convolution of 3*3, wherein F ' 2=Add[F'2,F"3],F"2 represents the characteristics of F ' 3 and F ' 3 after three-dimensional stitching; f' "2=Conv2d(F"2,fs=1024,size=[3,3]),F"'2 represents the characteristic of F" 2 after 3*3 convolution; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;
A8. The final output fusion feature F out (7 x 1024) is the sum of F 1 and F' "2, with dimensions 1024, F out=Add[F'1,F"'2.
In this example, the present invention was tested on Food-41 Food material dataset. Food-41 was collected from a large Food supply chain platform Mealcome (MLC dataset) in china. It contains 4100 images of 41 food ingredient types, divided into 3 data sets: 60% training set, 20% validation set and 20% test set. The experiment adopts PyTorch deep learning platform, the loss function adopts multi-classification cross entropy loss (Categorical Cross Entropy Loss), the network weight optimization adopts random gradient descent (Stochastic GRADIENT DESCENT, SGD) optimizer, the learning rate gamma is updated after the attenuation step length iteration and multiplied by the attenuation coefficient alpha, the basic learning rate is set to 0.001, the attenuation coefficient is set to 0.94, the momentum (momentum) is set to 0.9, and the iteration number (epoch) is set to 30. Firstly, the VGG16, resNet50, inceptionV, densNet121 and MobileNetV classical networks pre-trained on ImageNet are selected for fine tuning (41 is the last full connection layer of the reconstructed network, 1000 is the ImageNet, and 41 is the Food-41). 30 epoch training runs were iterated over the Food-41 training set and 3 test-to-average experimental comparisons were performed using the Food-41 test set. Then, selecting three networks with highest accuracy of ResNet50 0, inceptionV3 and DenseNet121, taking the networks as the feature extraction networks (ResNet-InceptionV 3, inceptionV3-DensNet121 and ResNet-DenseNet 121) in a group of two networks as a feature extraction network, constructing a network model based on attention and depth feature fusion, iterating 30 epoch training in a Food-41 training set, and performing 3-time test average value taking evaluation by using the Food-41 testing set.
TABLE 1
Classical network model Accuracy (%)
VGG16 90.60
ResNet50 94.68
InceptionV3 93.90
DensNet121 93.98
MobileNetV2 93.42
TABLE 2
Wherein, table 1 shows the experimental results on Food-41 test set after fine tuning of six classical networks. Table 2 shows the comparison of the experimental results of the method on the Food-41 test set. As shown in tables 1 and 2, the experimental structure shows that the method achieves a better effect in Food-41 data set. The ResNet50 0-DensNet121 in table 2 reached a maximum of 95.73%, and the method constructed from parallel attention feature extraction networks consisting of two networks had a boost and higher accuracy compared to any single network, and the ReNet-InceptionV 3 in table 1 had a boost of up to 1.89% compared to InceptionV3 alone. According to the method, compression excitation attention is embedded in the sub-feature extraction network, so that local detail features of food material images can be focused better, and then the features extracted by the two sub-networks are fused to generate a feature with stronger characterization capability, so that the accuracy of food material image classification is improved.

Claims (1)

1. A food material image classification model building method based on attention and depth feature fusion is characterized by comprising the following steps:
S1, acquiring food material image data, wherein the food material image data comprises historical image data and image data to be classified;
s2, embedding compression excitation attention into a ResNet network and a DenseNet network which are parallel, and then forming a parallel attention feature extraction network by the two networks to extract food material image features;
In particular implementations, resNet network includes a Res Block structure Block and a first SE attention layer; the Res Block structure Block comprises a first convolution layer, a first pooling layer and a first activation layer; denseNet121 the 121 network includes a Dense Block structure Block and a second SE attention layer; the Dense Block structure Block comprises a second convolution layer, a second pooling layer and a second activation layer;
Wherein the SE attention layer comprises: coding the spatial features on a channel into a global feature, and adopting global average pooling to output the numerical distribution condition of c feature graphs of the layer; adopting a door mechanism in a sigmoid form to enable a network to learn nonlinear relations among all channels; adopting a bottleneck structure comprising two full connections to reduce the feature dimension, wherein the dimension reduction coefficient r is a super parameter, then activating by using a ReLU function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature map y;
Specifically, the SE attention layer specifically includes: encoding the spatial features on a channel into a global feature, and outputting the numerical distribution of c feature maps of the layer by adopting global average pooling,
zc=FGAP(uc)
Wherein F GAP (·) represents global average pooling; u c represents the original feature map of the c-th channel;
The adoption of a door mechanism in the sigmoid form enables a network to learn the nonlinear relation among all channels:
sc=σ(g(zc,w))
Wherein s c represents the activation value of the c-th channel; z c represents the numerical distribution of the c-th channel profile; w represents a network weight; g (·) represents a pooling function; sigma (·) represents a sigmoid activation function;
Adopting a bottleneck structure comprising two full connections to reduce the feature dimension, wherein the dimension reduction coefficient r is a super parameter, then activating by using a ReLU function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature diagram y, wherein the final attention feature diagram y is as follows:
Wherein s c represents the activation value of the c-th channel; u c represents the original feature map of the c-th channel, resulting in subnet 1 feature F in1 and subnet 2 feature F in2;
S3, inputting the parallel ResNet network and DenseNet121 network extraction features into a deep feature fusion module, and further extracting deep food features; the method comprises the following steps:
A1. Inputting the characteristics of the subnet 1 and the characteristics of the subnet 2, and splicing the characteristics of the subnet 1 and the characteristics of the subnet 2 into spliced subnet characteristics F in a third dimension;
A2. Inputting the spliced subnetwork feature F into a1 st branch, carrying out average pooling of 3*3 to obtain an average pooled feature F 1, then carrying out convolution of 1*1, compressing feature dimensions to 1024, and obtaining a feature obtained by subjecting an average pooled feature F 1 to 1*1 convolution kernel;
A3. Inputting the characteristic F of the spliced subnetwork into a 2nd branch, firstly using 3*3 convolution to reduce the dimension to 512 at the same time, then respectively carrying out 1*3 and 3*1 asymmetric convolution on the two branches, finally converging and splicing from 512 liter dimension to 1024 to obtain the characteristic that F 2、F21、F22 and F' 2,F2 represent the characteristic that the dimension of the characteristic F of the spliced subnetwork is 512 after 3*3 convolution kernel; f 21 represents the feature of F 2 after 1*3 asymmetric convolution; f 22 represents the feature of F 2 after 3*1 asymmetric convolution; f' 2 represents the characteristics of F 21 and F 22 after channel splicing in the 3 rd dimension;
A4. Inputting the characteristic F of the spliced subnetwork into a 3 rd branch, firstly using a hole convolution ratio of 3*3 as 2, and then respectively carrying out asymmetric convolution of 1*3 and 3*1 to obtain characteristics of the characteristic F of the spliced subnetwork after the hole convolution of 3*3 convolution kernels and the proportion of 2 represented by F 3、F31 and F 32,F3; f 31 denotes the feature of F 3 after an asymmetric convolution of 1*3; f 32 denotes the feature of F 3 after an asymmetric convolution of 3*1;
A5. inputting the characteristic F of the spliced subnetwork into a4 th branch, firstly using a cavity convolution ratio of 3*3 as 3, and then respectively carrying out asymmetric convolution of 1*3 and 3*1 to obtain characteristics of F 4、F41 and F 42,F4 which represent the characteristic F of the spliced subnetwork after the cavity convolution with the 3*3 convolution kernel and the proportion of 3; f 41 denotes the feature of F 4 after an asymmetric convolution of 1*3; f 42 denotes the feature of F 4 after an asymmetric convolution of 3*1;
A6. Splicing F 31,F32,F41 and F 42 to a dimension 2048 in a3 rd dimension, and then performing convolution dimension reduction of 1*1 to 1024 to obtain characteristics of F '3 and F' 3,F'3 which are subjected to multi-branch splicing of F 31、F32、F41 and F 42 in a third dimension; f '3 represents the feature of F' 3 after 1*1-size convolution dimensionality reduction;
A7. Adding F 2 and F 3, and then performing 3*3 convolution to obtain characteristics of F '2 and F' 2,F"2 representing the spliced three dimensions of F '3 and F' 3; f' "2 represents the characteristic of F" 2 after 3*3 convolution;
A8. The finally output fusion characteristic F out is formed by adding F 1 and F' "2, and the dimension is 1024; s4, establishing a food material image classification model, and classifying the image data to be classified to obtain food material types.
CN202210342846.8A 2022-03-31 2022-03-31 Food material image classification model establishment method based on attention and depth feature fusion Active CN114898360B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210342846.8A CN114898360B (en) 2022-03-31 2022-03-31 Food material image classification model establishment method based on attention and depth feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210342846.8A CN114898360B (en) 2022-03-31 2022-03-31 Food material image classification model establishment method based on attention and depth feature fusion

Publications (2)

Publication Number Publication Date
CN114898360A CN114898360A (en) 2022-08-12
CN114898360B true CN114898360B (en) 2024-04-26

Family

ID=82715937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210342846.8A Active CN114898360B (en) 2022-03-31 2022-03-31 Food material image classification model establishment method based on attention and depth feature fusion

Country Status (1)

Country Link
CN (1) CN114898360B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112488301A (en) * 2020-12-09 2021-03-12 孙成林 Food inversion method based on multitask learning and attention mechanism
CN113486981A (en) * 2021-07-30 2021-10-08 西安电子科技大学 RGB image classification method based on multi-scale feature attention fusion network
CN113887410A (en) * 2021-09-30 2022-01-04 杭州电子科技大学 Deep learning-based multi-category food material identification system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389078B (en) * 2018-09-30 2022-06-21 京东方科技集团股份有限公司 Image segmentation method, corresponding device and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112488301A (en) * 2020-12-09 2021-03-12 孙成林 Food inversion method based on multitask learning and attention mechanism
CN113486981A (en) * 2021-07-30 2021-10-08 西安电子科技大学 RGB image classification method based on multi-scale feature attention fusion network
CN113887410A (en) * 2021-09-30 2022-01-04 杭州电子科技大学 Deep learning-based multi-category food material identification system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
采用融合卷积网的图像分类算法;李聪 等;计算机工程与科学;20191215(第12期);第89-96页 *

Also Published As

Publication number Publication date
CN114898360A (en) 2022-08-12

Similar Documents

Publication Publication Date Title
CN108596039B (en) Bimodal emotion recognition method and system based on 3D convolutional neural network
CN108256482B (en) Face age estimation method for distributed learning based on convolutional neural network
CN109000930B (en) Turbine engine performance degradation evaluation method based on stacking denoising autoencoder
CN109389171B (en) Medical image classification method based on multi-granularity convolution noise reduction automatic encoder technology
CN110728656A (en) Meta-learning-based no-reference image quality data processing method and intelligent terminal
CN112818764B (en) Low-resolution image facial expression recognition method based on feature reconstruction model
CN104268593A (en) Multiple-sparse-representation face recognition method for solving small sample size problem
CN111160189A (en) Deep neural network facial expression recognition method based on dynamic target training
CN116645716B (en) Expression recognition method based on local features and global features
CN110264407B (en) Image super-resolution model training and reconstruction method, device, equipment and storage medium
CN112418261B (en) Human body image multi-attribute classification method based on prior prototype attention mechanism
CN107240136A (en) A kind of Still Image Compression Methods based on deep learning model
CN109657707A (en) A kind of image classification method based on observing matrix transformation dimension
CN109767789A (en) A kind of new feature extracting method for speech emotion recognition
CN110458189A (en) Compressed sensing and depth convolutional neural networks Power Quality Disturbance Classification Method
CN111695611B (en) Bee colony optimization kernel extreme learning and sparse representation mechanical fault identification method
CN111368734B (en) Micro expression recognition method based on normal expression assistance
CN114169377A (en) G-MSCNN-based fault diagnosis method for rolling bearing in noisy environment
CN112766283A (en) Two-phase flow pattern identification method based on multi-scale convolution network
CN112365139A (en) Crowd danger degree analysis method under graph convolution neural network
Li et al. A deep learning method for material performance recognition in laser additive manufacturing
CN115965864A (en) Lightweight attention mechanism network for crop disease identification
CN108401150A (en) A kind of compressed sensing reconstruction algorithm statistic of attribute evaluation method of analog vision subjective perception
CN112508121B (en) Method and system for sensing outside of industrial robot
CN114170657A (en) Facial emotion recognition method integrating attention mechanism and high-order feature representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant