CN114898360B

CN114898360B - Food material image classification model establishment method based on attention and depth feature fusion

Info

Publication number: CN114898360B
Application number: CN202210342846.8A
Authority: CN
Inventors: 潘丽丽; 马俊勇; 雷前慧; 蒋湘辉
Original assignee: Central South University of Forestry and Technology
Current assignee: Central South University of Forestry and Technology
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2024-04-26
Anticipated expiration: 2042-03-31
Also published as: CN114898360A

Abstract

The invention discloses a food material image classification model building method based on attention and depth feature fusion, which comprises the steps of collecting food material image data, including historical image data and image data to be classified; embedding compressed excitation attention into a ResNet network and a DenseNet121 network which are parallel, and then forming a parallel attention feature extraction network by the two networks to extract food material image features; inputting the deep food material characteristics into a deep characteristic fusion module, and further extracting deep food material characteristics; and establishing a food material image classification model, and classifying to obtain the food material types. The invention utilizes the feature extraction network to embed the attention mechanism, so that the extracted features are more focused on the local details of the food materials, the sub-network features have better food material classification features, and meanwhile, the classification accuracy is effectively improved. In the feature fusion, network parameters are greatly reduced, and the addition of the features prevents the network deepening gradient from disappearing, so that the food classification is efficient and quick.

Description

Food material image classification model establishment method based on attention and depth feature fusion

Technical Field

The invention belongs to the field of image processing, and particularly relates to a food image classification model building method based on attention and depth feature fusion.

Background

With the improvement of living standard of residents, health consciousness of people is obviously improved, and the expectations and demands on healthy diet are higher and higher. Accurate diet assessment is an important way to assess the effectiveness of food nutrition compounding. Currently food suppliers rely primarily on manual methods for sorting and evaluating food materials. But this process is extremely tedious, laborious, expensive and subjective. Along with the rapid development of the Internet industry and multimedia technology, the image classification research in the food material field is paid more attention to in the fields of multimedia analysis and application, but the existing food material classification algorithm has the problems of unobvious extraction characteristics, low classification accuracy and the like, and cannot meet the demands of people.

Disclosure of Invention

The invention aims to provide a method for establishing a food material image classification model based on attention and depth feature fusion, which can enable a network to extract more distinguishable food material features, effectively fuse the depth features of two different networks and improve the accuracy of food material classification.

The invention provides a food material image classification model establishment method based on attention and depth feature fusion, which comprises the following steps:

S1, acquiring food material image data, wherein the food material image data comprises historical image data and image data to be classified;

S2, embedding compressed excitation attention (Squeeze-and-Excitation Attention) into a parallel ResNet network and a parallel DenseNet121 network, and then forming a parallel attention feature extraction network by the two networks to extract food image features;

S3, inputting the parallel ResNet network and DenseNet121 network extraction features into a deep feature fusion module, and further extracting deep food features;

s4, establishing a food material image classification model, and classifying the image data to be classified to obtain food material types.

Step S2, resNet the network comprises a Res Block structure Block and a first SE attention layer; the Res Block structure Block comprises a first convolution layer, a first pooling layer and a first activation layer; denseNet121 the 121 network includes a Dense Block structure Block and a second SE attention layer; the Dense Block structure Block includes a second convolutional layer, a second pooling layer, and a second active layer.

The SE attention layer includes: coding the spatial features on a channel into a global feature, and adopting global average pooling to output the numerical distribution condition of c feature graphs of the layer; adopting a door mechanism in a Sigmoid form to enable a network to learn nonlinear relations among all channels; and adopting a bottleneck structure comprising two full connections to reduce the feature dimension, wherein the dimension reduction coefficient r is a super parameter, then activating by using a ReLU (RECTIFIED LINEAR Unit, adjusting a linear Unit) function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature map y.

The SE attention layer specifically comprises: encoding the spatial features on a channel into a global feature, and outputting the numerical distribution of c feature maps of the layer by adopting global average pooling,

z_c＝F_GAP(u_c)

Wherein F _GAP (·) represents global average pooling; u _c represents the original feature map of the c-th channel;

The adoption of a door mechanism in the sigmoid form enables a network to learn the nonlinear relation among all channels:

s_c＝σ(g(z_c,w))

Wherein s _c represents the activation value of the c-th channel; z _c represents the numerical distribution of the c-th channel profile; w represents a network weight; g (·) represents a pooling function; sigma (·) represents Sigmoid activation function;

adopting a bottleneck structure comprising two full connections to reduce the feature dimension, wherein the dimension reduction coefficient r is a super parameter, then activating by using a ReLU function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature diagram y, wherein the final attention feature diagram y is as follows:

wherein s _c represents the activation value of the c-th channel; u _c represents the original feature map of the c-th channel, resulting in subnet 1 feature F _in1 and subnet 2 feature F _in2.

The step S3 comprises the following steps:

A1. Inputting the characteristics of the subnet 1 and the characteristics of the subnet 2, and splicing the characteristics of the subnet 1 and the characteristics of the subnet 2 into spliced subnet characteristics F in a third dimension;

A2. Inputting the spliced subnetwork feature F into a1 st branch, carrying out average pooling of 3*3 to obtain an average pooled feature F ₁, then carrying out convolution of 1*1, compressing feature dimensions to 1024, and obtaining a feature obtained by subjecting an average pooled feature F ₁ to 1*1 convolution kernel;

A3. Inputting the characteristic F of the spliced subnetwork into a 2nd branch, firstly using 3*3 convolution to reduce the dimension to 512 at the same time, then respectively carrying out 1*3 and 3*1 asymmetric convolution on the two branches, finally converging and splicing from 512 liter dimension to 1024 to obtain the characteristic that F ₂、F₂₁、F₂₂ and F' ₂,F₂ represent the characteristic that the dimension of the characteristic F of the spliced subnetwork is 512 after 3*3 convolution kernel; f ₂₁ represents the feature of F ₂ after 1*3 asymmetric convolution; f ₂₂ represents the feature of F ₂ after 3*1 asymmetric convolution; f' ₂ represents the characteristics of F ₂₁ and F ₂₂ after channel splicing in the 3 rd dimension;

A4. Inputting the characteristic F of the spliced subnetwork into a 3 rd branch, firstly using a hole convolution ratio of 3*3 as 2, and then respectively carrying out asymmetric convolution of 1*3 and 3*1 to obtain characteristics of the characteristic F of the spliced subnetwork after the hole convolution of 3*3 convolution kernels and the proportion of 2 represented by F ₃、F₃₁ and F ₃₂,F₃; f ₃₁ denotes the feature of F ₃ after an asymmetric convolution of 1*3; f ₃₂ denotes the feature of F ₃ after an asymmetric convolution of 3*1;

A5. Inputting the characteristic F of the spliced subnetwork into a4 th branch, firstly using a cavity convolution ratio of 3*3 as 3, and then respectively carrying out asymmetric convolution of 1*3 and 3*1 to obtain characteristics of F ₄、F₄₁ and F ₄₂,F₄ which represent the characteristic F of the spliced subnetwork after the cavity convolution with the 3*3 convolution kernel and the proportion of 3; f ₄₁ denotes the feature of F ₄ after an asymmetric convolution of 1*3; f ₄₂ denotes the feature of F ₄ after an asymmetric convolution of 3*1;

A6. Splicing F ₃₁,F₃₂,F₄₁ and F ₄₂ to a dimension 2048 in a 3 rd dimension, and then performing convolution dimension reduction of 1*1 to 1024 to obtain characteristics of F '₃ and F' ₃,F'₃ which are subjected to multi-branch splicing of F ₃₁、F₃₂、F₄₁ and F ₄₂ in a third dimension; f '₃ represents the feature of F' ₃ after 1*1-size convolution dimensionality reduction;

A7. Adding F ₂ and F ₃, and then performing 3*3 convolution to obtain characteristics of F '₂ and F' ₂,F"₂ representing the spliced three dimensions of F '₃ and F' ₃; f' "₂ represents the characteristic of F" ₂ after 3*3 convolution;

A8. The final output fusion feature F _out is F ₁ plus F' "₂, with dimensions 1024.

The step S3 comprises the following steps:

A1. Inputting a subnet 1 feature F _in1 and a subnet 2 feature F _in2, and splicing the subnet 1 feature F _in1 and the subnet 2 feature F _in2 into a size of 7 x n in a third dimension, wherein f= Concat [ F _in1,F_in2, axis=3 ]; wherein F represents the characteristics of the spliced subnetwork; concat [. Cndot. ] represents a splice operation; axis represents feature dimensions;

A2. Inputting the characteristic F of the spliced subnetwork into the 1 st branch to carry out 3*3 average pooling to obtain the characteristic F ₁,F₁＝Avg_Pool(F,pool_size = [3,3 ]), wherein avg_pool (DEG) represents average pooling operation; pool _size represents the pooling use convolution kernel size; then performing 1*1 convolution, compressing the feature dimension to 1024, wherein F' ₁＝Conv2d(F₁,fs＝1024,size＝[1,1]),F'₁ represents the feature obtained by carrying out 1*1 convolution kernels on the feature F ₁ after the average pooling; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;

A3. Inputting the characteristic F of the spliced subnetwork into a 2 nd branch, firstly using 3*3 convolution to simultaneously reduce the dimension to 512, then respectively carrying out 1*3 and 3*1 asymmetric convolution on the two branches, and finally converging and splicing from 512 liter dimension to 1024; f ₂＝Conv2d(F,fs＝512,size＝[3,3]),F₂ represents a feature with a dimension of 512, which is obtained by a 3*3 convolution kernel of the spliced subnetwork feature F; f ₂₁＝Conv2d(F₂,size＝[1,3]);F₂₁ represents the feature of F ₂ after 1*3 asymmetric convolution; f ₂₂＝Conv2d(F₂,size＝[3,1]),F₂₂ represents the feature of F ₂ after 3*1 asymmetric convolution; f' ₂＝Concat[F₂₁,F₂₂,axis＝3],F'₂ represents the characteristics of F ₂₁ and F ₂₂ after channel splicing in the 3 rd dimension; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size; concat [. Cndot. ] represents a splice operation;

A4. Inputting the characteristic F of the spliced subnetwork into a 3 rd branch, firstly using a cavity convolution ratio of 3*3 as 2, and then respectively carrying out asymmetric convolution of 1*3 and 3*1; f ₃＝Dilated_Conv(F,ratio＝2,fs＝512,size＝[3,3]),F₃ represents the characteristic of the spliced subnetwork after 3*3 convolution kernel and hole convolution with the proportion of 2; ratio represents the proportion of hole convolution; dilated _conv (·) represents a hole convolution; f ₃₁＝Conv2d(F₃,size＝[1,3]),F₃₁ denotes the feature of F ₃ after an asymmetric convolution of 1*3; f ₃₂＝Conv2d(F₃,size＝[3,1]);F₃₂ denotes the feature of F ₃ after an asymmetric convolution of 3*1; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;

A5. Inputting the characteristic F of the spliced subnetwork into a 4 th branch, firstly using a cavity convolution ratio of 3*3 to be 3, and then respectively carrying out asymmetric convolution of 1*3 and 3*1; f ₄ = Dilated _conv (F, ratio=3, fs=512, size= [3,3 ]), ratio represents the hole convolution ratio; ; dilated _conv (·) represents a hole convolution; f ₄ represents the characteristic of the spliced subnetwork characteristic F after 3*3 convolution kernel and cavity convolution with the proportion of 3; f ₄₁＝Conv2d(F₄,size＝[1,3]),F₄₁ denotes the feature of F ₄ after an asymmetric convolution of 1*3; f ₄₂＝Conv2d(F₄,size＝[3,1]),F₄₂ denotes the feature of F ₄ after an asymmetric convolution of 3*1;

A6. splice F ₃₁,F₃₂,F₄₁ and F ₄₂ in dimension 3 to dimension 2048, then perform convolution dimension reduction of 1*1 to 1024; f' ₃＝Concat[F₃₁,F₃₂,F₄₁,F₄₂,axis＝3],F'₃ represents the characteristics of the multi-branched F ₃₁、F₃₂、F₄₁ and F ₄₂ spliced in the third dimension; f '₃＝Conv2d(F₃,fs＝1024,size＝[1,1]),F"₃ represents the feature of F' ₃ after 1*1-size convolution dimensionality reduction; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size; concat [. Cndot. ] represents a splice operation; axis represents feature dimensions;

A7. Adding F ₂ and F ₃, and then performing a convolution of 3*3, wherein F ' ₂＝Add[F'₂,F"₃],F"₂ represents the characteristics of F ' ₃ and F ' ₃ after three-dimensional stitching; f' "₂＝Conv2d(F"₂,fs＝1024,size＝[3,3]),F"'₂ represents the characteristic of F" ₂ after 3*3 convolution; conv2d (·) represents a two-dimensional convolution operation; fs represents a feature dimension; size represents the convolution size;

A8. The final output fusion feature F _out (7 x 1024) is F ₁ summed with F' "₂, with dimensions 1024, F _out＝Add[F'₁,F"'₂.

According to the method for establishing the food image classification model based on the attention and depth feature fusion, which is provided by the invention, the attention mechanism is embedded into the feature extraction network, so that the extracted features are more focused on the local details of the food, and the sub-network features have better food classification features. The depth feature fusion module effectively fuses the extracted features of two single networks, combines complementary depth features of different sub-networks to form a feature with stronger characterization capability, and effectively improves classification accuracy. In the feature fusion, the convolution kernel comprises asymmetric convolution and cavity convolution instead of common convolution, so that network parameters are greatly reduced, and meanwhile, the residual error structure is adopted to add the features to prevent the network deepening gradient from disappearing, so that food material classification is efficient and rapid.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Fig. 2 is a flow chart of an embodiment of the present invention.

Fig. 3 is a schematic diagram of a depth feature fusion module according to an embodiment of the present invention.

Detailed Description

FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides a method for establishing a food material image classification model based on attention and depth feature fusion, which comprises the following steps:

In the food material classification, the difference between different categories is very fine, the distinguishing information including distinguishing is often the local area where the image exists, such as banana, musa, shallot and garlic are very similar in appearance, the human eyes cannot recognize the image at a glance, and the human eyes cannot distinguish the image through some fine shapes and colors, so that for the food material classification identification problem, it is important to capture the local features with distinguishing degrees, and the SE attention mechanism is embedded into ResNet networks (the sub-feature extraction network 1) and DenseNet networks (the sub-feature extraction network 2) to extract better local features.

Fig. 2 is a schematic flow chart of an embodiment of the present invention. In the step S2, the ResNet network includes a Res Block structure Block and a first SE attention layer; the Res Block structure Block comprises a first convolution layer, a first pooling layer, a first activation layer and the like; denseNet121 the 121 network includes a Dense Block structure Block and a second SE attention layer; the Dense Block structure Block comprises a second convolution layer, a second pooling layer, a second activation layer, and the like.

SE attention includes: the SE attention focuses on the relation among channel channels, and automatically learns the importance degree among different channels. SE attention first encodes the spatial features on a channel into a global feature, uses global averaging pooling to output the value distribution z _c of the c-th channel feature map of the layer,

z_c＝F_GAP(u_c)

s_c＝σ(g(z_c,w))

In order to reduce the complexity of the model and improve the generalization capability, adopting a bottleneck (bottleneck) structure comprising two full connections to reduce the feature dimension, taking the dimension-reducing coefficient r as a super parameter, then activating with a ReLU (RECTIFIED LINEAR Unit, adjusting a linear Unit) function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature graph y, wherein the final attention feature graph y is as follows:

Wherein s _c represents the activation value of the c-th channel; u _c represents the original feature map of the c-th channel, resulting in subnet 1 feature F _in1 (7×7×n1) and subnet 2 feature F _in2 (7×7×n2).

Fig. 3 is a schematic diagram of a depth feature fusion module according to an embodiment of the invention. The traditional feature fusion is to splice the features on the third dimension, namely, the channels are overlapped, so that feature information circulation is not smooth, and the problems of high feature dimension, high calculation cost and the like can be caused by directly using the feature splice from two parallel subnets for subsequent reasoning. In order to improve the expression capability of the network and achieve a better food material image recognition effect, a novel depth feature fusion module is provided, as shown in fig. 3. The parallel 2 sub-network feature extraction features are combined, and more complementary features are learned, so that the classification accuracy of food material images is improved. Considering that the reasonable utilization of computing resources mainly adopts small-size convolution kernels, including 1×1,1×3 and 3×1 small convolution kernels, the method is beneficial to reducing the consumption of computing resources, and simultaneously, in order to obtain richer image features, multiple groups of convolution operations are executed in parallel.

The step S3 comprises the following steps:

A1. Inputting a subnet 1 feature F _in1 (7×7×n1) and a subnet 2 feature F _in2 (7×7×n2), and concatenating a subnet 1 feature F _in1 (7×7×n1) and a subnet 2 feature F _in2 (7×7×n2) in a third dimension into a size of 7×7×n, f= Concat [ F _in1,F_in2, axis=3 ]; wherein F represents the characteristics of the spliced subnetwork; concat [. Cndot. ] represents a splice operation; axis represents feature dimensions;

A8. The final output fusion feature F _out (7 x 1024) is the sum of F ₁ and F' "₂, with dimensions 1024, F _out＝Add[F'₁,F"'₂.

In this example, the present invention was tested on Food-41 Food material dataset. Food-41 was collected from a large Food supply chain platform Mealcome (MLC dataset) in china. It contains 4100 images of 41 food ingredient types, divided into 3 data sets: 60% training set, 20% validation set and 20% test set. The experiment adopts PyTorch deep learning platform, the loss function adopts multi-classification cross entropy loss (Categorical Cross Entropy Loss), the network weight optimization adopts random gradient descent (Stochastic GRADIENT DESCENT, SGD) optimizer, the learning rate gamma is updated after the attenuation step length iteration and multiplied by the attenuation coefficient alpha, the basic learning rate is set to 0.001, the attenuation coefficient is set to 0.94, the momentum (momentum) is set to 0.9, and the iteration number (epoch) is set to 30. Firstly, the VGG16, resNet50, inceptionV, densNet121 and MobileNetV classical networks pre-trained on ImageNet are selected for fine tuning (41 is the last full connection layer of the reconstructed network, 1000 is the ImageNet, and 41 is the Food-41). 30 epoch training runs were iterated over the Food-41 training set and 3 test-to-average experimental comparisons were performed using the Food-41 test set. Then, selecting three networks with highest accuracy of ResNet50 0, inceptionV3 and DenseNet121, taking the networks as the feature extraction networks (ResNet-InceptionV 3, inceptionV3-DensNet121 and ResNet-DenseNet 121) in a group of two networks as a feature extraction network, constructing a network model based on attention and depth feature fusion, iterating 30 epoch training in a Food-41 training set, and performing 3-time test average value taking evaluation by using the Food-41 testing set.

TABLE 1

Classical network model	Accuracy (%)
		VGG16	90.60
ResNet50	94.68
		InceptionV3	93.90
DensNet121	93.98
		MobileNetV2	93.42

TABLE 2

Wherein, table 1 shows the experimental results on Food-41 test set after fine tuning of six classical networks. Table 2 shows the comparison of the experimental results of the method on the Food-41 test set. As shown in tables 1 and 2, the experimental structure shows that the method achieves a better effect in Food-41 data set. The ResNet50 0-DensNet121 in table 2 reached a maximum of 95.73%, and the method constructed from parallel attention feature extraction networks consisting of two networks had a boost and higher accuracy compared to any single network, and the ReNet-InceptionV 3 in table 1 had a boost of up to 1.89% compared to InceptionV3 alone. According to the method, compression excitation attention is embedded in the sub-feature extraction network, so that local detail features of food material images can be focused better, and then the features extracted by the two sub-networks are fused to generate a feature with stronger characterization capability, so that the accuracy of food material image classification is improved.

Claims

1. A food material image classification model building method based on attention and depth feature fusion is characterized by comprising the following steps:

s2, embedding compression excitation attention into a ResNet network and a DenseNet network which are parallel, and then forming a parallel attention feature extraction network by the two networks to extract food material image features;

In particular implementations, resNet network includes a Res Block structure Block and a first SE attention layer; the Res Block structure Block comprises a first convolution layer, a first pooling layer and a first activation layer; denseNet121 the 121 network includes a Dense Block structure Block and a second SE attention layer; the Dense Block structure Block comprises a second convolution layer, a second pooling layer and a second activation layer;

Wherein the SE attention layer comprises: coding the spatial features on a channel into a global feature, and adopting global average pooling to output the numerical distribution condition of c feature graphs of the layer; adopting a door mechanism in a sigmoid form to enable a network to learn nonlinear relations among all channels; adopting a bottleneck structure comprising two full connections to reduce the feature dimension, wherein the dimension reduction coefficient r is a super parameter, then activating by using a ReLU function, and finally multiplying the learned activation values of all channels by the original feature to obtain a final attention feature map y;

Specifically, the SE attention layer specifically includes: encoding the spatial features on a channel into a global feature, and outputting the numerical distribution of c feature maps of the layer by adopting global average pooling,

z_c＝F_GAP(u_c)

s_c＝σ(g(z_c,w))

Wherein s _c represents the activation value of the c-th channel; z _c represents the numerical distribution of the c-th channel profile; w represents a network weight; g (·) represents a pooling function; sigma (·) represents a sigmoid activation function;

Wherein s _c represents the activation value of the c-th channel; u _c represents the original feature map of the c-th channel, resulting in subnet 1 feature F _in1 and subnet 2 feature F _in2;

S3, inputting the parallel ResNet network and DenseNet121 network extraction features into a deep feature fusion module, and further extracting deep food features; the method comprises the following steps:

A6. Splicing F ₃₁,F₃₂,F₄₁ and F ₄₂ to a dimension 2048 in a3 rd dimension, and then performing convolution dimension reduction of 1*1 to 1024 to obtain characteristics of F '₃ and F' ₃,F'₃ which are subjected to multi-branch splicing of F ₃₁、F₃₂、F₄₁ and F ₄₂ in a third dimension; f '₃ represents the feature of F' ₃ after 1*1-size convolution dimensionality reduction;

A8. The finally output fusion characteristic F _out is formed by adding F ₁ and F' "₂, and the dimension is 1024; s4, establishing a food material image classification model, and classifying the image data to be classified to obtain food material types.