CN117935251A

CN117935251A - Food identification method and system based on aggregated attention

Info

Publication number: CN117935251A
Application number: CN202410330639.XA
Authority: CN
Inventors: 李忠涛; 赵光龙; 李雅其; 王婉露; 张玉璘
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2024-03-22
Filing date: 2024-03-22
Publication date: 2024-04-26

Abstract

The invention provides a food identification method and a system based on aggregate attention, and relates to the field of computer vision. The invention extracts image features by constructing the attention aggregation module, and provides a main network taking the attention aggregation module as a main component, wherein the main network adopts a pyramid structure, the main network comprises four stages, the image is downsampled between the stages to increase the channel number of the image and reduce the image resolution, and the image features are extracted through the main network, so that the high efficiency of food identification is realized.

Description

Food identification method and system based on aggregated attention

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a food identification method and system based on aggregate attention.

Background

Vision Transformer has become a popular backbone architecture for various computer vision tasks in recent years. The ViT model includes two key components, the self-attention layer and the MLP layer. The self-attention mechanism plays a crucial role in feature extraction, and dynamically generates an association matrix through similarity calculation between Query and Key. The global information aggregation method has remarkable feature extraction potential, and can build a strong data driving model. However, vision Transformer's encoder design was originally developed for language modeling, and presents inherent limitations in downstream computer vision tasks. In particular, the computation of self-care global correlation matrices is challenging due to their secondary complexity and higher memory consumption, limiting their application to high resolution image features.

Disclosure of Invention

The invention provides a food recognition method and a system based on aggregation attention, which aim to perform efficient and effective global context modeling and pay attention to the existence position of a target through an aggregation attention module so as to improve the recognition effect of food.

The invention improves the traditional self-attention mechanism, and provides a food identification method based on aggregate attention, which comprises the following steps:

s1, acquiring food video shot by a camera, and performing frame extraction on the video to obtain a food image to be detected;

S2, constructing a food detection backbone network, wherein the backbone network adopts a four-stage pyramid structure, food images are input into the backbone network to extract layering characteristics, image downsampling operation is performed between two stages by using average pooling, and linear projection and activation operation are performed before the average pooling in consideration of the fact that a large amount of information is lost by the average pooling; after inputting the image features into the stage i, firstly, carrying out dynamic position coding to capture the relation between different positions, then applying LayerNorm to carry out standardization processing, then using a multi-head self-attention mechanism to carry out attention calculation in the image sequence, again applying LayerNorm to carry out standardization, and finally using a convolution gating module to carry out nonlinear transformation on the sequence to complete the feature extraction process of the image sequence;

S3, constructing a food detection model, wherein the model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for predicting and outputting results;

s4, inputting the food image to be detected into a food detection model to obtain a detection result.

Preferably, the downsampling module in S2, given the input image X, can formulateTo represent the downsampling process, the averaging pooling performs the image downsampling operation, taking into account that the averaging pooling loses a lot of information, and therefore performs the linear projection and activation operations prior to the averaging pooling; after inputting the image feature X into stage i, the position information is recorded by dynamic position coding: /(I)Then, carrying out normalization processing by using LayerNorm, carrying out attention calculation inside the sequence by using a multi-head self-attention mechanism, carrying out normalization by using LayerNorm again, finally, carrying out processing on the sequence by using a convolution gating module, wherein the convolution gating module consists of two linear projections, the two linear projections are multiplied by elements, one projection is activated by an activation function, 3 x 3 depth convolution enhancement feature extraction is applied before the activation function, and the result of the element-by-element multiplication is input to the convolution gating module for output after the image features are added.

The invention also provides a food recognition system based on the aggregate attention, which is characterized by comprising a food image acquisition module and a food recognition module, wherein the video image acquisition module is responsible for shooting food videos and performing frame extraction operation on the videos to obtain a plurality of food images to be detected; inputting video images into a food recognition module for recognition and classification, wherein a food detection model is built in the food recognition module, the food detection model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for result prediction output; the backbone network adopts a pyramid structure, the image processing process is divided into four stages, and the ith stage is composed ofThe aggregation attention modules are stacked, and the images are subjected to layered feature extraction through a backbone network; and inputting the image characteristics into an average pooling layer for average pooling, and finally, predicting the food result through a fully connected classifier.

Compared with the prior art, the invention has the following technical effects:

According to the technical scheme provided by the invention, the designed aggregation attention module focuses on the position where the target exists, so that the global context modeling can be efficiently and effectively realized, and meanwhile, the linear projection and activation operation is performed before the average pooling, so that the loss of information in the downsampling process of the image is reduced, and the accuracy of a detection result is ensured.

Drawings

FIG. 1 is a flow chart of a method for identifying food based on aggregated attention according to an embodiment of the present invention;

Fig. 2 is a schematic diagram of a backbone network architecture based on a focused attention module according to the present invention.

Detailed Description

The invention aims to provide a food recognition method and a system based on aggregation attention, wherein the method is used for extracting image features by constructing an aggregation attention module, and a main network taking the aggregation attention module as a main component is provided, the main network adopts a pyramid structure, the main network comprises four stages, downsampling is carried out on images between the stages to increase the channel number of the images and reduce the image resolution, and the image features are extracted through the main network, so that the high efficiency of food recognition is realized.

Referring to fig. 1, in an embodiment of the present application, a method for identifying food based on aggregated attention is provided:

Further, as shown in fig. 2, after a food image is obtained from a camera, a 3-channel food image with 600×600 resolution to be identified is input into a backbone network for extracting layering characteristics, downsampling is performed on an image sequence between two stages, the number of image channels is increased and the image dimension is reduced in sequence, and each stage is formed by stacking a plurality of aggregation attention modules.

Further, the downsampling module in S2, given the input image X,(Wherein/>N is the number of token), the downsampling process may be represented by the following equation: The average pooling layer acts as a downsampling operator.

Further, the aggregate attention module in S2 records the position information by dynamic position coding after the image feature X is input to the stage i: Then, carrying out normalization processing by using LayerNorm, carrying out attention calculation inside the sequence by using a multi-head self-attention mechanism, carrying out normalization by using LayerNorm again, finally, carrying out processing on the sequence by using a convolution gating module, wherein the convolution gating module consists of two linear projections, the two linear projections are multiplied by elements, one projection is activated by an activation function, 3 x 3 depth convolution enhancement feature extraction is applied before the activation function, and the result of the element-by-element multiplication is input to the convolution gating module for output after the image features are added.

The embodiment provides a food recognition system based on aggregate attention, which is characterized by comprising a food image acquisition module and a food recognition module, wherein the video image acquisition module is responsible for shooting food videos and performing frame extraction operation on the videos to obtain a plurality of food images to be detected; inputting video images into a food recognition module for recognition and classification, wherein a food detection model is built in the food recognition module, the food detection model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for result prediction output; the backbone network adopts a pyramid structure, the image processing process is divided into four stages, and the ith stage is composed ofThe aggregation attention modules are stacked, and the images are subjected to layered feature extraction through a backbone network; and inputting the image characteristics into an average pooling layer for average pooling, and finally, predicting the result through a fully connected classifier.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and improvements could be made by those skilled in the art without departing from the inventive concept, which fall within the scope of the present invention.

Claims

1. A method of identifying a food based on aggregate attention, comprising the steps of:

2. The method of claim 1, wherein the downsampling module in S2, given the input image X, uses a formulaTo represent the downsampling process; after inputting the image feature X into stage i, the position information is recorded by dynamic position coding: Then, carrying out normalization processing by using LayerNorm, carrying out attention calculation inside the sequence by using a multi-head self-attention mechanism, carrying out normalization by using LayerNorm again, finally, carrying out processing on the sequence by using a convolution gating module, wherein the convolution gating module consists of two linear projections, the two linear projections are multiplied by elements, one projection is activated by an activation function, 3 x 3 depth convolution enhancement feature extraction is applied before the activation function, and the result of the element-by-element multiplication is input to the convolution gating module for output after the image features are added.

3. The food recognition system based on the aggregated attention is characterized by comprising a food image acquisition module and a food recognition module, wherein the video image acquisition module is responsible for shooting food videos and performing frame extraction operation on the videos to obtain a plurality of food images to be detected; inputting video images into a food recognition module for recognition and classification, wherein a food detection model is built in the food recognition module, the food detection model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for result prediction output; the backbone network adopts a pyramid structure, the image processing process is divided into four stages, and the ith stage is composed ofThe aggregation attention modules are stacked, and the images are subjected to layered feature extraction through a backbone network; and inputting the image characteristics into an average pooling layer for average pooling, and finally, predicting the result through a fully connected classifier.