CN117935251A - Food identification method and system based on aggregated attention - Google Patents

Food identification method and system based on aggregated attention Download PDF

Info

Publication number
CN117935251A
CN117935251A CN202410330639.XA CN202410330639A CN117935251A CN 117935251 A CN117935251 A CN 117935251A CN 202410330639 A CN202410330639 A CN 202410330639A CN 117935251 A CN117935251 A CN 117935251A
Authority
CN
China
Prior art keywords
food
image
attention
module
average pooling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410330639.XA
Other languages
Chinese (zh)
Inventor
李忠涛
赵光龙
李雅其
王婉露
张玉璘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Jinan
Original Assignee
University of Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Jinan filed Critical University of Jinan
Priority to CN202410330639.XA priority Critical patent/CN117935251A/en
Publication of CN117935251A publication Critical patent/CN117935251A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a food identification method and a system based on aggregate attention, and relates to the field of computer vision. The invention extracts image features by constructing the attention aggregation module, and provides a main network taking the attention aggregation module as a main component, wherein the main network adopts a pyramid structure, the main network comprises four stages, the image is downsampled between the stages to increase the channel number of the image and reduce the image resolution, and the image features are extracted through the main network, so that the high efficiency of food identification is realized.

Description

Food identification method and system based on aggregated attention
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a food identification method and system based on aggregate attention.
Background
Vision Transformer has become a popular backbone architecture for various computer vision tasks in recent years. The ViT model includes two key components, the self-attention layer and the MLP layer. The self-attention mechanism plays a crucial role in feature extraction, and dynamically generates an association matrix through similarity calculation between Query and Key. The global information aggregation method has remarkable feature extraction potential, and can build a strong data driving model. However, vision Transformer's encoder design was originally developed for language modeling, and presents inherent limitations in downstream computer vision tasks. In particular, the computation of self-care global correlation matrices is challenging due to their secondary complexity and higher memory consumption, limiting their application to high resolution image features.
Disclosure of Invention
The invention provides a food recognition method and a system based on aggregation attention, which aim to perform efficient and effective global context modeling and pay attention to the existence position of a target through an aggregation attention module so as to improve the recognition effect of food.
The invention improves the traditional self-attention mechanism, and provides a food identification method based on aggregate attention, which comprises the following steps:
s1, acquiring food video shot by a camera, and performing frame extraction on the video to obtain a food image to be detected;
S2, constructing a food detection backbone network, wherein the backbone network adopts a four-stage pyramid structure, food images are input into the backbone network to extract layering characteristics, image downsampling operation is performed between two stages by using average pooling, and linear projection and activation operation are performed before the average pooling in consideration of the fact that a large amount of information is lost by the average pooling; after inputting the image features into the stage i, firstly, carrying out dynamic position coding to capture the relation between different positions, then applying LayerNorm to carry out standardization processing, then using a multi-head self-attention mechanism to carry out attention calculation in the image sequence, again applying LayerNorm to carry out standardization, and finally using a convolution gating module to carry out nonlinear transformation on the sequence to complete the feature extraction process of the image sequence;
S3, constructing a food detection model, wherein the model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for predicting and outputting results;
s4, inputting the food image to be detected into a food detection model to obtain a detection result.
Preferably, the downsampling module in S2, given the input image X, can formulateTo represent the downsampling process, the averaging pooling performs the image downsampling operation, taking into account that the averaging pooling loses a lot of information, and therefore performs the linear projection and activation operations prior to the averaging pooling; after inputting the image feature X into stage i, the position information is recorded by dynamic position coding: /(I)Then, carrying out normalization processing by using LayerNorm, carrying out attention calculation inside the sequence by using a multi-head self-attention mechanism, carrying out normalization by using LayerNorm again, finally, carrying out processing on the sequence by using a convolution gating module, wherein the convolution gating module consists of two linear projections, the two linear projections are multiplied by elements, one projection is activated by an activation function, 3 x 3 depth convolution enhancement feature extraction is applied before the activation function, and the result of the element-by-element multiplication is input to the convolution gating module for output after the image features are added.
The invention also provides a food recognition system based on the aggregate attention, which is characterized by comprising a food image acquisition module and a food recognition module, wherein the video image acquisition module is responsible for shooting food videos and performing frame extraction operation on the videos to obtain a plurality of food images to be detected; inputting video images into a food recognition module for recognition and classification, wherein a food detection model is built in the food recognition module, the food detection model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for result prediction output; the backbone network adopts a pyramid structure, the image processing process is divided into four stages, and the ith stage is composed ofThe aggregation attention modules are stacked, and the images are subjected to layered feature extraction through a backbone network; and inputting the image characteristics into an average pooling layer for average pooling, and finally, predicting the food result through a fully connected classifier.
Compared with the prior art, the invention has the following technical effects:
According to the technical scheme provided by the invention, the designed aggregation attention module focuses on the position where the target exists, so that the global context modeling can be efficiently and effectively realized, and meanwhile, the linear projection and activation operation is performed before the average pooling, so that the loss of information in the downsampling process of the image is reduced, and the accuracy of a detection result is ensured.
Drawings
FIG. 1 is a flow chart of a method for identifying food based on aggregated attention according to an embodiment of the present invention;
Fig. 2 is a schematic diagram of a backbone network architecture based on a focused attention module according to the present invention.
Detailed Description
The invention aims to provide a food recognition method and a system based on aggregation attention, wherein the method is used for extracting image features by constructing an aggregation attention module, and a main network taking the aggregation attention module as a main component is provided, the main network adopts a pyramid structure, the main network comprises four stages, downsampling is carried out on images between the stages to increase the channel number of the images and reduce the image resolution, and the image features are extracted through the main network, so that the high efficiency of food recognition is realized.
Referring to fig. 1, in an embodiment of the present application, a method for identifying food based on aggregated attention is provided:
s1, acquiring food video shot by a camera, and performing frame extraction on the video to obtain a food image to be detected;
S2, constructing a food detection backbone network, wherein the backbone network adopts a four-stage pyramid structure, food images are input into the backbone network to extract layering characteristics, image downsampling operation is performed between two stages by using average pooling, and linear projection and activation operation are performed before the average pooling in consideration of the fact that a large amount of information is lost by the average pooling; after inputting the image features into the stage i, firstly, carrying out dynamic position coding to capture the relation between different positions, then applying LayerNorm to carry out standardization processing, then using a multi-head self-attention mechanism to carry out attention calculation in the image sequence, again applying LayerNorm to carry out standardization, and finally using a convolution gating module to carry out nonlinear transformation on the sequence to complete the feature extraction process of the image sequence;
S3, constructing a food detection model, wherein the model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for predicting and outputting results;
s4, inputting the food image to be detected into a food detection model to obtain a detection result.
Further, as shown in fig. 2, after a food image is obtained from a camera, a 3-channel food image with 600×600 resolution to be identified is input into a backbone network for extracting layering characteristics, downsampling is performed on an image sequence between two stages, the number of image channels is increased and the image dimension is reduced in sequence, and each stage is formed by stacking a plurality of aggregation attention modules.
Further, the downsampling module in S2, given the input image X,(Wherein/>N is the number of token), the downsampling process may be represented by the following equation: The average pooling layer acts as a downsampling operator.
Further, the aggregate attention module in S2 records the position information by dynamic position coding after the image feature X is input to the stage i: Then, carrying out normalization processing by using LayerNorm, carrying out attention calculation inside the sequence by using a multi-head self-attention mechanism, carrying out normalization by using LayerNorm again, finally, carrying out processing on the sequence by using a convolution gating module, wherein the convolution gating module consists of two linear projections, the two linear projections are multiplied by elements, one projection is activated by an activation function, 3 x 3 depth convolution enhancement feature extraction is applied before the activation function, and the result of the element-by-element multiplication is input to the convolution gating module for output after the image features are added.
The embodiment provides a food recognition system based on aggregate attention, which is characterized by comprising a food image acquisition module and a food recognition module, wherein the video image acquisition module is responsible for shooting food videos and performing frame extraction operation on the videos to obtain a plurality of food images to be detected; inputting video images into a food recognition module for recognition and classification, wherein a food detection model is built in the food recognition module, the food detection model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for result prediction output; the backbone network adopts a pyramid structure, the image processing process is divided into four stages, and the ith stage is composed ofThe aggregation attention modules are stacked, and the images are subjected to layered feature extraction through a backbone network; and inputting the image characteristics into an average pooling layer for average pooling, and finally, predicting the result through a fully connected classifier.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and improvements could be made by those skilled in the art without departing from the inventive concept, which fall within the scope of the present invention.

Claims (3)

1. A method of identifying a food based on aggregate attention, comprising the steps of:
s1, acquiring food video shot by a camera, and performing frame extraction on the video to obtain a food image to be detected;
S2, constructing a food detection backbone network, wherein the backbone network adopts a four-stage pyramid structure, food images are input into the backbone network to extract layering characteristics, image downsampling operation is performed between two stages by using average pooling, and linear projection and activation operation are performed before the average pooling in consideration of the fact that a large amount of information is lost by the average pooling; after inputting the image features into the stage i, firstly, carrying out dynamic position coding to capture the relation between different positions, then applying LayerNorm to carry out standardization processing, then using a multi-head self-attention mechanism to carry out attention calculation in the image sequence, again applying LayerNorm to carry out standardization, and finally using a convolution gating module to carry out nonlinear transformation on the sequence to complete the feature extraction process of the image sequence;
S3, constructing a food detection model, wherein the model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for predicting and outputting results;
s4, inputting the food image to be detected into a food detection model to obtain a detection result.
2. The method of claim 1, wherein the downsampling module in S2, given the input image X, uses a formulaTo represent the downsampling process; after inputting the image feature X into stage i, the position information is recorded by dynamic position coding: Then, carrying out normalization processing by using LayerNorm, carrying out attention calculation inside the sequence by using a multi-head self-attention mechanism, carrying out normalization by using LayerNorm again, finally, carrying out processing on the sequence by using a convolution gating module, wherein the convolution gating module consists of two linear projections, the two linear projections are multiplied by elements, one projection is activated by an activation function, 3 x 3 depth convolution enhancement feature extraction is applied before the activation function, and the result of the element-by-element multiplication is input to the convolution gating module for output after the image features are added.
3. The food recognition system based on the aggregated attention is characterized by comprising a food image acquisition module and a food recognition module, wherein the video image acquisition module is responsible for shooting food videos and performing frame extraction operation on the videos to obtain a plurality of food images to be detected; inputting video images into a food recognition module for recognition and classification, wherein a food detection model is built in the food recognition module, the food detection model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for result prediction output; the backbone network adopts a pyramid structure, the image processing process is divided into four stages, and the ith stage is composed ofThe aggregation attention modules are stacked, and the images are subjected to layered feature extraction through a backbone network; and inputting the image characteristics into an average pooling layer for average pooling, and finally, predicting the result through a fully connected classifier.
CN202410330639.XA 2024-03-22 2024-03-22 Food identification method and system based on aggregated attention Pending CN117935251A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410330639.XA CN117935251A (en) 2024-03-22 2024-03-22 Food identification method and system based on aggregated attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410330639.XA CN117935251A (en) 2024-03-22 2024-03-22 Food identification method and system based on aggregated attention

Publications (1)

Publication Number Publication Date
CN117935251A true CN117935251A (en) 2024-04-26

Family

ID=90752352

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410330639.XA Pending CN117935251A (en) 2024-03-22 2024-03-22 Food identification method and system based on aggregated attention

Country Status (1)

Country Link
CN (1) CN117935251A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113538347A (en) * 2021-06-29 2021-10-22 中国电子科技集团公司电子科学研究院 Image detection method and system based on efficient bidirectional path aggregation attention network
CN114973049A (en) * 2022-01-05 2022-08-30 上海人工智能创新中心 Lightweight video classification method for unifying convolution and self attention
CN115565066A (en) * 2022-09-26 2023-01-03 北京理工大学 SAR image ship target detection method based on Transformer
CN116188836A (en) * 2022-12-14 2023-05-30 长沙理工大学 Remote sensing image classification method and device based on space and channel feature extraction
CN116703980A (en) * 2023-08-04 2023-09-05 南昌工程学院 Target tracking method and system based on pyramid pooling transducer backbone network
CN116824694A (en) * 2023-06-06 2023-09-29 西安电子科技大学 Action recognition system and method based on time sequence aggregation and gate control transducer

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113538347A (en) * 2021-06-29 2021-10-22 中国电子科技集团公司电子科学研究院 Image detection method and system based on efficient bidirectional path aggregation attention network
CN114973049A (en) * 2022-01-05 2022-08-30 上海人工智能创新中心 Lightweight video classification method for unifying convolution and self attention
CN115565066A (en) * 2022-09-26 2023-01-03 北京理工大学 SAR image ship target detection method based on Transformer
CN116188836A (en) * 2022-12-14 2023-05-30 长沙理工大学 Remote sensing image classification method and device based on space and channel feature extraction
CN116824694A (en) * 2023-06-06 2023-09-29 西安电子科技大学 Action recognition system and method based on time sequence aggregation and gate control transducer
CN116703980A (en) * 2023-08-04 2023-09-05 南昌工程学院 Target tracking method and system based on pyramid pooling transducer backbone network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
专知: "【ICLR2022】UniFormer:无缝集成 Transformer,更高效的时空表征学习框架", 《知乎》, 17 February 2022 (2022-02-17), pages 3 - 4 *
极市平台: "TransNeXt:昨日最强模型已不强,TransNeXt-Tiny在ImageNet上准确率刷到84.0%", 《知乎》, 4 December 2023 (2023-12-04), pages 4 - 7 *

Similar Documents

Publication Publication Date Title
CN111639692B (en) Shadow detection method based on attention mechanism
Li et al. Tea: Temporal excitation and aggregation for action recognition
CN111462126B (en) Semantic image segmentation method and system based on edge enhancement
Shih et al. Real-time object detection with reduced region proposal network via multi-feature concatenation
CN112949673B (en) Feature fusion target detection and identification method based on global attention
Javed et al. Byte-level object identification for forensic investigation of digital images
CN113822246B (en) Vehicle weight identification method based on global reference attention mechanism
CN111666948A (en) Real-time high-performance semantic segmentation method and device based on multi-path aggregation
CN112785626A (en) Twin network small target tracking method based on multi-scale feature fusion
Gangwar et al. Deepirisnet2: Learning deep-iriscodes from scratch for segmentation-robust visible wavelength and near infrared iris recognition
CN113033276A (en) Behavior recognition method based on conversion module
CN114463805B (en) Deep forgery detection method, device, storage medium and computer equipment
Li et al. Event transformer
CN114519383A (en) Image target detection method and system
CN117542045A (en) Food identification method and system based on space-guided self-attention
CN117935251A (en) Food identification method and system based on aggregated attention
Zhang et al. AG-Net: An advanced general CNN model for steganalysis
CN114663861B (en) Vehicle re-identification method based on dimension decoupling and non-local relation
CN116246109A (en) Multi-scale hole neighborhood attention computing backbone network model and application thereof
CN113688783B (en) Face feature extraction method, low-resolution face recognition method and equipment
CN115240121A (en) Joint modeling method and device for enhancing local features of pedestrians
CN114863520A (en) Video expression recognition method based on C3D-SA
CN111242229A (en) Image identification method based on two-stage information fusion
Ma et al. Rtsnet: Real-time semantic segmentation network for outdoor scenes
Culurciello et al. Clustering learning for robotic vision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination