CN117935251A - Food identification method and system based on aggregated attention - Google Patents
Food identification method and system based on aggregated attention Download PDFInfo
- Publication number
- CN117935251A CN117935251A CN202410330639.XA CN202410330639A CN117935251A CN 117935251 A CN117935251 A CN 117935251A CN 202410330639 A CN202410330639 A CN 202410330639A CN 117935251 A CN117935251 A CN 117935251A
- Authority
- CN
- China
- Prior art keywords
- food
- image
- attention
- module
- average pooling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 230000002776 aggregation Effects 0.000 claims abstract description 19
- 238000004220 aggregation Methods 0.000 claims abstract description 19
- 238000011176 pooling Methods 0.000 claims description 32
- 238000001514 detection method Methods 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 17
- 238000012545 processing Methods 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 11
- 230000007246 mechanism Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 abstract 1
- 238000012935 Averaging Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000009246 food effect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Landscapes
- Image Analysis (AREA)
Abstract
The invention provides a food identification method and a system based on aggregate attention, and relates to the field of computer vision. The invention extracts image features by constructing the attention aggregation module, and provides a main network taking the attention aggregation module as a main component, wherein the main network adopts a pyramid structure, the main network comprises four stages, the image is downsampled between the stages to increase the channel number of the image and reduce the image resolution, and the image features are extracted through the main network, so that the high efficiency of food identification is realized.
Description
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a food identification method and system based on aggregate attention.
Background
Vision Transformer has become a popular backbone architecture for various computer vision tasks in recent years. The ViT model includes two key components, the self-attention layer and the MLP layer. The self-attention mechanism plays a crucial role in feature extraction, and dynamically generates an association matrix through similarity calculation between Query and Key. The global information aggregation method has remarkable feature extraction potential, and can build a strong data driving model. However, vision Transformer's encoder design was originally developed for language modeling, and presents inherent limitations in downstream computer vision tasks. In particular, the computation of self-care global correlation matrices is challenging due to their secondary complexity and higher memory consumption, limiting their application to high resolution image features.
Disclosure of Invention
The invention provides a food recognition method and a system based on aggregation attention, which aim to perform efficient and effective global context modeling and pay attention to the existence position of a target through an aggregation attention module so as to improve the recognition effect of food.
The invention improves the traditional self-attention mechanism, and provides a food identification method based on aggregate attention, which comprises the following steps:
s1, acquiring food video shot by a camera, and performing frame extraction on the video to obtain a food image to be detected;
S2, constructing a food detection backbone network, wherein the backbone network adopts a four-stage pyramid structure, food images are input into the backbone network to extract layering characteristics, image downsampling operation is performed between two stages by using average pooling, and linear projection and activation operation are performed before the average pooling in consideration of the fact that a large amount of information is lost by the average pooling; after inputting the image features into the stage i, firstly, carrying out dynamic position coding to capture the relation between different positions, then applying LayerNorm to carry out standardization processing, then using a multi-head self-attention mechanism to carry out attention calculation in the image sequence, again applying LayerNorm to carry out standardization, and finally using a convolution gating module to carry out nonlinear transformation on the sequence to complete the feature extraction process of the image sequence;
S3, constructing a food detection model, wherein the model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for predicting and outputting results;
s4, inputting the food image to be detected into a food detection model to obtain a detection result.
Preferably, the downsampling module in S2, given the input image X, can formulateTo represent the downsampling process, the averaging pooling performs the image downsampling operation, taking into account that the averaging pooling loses a lot of information, and therefore performs the linear projection and activation operations prior to the averaging pooling; after inputting the image feature X into stage i, the position information is recorded by dynamic position coding: /(I)Then, carrying out normalization processing by using LayerNorm, carrying out attention calculation inside the sequence by using a multi-head self-attention mechanism, carrying out normalization by using LayerNorm again, finally, carrying out processing on the sequence by using a convolution gating module, wherein the convolution gating module consists of two linear projections, the two linear projections are multiplied by elements, one projection is activated by an activation function, 3 x 3 depth convolution enhancement feature extraction is applied before the activation function, and the result of the element-by-element multiplication is input to the convolution gating module for output after the image features are added.
The invention also provides a food recognition system based on the aggregate attention, which is characterized by comprising a food image acquisition module and a food recognition module, wherein the video image acquisition module is responsible for shooting food videos and performing frame extraction operation on the videos to obtain a plurality of food images to be detected; inputting video images into a food recognition module for recognition and classification, wherein a food detection model is built in the food recognition module, the food detection model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for result prediction output; the backbone network adopts a pyramid structure, the image processing process is divided into four stages, and the ith stage is composed ofThe aggregation attention modules are stacked, and the images are subjected to layered feature extraction through a backbone network; and inputting the image characteristics into an average pooling layer for average pooling, and finally, predicting the food result through a fully connected classifier.
Compared with the prior art, the invention has the following technical effects:
According to the technical scheme provided by the invention, the designed aggregation attention module focuses on the position where the target exists, so that the global context modeling can be efficiently and effectively realized, and meanwhile, the linear projection and activation operation is performed before the average pooling, so that the loss of information in the downsampling process of the image is reduced, and the accuracy of a detection result is ensured.
Drawings
FIG. 1 is a flow chart of a method for identifying food based on aggregated attention according to an embodiment of the present invention;
Fig. 2 is a schematic diagram of a backbone network architecture based on a focused attention module according to the present invention.
Detailed Description
The invention aims to provide a food recognition method and a system based on aggregation attention, wherein the method is used for extracting image features by constructing an aggregation attention module, and a main network taking the aggregation attention module as a main component is provided, the main network adopts a pyramid structure, the main network comprises four stages, downsampling is carried out on images between the stages to increase the channel number of the images and reduce the image resolution, and the image features are extracted through the main network, so that the high efficiency of food recognition is realized.
Referring to fig. 1, in an embodiment of the present application, a method for identifying food based on aggregated attention is provided:
s1, acquiring food video shot by a camera, and performing frame extraction on the video to obtain a food image to be detected;
S2, constructing a food detection backbone network, wherein the backbone network adopts a four-stage pyramid structure, food images are input into the backbone network to extract layering characteristics, image downsampling operation is performed between two stages by using average pooling, and linear projection and activation operation are performed before the average pooling in consideration of the fact that a large amount of information is lost by the average pooling; after inputting the image features into the stage i, firstly, carrying out dynamic position coding to capture the relation between different positions, then applying LayerNorm to carry out standardization processing, then using a multi-head self-attention mechanism to carry out attention calculation in the image sequence, again applying LayerNorm to carry out standardization, and finally using a convolution gating module to carry out nonlinear transformation on the sequence to complete the feature extraction process of the image sequence;
S3, constructing a food detection model, wherein the model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for predicting and outputting results;
s4, inputting the food image to be detected into a food detection model to obtain a detection result.
Further, as shown in fig. 2, after a food image is obtained from a camera, a 3-channel food image with 600×600 resolution to be identified is input into a backbone network for extracting layering characteristics, downsampling is performed on an image sequence between two stages, the number of image channels is increased and the image dimension is reduced in sequence, and each stage is formed by stacking a plurality of aggregation attention modules.
Further, the downsampling module in S2, given the input image X,(Wherein/>N is the number of token), the downsampling process may be represented by the following equation: The average pooling layer acts as a downsampling operator.
Further, the aggregate attention module in S2 records the position information by dynamic position coding after the image feature X is input to the stage i: Then, carrying out normalization processing by using LayerNorm, carrying out attention calculation inside the sequence by using a multi-head self-attention mechanism, carrying out normalization by using LayerNorm again, finally, carrying out processing on the sequence by using a convolution gating module, wherein the convolution gating module consists of two linear projections, the two linear projections are multiplied by elements, one projection is activated by an activation function, 3 x 3 depth convolution enhancement feature extraction is applied before the activation function, and the result of the element-by-element multiplication is input to the convolution gating module for output after the image features are added.
The embodiment provides a food recognition system based on aggregate attention, which is characterized by comprising a food image acquisition module and a food recognition module, wherein the video image acquisition module is responsible for shooting food videos and performing frame extraction operation on the videos to obtain a plurality of food images to be detected; inputting video images into a food recognition module for recognition and classification, wherein a food detection model is built in the food recognition module, the food detection model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for result prediction output; the backbone network adopts a pyramid structure, the image processing process is divided into four stages, and the ith stage is composed ofThe aggregation attention modules are stacked, and the images are subjected to layered feature extraction through a backbone network; and inputting the image characteristics into an average pooling layer for average pooling, and finally, predicting the result through a fully connected classifier.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and improvements could be made by those skilled in the art without departing from the inventive concept, which fall within the scope of the present invention.
Claims (3)
1. A method of identifying a food based on aggregate attention, comprising the steps of:
s1, acquiring food video shot by a camera, and performing frame extraction on the video to obtain a food image to be detected;
S2, constructing a food detection backbone network, wherein the backbone network adopts a four-stage pyramid structure, food images are input into the backbone network to extract layering characteristics, image downsampling operation is performed between two stages by using average pooling, and linear projection and activation operation are performed before the average pooling in consideration of the fact that a large amount of information is lost by the average pooling; after inputting the image features into the stage i, firstly, carrying out dynamic position coding to capture the relation between different positions, then applying LayerNorm to carry out standardization processing, then using a multi-head self-attention mechanism to carry out attention calculation in the image sequence, again applying LayerNorm to carry out standardization, and finally using a convolution gating module to carry out nonlinear transformation on the sequence to complete the feature extraction process of the image sequence;
S3, constructing a food detection model, wherein the model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for predicting and outputting results;
s4, inputting the food image to be detected into a food detection model to obtain a detection result.
2. The method of claim 1, wherein the downsampling module in S2, given the input image X, uses a formulaTo represent the downsampling process; after inputting the image feature X into stage i, the position information is recorded by dynamic position coding: Then, carrying out normalization processing by using LayerNorm, carrying out attention calculation inside the sequence by using a multi-head self-attention mechanism, carrying out normalization by using LayerNorm again, finally, carrying out processing on the sequence by using a convolution gating module, wherein the convolution gating module consists of two linear projections, the two linear projections are multiplied by elements, one projection is activated by an activation function, 3 x 3 depth convolution enhancement feature extraction is applied before the activation function, and the result of the element-by-element multiplication is input to the convolution gating module for output after the image features are added.
3. The food recognition system based on the aggregated attention is characterized by comprising a food image acquisition module and a food recognition module, wherein the video image acquisition module is responsible for shooting food videos and performing frame extraction operation on the videos to obtain a plurality of food images to be detected; inputting video images into a food recognition module for recognition and classification, wherein a food detection model is built in the food recognition module, the food detection model consists of a main network, an average pooling layer and a full-connection classifier, the main network consists of a downsampling module and an attention aggregation module, and the average pooling layer and the full-connection classifier are used for result prediction output; the backbone network adopts a pyramid structure, the image processing process is divided into four stages, and the ith stage is composed ofThe aggregation attention modules are stacked, and the images are subjected to layered feature extraction through a backbone network; and inputting the image characteristics into an average pooling layer for average pooling, and finally, predicting the result through a fully connected classifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410330639.XA CN117935251A (en) | 2024-03-22 | 2024-03-22 | Food identification method and system based on aggregated attention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410330639.XA CN117935251A (en) | 2024-03-22 | 2024-03-22 | Food identification method and system based on aggregated attention |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117935251A true CN117935251A (en) | 2024-04-26 |
Family
ID=90752352
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410330639.XA Pending CN117935251A (en) | 2024-03-22 | 2024-03-22 | Food identification method and system based on aggregated attention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117935251A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113538347A (en) * | 2021-06-29 | 2021-10-22 | 中国电子科技集团公司电子科学研究院 | Image detection method and system based on efficient bidirectional path aggregation attention network |
CN114973049A (en) * | 2022-01-05 | 2022-08-30 | 上海人工智能创新中心 | Lightweight video classification method for unifying convolution and self attention |
CN115565066A (en) * | 2022-09-26 | 2023-01-03 | 北京理工大学 | SAR image ship target detection method based on Transformer |
CN116188836A (en) * | 2022-12-14 | 2023-05-30 | 长沙理工大学 | Remote sensing image classification method and device based on space and channel feature extraction |
CN116703980A (en) * | 2023-08-04 | 2023-09-05 | 南昌工程学院 | Target tracking method and system based on pyramid pooling transducer backbone network |
CN116824694A (en) * | 2023-06-06 | 2023-09-29 | 西安电子科技大学 | Action recognition system and method based on time sequence aggregation and gate control transducer |
-
2024
- 2024-03-22 CN CN202410330639.XA patent/CN117935251A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113538347A (en) * | 2021-06-29 | 2021-10-22 | 中国电子科技集团公司电子科学研究院 | Image detection method and system based on efficient bidirectional path aggregation attention network |
CN114973049A (en) * | 2022-01-05 | 2022-08-30 | 上海人工智能创新中心 | Lightweight video classification method for unifying convolution and self attention |
CN115565066A (en) * | 2022-09-26 | 2023-01-03 | 北京理工大学 | SAR image ship target detection method based on Transformer |
CN116188836A (en) * | 2022-12-14 | 2023-05-30 | 长沙理工大学 | Remote sensing image classification method and device based on space and channel feature extraction |
CN116824694A (en) * | 2023-06-06 | 2023-09-29 | 西安电子科技大学 | Action recognition system and method based on time sequence aggregation and gate control transducer |
CN116703980A (en) * | 2023-08-04 | 2023-09-05 | 南昌工程学院 | Target tracking method and system based on pyramid pooling transducer backbone network |
Non-Patent Citations (2)
Title |
---|
专知: "【ICLR2022】UniFormer:无缝集成 Transformer,更高效的时空表征学习框架", 《知乎》, 17 February 2022 (2022-02-17), pages 3 - 4 * |
极市平台: "TransNeXt:昨日最强模型已不强,TransNeXt-Tiny在ImageNet上准确率刷到84.0%", 《知乎》, 4 December 2023 (2023-12-04), pages 4 - 7 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111639692B (en) | Shadow detection method based on attention mechanism | |
Li et al. | Tea: Temporal excitation and aggregation for action recognition | |
CN111462126B (en) | Semantic image segmentation method and system based on edge enhancement | |
Shih et al. | Real-time object detection with reduced region proposal network via multi-feature concatenation | |
CN112949673B (en) | Feature fusion target detection and identification method based on global attention | |
Javed et al. | Byte-level object identification for forensic investigation of digital images | |
CN113822246B (en) | Vehicle weight identification method based on global reference attention mechanism | |
CN111666948A (en) | Real-time high-performance semantic segmentation method and device based on multi-path aggregation | |
CN112785626A (en) | Twin network small target tracking method based on multi-scale feature fusion | |
Gangwar et al. | Deepirisnet2: Learning deep-iriscodes from scratch for segmentation-robust visible wavelength and near infrared iris recognition | |
CN113033276A (en) | Behavior recognition method based on conversion module | |
CN114463805B (en) | Deep forgery detection method, device, storage medium and computer equipment | |
Li et al. | Event transformer | |
CN114519383A (en) | Image target detection method and system | |
CN117542045A (en) | Food identification method and system based on space-guided self-attention | |
CN117935251A (en) | Food identification method and system based on aggregated attention | |
Zhang et al. | AG-Net: An advanced general CNN model for steganalysis | |
CN114663861B (en) | Vehicle re-identification method based on dimension decoupling and non-local relation | |
CN116246109A (en) | Multi-scale hole neighborhood attention computing backbone network model and application thereof | |
CN113688783B (en) | Face feature extraction method, low-resolution face recognition method and equipment | |
CN115240121A (en) | Joint modeling method and device for enhancing local features of pedestrians | |
CN114863520A (en) | Video expression recognition method based on C3D-SA | |
CN111242229A (en) | Image identification method based on two-stage information fusion | |
Ma et al. | Rtsnet: Real-time semantic segmentation network for outdoor scenes | |
Culurciello et al. | Clustering learning for robotic vision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |