CN115761444A - Training method of incomplete information target recognition model and target recognition method - Google Patents

Training method of incomplete information target recognition model and target recognition method Download PDF

Info

Publication number
CN115761444A
CN115761444A CN202211480465.2A CN202211480465A CN115761444A CN 115761444 A CN115761444 A CN 115761444A CN 202211480465 A CN202211480465 A CN 202211480465A CN 115761444 A CN115761444 A CN 115761444A
Authority
CN
China
Prior art keywords
image
target
feature vector
vector
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211480465.2A
Other languages
Chinese (zh)
Other versions
CN115761444B (en
Inventor
张栩铭
姜舜译
闫淇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202211480465.2A priority Critical patent/CN115761444B/en
Publication of CN115761444A publication Critical patent/CN115761444A/en
Application granted granted Critical
Publication of CN115761444B publication Critical patent/CN115761444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

A training method and a target recognition method of an incomplete information target recognition model relate to the technical field of image data processing and data recognition, and solve the problem that an incomplete information target recognition technology is urgently needed, and the training method comprises the following steps: establishing an image video data set comprising a first image and continuous video frames, and artificially labeling a target position and a target type on the image to obtain a labeling label; and training the incomplete target detection model according to the image video data set, the first output characteristic vector and the first fusion characteristic vector and according to the labeling label. The invention improves the accuracy of the target detection model to the incomplete target and reduces the false detection rate of the model; the robustness of the algorithm to incomplete information target detection is better through the enhancement of the spatial context; the method can effectively utilize the context relationship of time domains aiming at the video data, and utilizes the relational modeling of different time domain information to improve the target detection accuracy.

Description

Training method of incomplete information target recognition model and target recognition method
Technical Field
The invention relates to the technical field of image data processing and data recognition, in particular to a training method and a target recognition method of an incomplete information target recognition model.
Background
The target recognition technology is one of the main technologies for intelligently processing video images as a new-generation information technology, and has attracted extensive attention and application in the civil and military defense fields. In the current mainstream video image target identification technology, a better identification effect can be obtained only on the assumption of an ideal condition that targets in an image are clearly visible and contour features are obvious. However, in a real application scene, there are many uncooperative harsh scenes, such as a target being partially blocked and artificially camouflaged, so that only local target information, i.e. incomplete target information (abbreviated as incomplete information), can be obtained from a video image, which makes target identification difficult.
Most of the existing mainstream target detection methods are based on a convolutional neural network, and the position and the type of a target are directly obtained by extracting the low-level and high-level features of a whole graph. The method has the defects that for an incomplete target, the difference between the characteristics of the target and the characteristics of the complete target is large, and the common convolutional neural network cannot accurately identify the target. For incomplete targets, only partial structural information of the target is included, and therefore, a technique for identifying the incomplete target is needed to solve the above problems.
Disclosure of Invention
The invention provides a training method of an incomplete information target recognition model and a target recognition method, aiming at solving the problem of accurately recognizing an incomplete target.
The technical scheme adopted by the invention for solving the technical problem is as follows:
a training method of an incomplete information target recognition model comprises the following steps:
step 1, establishing an image video data set, wherein the image video data set comprises a first image and continuous video frames, and the first image and the continuous video frames both have incomplete targets;
step 2, a plurality of second images of continuous frames in the continuous video frames are obtained, and the target positions and the target types on the first images and the second images are manually marked to obtain marking labels;
step 3, training an incomplete information target recognition model according to the image video data set, the first output feature vector, the first fusion feature vector and the label to obtain the trained incomplete information target recognition model;
the first output feature vector and the first fusion feature vector are obtained by the following method:
for a first image, extracting the features of the first image according to the label of the first image to obtain a first output feature vector; and for continuous video frames in the picture video data set, performing feature extraction on second images of the continuous frames according to the label tags of the second images to obtain second output feature vectors, and performing feature fusion on the second output feature vectors of the second images of the continuous frames to obtain first fusion feature vectors.
The invention has the beneficial effects that:
according to the training method and the target recognition method of the incomplete information target recognition model, the relationships between the target and the whole image and between the local characteristics of the target are effectively modeled, the accuracy of the target recognition model on the incomplete target is improved, and the false detection rate of the model is reduced; the robustness of the algorithm to incomplete information target detection is better through the enhancement of the spatial context; the method can effectively utilize the context relationship of time domains aiming at the video data, and utilizes the relational modeling of different time domain information to improve the target detection accuracy.
Drawings
FIG. 1 is a schematic diagram of a precursor Transformer-based object recognition algorithm.
Fig. 2 is a schematic diagram of an improved signature coding network structure of the present invention.
FIG. 3 is a schematic diagram of the objective recognition algorithm based on spatiotemporal context enhancement for consecutive video frames based on the transform improvement according to the present invention.
FIG. 4 is a schematic diagram of a target recognition algorithm based on spatiotemporal context enhancement for images and improved based on Transformer according to the present invention.
FIG. 5 is a schematic diagram of a target distribution of a test data set according to the present invention.
FIG. 6 is a comparison graph of detection indexes of the target recognition algorithm based on Transformer according to the present invention.
FIG. 7 is a graph showing a comparison of partial images of the recognition results of the original transform method and the method of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
Example one
Fig. 1 is a schematic diagram of an original target recognition algorithm based on a transform, and the problem that for an incomplete target, the difference between the target feature and the complete target feature is large, and a general convolutional neural network cannot accurately recognize the target exists, so the present embodiment is proposed.
The embodiment provides a training method of an incomplete information target recognition model, which comprises the following steps:
step 1, establishing an image video data set, wherein the image video data set comprises a first image and continuous video frames, and the first image and the continuous video frames both have incomplete targets;
step 2, images of the continuous video frames are called second images, a plurality of continuous frame second images in the continuous video frames are obtained, and target positions and target types on the first images and the second images are manually marked to obtain marking labels;
step 3, training an incomplete information target recognition model according to the image video data set, the first output feature vector, the first fusion feature vector and the label to obtain the trained incomplete information target recognition model;
the first output feature vector and the first fusion feature vector are obtained by the following method:
for the first image, extracting the features of the first image according to the label of the first image to obtain a first output feature vector; and for continuous video frames in the picture video data set, performing feature extraction on the second images of the continuous frames according to the label tags of the second images of the continuous frames to obtain second output feature vectors, and performing feature fusion on the second output feature vectors of the second images of the continuous frames to obtain first fusion feature vectors.
The incomplete information object recognition model obtained in the embodiment is not limited to be capable of recognizing only incomplete information objects.
Example two
The embodiment provides a training method of an incomplete information target recognition model, which comprises the following steps:
step 1, establishing an image video data set, wherein the image video data set comprises a first image and continuous video frames, and both the first image and the continuous video frames have incomplete targets and also have complete targets.
Step 2, manually marking the target positions and the target types on the first image and the continuous video frames to obtain marking labels, namely obtaining marked image video data sets; the images in successive video frames are referred to as second images. As an embodiment, the second images of all of the consecutive video frames are manually annotated.
For the continuous video frames in the above step 1 and step 2, the image video data set may also include a video and a first image, the continuous video frames are continuous frames extracted in the video for a certain time period or certain time periods, and the frame extraction proportion of the continuous video frames obtained by extracting the frames in the video is not less than 40% of the total number of frames contained in the video. In the first image, the continuous video frame and the second image of the marked continuous frame, the targets simultaneously comprise a complete information target and an occluded incomplete information target.
Step 3, for the first image, extracting the features of the first image according to the manually marked target type and target position to obtain a first output feature vector; for continuous video frames in a picture video data set, performing feature extraction on second images of continuous frames (for example, 16 frames in an adjacent time domain) in a certain period according to a target type and a target position which are artificially labeled to obtain a second output feature vector, and performing feature fusion (feature splicing) on the second output feature vectors of the second images (16 frames) of the continuous frames to obtain a first fusion feature vector, so that time domain context enhancement is realized; and the second images subjected to feature extraction in the step 3 are the second images manually labeled in the step 3.
Feature extraction is typically performed for each image, i.e. each first image and each second image. Preferably, the method of feature extraction for the first image is the same as the method of feature extraction for the second image;
and training the incomplete information target recognition model according to the first output feature vector, the first fusion feature vector and the label to obtain the trained incomplete information target recognition model.
And 4, training the incomplete information target recognition model according to the image video data set, the first output feature vector, the first fusion feature vector and the label to obtain the trained incomplete information target recognition model.
The trained incomplete information target identification model is used for acquiring the target type and the target position of an incomplete target in a video or an image according to the video or the image, and is also used for acquiring the target type and the target position of a complete target in the video or the image according to the video or the image.
The first output feature vector and the second output feature vector both include spatial context information, specifically including structural relationship information between components of the target and spatial relationship information between the target and the original image. The first image in the image video data set and the second image in successive video frames in the image video data set are both referred to as original images. The second output characteristic vector realizes the enhancement of the time domain context
In step 1, in order to ensure the generalization ability of the method, the number of images in the image data set is not less than 100 ten thousand, and the labeled target category is not less than 1000.
Fig. 2 is a schematic diagram of a feature extraction network structure, and the specific method for feature extraction in step 3 is as follows:
step 3.1, scaling the original image to a first pixel size in a unified manner, wherein the first pixel size is x p1 ×y p1 ,x p1 And y p1 Are all positive integers, and then divide the image of the first pixel size into N 1 ×M 1 An image grid, N 1 And M 1 Are all integers greater than 2, it is worth noting that for values characterizing pixel size, x is a positive integer p1 Need to be greater than N 1 、y p1 Need to be greater than N 1 And the like, and the obvious value and size relationship is not described in detail herein.
For example, with uniform scaling to 1024 × 1024 pixel size, the image is divided into 16 × 16 image grid areas (i.e., 256 grid areas with 16 rows and 16 columns).
Step 3.2 from N 1 ×M 1 Discarding the image grids with the random extraction ratio r in each image grid, and remaining N 1 ×M 1 X (1-r) image grids for the remaining N 1 ×M 1 And each image grid in the (1-r) image grids is subjected to feature extraction by utilizing a first convolution neural network to obtain a multi-dimensional first feature vector.
For example, grids with the proportion of r are randomly extracted from 256 image grid areas and discarded, and each remaining image grid is subjected to feature extraction by using a first convolution neural network to obtain a 128-dimensional first feature vector. r is equal to or greater than 0.3 and equal to or less than 0.6, r =0.5 in the embodiment, and the first convolutional neural network is ResNet50.
3.3, extracting a target area image from the original image according to the manually marked target type and target position, and aligning the target area image with the targetScaling the target area image to a second pixel size, the second pixel size being x p2 ×y p2 Usually x p2 <x p1 ,y p2 <y p1
Extracting an image of a target area in the original image, namely a target area screenshot (the target screenshot in fig. 2) according to the manual annotation result, and zooming the target area screenshot to 256 × 256 pixels;
step 3.4, carrying out grid division on the zoomed target area image, and dividing the zoomed target area image into N 2 ×M 2 A grid area, N 2 And M 2 Are integers of 2 or more. For example, the division into 8 × 8 grid regions for a total of 64 grids.
Step 3.5: from N 2 ×M 2 Randomly extracting grids with the proportion of f from each grid region, discarding the rest N 2 ×M 2 And (1-f) grid areas are subjected to feature extraction by utilizing a second convolutional neural network to obtain a multidimensional second feature vector.
For example, a 128-dimensional second feature vector is obtained, wherein the extraction ratio f satisfies 0.2 ≦ f ≦ 0.7, the number of the second feature vectors is 64 × (1-f) × n, where n is the number of targets; in this embodiment, f =0.4 is set, the number of the second eigenvectors is 38 × n, the second convolutional neural network is ResNet18, and the last layer of ResNet18 is changed to a full-connected layer with 128-dimensional output.
And 3.6, aiming at each target, according to the target position of the artificial mark, adopting a preset coding rule to code the target position to obtain a multidimensional coding feature vector, namely, expressing the target position (such as a target central point) of the artificial mark into a vector, such as a 128-dimensional coding feature vector, wherein the first feature vector, the second feature vector and the coding feature vector have the same dimension.
The preset encoding rule is as follows:
Figure BDA0003961316190000061
Figure BDA0003961316190000062
wherein PE represents a position code; pos represents the grid number of the center of the current target position belonging to the image, and the grid number complies with the line priority criterion; d represents the dimension of the coded feature vector;
Figure BDA0003961316190000063
d _ index represents the coded feature vector element position, i.e., the position in the coded feature vector divided by 2 rounded down.
Step 3.7, carrying out vector fusion on the obtained coding feature vector and the second feature vector to obtain a third feature vector;
and 3.8, inputting the first feature vector and the third feature vector into a transform Encoder (transform Encoder), and obtaining an output feature vector through a Self-Attention mechanism (Self-Attention). When the original image in step 3.1 is the first image, the output feature vector is the first output feature vector, and when the original image in step 3.1 is the second image, the output feature vector is the second output feature vector. The output feature vector includes spatial context information, and the spatial context information includes a structural relationship of the target component and a spatial relationship between the target whole and the image whole.
And 3.9, when the original image in the step 3.1 is the second image, performing feature fusion on the second output feature vector of the second image of the continuous frame to obtain a first fusion feature vector.
The training in the step 4 includes:
step 4.1, inputting the first fusion characteristic vector and the first output characteristic vector into a Transformer Decoder (Transformer Decoder), and simultaneously inputting a preset number m of Query key value vectors (Query) into the Transformer Decoder, and decoding to obtain m decoding characteristic outputs, wherein m is an integer not less than 256, and the Query key value vectors are trainable parameters;
step 4.2, calculating each decoding characteristic output through a feed forward neural network (FFN) to obtain a result vector containing a target type and a target position, and analyzing the result vector to obtain an analyzed target type and an analyzed target position;
step 4.3, comparing the analyzed target type and the analyzed target position with the label, calculating a loss function, and updating the network parameters of the incomplete information target identification model through a back propagation algorithm;
the method specifically comprises the following steps: comparing the target category obtained by analysis with a target category label manually marked, comparing the target position obtained by analysis with a target position label manually marked, calculating a loss function according to the comparison result, and updating parameters through a back propagation algorithm;
and 4.4, iteratively updating the network parameters, and finishing model training when the iteration times are finished and/or the preset optimal performance metric value is reached, namely obtaining the trained incomplete information target recognition model, and finishing training the incomplete information target recognition model.
And step 4, acquiring a first output characteristic vector and a first fusion characteristic vector according to the image video data set and the label, wherein the specific acquisition method is the same as step 3.
As shown in fig. 3, if input data to be detected is a video, that is, if an incomplete information target identification model is input as a continuous video frame, feature extraction is performed on images in the continuous video frame according to a manually labeled target type and a target position to obtain a second output feature vector, feature fusion is performed on second output feature vectors of a plurality of frames of pictures in adjacent time domains to obtain a first fusion feature vector, and the model obtains a target type and a target position of an incomplete target according to the first fusion feature vector, specifically: inputting the first fusion characteristic vector into a Transformer decoder, wherein a preset number m of query key value vectors are arranged in the Transformer decoder, m decoding characteristic outputs are obtained through decoding, each decoding characteristic output is calculated through a feedforward neural network to obtain a result vector containing a target type and a target position, the result vector is analyzed to obtain an analyzed target type and an analyzed target position, and the target type and the target position at the moment are target types and target positions of incomplete targets output through model detection.
The algorithmic network structure for the first image is shown in fig. 4. If the input data to be detected is a single image, namely the input of the incomplete information target identification model is a first image, performing feature extraction on the first image to obtain a first output feature vector, and obtaining the target type and the target position of the incomplete target by the model according to the first fusion feature vector, wherein the method specifically comprises the following steps: and inputting the first fusion feature vector into a transform decoder, wherein the transform decoder is provided with a preset number m of query key value vectors, m decoding feature outputs are obtained through decoding, each decoding feature output is calculated through a feed-forward neural network to obtain a result vector containing a target type and a target position, the result vector is analyzed to obtain an analyzed target type and an analyzed target position, and the target type and the target position at the moment are used as the target type and the target position of an incomplete target detected and output by a model.
EXAMPLE III
The invention also provides a target identification method of the incomplete information target identification model, which comprises the following steps:
and performing incomplete target detection on the video and/or the image by using the trained incomplete information target identification model in the first embodiment or the second embodiment, outputting the target type and the target position of the incomplete target by using the trained incomplete information target identification model of the video and/or the image, and further outputting the target type and the target position of the complete target by using the model.
And (3) performing test verification on the model, selecting 1000 sheets as a test data set, and comparing the target identification result with the original target identification method based on the Transformer by using the method disclosed by the invention. The data set contains visible and infrared images shot by the unmanned aerial vehicle, the shot scenes comprise mountainous regions, plains/suburbs, oceans, deserts/gobi, and the target categories comprise vehicles, airplanes and ships. The target proportion is 42% after completion, the target proportion is 43% below the shielding rate of 30%, the target proportion is 15% between the shielding rate of 30% and the shielding rate of 80%, the target distribution is shown in figure 5, and the result statistics are shown in figure 6. The partial image recognition results are shown in fig. 7.
Compared with the prior art, the technical scheme of the invention has the following advantages:
1) The method effectively models the relationship between the target and the whole image and the relationship between the local characteristics of the target, improves the accuracy of the target identification model, and reduces the false detection rate of the model. In addition, the robustness of the algorithm to incomplete information target detection is better through enhancement of the spatial context.
2) The method can effectively utilize the context relationship of time domains aiming at the video data, and utilizes the relational modeling of different time domain information to improve the target detection accuracy by more than 3 percent.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (8)

1. A training method of an incomplete information target recognition model is characterized by comprising the following steps:
step 1, establishing an image video data set, wherein the image video data set comprises a first image and continuous video frames, and the first image and the continuous video frames both have incomplete targets;
step 2, obtaining a plurality of continuous frame second images in continuous video frames, and manually marking the target positions and the target types on the first images and the second images to obtain marking labels;
step 3, training an incomplete information target recognition model according to the image video data set, the first output characteristic vector, the first fusion characteristic vector and the label to obtain the trained incomplete information target recognition model;
the first output feature vector and the first fusion feature vector are obtained by the following method:
for the first image, extracting the features of the first image according to the label of the first image to obtain a first output feature vector; and for continuous video frames in the picture video data set, performing feature extraction on second images of the continuous frames according to the label tags of the second images to obtain second output feature vectors, and performing feature fusion on the second output feature vectors of the second images of the continuous frames to obtain first fusion feature vectors.
2. The method as claimed in claim 1, wherein the first image and the second image are both referred to as original images, and the method for extracting features in step 3 comprises:
step 3.1, scaling the original image to a first pixel size which is x p1 ×y p1 ,x p1 And y p1 Are all positive integers, and then divide the image of the first pixel size into N 1 ×M 1 An image grid, N 1 And M 1 Are all integers greater than 2;
step 3.2 from N 1 ×M 1 Discarding the image grids with the random extraction ratio r in each image grid, and remaining N 1 ×M 1 X (1-r) image grids for the remaining N 1 ×M 1 Each image grid in the (1-r) image grids is subjected to feature extraction by utilizing a first convolution neural network to obtain a multi-dimensional first feature vector;
step 3.3, extracting a target area image from the original image according to the manually marked target type and target position, and zooming the target area image to a second pixel size, wherein the second pixel size is x p2 ×y p2 Usually x p2 <x p1 ,y p2 <y p1
Step 3.4, carrying out grid division on the zoomed target area image, and dividing the zoomed target area image into N 2 ×M 2 A grid area, N 2 And M 2 Are integers greater than or equal to 2;
step 3.5: from N 2 ×M 2 Randomly extracting grids with the proportion of f from each grid region and discarding the rest N 2 ×M 2 (1-f) carrying out feature extraction on the grid areas by utilizing a second convolutional neural network to obtain a multidimensional second gridTwo feature vectors;
3.6, according to the manually marked target positions, coding the target position of each target by adopting a preset coding rule to obtain a multidimensional coding feature vector, wherein the first feature vector, the second feature vector and the coding feature vector have the same dimension;
step 3.7, carrying out vector fusion on the coding feature vector and the second feature vector to obtain a third feature vector;
step 3.8, inputting the first feature vector and the third feature vector into a transform encoder, and obtaining an output feature vector through a self-attention mechanism; when the image in the step 3.1 is a first image, the output characteristic vector is a first output characteristic vector, and when the image in the step 3.1 is a second image, the output characteristic vector is a second output characteristic vector and the step 3.9 is carried out;
and 3.9, performing feature fusion on the second output feature vector of the second image of the continuous frame to obtain a first fusion feature vector.
3. The method as claimed in claim 2, wherein the predetermined coding rule is:
Figure FDA0003961316180000021
Figure FDA0003961316180000022
wherein PE represents a position code; pos represents the grid number of the center of the current target position belonging to the image, and the grid number complies with the line priority criterion; d represents the dimension of the coded feature vector;
Figure FDA0003961316180000023
d _ index represents the coded feature vector element position, i.e., the position in the coded feature vector divided by 2 rounded down.
4. The method as claimed in claim 2, wherein r is 0.3. Ltoreq. R.ltoreq.0.6, f is 0.2. Ltoreq. F.ltoreq.0.7.
5. The method as claimed in claim 2, wherein the first convolutional neural network is ResNet50, and the second convolutional neural network is ResNet18.
6. The method as claimed in claim 1, wherein the first output feature vector and the second output feature vector each include information of a spatial relationship between the target and the original image and information of a structural relationship between components of the target.
7. The method as claimed in claim 1, wherein the training of the incomplete information object recognition model comprises:
inputting the first fusion characteristic vector and the first output characteristic vector into a Transformer decoder, inputting a preset number m of query key value vectors into the Transformer decoder, and decoding to obtain m decoding characteristic outputs, wherein m is an integer, and the query key value vectors are trainable parameters;
calculating by each decoding characteristic output through a feedforward neural network to obtain a result vector containing a target type and a target position, and analyzing the result vector to obtain an analyzed target type and an analyzed target position;
and comparing the analyzed target type and the analyzed target position with the label, calculating a loss function, and updating the network parameters of the incomplete information target identification model through a back propagation algorithm.
8. A method of object recognition, comprising:
acquiring an image or video to be identified;
inputting the image or video to be recognized into an incomplete information target recognition model for processing to obtain the target type and the target position output by the incomplete information target recognition model, wherein the incomplete information target recognition model is obtained by training through the training method of any one of claims 1 to 7.
CN202211480465.2A 2022-11-24 2022-11-24 Training method of incomplete information target recognition model and target recognition method Active CN115761444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211480465.2A CN115761444B (en) 2022-11-24 2022-11-24 Training method of incomplete information target recognition model and target recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211480465.2A CN115761444B (en) 2022-11-24 2022-11-24 Training method of incomplete information target recognition model and target recognition method

Publications (2)

Publication Number Publication Date
CN115761444A true CN115761444A (en) 2023-03-07
CN115761444B CN115761444B (en) 2023-07-25

Family

ID=85336699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211480465.2A Active CN115761444B (en) 2022-11-24 2022-11-24 Training method of incomplete information target recognition model and target recognition method

Country Status (1)

Country Link
CN (1) CN115761444B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191318A (en) * 2021-05-21 2021-07-30 上海商汤智能科技有限公司 Target detection method and device, electronic equipment and storage medium
CN113222916A (en) * 2021-04-28 2021-08-06 北京百度网讯科技有限公司 Method, apparatus, device and medium for detecting image using target detection model
CN113673489A (en) * 2021-10-21 2021-11-19 之江实验室 Video group behavior identification method based on cascade Transformer
CN114120172A (en) * 2021-10-29 2022-03-01 北京百度网讯科技有限公司 Video-based target detection method and device, electronic equipment and storage medium
CN114973038A (en) * 2022-06-20 2022-08-30 西安微电子技术研究所 Transformer-based airport runway line detection method
CN115294501A (en) * 2022-08-11 2022-11-04 北京字跳网络技术有限公司 Video identification method, video identification model training method, medium and electronic device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113222916A (en) * 2021-04-28 2021-08-06 北京百度网讯科技有限公司 Method, apparatus, device and medium for detecting image using target detection model
CN113191318A (en) * 2021-05-21 2021-07-30 上海商汤智能科技有限公司 Target detection method and device, electronic equipment and storage medium
CN113673489A (en) * 2021-10-21 2021-11-19 之江实验室 Video group behavior identification method based on cascade Transformer
CN114120172A (en) * 2021-10-29 2022-03-01 北京百度网讯科技有限公司 Video-based target detection method and device, electronic equipment and storage medium
CN114973038A (en) * 2022-06-20 2022-08-30 西安微电子技术研究所 Transformer-based airport runway line detection method
CN115294501A (en) * 2022-08-11 2022-11-04 北京字跳网络技术有限公司 Video identification method, video identification model training method, medium and electronic device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ASHISH VASWANI ET.AL: "Attention Is All You Need", 《ARXIV:1706.03762V5 [CS.CL] 》, pages 1 - 15 *
KRISHNA KUMAR SINGH ET.AL: "Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-supervised Object and Action Localization", 《ARXIV:1704.04232V2 [CS.CV] 》, pages 1 - 10 *
LU HE ET.AL: "End-to-End Video Object Detection with Spatial-Temporal Transformers", 《ARXIV:2105.10920V1 [CS.CV] 》, pages 3 - 7 *
洪峰 等: "基于时空一致性约束视频目标车辆的检测与跟踪算法研究", 《电子测量与仪器学报》, vol. 36, no. 3, pages 105 - 112 *

Also Published As

Publication number Publication date
CN115761444B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
Bappy et al. Hybrid lstm and encoder–decoder architecture for detection of image forgeries
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN108537119B (en) Small sample video identification method
CN110751018A (en) Group pedestrian re-identification method based on mixed attention mechanism
CN110889375B (en) Hidden-double-flow cooperative learning network and method for behavior recognition
CN112150450B (en) Image tampering detection method and device based on dual-channel U-Net model
CN113822246B (en) Vehicle weight identification method based on global reference attention mechanism
CN109919032A (en) A kind of video anomaly detection method based on action prediction
CN110765841A (en) Group pedestrian re-identification system and terminal based on mixed attention mechanism
CN115082966B (en) Pedestrian re-recognition model training method, pedestrian re-recognition method, device and equipment
CN111160096A (en) Method, device and system for identifying poultry egg abnormality, storage medium and electronic device
CN109492610B (en) Pedestrian re-identification method and device and readable storage medium
CN115240121B (en) Joint modeling method and device for enhancing local features of pedestrians
CN109614933A (en) A kind of motion segmentation method based on certainty fitting
CN114067286A (en) High-order camera vehicle weight recognition method based on serialized deformable attention mechanism
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN113936175A (en) Method and system for identifying events in video
Salem et al. Semantic image inpainting using self-learning encoder-decoder and adversarial loss
CN110825916A (en) Person searching method based on body shape recognition technology
CN113642685A (en) Efficient similarity-based cross-camera target re-identification method
CN117197727A (en) Global space-time feature learning-based behavior detection method and system
CN116453102A (en) Foggy day license plate recognition method based on deep learning
CN111783570A (en) Method, device and system for re-identifying target and computer storage medium
CN111709442A (en) Multilayer dictionary learning method for image classification task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant