CN117474950A - Cross-modal target tracking method based on visual semantics - Google Patents

Cross-modal target tracking method based on visual semantics Download PDF

Info

Publication number
CN117474950A
CN117474950A CN202311454366.1A CN202311454366A CN117474950A CN 117474950 A CN117474950 A CN 117474950A CN 202311454366 A CN202311454366 A CN 202311454366A CN 117474950 A CN117474950 A CN 117474950A
Authority
CN
China
Prior art keywords
target
tracking
frame
moving
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311454366.1A
Other languages
Chinese (zh)
Inventor
赵彦春
李福生
万优
段裕龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangtze River Delta Research Institute of UESTC Huzhou filed Critical Yangtze River Delta Research Institute of UESTC Huzhou
Priority to CN202311454366.1A priority Critical patent/CN117474950A/en
Publication of CN117474950A publication Critical patent/CN117474950A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Closed-Circuit Television Systems (AREA)

Abstract

The invention discloses a cross-modal target tracking method based on visual semantics, which comprises the steps of firstly detecting, extracting, identifying and tracking a moving target in an image sequence through a visual module to obtain the moving parameters of the moving target, such as position, speed, acceleration and movement track; the target state in the subsequent frame is predicted from the tracked video sequence given the target state of the initial frame. The computer vision technology provided by the invention can be better applied to the field of target tracking, is effectively improved on the basis of the computer vision technology, and is used for comprehensively preprocessing images through a specific algorithm and then respectively identifying and tracking targets. By means of a whole set of algorithm scheme, accuracy and convenience of target tracking are guaranteed, real-time performance and robustness of the whole tracking system on identification and tracking of a moving target are improved through cooperative work of a fixed machine and a moving machine, and consistency of identification of movement of an object to be detected in the system is achieved.

Description

Cross-modal target tracking method based on visual semantics
Technical Field
The invention relates to the technical field of computer vision, in particular to a cross-modal target tracking method based on visual semantics.
Background
Object tracking is an important problem widely studied in machine vision and is classified into single-object tracking and multi-object tracking. The former tracks a single target in a video picture, and the latter tracks a plurality of targets in the video picture at the same time, so as to obtain the motion trail of the targets. Vision-based automatic target tracking has important applications in the fields of intelligent monitoring, action and behavior analysis, automatic driving and the like. For example, in an autopilot system, target tracking algorithms track the movement of moving vehicles, pedestrians, other animals, and make predictions of their future location, speed, etc.
At present, the fields of target identification and individual tracking have been developed for many years, and a large number of algorithms emerge, but the fields still have the following problems:
(1) The method has the advantages that the environment is complex, the target to be detected has partial shielding condition in the image, certain error exists in single fixed visual angle recognition, and meanwhile, the target recognition under the conditions of visual angle transformation, prediction and the like is also a difficult problem in the field of target tracking.
(2) In order to ensure the tracking effectiveness of the tracking algorithm, the algorithm is required to ensure the robustness, but has higher robustness, which means higher computational complexity and more complex tracking strategies, so that the real-time performance of the algorithm is reduced, and a series of problems of recognition and prediction errors are brought.
Disclosure of Invention
The invention aims to provide a cross-modal target tracking method based on visual semantics, so as to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: the cross-modal target tracking method based on visual semantics comprises the following steps:
(1) Firstly, detecting, extracting, identifying and tracking a moving target in an image sequence through a vision module to obtain the moving parameters of the moving target, such as position, speed, acceleration and movement track;
(1.1) inputting an initial frame and designating a target desired to be tracked, typically calibrated with rectangular boxes, generating a plurality of candidate boxes in a next frame and extracting features of the candidate boxes;
(1.2) scoring the candidate boxes by the observation model;
and (1.3) finally, finding a candidate frame with the highest score in the scores as a predicted target, or fusing a plurality of predicted values to obtain a better predicted target.
(2) Predicting a target state of the initial frame according to the tracked video sequence;
(3) After a target area of a current frame is obtained, extracting characteristics of the current target, and carrying out online updating of an observation model by using a model updating algorithm;
(4) Extracting the characteristics to obtain a detection result of each frame of image, correlating the detection result with the existing tracking track, observing the model, verifying the possibility of a motion area of the motion model, and analyzing candidate frames generated by the motion model;
(5) The algorithm completes the prediction of the second frame according to the information of the first frame, the subsequent frames and the like, and meanwhile, the model is updated according to the appointed rule;
(6) And comparing the information acquired by the vision module, and continuously adjusting the predicted position of the target according to the algorithm to realize accurate tracking of the target.
Preferably, in the step (1), a target tracking algorithm is used for detecting and extracting the target, so as to remove the false detection and increase the missing detection.
Preferably, in the step (1), the detection and extraction of the target includes visual features, statistical features, transform coefficient features, algebraic features.
Preferably, in the step (1), the vision module includes a CCD camera, a cloud processing platform, and an embedded processor, and also includes a fixed machine and a moving machine for installing the modules.
Preferably, in the step (2), when the prediction is performed, the detection and the track and the matching are regarded as binary variables, and the optimal value of the variable is obtained by constructing an integral objective function, so that the objective function is optimal, and the optimal matching of the detection and the track is obtained.
Preferably, in the step (2), the MeanShift operation is performed on all image frames of the video sequence, and the result of the previous frame is used as the initial value of the search window of the MeanShift algorithm of the next frame, so that the iteration is performed.
Preferably, in the step (3), the content of online update is the prediction condition of the target area model, and the online update is fed back to the initialized target extraction analysis step to judge the accuracy of the model prediction, and the prediction area is updated by using an algorithm, and iteration is continued, if the target subsequently appears in a rectangular candidate frame of the prediction area, no feedback is performed.
Preferably, the fixed machine and the moving machine both use a computer vision algorithm to track the target, and calculate the tracking displacement difference of the moving tracking relative to the ground fixed tracking target in the real world according to the tracking result, namely the position of the target in the image, so as to adjust the posture of the moving machine, and ensure that the target is always kept at the center position of the image.
Preferably, the motion machine comprises a plurality of communication interfaces which can be added with other peripheral equipment, so that the purpose of information transmission between the motion machine and the fixed machine and further continuously adjusting the predicted position and the self position is realized.
Compared with the prior art, the invention has the following beneficial effects:
the computer vision technology provided by the invention can be better applied to the field of target tracking, is effectively improved on the basis of the computer vision technology, and is used for comprehensively preprocessing images through a specific algorithm and then respectively identifying and tracking targets. By means of a whole set of algorithm scheme, accuracy and convenience of target tracking are guaranteed, real-time performance and robustness of the whole tracking system on identification and tracking of a moving target are improved through cooperative work of a fixed machine and a moving machine, and consistency of identification of movement of an object to be detected in the system is achieved.
Detailed Description
The technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The cross-modal target tracking method based on visual semantics comprises the following steps:
firstly, detecting, extracting, identifying and tracking a moving target in an image sequence through a vision module to obtain the moving parameters of the moving target, such as position, speed, acceleration and moving track, detecting and extracting the target by adopting a target tracking algorithm, removing false detection and adding missing detection, providing a basis for further behavior analysis, wherein the algorithm comprises but is not limited to a Mean shift algorithm, state prediction is carried out by using Kalman filtering and particle filtering, TLD is based on tracking of online learning, KCF is based on a correlation filtering algorithm, detecting and extracting the target comprises visual characteristics (image edges, contours, shapes, textures and areas), statistical characteristics (histograms), transformation coefficient characteristics (Fourier, autoregressive models) and algebraic characteristics (singular value decomposition of an image matrix), and the vision module comprises a CCD camera, a cloud processing platform and an embedded processor, and also comprises a fixed machine and a moving machine for installing the modules. The fixed machine means that the camera does not move in the monitoring process, the movement of the target in the field of view of the camera is detected, and only the movement of the target relative to the camera is detected; a moving machine refers to a camera that moves (translates, rotates, multi-degree of freedom motion) during surveillance, creating complex relative motion between the object and the camera.
The fixed machine and the moving machine both use a computer vision algorithm to track the target, and according to the tracking result, namely the position of the target in the image, the tracking displacement difference of the moving tracking target relative to the ground fixed tracking target in the real world is calculated, so that the gesture of the moving machine is adjusted, and the target is always kept at the center position of the image. The moving machine comprises a plurality of communication interfaces which can be added with other peripheral equipment so as to realize the purpose of information transmission between the moving machine and the fixed machine and further continuously adjusting the predicted position and the self position.
The information extraction and analysis steps of the moving target are as follows:
(1) Inputting an initial frame and designating a target to be tracked, calibrating the initial frame by using a rectangular frame, generating a plurality of candidate frames in the next frame and extracting the characteristics of the candidate frames;
(2) The observation model scores the candidate frames;
(3) And finally, finding a candidate frame with the highest score in the scores as a predicted target, or fusing a plurality of predicted values to obtain a better predicted target.
And then carrying out next processing and analysis to realize the behavior understanding of the moving object, according to the tracked video sequence, giving the target state (position and scale) of an initial frame (first frame), predicting the target state in a subsequent frame, taking the detection and the track and matching as binary variables when predicting, constructing an integral target function, solving the optimal value of the variable to ensure that the target function is optimal, thus obtaining the optimal matching of the detection and the track, carrying out MeanShift operation on all image frames of the video sequence, and taking the result of the previous frame (namely the center position and the window size of a search window) as the initial value of the search window of the MeanShift algorithm of the next frame, thus carrying out iteration.
The meanShift is used for searching an optimal iteration result for a single picture, the camShift is used for processing a video sequence, and the meanShift is called for each frame of picture in the sequence to search the optimal iteration result. The camShift is used for processing a video sequence, so that the size of a window can be continuously adjusted, and therefore, when the size of a target is changed, the algorithm can adaptively adjust the target area to continue tracking.
Extracting the characteristics of a current target after the target area of the current frame is obtained, carrying out online updating of an observation model by using a model updating algorithm, feeding back the updated content as the prediction condition of the target area model to an initialized target extraction and analysis step so as to judge the accuracy of the model prediction, updating the prediction area by using the algorithm, and continuously iterating, wherein if the target subsequently appears in a rectangular candidate frame of the prediction area, no feedback is carried out; extracting the characteristics to obtain a detection result of each frame of image, correlating the detection result with the existing tracking track, observing the model, verifying the possibility of a motion area of the motion model, and analyzing candidate frames generated by the motion model; the algorithm completes the prediction of the second frame according to the information of the first frame, the subsequent frames and the like, and meanwhile, the model is updated according to the appointed rule; and comparing the information acquired by the vision module, and continuously adjusting the predicted position of the target according to the algorithm to realize accurate tracking of the target.
Examples:
the tracked mobile object is a motor-driven trolley, the fixed machine is a fixed camera, the mobile machine is an unmanned aerial vehicle with a camera, the mobile machine and the unmanned aerial vehicle are both provided with the related processor, and the trolley is started after preparation, so that the trolley is tracked in the monitoring range of the mobile object and the unmanned aerial vehicle.
The on-board CCD camera is used for collecting the information of the target and surrounding images, identifying and tracking the target, and calculating the pixel position difference between the target tracking target and the image center point on the continuous image sequence. And calculating the relative displacement of the unmanned aerial vehicle and the target in the real three-dimensional space according to a certain mapping relation through the calculated position difference of the image. The calculated relative displacement related information is provided for a flight control module, so that the flight attitude of the unmanned aerial vehicle is adjusted, and finally, the flight following is realized.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (9)

1. The cross-modal target tracking method based on visual semantics is characterized by comprising the following steps of:
(1) Firstly, detecting, extracting, identifying and tracking a moving target in an image sequence through a vision module to obtain the moving parameters of the moving target, such as position, speed, acceleration and movement track;
(1.1) inputting an initial frame and designating a target desired to be tracked, typically calibrated with rectangular boxes, generating a plurality of candidate boxes in a next frame and extracting features of the candidate boxes;
(1.2) scoring the candidate boxes by the observation model;
and (1.3) finally, finding a candidate frame with the highest score in the scores as a predicted target, or fusing a plurality of predicted values to obtain a better predicted target.
(2) Predicting a target state of the initial frame according to the tracked video sequence;
(3) After a target area of a current frame is obtained, extracting characteristics of the current target, and carrying out online updating of an observation model by using a model updating algorithm;
(4) Extracting the characteristics to obtain a detection result of each frame of image, correlating the detection result with the existing tracking track, observing the model, verifying the possibility of a motion area of the motion model, and analyzing candidate frames generated by the motion model;
(5) The algorithm completes the prediction of the second frame according to the information of the first frame, the subsequent frames and the like, and meanwhile, the model is updated according to the appointed rule;
(6) And comparing the information acquired by the vision module, and continuously adjusting the predicted position of the target according to the algorithm to realize accurate tracking of the target.
2. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (1), a target tracking algorithm is adopted for target detection and extraction, false detection is removed, and missing detection is increased.
3. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (1), the detection and extraction of the target comprise visual features, statistical features, transformation coefficient features and algebraic features.
4. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (1), the vision module comprises a CCD camera, a cloud processing platform and an embedded processor, and also comprises a fixed machine and a moving machine for installing the modules.
5. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (2), when prediction is performed, the detection and the track and matching are regarded as binary variables, and the optimal value of the variables is obtained by constructing an integral objective function, so that the objective function is optimal, and the optimal matching of the detection and the track is obtained.
6. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (2), the MeanShift operation is performed on all the image frames of the video sequence, and the result of the previous frame is used as the initial value of the search window of the MeanShift algorithm of the next frame, so that iteration is performed.
7. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (3), the content of online updating is the prediction condition of the target area model, and the online updating is fed back to the initialized target extraction analysis step to judge the accuracy of the model prediction, the prediction area is updated by utilizing an algorithm, iteration is continued, and if the target subsequently appears in a rectangular candidate frame of the prediction area, no feedback is carried out.
8. The visual semantic cross-modal target tracking method according to claim 4, wherein: the fixed machine and the moving machine both use a computer vision algorithm to track the target, and according to the tracking result, namely the position of the target in the image, the tracking displacement difference of the moving tracking target relative to the ground fixed tracking target in the real world is calculated, so that the gesture of the moving machine is adjusted, and the target is always kept at the center position of the image.
9. The visual semantic cross-modal target tracking method according to claim 4, wherein: the motion machine comprises a plurality of communication interfaces which can be added with other peripheral equipment so as to realize the purpose of information transmission with the fixed machine and further continuously adjusting the predicted position and the self position.
CN202311454366.1A 2023-11-03 2023-11-03 Cross-modal target tracking method based on visual semantics Pending CN117474950A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311454366.1A CN117474950A (en) 2023-11-03 2023-11-03 Cross-modal target tracking method based on visual semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311454366.1A CN117474950A (en) 2023-11-03 2023-11-03 Cross-modal target tracking method based on visual semantics

Publications (1)

Publication Number Publication Date
CN117474950A true CN117474950A (en) 2024-01-30

Family

ID=89630712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311454366.1A Pending CN117474950A (en) 2023-11-03 2023-11-03 Cross-modal target tracking method based on visual semantics

Country Status (1)

Country Link
CN (1) CN117474950A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015501A (en) * 2024-04-08 2024-05-10 中国人民解放军陆军步兵学院 Medium-low altitude low-speed target identification method based on computer vision

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015501A (en) * 2024-04-08 2024-05-10 中国人民解放军陆军步兵学院 Medium-low altitude low-speed target identification method based on computer vision
CN118015501B (en) * 2024-04-08 2024-06-11 中国人民解放军陆军步兵学院 Medium-low altitude low-speed target identification method based on computer vision

Similar Documents

Publication Publication Date Title
CN113269098B (en) Multi-target tracking positioning and motion state estimation method based on unmanned aerial vehicle
CN109800689B (en) Target tracking method based on space-time feature fusion learning
CN111932580A (en) Road 3D vehicle tracking method and system based on Kalman filtering and Hungary algorithm
CN112785628B (en) Track prediction method and system based on panoramic view angle detection tracking
CN110738690A (en) unmanned aerial vehicle video middle vehicle speed correction method based on multi-target tracking framework
Premebida et al. A multi-target tracking and GMM-classifier for intelligent vehicles
Ji et al. RGB-D SLAM using vanishing point and door plate information in corridor environment
CN117474950A (en) Cross-modal target tracking method based on visual semantics
WO2022021661A1 (en) Gaussian process-based visual positioning method, system, and storage medium
CN113192105A (en) Method and device for tracking multiple persons and estimating postures indoors
CN114549549A (en) Dynamic target modeling tracking method based on instance segmentation in dynamic environment
CN117011378A (en) Mobile robot target positioning and tracking method and related equipment
CN114998276A (en) Robot dynamic obstacle real-time detection method based on three-dimensional point cloud
Shi et al. Fuzzy dynamic obstacle avoidance algorithm for basketball robot based on multi-sensor data fusion technology
Zhao et al. Dynamic object tracking for self-driving cars using monocular camera and lidar
Atoum et al. Monocular video-based trailer coupler detection using multiplexer convolutional neural network
CN113379795B (en) Multi-target tracking and segmentation method based on conditional convolution and optical flow characteristics
CN108664918B (en) Intelligent vehicle front pedestrian tracking method based on background perception correlation filter
CN108469729B (en) Human body target identification and following method based on RGB-D information
Mancusi et al. Trackflow: Multi-object tracking with normalizing flows
Sun et al. Real-time and fast RGB-D based people detection and tracking for service robots
CN117331071A (en) Target detection method based on millimeter wave radar and vision multi-mode fusion
CN117011341A (en) Vehicle track detection method and system based on target tracking
Tamas et al. Lidar and vision based people detection and tracking
Wang et al. Online drone-based moving target detection system in dense-obstructer environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication