CN117474950A

CN117474950A - Cross-modal target tracking method based on visual semantics

Info

Publication number: CN117474950A
Application number: CN202311454366.1A
Authority: CN
Inventors: 赵彦春; 李福生; 万优; 段裕龙
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2023-11-03
Filing date: 2023-11-03
Publication date: 2024-01-30

Abstract

The invention discloses a cross-modal target tracking method based on visual semantics, which comprises the steps of firstly detecting, extracting, identifying and tracking a moving target in an image sequence through a visual module to obtain the moving parameters of the moving target, such as position, speed, acceleration and movement track; the target state in the subsequent frame is predicted from the tracked video sequence given the target state of the initial frame. The computer vision technology provided by the invention can be better applied to the field of target tracking, is effectively improved on the basis of the computer vision technology, and is used for comprehensively preprocessing images through a specific algorithm and then respectively identifying and tracking targets. By means of a whole set of algorithm scheme, accuracy and convenience of target tracking are guaranteed, real-time performance and robustness of the whole tracking system on identification and tracking of a moving target are improved through cooperative work of a fixed machine and a moving machine, and consistency of identification of movement of an object to be detected in the system is achieved.

Description

Cross-modal target tracking method based on visual semantics

Technical Field

The invention relates to the technical field of computer vision, in particular to a cross-modal target tracking method based on visual semantics.

Background

Object tracking is an important problem widely studied in machine vision and is classified into single-object tracking and multi-object tracking. The former tracks a single target in a video picture, and the latter tracks a plurality of targets in the video picture at the same time, so as to obtain the motion trail of the targets. Vision-based automatic target tracking has important applications in the fields of intelligent monitoring, action and behavior analysis, automatic driving and the like. For example, in an autopilot system, target tracking algorithms track the movement of moving vehicles, pedestrians, other animals, and make predictions of their future location, speed, etc.

At present, the fields of target identification and individual tracking have been developed for many years, and a large number of algorithms emerge, but the fields still have the following problems:

(1) The method has the advantages that the environment is complex, the target to be detected has partial shielding condition in the image, certain error exists in single fixed visual angle recognition, and meanwhile, the target recognition under the conditions of visual angle transformation, prediction and the like is also a difficult problem in the field of target tracking.

(2) In order to ensure the tracking effectiveness of the tracking algorithm, the algorithm is required to ensure the robustness, but has higher robustness, which means higher computational complexity and more complex tracking strategies, so that the real-time performance of the algorithm is reduced, and a series of problems of recognition and prediction errors are brought.

Disclosure of Invention

The invention aims to provide a cross-modal target tracking method based on visual semantics, so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: the cross-modal target tracking method based on visual semantics comprises the following steps:

(1) Firstly, detecting, extracting, identifying and tracking a moving target in an image sequence through a vision module to obtain the moving parameters of the moving target, such as position, speed, acceleration and movement track;

(1.1) inputting an initial frame and designating a target desired to be tracked, typically calibrated with rectangular boxes, generating a plurality of candidate boxes in a next frame and extracting features of the candidate boxes;

(1.2) scoring the candidate boxes by the observation model;

and (1.3) finally, finding a candidate frame with the highest score in the scores as a predicted target, or fusing a plurality of predicted values to obtain a better predicted target.

(2) Predicting a target state of the initial frame according to the tracked video sequence;

(3) After a target area of a current frame is obtained, extracting characteristics of the current target, and carrying out online updating of an observation model by using a model updating algorithm;

(4) Extracting the characteristics to obtain a detection result of each frame of image, correlating the detection result with the existing tracking track, observing the model, verifying the possibility of a motion area of the motion model, and analyzing candidate frames generated by the motion model;

(5) The algorithm completes the prediction of the second frame according to the information of the first frame, the subsequent frames and the like, and meanwhile, the model is updated according to the appointed rule;

(6) And comparing the information acquired by the vision module, and continuously adjusting the predicted position of the target according to the algorithm to realize accurate tracking of the target.

Preferably, in the step (1), a target tracking algorithm is used for detecting and extracting the target, so as to remove the false detection and increase the missing detection.

Preferably, in the step (1), the detection and extraction of the target includes visual features, statistical features, transform coefficient features, algebraic features.

Preferably, in the step (1), the vision module includes a CCD camera, a cloud processing platform, and an embedded processor, and also includes a fixed machine and a moving machine for installing the modules.

Preferably, in the step (2), when the prediction is performed, the detection and the track and the matching are regarded as binary variables, and the optimal value of the variable is obtained by constructing an integral objective function, so that the objective function is optimal, and the optimal matching of the detection and the track is obtained.

Preferably, in the step (2), the MeanShift operation is performed on all image frames of the video sequence, and the result of the previous frame is used as the initial value of the search window of the MeanShift algorithm of the next frame, so that the iteration is performed.

Preferably, in the step (3), the content of online update is the prediction condition of the target area model, and the online update is fed back to the initialized target extraction analysis step to judge the accuracy of the model prediction, and the prediction area is updated by using an algorithm, and iteration is continued, if the target subsequently appears in a rectangular candidate frame of the prediction area, no feedback is performed.

Preferably, the fixed machine and the moving machine both use a computer vision algorithm to track the target, and calculate the tracking displacement difference of the moving tracking relative to the ground fixed tracking target in the real world according to the tracking result, namely the position of the target in the image, so as to adjust the posture of the moving machine, and ensure that the target is always kept at the center position of the image.

Preferably, the motion machine comprises a plurality of communication interfaces which can be added with other peripheral equipment, so that the purpose of information transmission between the motion machine and the fixed machine and further continuously adjusting the predicted position and the self position is realized.

Compared with the prior art, the invention has the following beneficial effects:

the computer vision technology provided by the invention can be better applied to the field of target tracking, is effectively improved on the basis of the computer vision technology, and is used for comprehensively preprocessing images through a specific algorithm and then respectively identifying and tracking targets. By means of a whole set of algorithm scheme, accuracy and convenience of target tracking are guaranteed, real-time performance and robustness of the whole tracking system on identification and tracking of a moving target are improved through cooperative work of a fixed machine and a moving machine, and consistency of identification of movement of an object to be detected in the system is achieved.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The cross-modal target tracking method based on visual semantics comprises the following steps:

firstly, detecting, extracting, identifying and tracking a moving target in an image sequence through a vision module to obtain the moving parameters of the moving target, such as position, speed, acceleration and moving track, detecting and extracting the target by adopting a target tracking algorithm, removing false detection and adding missing detection, providing a basis for further behavior analysis, wherein the algorithm comprises but is not limited to a Mean shift algorithm, state prediction is carried out by using Kalman filtering and particle filtering, TLD is based on tracking of online learning, KCF is based on a correlation filtering algorithm, detecting and extracting the target comprises visual characteristics (image edges, contours, shapes, textures and areas), statistical characteristics (histograms), transformation coefficient characteristics (Fourier, autoregressive models) and algebraic characteristics (singular value decomposition of an image matrix), and the vision module comprises a CCD camera, a cloud processing platform and an embedded processor, and also comprises a fixed machine and a moving machine for installing the modules. The fixed machine means that the camera does not move in the monitoring process, the movement of the target in the field of view of the camera is detected, and only the movement of the target relative to the camera is detected; a moving machine refers to a camera that moves (translates, rotates, multi-degree of freedom motion) during surveillance, creating complex relative motion between the object and the camera.

The fixed machine and the moving machine both use a computer vision algorithm to track the target, and according to the tracking result, namely the position of the target in the image, the tracking displacement difference of the moving tracking target relative to the ground fixed tracking target in the real world is calculated, so that the gesture of the moving machine is adjusted, and the target is always kept at the center position of the image. The moving machine comprises a plurality of communication interfaces which can be added with other peripheral equipment so as to realize the purpose of information transmission between the moving machine and the fixed machine and further continuously adjusting the predicted position and the self position.

The information extraction and analysis steps of the moving target are as follows:

(1) Inputting an initial frame and designating a target to be tracked, calibrating the initial frame by using a rectangular frame, generating a plurality of candidate frames in the next frame and extracting the characteristics of the candidate frames;

(2) The observation model scores the candidate frames;

(3) And finally, finding a candidate frame with the highest score in the scores as a predicted target, or fusing a plurality of predicted values to obtain a better predicted target.

And then carrying out next processing and analysis to realize the behavior understanding of the moving object, according to the tracked video sequence, giving the target state (position and scale) of an initial frame (first frame), predicting the target state in a subsequent frame, taking the detection and the track and matching as binary variables when predicting, constructing an integral target function, solving the optimal value of the variable to ensure that the target function is optimal, thus obtaining the optimal matching of the detection and the track, carrying out MeanShift operation on all image frames of the video sequence, and taking the result of the previous frame (namely the center position and the window size of a search window) as the initial value of the search window of the MeanShift algorithm of the next frame, thus carrying out iteration.

The meanShift is used for searching an optimal iteration result for a single picture, the camShift is used for processing a video sequence, and the meanShift is called for each frame of picture in the sequence to search the optimal iteration result. The camShift is used for processing a video sequence, so that the size of a window can be continuously adjusted, and therefore, when the size of a target is changed, the algorithm can adaptively adjust the target area to continue tracking.

Extracting the characteristics of a current target after the target area of the current frame is obtained, carrying out online updating of an observation model by using a model updating algorithm, feeding back the updated content as the prediction condition of the target area model to an initialized target extraction and analysis step so as to judge the accuracy of the model prediction, updating the prediction area by using the algorithm, and continuously iterating, wherein if the target subsequently appears in a rectangular candidate frame of the prediction area, no feedback is carried out; extracting the characteristics to obtain a detection result of each frame of image, correlating the detection result with the existing tracking track, observing the model, verifying the possibility of a motion area of the motion model, and analyzing candidate frames generated by the motion model; the algorithm completes the prediction of the second frame according to the information of the first frame, the subsequent frames and the like, and meanwhile, the model is updated according to the appointed rule; and comparing the information acquired by the vision module, and continuously adjusting the predicted position of the target according to the algorithm to realize accurate tracking of the target.

Examples:

the tracked mobile object is a motor-driven trolley, the fixed machine is a fixed camera, the mobile machine is an unmanned aerial vehicle with a camera, the mobile machine and the unmanned aerial vehicle are both provided with the related processor, and the trolley is started after preparation, so that the trolley is tracked in the monitoring range of the mobile object and the unmanned aerial vehicle.

The on-board CCD camera is used for collecting the information of the target and surrounding images, identifying and tracking the target, and calculating the pixel position difference between the target tracking target and the image center point on the continuous image sequence. And calculating the relative displacement of the unmanned aerial vehicle and the target in the real three-dimensional space according to a certain mapping relation through the calculated position difference of the image. The calculated relative displacement related information is provided for a flight control module, so that the flight attitude of the unmanned aerial vehicle is adjusted, and finally, the flight following is realized.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The cross-modal target tracking method based on visual semantics is characterized by comprising the following steps of:

(1.2) scoring the candidate boxes by the observation model;

2. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (1), a target tracking algorithm is adopted for target detection and extraction, false detection is removed, and missing detection is increased.

3. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (1), the detection and extraction of the target comprise visual features, statistical features, transformation coefficient features and algebraic features.

4. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (1), the vision module comprises a CCD camera, a cloud processing platform and an embedded processor, and also comprises a fixed machine and a moving machine for installing the modules.

5. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (2), when prediction is performed, the detection and the track and matching are regarded as binary variables, and the optimal value of the variables is obtained by constructing an integral objective function, so that the objective function is optimal, and the optimal matching of the detection and the track is obtained.

6. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (2), the MeanShift operation is performed on all the image frames of the video sequence, and the result of the previous frame is used as the initial value of the search window of the MeanShift algorithm of the next frame, so that iteration is performed.

7. The visual semantic cross-modal target tracking method according to claim 1, wherein: in the step (3), the content of online updating is the prediction condition of the target area model, and the online updating is fed back to the initialized target extraction analysis step to judge the accuracy of the model prediction, the prediction area is updated by utilizing an algorithm, iteration is continued, and if the target subsequently appears in a rectangular candidate frame of the prediction area, no feedback is carried out.

8. The visual semantic cross-modal target tracking method according to claim 4, wherein: the fixed machine and the moving machine both use a computer vision algorithm to track the target, and according to the tracking result, namely the position of the target in the image, the tracking displacement difference of the moving tracking target relative to the ground fixed tracking target in the real world is calculated, so that the gesture of the moving machine is adjusted, and the target is always kept at the center position of the image.

9. The visual semantic cross-modal target tracking method according to claim 4, wherein: the motion machine comprises a plurality of communication interfaces which can be added with other peripheral equipment so as to realize the purpose of information transmission with the fixed machine and further continuously adjusting the predicted position and the self position.