CN116895032A

CN116895032A - Target tracking method, device, terminal equipment and storage medium

Info

Publication number: CN116895032A
Application number: CN202210316572.5A
Authority: CN
Inventors: 胡文颖; 王峰; 郑宜海; 陈松平; 章照中
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Jiangxi Co Ltd; Nanchang Hangkong University
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Jiangxi Co Ltd; Nanchang Hangkong University
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2023-10-17

Abstract

The embodiment of the invention provides a target tracking method, a target tracking device, terminal equipment and a storage medium, and relates to the field of security video monitoring; the frame extraction frequency matched with the dynamic change of the visual target is obtained, the calculated amount is reduced, the effect of accurate tracking is achieved, and the security quality of the monitoring area is effectively ensured. The method comprises the following steps: setting the set number of extracted image frames per preset time length according to the characteristic representing the dynamic change of the visual target in the video to be detected; extracting the set number of image frames from the video clips to be detected with each preset time length; and carrying out target tracking on the visual target in the video to be detected based on the set number of image frames.

Description

Target tracking method, device, terminal equipment and storage medium

[ field of technology ]

The embodiment of the invention relates to the field of security video monitoring, in particular to a target tracking method, device terminal equipment and storage medium.

[ background Art ]

With rapid development of video detection and network transmission technologies, cameras are installed at locations where monitoring events are required, such as factories, streets, intersections, stations, and the like. Visual targets (e.g., buildings, people, vehicles, animals, etc.) in the video are captured by manual or machine tracking cameras, and subsequent analysis is performed based on the motion trajectories of the visual targets tracked in the video.

The existing method for tracking the visual target has the problems that the tracking target is easy to lose, the tracking accuracy is greatly reduced, or the bandwidth and the calculated amount are overlarge for the object passing through quickly.

[ invention ]

The embodiment of the invention provides a target tracking method, a device, terminal equipment and a storage medium, which are used for obtaining the frame extraction frequency matched with the dynamic change of a visual target, combining the method of coincidence rate and feature matching, reducing the calculated amount, achieving the effect of accurate tracking and effectively guaranteeing the security quality of a monitoring area.

In a first aspect, an embodiment of the present invention provides a target tracking method, applied to an electronic terminal device, where the method includes: setting the set number of extracted image frames per preset time length according to the characteristic representing the dynamic change of the visual target in the video to be detected; extracting the set number of image frames from the video clips to be detected with each preset time length; and carrying out target tracking on the visual target in the video to be detected based on the set number of image frames.

According to the target tracking method, the frame extraction frequency of the video to be detected is dynamically adjusted according to the characteristic representing the dynamic change of the visual target in the video to be detected, so that the number of extracted image frames of the video to be detected is matched with the dynamic change of the visual target in the video to be detected, it can be understood that the visual target in the video to be detected has the characteristic of rapid change, the number of frames extracted per preset time length is increased, the visual target in the video to be detected has the characteristic of slower change, the number of frames extracted per preset time length is reduced, when the visual target rapidly changes, the visual target is tracked, the calculated amount is ensured not to excessively increase when the visual target does not rapidly change, the effect of accurate tracking is achieved while the calculated amount is reduced, and the security quality of a monitoring area is effectively ensured.

In one possible implementation manner, setting a set number of extracted image frames per preset time length according to a feature representing dynamic change of a visual target in a video to be detected includes:

extracting a standard number of image frames from the video clips to be detected with each preset time length;

detecting the category of the visual target in the video to be detected based on the standard number of image frames;

and setting the set number of extracted image frames per preset time length according to the category of the visual target in the video to be detected.

In one possible implementation manner, setting a set number of extracted image frames per preset time length according to a category of a visual target in the video to be detected includes:

when the visual target in the video to be detected is a vehicle type, increasing the standard quantity to obtain the set quantity after setting;

and when the visual target in the video to be detected is a pedestrian category, reducing the standard quantity to obtain the set quantity after setting.

detecting the number of visual targets in the video to be detected based on the standard number of image frames;

and setting the set number of extracted image frames per preset time length according to the number of visual targets in the video to be detected.

In one possible implementation manner, setting a set number of extracted image frames per preset time length according to the number of visual objects in the video to be detected includes:

when the number of visual targets in the video to be detected is larger than a first preset threshold, increasing the standard number to obtain the set number after setting;

and when the number of the visual targets in the video to be detected is smaller than a first preset threshold, reducing the standard number to obtain the set number after setting.

In one possible implementation manner, performing object tracking on the visual object in the video to be detected based on the set number of image frames includes:

respectively marking different visual targets displayed by each image frame in the set number of image frames by using a multi-layer neural network for performing target detection to obtain a plurality of image frames carrying the marks;

Sequentially taking each image frame in the plurality of image frames carrying the labels as a current image frame;

when the coincidence rate of a first specific position area corresponding to any mark in the current image frame and a first visual target to be tracked in a previous frame image of the current image frame is larger than a second preset threshold, determining that the first visual target to be tracked is successfully tracked, and deleting the mark corresponding to the first specific position area;

performing feature comparison on the current image frame with the deleted part of the labels and the image of the previous frame of the current image frame;

and according to the characteristic comparison result, carrying out target tracking on the position area reserved for marking in the current image frame marked by the deleted part.

In one possible implementation manner, according to the feature comparison result, performing object tracking on a location area reserved for marking in the current image frame marked by the deleted part, where the location area reserved for marking includes:

and when the similarity between the second specific position area corresponding to any mark in the current image frame with the mark deleted part and the second visual target feature to be tracked in the image of the previous frame of the current image frame is larger than a third preset threshold, determining that the second visual target to be tracked is successfully tracked, and deleting the mark of the second specific position area.

In a second aspect, an embodiment of the present invention provides a picture display device, provided in an electronic terminal device, including:

the setting module is used for setting the set number of extracted image frames per preset time length according to the characteristic representing the dynamic change of the visual target in the video to be detected;

the extraction module is used for extracting the set number of image frames from the video clips to be detected with each preset time length;

and the tracking module is used for tracking the target of the visual target in the video to be detected based on the set number of image frames.

In one possible implementation manner, the setting module includes:

the first extraction sub-module is used for extracting a standard number of image frames from the video clips to be detected with each preset time length;

the first detection sub-module is used for detecting the category of the visual target in the video to be detected based on the standard number of image frames;

the first setting sub-module is used for setting the set number of extracted image frames per preset time length according to the category of the visual target in the video to be detected.

In one possible implementation manner, the first setting submodule is specifically configured to increase the standard number to obtain the set number after setting when the visual target in the video to be detected is a vehicle type; and when the visual target in the video to be detected is a pedestrian category, reducing the standard quantity to obtain the set quantity after setting.

In one possible implementation manner, the setting module includes:

the second extraction sub-module is used for extracting the standard number of image frames from the video clips to be detected with each preset time length;

the second detection sub-module is used for detecting the number of visual targets in the video to be detected based on the standard number of image frames;

and the second setting sub-module is used for setting the set number of extracted image frames per preset time length according to the number of visual targets in the video to be detected.

In one possible implementation manner, the second setting submodule is specifically configured to increase the standard number to obtain the set number after setting when the number of visual targets in the video to be detected is greater than a first preset threshold; and when the number of the visual targets in the video to be detected is smaller than a first preset threshold, reducing the standard number to obtain the set number after setting.

In one possible implementation manner, the tracking module includes:

the marking sub-module is used for respectively marking different visual targets displayed by each image frame in the set number of image frames by utilizing a multi-layer neural network for executing target detection, so as to obtain a plurality of image frames carrying the marking;

The image frame determining sub-module is used for sequentially taking each image frame in the plurality of image frames carrying the labels as a current image frame;

the deleting sub-module is used for determining that the first visual target to be tracked is successfully tracked when the coincidence rate of a first specific position area corresponding to any mark in the current image frame and a first visual target to be tracked in a previous frame image of the current image frame is larger than a second preset threshold value, and deleting the mark corresponding to the first specific position area;

the comparison sub-module is used for performing feature comparison on the current image frame with the deleted part of the labels and the image of the previous frame of the current image frame;

and the tracking sub-module is used for carrying out target tracking on the position area reserved for the mark in the current image frame with the part marked deleted according to the characteristic comparison result.

In one possible implementation manner, the tracking submodule is specifically configured to determine that the second visual target to be tracked is successfully tracked when a similarity between a second specific location area corresponding to any annotation in the current image frame with the deletion portion annotation and a feature of a second visual target to be tracked in a previous image of the current image frame is greater than a third preset threshold, and delete the annotation of the second specific location area.

In a third aspect, an embodiment of the present invention provides a target tracking system, provided in an electronic terminal device, where the target tracking system includes:

the device comprises a dynamic frame extraction unit, a target detection unit and a target tracking unit; wherein, the liquid crystal display device comprises a liquid crystal display device,

the dynamic frame extraction unit is used for extracting the image frames of the set number of video clips to be detected for each preset time length according to the characteristics of the dynamic change of the visual target in the video to be detected;

the target detection unit is used for detecting the dynamic change characteristics of the visual target in the video to be detected;

the target tracking unit is used for tracking the target of the visual target in the video to be detected based on the set number of image frames.

In one possible implementation manner, the target tracking unit includes: a coincidence rate calculating unit and a feature matching unit; the system further comprises a database and a target tracking unit;

the coincidence rate calculating unit is used for calculating the coincidence rate of every two adjacent image frames;

the feature matching unit is used for performing feature matching on every two adjacent image frames;

the database is used for storing the structured data of the video to be detected;

the target tracking unit is used for connecting visual targets of different image frames according to the calculation result of the characteristic matching unit to obtain track data, and searching in the structural data according to the track data to obtain a track route of the movement of the visual targets.

In a fourth aspect, an embodiment of the present invention provides a terminal device, including: at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, the processor invoking the program instructions capable of performing the method provided in the first aspect.

In a fifth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method provided in the first aspect.

It should be understood that the second to fifth aspects of the embodiments of the present invention are consistent with the technical solutions of the first aspect of the embodiments of the present invention, and the beneficial effects obtained by each aspect and the corresponding possible implementation manner are similar, and are not repeated.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present specification, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart illustrating steps of a target tracking method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating an exemplary setting of a set number of decimated image frames per a preset length of time in accordance with the present invention;

FIG. 3 is a flowchart of another example setting a set number of decimated image frames per a preset length of time in accordance with the present invention;

FIG. 4 is a flow chart of object tracking according to another embodiment of the present invention;

FIG. 5 is a flow chart of an example of the present invention for acquiring a target trajectory to be tracked;

FIG. 6 is a schematic diagram of a target tracking system according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a target tracking system according to another embodiment of the present invention;

FIG. 8 is a functional block diagram of a target tracking apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic terminal device according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present disclosure.

[ detailed description ] of the invention

For a better understanding of the technical solutions of the present specification, the following detailed description of the embodiments of the present invention refers to the accompanying drawings.

It should be understood that the described embodiments are only some, but not all, of the embodiments of the present description. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present disclosure.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Fig. 1 is a flowchart of steps of a target tracking method according to an embodiment of the present invention, as shown in fig. 1, the steps include:

s101: and setting the set number of extracted image frames per preset time length according to the characteristic representing the dynamic change of the visual target in the video to be detected.

The preset time length may be a unit time length, for example, or a set time length, for example, 1 ms, 2 ms, or the like.

The characteristic representing the dynamic change of the visual target in the video to be detected can be a characteristic representing the moving speed of a single visual target or a characteristic representing the relative change speed of different visual targets. For example, the characteristic representing the dynamic change of the visual target in the video to be detected may be a characteristic representing the walking speed of the human body a in the video, or may be a characteristic representing the relative action frequency of the human body a and the human body B.

S102: and extracting the set number of image frames from the video clips to be detected with each preset time length.

The larger the set number, the more image frames are extracted per a preset length of time. A high frame rate means that the number of frames of the image extracted per a preset length of time is large.

S103: and carrying out target tracking on the visual target in the video to be detected based on the set number of image frames.

And carrying out target tracking on the visual target in the video to be detected according to the extracted image frame.

For example, when the visual target in the video to be detected moves faster, M is extracted from the video slice to be detected for each preset time length ₁ Opening an image frame; when the visual target in the video to be detected moves slowly, M is extracted from the video film to be detected with each preset time length ₂ Image frame, M ₁ ＞M ₂ 。

The target tracking method provided by the embodiment of the invention can be suitable for scenes with a large number of visual targets and large variability of the visual targets, such as security scenes, monitoring scenes and the like. For example, when a large number of targets exist at a road intersection, particularly when pedestrian and vehicle flows are large, the number of targets is large, and the moving speed is high. The target tracking method provided by the embodiment of the invention can track the target of the visual target in the video to be detected based on more image frames when the number of the visual targets is large and the moving speed is high, thereby ensuring accurate target tracking and high tracking success rate. And when the number of visual targets is small and the moving speed is low, the target tracking is performed on the visual targets in the video to be detected based on fewer image frames, so that the calculated amount and the hardware resource consumption are reduced.

An embodiment of the invention proposes that the characteristic for representing the dynamic change of the visual target in the video to be detected can be the category of the visual target and the number of the visual targets. In this embodiment, according to the feature characterizing the dynamic change of the visual target in the video to be detected, setting the set number of extracted image frames per preset time length may include the steps of:

s1021: and extracting a standard number of image frames from the video clips to be detected for each preset time length.

S1022: and detecting the category of the visual target in the video to be detected based on the standard number of image frames.

Extracting a standard number of image frames from a video segment to be detected for each preset time length can be understood as extracting video frames of the video to be detected by adopting a default frame extracting frequency.

The visual target movement speeds of the different categories are generally different. For example, the visual targets of the video to be detected, which are collected by the video photographing device disposed at the intersection of roads, include pedestrians and vehicles. Generally, the moving speed of the pedestrian is lower than the moving speed of the vehicle. Therefore, the embodiment of the invention detects the visual target category in the video to be detected, and determines the extraction frequency according to the visual target category in the video to be detected.

S1023: and setting the set number of extracted image frames per preset time length according to the category of the visual target in the video to be detected.

Wherein, S1023 sets a set number of extracted image frames per preset time length according to the category of the visual target in the video to be detected, including: when the visual target in the video to be detected is a vehicle type, increasing the standard quantity to obtain the set quantity after setting; and when the visual target in the video to be detected is a pedestrian category, reducing the standard quantity to obtain the set quantity after setting.

The embodiment of the present invention for setting the set number of extracted image frames per preset time length according to the category of the visual target in the video to be detected can be regarded as a process of dynamically adjusting the frame extraction frequency, and referring to fig. 2, fig. 2 is a flowchart illustrating an example of setting the set number of extracted image frames per preset time length according to the present invention, where dynamically adjusting the frame extraction frequency includes the following processes:

k201: the camera device is connected to the system. And taking out frames of the video to be detected by adopting a default frame rate of 50 frames per second to obtain a plurality of image frames.

K202: and detecting the image frames by adopting a preset target detection algorithm, and determining that the visual target displayed by the video to be detected is a vehicle. The preset target detection algorithm may be Faster RCNN, retinaNet, centreNet, etc.

K203: the frame rate of the frame extraction is adjusted, and the frame rate of 50 frames per second, such as 70 frames per second, is adopted to extract the frame of the video to be detected, so as to obtain the image frame for target tracking.

K204: and inputting the image frame for target tracking into a vehicle target detection model for target tracking. The vehicle target detection model outputs an image frame marked with the visual target vehicle.

Or, dynamically adjusting the frame extraction frequency includes the following steps:

k201: the imaging device captures an interventional system. And taking out frames of the video to be detected by adopting a default frame rate of 50 frames per second to obtain a plurality of image frames.

K202: and detecting the image frames by adopting a preset target detection algorithm, and determining that the visual target displayed by the video to be detected is a pedestrian. The preset target detection algorithm may be Faster RCNN, retinaNet, centreNet, etc.

K203: the frame rate of the frame extraction is adjusted, and the frame rate of 50 frames per second lower than the default frame rate is adopted, for example, 70 frames per second is adopted for extracting the video to be detected, so that the image frame for target tracking is obtained.

K204: and inputting the image frame for target tracking into a human body target detection model to perform target tracking. The human body target detection model outputs an image frame marked with a visual target human body.

K202: and detecting the image frames by adopting a preset target detection algorithm, and determining that visual targets displayed by the video to be detected comprise pedestrians and vehicles. The preset target detection algorithm may be Faster RCNN, retinaNet, centreNet, etc.

K204: inputting an image frame for target tracking into a vehicle target detection model for target tracking, and outputting an image frame marked with a visual target vehicle by the vehicle target detection model; after frame loss processing is carried out on the image frames for target tracking, the image frames are input into a human body target detection model for target tracking, and the human body target detection model outputs the image frames marked with the human body of the visual target. And respectively carrying out target tracking based on the image frame marked with the visual target vehicle and the image frame marked with the visual target human body.

In this embodiment, according to the feature characterizing the dynamic change of the visual target in the video to be detected, setting the set number of extracted image frames per preset time length may further include the steps of:

S102-1: and extracting a standard number of image frames from the video clips to be detected for each preset time length.

S102-2: and detecting the number of visual targets in the video to be detected based on the standard number of image frames.

When the number of visual targets is large, the probability of the change of the relative positions of different visual targets is increased, so that the embodiment of the invention detects the number of the visual targets in the video to be detected and determines the extraction frequency according to the category of the visual targets in the video to be detected.

S102-3: and setting the set number of extracted image frames per preset time length according to the number of visual targets in the video to be detected.

Wherein, S102-3 sets a set number of extracted image frames per preset time length according to the number of visual objects in the video to be detected, including:

when the number of visual targets in the video to be detected is larger than a first preset threshold, increasing the standard number to obtain the set number after setting; and when the number of the visual targets in the video to be detected is smaller than a first preset threshold, reducing the standard number to obtain the set number after setting.

FIG. 3 is a flowchart illustrating another example of setting a set number of decimated image frames per preset time period, as shown in FIG. 3, wherein the process of setting the set number of decimated image frames per preset time period to complete dynamic frame decimation includes:

K301: and acquiring a current frame image of the video to be detected.

K302: and respectively detecting pedestrians and vehicles in the video to be detected according to the target detection algorithm, and outputting detection results including the types and the number of the data targets. The models that execute the target detection algorithm include a vehicle target detection model and a pedestrian target detection model.

K303: and feeding back the detection result to a dynamic frame extraction link, and dynamically adjusting the frame extraction rate according to the visual target category and the visual target quantity.

K304: and performing frame extraction on the video to be detected by adopting the adjusted frame rate to obtain a set number of image frames. And performing target tracking based on the set number of image frames.

In order to further reduce the calculation amount, another embodiment of the present invention proposes an implementation method for performing object tracking on visual objects in a video to be detected, fig. 4 is a flowchart of object tracking according to another embodiment of the present invention, as shown in fig. 4, the embodiment sequentially calculates the coincidence rate of visual objects for every two adjacent frames of images, when the coincidence rate of any visual object is greater than a second preset threshold T ₁ Determining that the visual target is successfully tracked, deleting the label of the visual target, comparing the characteristics of the image frames deleted with the labels, deleting part of labels in the image frames subjected to the characteristic comparison, and comparing the characteristics The calculation amount is reduced for the stage.

Feature comparison is carried out on the image frames subjected to the coincidence rate calculation, and the similarity of visual targets in two adjacent frames of images is larger than a third preset threshold T ₃ And when the visual target tracking is successful, deleting the mark of the visual target.

S103 includes the sub-steps of:

s1031: and respectively marking different visual targets displayed by each image frame in the set number of image frames by using a multi-layer neural network for performing target detection, so as to obtain a plurality of image frames carrying the marks.

Different visual targets displayed by the image frames can be respectively marked in a frame selection mode. And detecting visual targets in the image frames by the multi-layer neural network for performing target detection, selecting the visual targets by adopting the target frames, and outputting the image frames carrying the target frames.

S1032: and taking each image frame in the plurality of image frames carrying the labels as a current image frame in turn.

S1033: and when the coincidence rate of the first specific position area corresponding to any mark in the current image frame and the visual target to be tracked in the previous image of the current image frame is larger than a second preset threshold, determining that the visual target to be tracked is successfully tracked, and deleting the mark corresponding to the first specific position area.

And when the coincidence rate of the first specific position area selected by the target frame and the first visual target to be tracked of the previous frame image of the current image frame is larger than a second preset threshold value, the first specific position area can be determined to be the area of the first visual target to be tracked in the current image frame, and the first visual target to be tracked is determined to be successfully tracked. And deleting the target frame marked in the first specific position area, so that the first specific position area is not required to be calculated again in subsequent feature comparison, and the calculation resource is saved.

A second, different preset threshold may be set for different types of visual targets.

In one example of the present invention, a current image frame Q is calculated ₁ And current image frame Q ₁ Previous frame image frame Q of (2) ₂ Aiming at the visual target of human body category, setting a second preset threshold value as T ₁ Aiming at the visual target of the human body category, setting a second preset threshold value as T ₂ . When the visual target class is human body, the image frame Q ₁ In the presence area and image frame Q ₂ Any coincidence rate of marked visual targets is larger than T ₁ The determination is made that tracking the visual target was successful. When the visual target class is a vehicle, the image frame Q ₁ In the presence area and image frame Q ₂ Any coincidence rate of marked visual targets is larger than T ₂ The determination is made that tracking the visual target was successful.

S1034: and comparing the characteristics of the current image frame with the characteristics of the image of the previous frame of the current image frame, wherein the current image frame is marked by the deleted part.

S1035: and according to the characteristic comparison result, carrying out target tracking on the position area reserved for marking in the current image frame marked by the deleted part.

And carrying out target tracking on the position area reserved for marking in the current image frame with the part marked deleted, wherein the target tracking comprises the following steps: and when the similarity between the second specific position area corresponding to any mark in the current image frame with the mark deleted part and the second visual target feature to be tracked in the image of the previous frame of the current image frame is larger than a third preset threshold, determining that the second visual target to be tracked is successfully tracked, and deleting the mark of the second specific position area.

Fig. 5 is a flowchart illustrating an example of the present invention for acquiring a track of a target to be tracked, and as shown in fig. 5, an example of the present invention acquires structured data of a video according to track data of a visual target to be tracked, and stores the structured data in a database. And tracking the target of the visual target in the video to be detected, tracking the position of the visual target in each frame of image after each image frame in a plurality of image frames extracted from the video to be detected, and obtaining the change track of the visual target in the video to be detected on different frame images to obtain track data. And searching in the structural data of the video to be detected by utilizing the track data and the characteristic data of the visual target to be tracked, and acquiring a track route for tracking the movement of the target.

The embodiment of the invention also provides a target tracking system, and fig. 6 is a schematic structural diagram of the target tracking system according to the embodiment of the invention, and as shown in fig. 6, the target tracking system comprises a dynamic frame extraction unit, a target detection unit and a target tracking unit.

The dynamic frame extraction unit extracts the set number of image frames from the video segments to be detected for each preset time length according to the dynamic change characteristics of the visual target in the video to be detected, and the frame extraction mode can be dynamically adjusted as shown in fig. 2.

The target detection unit includes a human body target detection unit and a vehicle target detection unit. The human body target detection unit is used for detecting and marking visual targets of human body categories in the image frames, and the human body target detection unit is also used for identifying the number of the visual targets of the human body categories in the image frames. The vehicle target detection unit is used for detecting and marking visual targets of vehicle categories in the image frames, and the vehicle target detection unit is also used for identifying the number of the visual targets of the vehicle categories in the image frames.

Fig. 7 is a schematic structural diagram of an object tracking system according to another embodiment of the present invention, and as shown in fig. 7, the object tracking unit includes a coincidence rate calculating unit and a feature matching unit. The coincidence rate calculating unit is used for calculating the coincidence rate of every two adjacent image frames; the labels of the successfully tracked visual targets can be deleted, and the image frames with the deleted parts of the labels are input into the feature matching unit.

The feature matching unit is used for performing feature matching on every two adjacent image frames.

The coincidence rate calculating unit calculates the coincidence rate of every two adjacent image frames, the image frames after the coincidence rate calculation are input into the feature matching unit, and the process of feature matching of the feature matching unit on every two adjacent image frames can refer to the method for tracking the visual target shown in fig. 4: calculating the coincidence rate of the labeling frame selection image areas in every two adjacent image frames, determining the image areas with the coincidence rate larger than a second preset threshold value as the visual target to be tracked, deleting the labeling of the visual target area to be tracked, and comparing the characteristics of the image frames with the coincidence rate calculated. The feature comparison comprises the steps of extracting feature images of two adjacent image frames, and calculating the similarity of the feature images to realize the feature comparison.

The target tracking system further comprises a database and a target tracking unit. The database is used for storing the structured data of the video to be detected; the target tracking unit is used for connecting visual targets of different image frames according to the calculation result of the characteristic matching unit to obtain track data, and searching in the structural data according to the track data to obtain a track route of the movement of the visual targets.

Fig. 8 is a functional block diagram of an object tracking device according to an embodiment of the present invention, where the object tracking device is disposed in an electronic terminal device, as shown in fig. 8, and the device includes:

the setting module 81 is configured to set a set number of extracted image frames per preset time length according to a characteristic representing dynamic change of a visual target in a video to be detected;

the extracting module 82 is configured to extract the set number of image frames from the video segments to be detected for each preset time length;

the tracking module 83 is configured to track a target of a visual target in the video to be detected based on the set number of image frames.

The embodiment shown in fig. 8 provides a target tracking device that can be used to implement the technical solutions of the method embodiments shown in fig. 1 to fig. 7 in the present specification, and the implementation principle and technical effects may be further referred to in the related description of the method embodiments.

Optionally, the setting module includes:

Optionally, the first setting submodule is specifically configured to increase the standard number to obtain the set number after setting when the visual target in the video to be detected is a vehicle category; and when the visual target in the video to be detected is a pedestrian category, reducing the standard quantity to obtain the set quantity after setting.

Optionally, the setting module includes:

Optionally, the second setting submodule is specifically configured to increase the standard number to obtain the set number after setting when the number of visual targets in the video to be detected is greater than a first preset threshold; and when the number of the visual targets in the video to be detected is smaller than a first preset threshold, reducing the standard number to obtain the set number after setting.

Optionally, the tracking module includes:

Optionally, the tracking submodule is specifically configured to determine that the second visual target to be tracked is successfully tracked when a similarity between a second specific location area corresponding to any label in the current image frame with the deletion portion label and a feature of the second visual target to be tracked in a previous frame image of the current image frame is greater than a third preset threshold, and delete the label of the second specific location area.

The device provided by the above-described embodiment is used for executing the technical scheme of the above-described method embodiment, and its implementation principle and technical effects may further refer to the related description in the method embodiment, which is not repeated herein.

The device provided by the above-described embodiment may be, for example: a chip or a chip module. The device provided by the above-described embodiment is used for executing the technical scheme of the above-described method embodiment, and its implementation principle and technical effects may further refer to the related description in the method embodiment, which is not repeated herein.

With respect to each module/unit included in each apparatus described in the above embodiments, it may be a software module/unit, or may be a hardware module/unit, or may be a software module/unit partially, or a hardware module/unit partially. For example, for each device applied to or integrated in a chip, each module/unit included in the device may be implemented in hardware such as a circuit, or at least part of the modules/units may be implemented in software program, where the software program runs on a processor integrated in the chip, and the rest of the modules/units may be implemented in hardware such as a circuit; for each device applied to or integrated in the chip module, each module/unit contained in the device may be implemented in a hardware manner such as a circuit, and different modules/units may be located in the same component (e.g. a chip, a circuit module, etc.) of the chip module or different components, or at least part of the modules/units may be implemented in a software program, where the software program runs on a processor integrated in the chip module, and the rest of the modules/units may be implemented in a hardware manner such as a circuit; for each device applied to or integrated in the electronic terminal device, each module/unit included in the device may be implemented in hardware such as a circuit, and different modules/units may be located in the same component (for example, a chip, a circuit module, etc.) or different components in the electronic terminal device, or at least part of the modules/units may be implemented in a software program, where the software program runs on a processor integrated in the electronic terminal device, and the remaining (if any) part of the modules/units may be implemented in hardware such as a circuit.

Fig. 9 is a schematic structural diagram of an electronic terminal device provided in the embodiment of the present invention, where the electronic terminal device 900 includes a processor 910, a memory 911, and a computer program stored in the memory 911 and capable of running on the processor 910, where the steps in the foregoing method embodiments are implemented when the processor 910 executes the program, and the electronic terminal device provided in the embodiment may be used to execute the technical solutions of the foregoing method embodiments, and the principles and technical effects of the implementation may be further referred to the related descriptions in the method embodiments and are not repeated herein.

Fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present disclosure, where, as shown in fig. 10, the terminal device may include at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by a processor that invokes the program instructions to perform the object tracking method provided in the embodiments shown in fig. 1 to 7 of the present specification.

It is to be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the terminal device 100. In other embodiments of the invention, terminal device 100 may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

As shown in fig. 10, the terminal device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a mobile communication module 150, a wireless communication module 160, an indicator 192, a camera 193, a display 194, and the like.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

The processor 110 executes various functional applications and data processing by running programs stored in the internal memory 121, for example, implementing the object tracking method provided by the embodiments of the present invention shown in fig. 1 to 7.

The wireless communication function of the terminal device 100 can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the terminal device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The terminal device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (flex), a mini, a Micro led, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the terminal device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The terminal device 100 may implement a photographing function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process data fed back by the camera 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, the terminal device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the terminal device 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, or the like.

Video codecs are used to compress or decompress digital video. The terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record video in various encoding formats, for example: dynamic picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

The internal memory 121 may be used to store computer executable program code including instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data (such as audio data, phonebook, etc.) created during use of the terminal device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like. The processor 110 performs various functional applications of the terminal device 100 and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

Embodiments of the present invention provide a non-transitory computer readable storage medium storing computer instructions that cause a computer to execute the object tracking method provided in the embodiments shown in fig. 1 to 7 of the present specification. The non-transitory computer readable storage medium may refer to a non-volatile computer storage medium.

The non-transitory computer readable storage media described above may employ any combination of one or more computer readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (erasable programmable read only memory, EPROM) or flash memory, an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, radio Frequency (RF), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for the present specification may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (local area network, LAN) or a wide area network (wide area network, WAN), or may be connected to an external computer (e.g., connected via the internet using an internet service provider).

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In the description of embodiments of the present invention, a description of reference to the terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present specification. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present specification, the meaning of "plurality" means at least two, for example, two, three, etc., unless explicitly defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present specification in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present specification.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

It should be noted that, the terminal according to the embodiment of the present invention may include, but is not limited to, a personal computer (personal computer, PC), a personal digital assistant (personal digital assistant, PDA), a wireless handheld device, a tablet computer (tablet computer), a mobile phone, an MP3 player, an MP4 player, and the like.

In the several embodiments provided in this specification, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, each functional unit in each embodiment of the present specification may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods described in the embodiments of the present specification. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disk, etc.

The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. A target tracking method, applied to an electronic terminal device, comprising:

setting the set number of extracted image frames per preset time length according to the characteristic representing the dynamic change of the visual target in the video to be detected;

extracting the set number of image frames from the video clips to be detected with each preset time length;

And carrying out target tracking on the visual target in the video to be detected based on the set number of image frames.

2. The method of claim 1, wherein setting the set number of decimated image frames per a preset length of time based on characteristics characterizing dynamic changes of visual objects in the video to be detected, comprises:

3. The method according to claim 2, wherein setting the set number of decimated image frames per a preset length of time according to the category of visual objects in the video to be detected comprises:

4. The method of claim 1, wherein setting the set number of decimated image frames per a preset length of time based on characteristics characterizing dynamic changes of visual objects in the video to be detected, comprises:

5. The method of claim 4, wherein setting a set number of decimated image frames per a preset length of time according to a number of visual objects in the video to be detected, comprises:

6. The method of claim 1, wherein tracking the visual target in the video to be detected based on the set number of image frames comprises:

7. The method of claim 6, wherein performing object tracking on the location area reserved for the annotation in the current image frame annotated by the deleted portion according to the feature comparison result comprises:

8. A target tracking apparatus, characterized in that it is provided in an electronic terminal device, said apparatus comprising:

9. A terminal device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein,

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-7.

10. A non-transitory computer readable storage medium storing computer instructions that cause the computer to perform the method of any one of claims 1 to 7.