CN108230352B

CN108230352B - Target object detection method and device and electronic equipment

Info

Publication number: CN108230352B
Application number: CN201710059806.1A
Authority: CN
Inventors: 余锋伟; 李文博; 闫俊杰
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-01-24
Filing date: 2017-01-24
Publication date: 2021-02-26
Anticipated expiration: 2037-01-24
Also published as: CN108230352A

Abstract

The embodiment of the invention provides a target object detection method, a target object detection device and electronic equipment, wherein the target object detection method comprises the following steps: predicting the motion state of a target object in a second video frame by using a first neural network according to the characteristic point of at least one target object in a first video frame to obtain a motion state prediction result, wherein the first video frame is a current video frame, and the second video frame is a subsequent video frame of the current video frame; predicting the position of the target object in the second video frame by using a second neural network according to the characteristic points of the target object to obtain a position prediction result; matching the position prediction result with the position detection result; and determining the motion state of the target object according to the matching result and the motion state prediction result. By the embodiment of the invention, the judgment on whether the target object disappears in the monitoring video can be effectively realized, the occurrence of target object tracking errors is reduced, and the accuracy of image detection is improved.

Description

Target object detection method and device and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a target object detection method and device and electronic equipment.

Background

Image detection is a process of extracting and detecting a feature region or a detection target of interest in an image, and in recent years, with the development of multi-target tracking technology, multi-target tracking means is increasingly applied to scenes such as video monitoring or live video for image detection and image processing after detection.

Multi-target tracking is a technique that uses a computer to determine the location, size, and complete motion trajectory of individual moving objects of interest having certain distinctive visual characteristics in a video sequence. The key to the implementation of conventional multi-target tracking methods is to perform efficient data correlation, pairing metrology data from a single or multiple sensors with known or determined trajectories. However, the data association method lacks a fine judgment mechanism, and the loss of the target object in the video cannot be accurately judged, so that the target object tracking error is caused, and the accuracy of image detection is reduced.

Disclosure of Invention

The embodiment of the invention provides a target object detection scheme.

According to an aspect of the embodiments of the present invention, there is provided a target object detection method, including: predicting the motion state of at least one target object in a second video frame by using a first neural network according to the characteristic point of the target object in a first video frame to obtain a motion state prediction result, wherein the first video frame is a current video frame, and the second video frame is a subsequent video frame of the current video frame; predicting the position of the target object in a second video frame by using a second neural network according to the characteristic point of the target object to obtain a position prediction result; matching the position prediction result with a position detection result of the corresponding target object in a second video frame; and determining the motion state of the target object according to the matching result and the motion state prediction result.

Optionally, matching the position prediction result with a position detection result of the target object corresponding to the second video frame includes: and performing association in an expression feature space between the position prediction result and a position detection result of the corresponding target object in the second video frame, and determining a matching result between the position prediction result and the position detection result in the second video frame according to the association result.

Optionally, associating the position prediction result with a position detection result of the corresponding target object in the second video frame in a representation feature space, including: and determining the difference between the position prediction result and the position detection result, and performing correlation in the expression feature space according to the difference.

Optionally, the motion state of the target object comprises at least one of: tracked state, transient loss state, long-term loss state, and vanishing state; wherein the tracked state is used for indicating that the position prediction result of the target object is associated with the position detection result in the corresponding video frame; the transient loss state is used for indicating that the prediction result of the target object is not associated with the position detection result in the corresponding video frame; the long-term loss state is used for indicating that all position prediction results are not associated with corresponding position detection results in a first set number of video frame sequences of the target object; the disappearance state is used for indicating that all position prediction results are not associated with corresponding position detection results in a second set number of video frame sequences of the target object; wherein the first set number is less than the second set number.

Optionally, the motion state further comprises a generation state indicating that the target object appears in the first video frame for the first time.

Optionally, the determining the motion state of the target object according to the matching result and the motion state prediction result includes: and in response to the matching result indicating that the position prediction result is associated with the position detection result, marking the motion state of the target object as a tracked state.

Optionally, the determining the motion state of the target object according to the matching result and the motion state prediction result further includes: and acquiring a motion state prediction result of the target object in response to the matching result indicating that the position prediction result is not associated with the position detection result.

Optionally, after the obtaining of the motion state prediction result of the target object, the method further includes: in response to the motion state prediction result indicating that the target object is in a tracked state, marking the motion state of the target object as a transient loss state.

Optionally, after the obtaining of the motion state prediction result of the target object, the method further includes: responding to the motion state prediction result to indicate that the target object is in a transient loss state, judging whether the frequency of the target object which is continuously marked as the transient loss state reaches N-1 times, and if so, marking the motion state of the target object as a long-term loss state; if not, marking the motion state of the target object as a transient loss state; wherein N represents the first set number.

Optionally, the method further comprises: responding to the motion state prediction result to indicate that the target object is in a long-term loss state, judging whether the number of times that the target object is continuously marked as the long-term loss state reaches M-1 times, and if so, marking the motion state of the target object as a disappearance state; if not, marking the motion state of the target as a long-term loss state; wherein M represents the second set number.

Optionally, matching the position prediction result with a position detection result of the target object corresponding to the second video frame includes: and matching the position prediction result with a position detection result of the corresponding target object in the second video frame detected by the target detector.

Optionally, after the determining the motion state of the target object according to the matching result and the motion state prediction result, the method further includes: and identifying the action of the target object according to the motion state.

Optionally, after the determining the motion state of the target object according to the matching result and the motion state prediction result, the method further includes: and counting the target objects according to the motion state.

Optionally, after the determining the motion state of the target object according to the matching result and the motion state prediction result, the method further includes: and counting the target object according to the motion state, and analyzing the flow of the target object according to a counting result.

Optionally, after the determining the motion state of the target object according to the matching result and the motion state prediction result, the method further includes: and detecting an abnormal target object according to the motion state, and alarming the abnormal target object.

Optionally, after the determining the motion state of the target object according to the matching result and the motion state prediction result, the method further includes: and recommending information to the target object according to the motion state.

Optionally, the first neural network is a recurrent neural network RNN, and/or the second neural network is a recurrent neural network RNN.

According to another aspect of the embodiments of the present invention, there is also provided a target object detection apparatus, including: the prediction module is used for predicting the motion state of at least one target object in a second video frame by using a first neural network according to the feature point of the target object in a first video frame to obtain a motion state prediction result, wherein the first video frame is a current video frame, and the second video frame is a subsequent video frame of the current video frame; predicting the position of the target object in a second video frame by using a second neural network according to the characteristic point of the target object to obtain a position prediction result; the matching module is used for matching the position prediction result with a position detection result of the corresponding target object in a second video frame; and the determining module is used for determining the motion state of the target object according to the matching result and the motion state prediction result.

Optionally, the matching module is configured to perform association in an expression feature space between the position prediction result and a position detection result of the target object corresponding to the second video frame, and determine a matching result between the position prediction result and the position detection result in the second video frame according to the association result.

Optionally, the matching module is configured to determine a difference between the position prediction result and the position detection result, and perform association in a representation feature space according to the difference; and determining a matching result between the position prediction result and the position detection result in the second video frame according to the correlation result.

Optionally, the determining module includes: and the association submodule is used for responding to the matching result indicating that the position prediction result is associated with the position detection result, and marking the motion state of the target object as a tracked state.

Optionally, the determining module further comprises: and the non-correlation submodule is used for responding to the matching result indicating that the position prediction result is not correlated with the position detection result, and acquiring a motion state prediction result of the target object.

Optionally, the non-association sub-module is further configured to, after obtaining the motion state prediction result of the target object, mark the motion state of the target object as a transient loss state in response to the motion state prediction result indicating that the target object is in a tracked state.

Optionally, the non-associated sub-module is further configured to, in response to the motion state prediction result indicating that the target object is in a transient loss state, determine whether the number of times that the target object is continuously marked as the transient loss state reaches N-1 times, and if so, mark the motion state of the target object as a long-term loss state; if not, marking the motion state of the target object as a transient loss state; wherein N represents the first set number.

Optionally, the non-associated sub-module is further configured to, in response to the motion state prediction result indicating that the target object is in a long-term loss state, determine whether the number of times that the target object is continuously marked as the long-term loss state reaches M-1 times, and if so, mark the motion state of the target object as a lost state; if not, marking the motion state of the target as a long-term loss state; wherein M represents the second set number.

Optionally, the matching module is configured to match the position prediction result with a position detection result of the target object in the second video frame detected by the target detector.

Optionally, the apparatus further comprises: and the first operation module is used for identifying the action of the target object according to the motion state after the determination module determines the motion state of the target object according to the matching result and the motion state prediction result.

Optionally, the apparatus further comprises: and the second operation module is used for counting the target object according to the motion state after the determination module determines the motion state of the target object according to the matching result and the motion state prediction result.

Optionally, the apparatus further comprises: and the third operation module is used for counting the target object according to the motion state after the determination module determines the motion state of the target object according to the matching result and the motion state prediction result, and analyzing the flow of the target object according to the counting result.

Optionally, the apparatus further comprises: and the fourth operation module is used for detecting an abnormal target object according to the motion state and alarming the abnormal target object after the determination module determines the motion state of the target object according to the matching result and the motion state prediction result.

Optionally, the apparatus further comprises: and the fifth operation module is used for recommending information to the target object according to the motion state after the determination module determines the motion state of the target object according to the matching result and the motion state prediction result.

According to still another aspect of an embodiment of the present invention, there is also provided an electronic apparatus including: the system comprises a processor, a memory, a communication element and a communication bus, wherein the processor, the memory and the communication element are communicated with each other through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the target object detection method.

According to still another aspect of an embodiment of the present invention, there is also provided a computer-readable storage medium storing: the executable instructions are used for predicting the motion state of at least one target object in a second video frame by using a first neural network according to the characteristic point of the target object in the first video frame to obtain a motion state prediction result, wherein the first video frame is a current video frame, and the second video frame is a subsequent video frame of the current video frame; according to the characteristic point of the target object, predicting the position of the target object in a second video frame by using a second neural network to obtain a position prediction result; executable instructions for matching the position prediction result with a position detection result of the corresponding target object in a second video frame; executable instructions for determining a motion state of the target object based on the matching result and the motion state prediction result.

According to the detection scheme of the target object provided by the embodiment of the invention, based on the image feature point of the target object in the current video frame, the motion state and the position of the target object in the next video frame are respectively predicted through the first neural network and the second neural network, the actual motion state of the target object in the next video frame is determined according to the comparison between the prediction result of the position of the target object and the detection result of the position of the target object in the next video frame and the prediction result of the motion state of the target object, and the detection result of the target object in the video frame can be determined according to the actual motion state. In the scheme provided by the embodiment of the invention, the motion state of the target object is represented as the state based on the image characteristic points, and because the difference of the image characteristic points between different target objects is large, if the target object disappears in the video, the predicted motion state based on the characteristic points is difficult to be similar to the motion states of other target objects, so that the method is sensitive to detecting the disappearance of the target object in the video, the judgment on whether the target object disappears in the monitoring video can be effectively realized, the occurrence of errors in tracking the target object is reduced, and the accuracy of image detection is improved.

Drawings

Fig. 1 is a flowchart illustrating steps of a method for detecting a target object according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of a method for detecting a target object according to a second embodiment of the present invention;

fig. 3 is a block diagram of a target object detection apparatus according to a third embodiment of the present invention;

fig. 4 is a block diagram of a target object detection apparatus according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention is provided in conjunction with the accompanying drawings (like numerals indicate like elements throughout the several views) and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present invention are used merely to distinguish one element, step, device, module, or the like from another element, and do not denote any particular technical or logical order therebetween.

Example one

Referring to fig. 1, a flowchart illustrating steps of a method for detecting a target object according to a first embodiment of the present invention is shown.

The detection method of the target object of the embodiment comprises the following steps:

step S102: and predicting the motion state of the target object in the second video frame by using the first neural network according to the characteristic point of at least one target object in the first video frame to obtain a motion state prediction result.

The first video frame is a current video frame, and the second video frame is a subsequent video frame of the current video frame. The subsequent video frame may be a video frame next to the current video frame, or a video frame that is temporally subsequent and is not adjacent to the current video frame. Frame-by-frame prediction can be achieved by prediction of the next video frame adjacent to the current video frame; non-frame-by-frame prediction may be achieved by prediction of a subsequent video frame that is not adjacent to the current video frame.

In the embodiment of the present invention, the Neural Network for performing motion state prediction and position prediction may be a CNN (Convolutional Neural Network) having a motion state prediction function and/or a position prediction function, or may be an RNN (Recurrent Neural Network). The RNN is an artificial neural network with nodes directionally connected into a ring, the internal state of which can exhibit dynamic timing behavior, and is essentially characterized by both internal feedback and feedforward connections between processing units. From the system point of view, the system is a feedback dynamic system, embodies the process dynamic characteristics in the calculation process, and has stronger dynamic behavior and calculation capability than a feedforward neural network. And thus can be applied to realize the motion state prediction and the position prediction of the target object in the embodiment of the invention.

Step S104: and predicting the position of the target object in the second video frame by using the second neural network according to the characteristic points of the target object to obtain a position prediction result.

In the embodiment of the invention, the motion state and the position of the target object in the subsequent video frame are predicted based on the characteristic point of the target object in the current video frame.

The prediction of the motion state of the target object by the first neural network and the prediction of the position of the target object by the second neural network can be executed in no sequence or in parallel.

The first neural network and the second neural network are both trained neural networks, wherein the first neural network has a function of predicting the motion state of the target object, and the second neural network has a function of predicting the position of the target object. The training of the first neural network and the second neural network can be realized by any appropriate related training mode, and the training mode of the neural network is not limited by the embodiment of the invention. For example, for a first neural network, the training samples take a sequence of motion states; for the second neural network, the training samples take a sequence of target locations. In training, the input to each frame of the first and second neural networks is the data of the training samples in the current frame, and the output in each frame is the prediction for the next or other following frame. During testing, the input of each frame of the first neural network is the motion state of the target object in the previous frame, and the output is the predicted motion state of the next frame or other subsequent frames; the input of each frame of the second neural network is the position of the target object in the previous frame, and the output is the predicted position of the target object in the next frame or other subsequent frames.

Step S106: and matching the position prediction result with the position detection result of the corresponding target object in the second video frame.

The position detection result of the target object in the second video frame may be obtained in any suitable manner, including but not limited to a convolutional neural network CNN manner, a target detector manner, and the like.

Step S108: and determining the motion state of the target object according to the matching result and the motion state prediction result.

Based on the prediction of the position of the target object in the second video frame from the feature point of the target object in the first video frame, the actual position detection result may or may not be matched, and the actual motion state of the target object needs to be determined according to the matching result, and the actual motion state may represent the detection result of the target object.

According to the method for detecting the target object provided by the embodiment, based on the feature point of the target object in the current video frame, the motion state and the position of the target object in the next video frame, which is the next video frame, are respectively predicted through the first neural network and the second neural network, the actual motion state of the target object in the next video frame is determined according to the comparison between the prediction result of the position of the target object and the detection result of the position of the target object in the next video frame and the prediction result of the motion state of the target object, and the detection result of the target object in the video frame can be determined according to the actual motion state. In the method for detecting a target object provided by this embodiment, the motion state of the target object is represented as a state based on feature points, and because the difference between feature points of different target objects is large, if the target object disappears in a video, the predicted motion state based on feature points is difficult to be similar to the motion states of other target objects, so that the method is sensitive to detecting the disappearance of the target object in the video, and can effectively determine whether the target object disappears in a monitoring video, reduce the occurrence of errors in tracking the target object, and improve the accuracy of image detection.

It should be noted that the detection scheme of the target object in the embodiment of the present invention can be applied to the detection of a single target object, so as to improve the detection accuracy; the method can also be applied to multi-target object detection to reduce the missing detection and error rate of the multi-target object detection.

The detection method of the target object of the present embodiment may be performed by any suitable device having data processing capability, including but not limited to: mobile terminals, PCs, servers and other electronic devices having data processing capabilities.

Example two

Referring to fig. 2, a flowchart illustrating steps of a method for detecting a target object according to a second embodiment of the present invention is shown.

In this embodiment, the first and second neural networks both use RNNs, and the detection scheme of the target object of the present invention is described by taking, as an example, a next video frame in which a following video frame is a current video frame. It should be clear to those skilled in the art that the target object detection scheme of the present invention can be implemented with reference to the present embodiment for other following video frames that are not adjacent to the current video frame.

step S202: and acquiring a current video frame to be detected.

Step S204: and acquiring the characteristic points of the target object in the current video frame.

The target object may be any suitable movable object, including but not limited to: humans, animals, etc. The target object may be one or may include a plurality of objects.

The feature points of the target object may be obtained in any suitable manner, including but not limited to, by CNN with feature point obtaining function, or by a suitable target detector, FeatureDetector, etc. By the target detector, the feature points of the target object can be simply and effectively acquired.

Step S206: and according to the characteristic point of the target object in the current video frame, predicting the motion state of the target object in the second video frame by using the first RNN to obtain a motion state prediction result, and predicting the position of the target object in the second video frame by using the second RNN to obtain a position prediction result.

In this embodiment, the motion state of the target object includes: tracked state, transient loss state, long-term loss state, and vanishing state. Wherein the tracked state is used for indicating that the position prediction result of the target object is associated with the position detection result in the corresponding video frame, for example, the position prediction result of the target object in the next video frame is associated with the position detection result of the target object in the next video frame; the transient loss state is used for indicating that the prediction result of the target object is not associated with the position detection result in the corresponding video frame, for example, the position prediction result of the target object in the next video frame is not associated with the position detection result of the target object in the next video frame; the long-term loss state is used for indicating that all the position prediction results of the target object are not associated with the corresponding position detection results in the first set number of video frame sequences, for example, the position prediction results of the target object in a1, a2 and A3 are sequentially predicted by the second RNN to be not associated with the position detection results of the target object in a1, a2 and A3; the disappeared state indicates that the target object has no correlation between all the position prediction results and the corresponding position detection results in the second set number of video frame sequences, for example, the position prediction results of the target object in a1, a2, … … a10 and the position detection results of the target object in a1, a2, … … a10 are predicted in sequence by the second RNN. Wherein the first set number is less than the second set number.

Optionally, the motion state further comprises a generation state for indicating that the target object appears in the first video frame for the first time for state discrimination and labeling.

In this embodiment, whether two objects are related or not can be determined through any suitable association algorithm. The input of the correlation algorithm is a similarity matrix and the output is the result of the correlation. The similarity matrix includes a similarity degree between the target object in the first video frame and the target object in the second video frame, and is generally measured by using a difference between information such as position information and appearance information. But not limited thereto, in practical applications, other algorithms may also be used to determine whether two objects are related, such as a bipartite graph matching algorithm, a K-nearest neighbor correlation algorithm, and the like.

Step S208: and matching the position prediction result with the position detection result of the corresponding target object in the second video frame.

In a possible manner, the matching between the position prediction result and the position detection result may be implemented by determining whether there is a correlation between the two. In this manner, the position prediction result and the position detection result of the corresponding target object in the second video frame may be correlated in the representation feature space, and the matching result between the position prediction result and the position detection result in the second video frame may be determined according to the correlation result. The appearance features are visual feature information of objects in the detection frame, such as features represented by using a color histogram of an image, and complex image features extracted by using a convolutional neural network with higher precision. Of course, in practical applications, other ways of obtaining the appearance characteristics are also applicable. The representation feature space includes information of the representation feature of the target object, and in this embodiment, the representation feature information of the representation feature space also includes information of the position of the target object. The association in the representation feature space may be performed by determining the difference between the position prediction result and the position detection result in the association manner in step S206, and performing the association in the representation feature space according to the difference, which is not described herein again.

The position detection result of the target object in the second video frame may be obtained in any suitable manner, including but not limited to: by a trained CNN with position acquisition function, or by a target detector, etc.

Step S210: and determining the motion state of the target object according to the matching result and the motion state prediction result.

For example, if the matching result indicates that the position prediction result is associated with the position detection result, the motion state of the target object is labeled as a tracked state in response to the matching result indicating that the position prediction result is associated with the position detection result.

For another example, if the matching result indicates that the position prediction result is not associated with the position detection result, the motion state prediction result of the target object is obtained in response to the matching result indicating that the position prediction result is not associated with the position detection result; if the motion state prediction result indicates that the target object is in the tracked state, responding to the motion state prediction result indicating that the target object is in the tracked state, and marking the motion state of the target object as a transient loss state; if the motion state prediction result indicates that the target object is in the transient loss state, responding to the motion state prediction result to indicate that the target object is in the transient loss state, judging whether the frequency of the target object which is continuously marked as the transient loss state reaches N-1 times, and if so, marking the motion state of the target object as the long-term loss state; if not, marking the motion state of the target object as a transient loss state; wherein N represents a first set number; if the motion state prediction result indicates that the target object is in a long-term loss state, responding to the motion state prediction result to indicate that the target object is in the long-term loss state, judging whether the frequency of the target object which is continuously marked as the long-term loss state reaches M-1 times, and if so, marking the motion state of the target object as a disappearance state; if not, marking the motion state of the target as a long-term loss state; where M represents a second set number. Wherein N and M are integers, N is less than M, N is greater than or equal to 3, and M is greater than or equal to 4. In a preferred embodiment, N is 3 and M is 10.

Step S212: and determining the operation on the target object according to the motion state of the target object.

Wherein the operation on the target object includes but is not limited to at least one of the following:

operation one: and determining a detection result of the target object according to the motion state of the target object.

If the motion state of the target object is determined, the detection result of the target object can be determined according to the motion state.

For example, if the motion state of the target object is the tracked state, it can be determined that the target object to be detected appears in two consecutive video frames and does not disappear; if the motion state of the target object is a transient loss state, determining that the target object to be detected has transient disappearance in the continuous multi-frame video frame sequence; if the motion state of the target object is a long-term loss state, the target object to be detected can be determined to disappear for a long time in the continuous multi-frame video frame sequence; if the motion state of the target object is a disappearance state, it can be determined that the target object to be detected has completely disappeared in the continuous multi-frame video frame sequence.

And operation II: and identifying the action of the target object according to the motion state of the target object.

For example, if the motion state of the target object is always in the tracked state, the basic motion trajectory of the target object may be further obtained in an appropriate manner according to the motion state of the target object, and the motion of identifying the target object may be performed.

Operation three: and counting the target objects according to the motion states of the target objects.

For example, when the target object includes a plurality of target objects, the target objects may be counted according to the motion states of the plurality of target objects, such as counting the target objects in a tracked state, or counting the target objects in a short-term loss state or a long-term loss state, or the like.

And operation four: and counting the target object according to the motion state of the target object, and analyzing the flow of the target object according to the counting result.

After the number of the target objects is obtained, that is, after the counting result is obtained, the flow analysis of the target objects can be performed according to the counting result.

And operation five: and detecting an abnormal target object according to the motion state of the target object, and alarming the abnormal target object.

For example, in the process of monitoring a certain target object, when the target object is found to be in a transient loss state, a long-term loss state, or a disappearance state, an abnormal alarm corresponding to the motion state may be performed.

And operation six: and recommending information to the target object according to the motion state of the target object.

For example, the motion trajectory of the target object in the tracked state is analyzed, and information recommendation is performed according to the analysis result, and if the target object drives the vehicle frequently, information recommendation of the corresponding vehicle can be performed.

Hereinafter, a detection process of the target object in the present embodiment will be described with a specific example.

The detection process of the target object of the present example includes: training two recurrent neural networks (RNN 1 and RNN 2) to enable the RNN1 to predict the future motion state of the target object and enable the RNN2 to predict the future position of the target object; respectively predicting the motion state and the position of a target object existing in the t-1 frame in the t frame by using RNN1 and RNN 2; matching the result of the position prediction with the result of the currently observed position Detection in a frame t, wherein a FeatureDetector can be adopted, and according to the Detection Response, the position of the target object and the characteristic point of the target object can be obtained; performing association in a representation feature space between a target object A1 labeled as a tracked state currently and a target object A2 in a long-term loss state and a corresponding currently observed actual result, labeling the associated target object A1 or A2 as a tracked state target object if the target object A1 or A2 is associated with the actual result, and labeling the target object A1 which is not associated as a transient loss state if the target object A1 is not associated with the actual result; if the target object A2 does not have an association with the actual result, marking the target object A2 which is not associated as a long-term lost state; marking a target meta-denier, which is marked as a long-term loss state and lost for more than 10 frames, as a lost state; continuing to detect the target object of the subsequent video frame, and respectively predicting the motion state and the position of the target object existing in the t frame in the t +1 frame by using RNN1 and RNN 2; and sequentially and iteratively executing until the video frame sequence is ended.

Through the embodiment, the judgment on whether the target object disappears in the monitored video can be effectively realized, the occurrence of errors of multi-target object tracking is reduced, and the accuracy of image detection is improved.

The detection method of the target object of the present embodiment may be performed by any suitable device having data processing capability, including but not limited to: mobile terminals, PCs, etc.

EXAMPLE III

Referring to fig. 3, a block diagram of a target object detection apparatus according to a third embodiment of the present invention is shown.

The detection device of the target object of the present embodiment includes: the prediction module 302 is configured to predict a motion state of a target object in a second video frame by using a first neural network according to a feature point of at least one target object in a first video frame, to obtain a motion state prediction result, where the first video frame is a current video frame, and the second video frame is a subsequent video frame of the current video frame; predicting the position of the target object in the second video frame by using a second neural network according to the characteristic points of the target object to obtain a position prediction result; a matching module 304, configured to match the position prediction result with a position detection result of a corresponding target object in the second video frame; and a determining module 306, configured to determine a motion state of the target object according to the matching result and the motion state prediction result.

According to the detection apparatus for the target object provided in this embodiment, based on the image feature point of the target object in the current video frame, the motion state and the position of the target object in the next video frame, which is the next video frame, are predicted through the first neural network and the second neural network, and further, according to the comparison between the prediction result of the position of the target object and the detection result of the position of the target object in the next video frame, and the prediction result of the motion state of the target object, the actual motion state of the target object in the next video frame is determined, and the detection result of the target object in the video frame can be determined according to the actual motion state. In this embodiment, the motion state of the target object is represented as a state based on the image feature points, and because the difference between the image feature points of different target objects is large, if the target object disappears in the video, the predicted motion state based on the feature points is difficult to be similar to the motion states of other target objects, so that the method is sensitive to detecting the disappearance of the target object in the video, and can effectively realize the judgment on whether the target object disappears in the monitoring video, reduce the occurrence of errors in tracking the target object, and improve the accuracy of image detection.

Example four

Referring to fig. 4, a block diagram of a target object detection apparatus according to a fourth embodiment of the present invention is shown.

The detection device of the target object of the present embodiment includes: a prediction module 402, configured to predict, according to a feature point of at least one target object in a first video frame, a motion state of the target object in a second video frame by using a first neural network, to obtain a motion state prediction result, where the first video frame is a current video frame, and the second video frame is a subsequent video frame of the current video frame; predicting the position of the target object in the second video frame by using a second neural network according to the characteristic points of the target object to obtain a position prediction result; a matching module 404, configured to match the position prediction result with a position detection result of a corresponding target object in the second video frame; and a determining module 406, configured to determine a motion state of the target object according to the matching result and the motion state prediction result.

Optionally, the matching module 404 is configured to perform association in the representation feature space between the position prediction result and the position detection result of the corresponding target object in the second video frame, and determine a matching result between the position prediction result and the position detection result in the second video frame according to the association result.

Optionally, the matching module 404 is configured to determine a difference between the position prediction result and the position detection result, and perform association in the representation feature space according to the difference; and determining a matching result between the position prediction result and the position detection result in the second video frame according to the correlation result.

Optionally, the motion state further comprises a generation state for indicating that the target object is first present in the first video frame.

Optionally, the determining module 406 includes: the associating sub-module 4062 is configured to label the motion state of the target object as the tracked state in response to the matching result indicating that the position prediction result is associated with the position detection result.

Optionally, the determining module 406 further includes: the non-association submodule 4064 is configured to, in response to the matching result indicating that the position prediction result is not associated with the position detection result, obtain a motion state prediction result of the target object.

Optionally, the non-associating sub-module 4064 is further configured to, after obtaining the motion state prediction result of the target object, mark the motion state of the target object as a transient loss state in response to the motion state prediction result indicating that the target object is in a tracked state.

Optionally, the non-associated sub-module 4064 is further configured to, in response to the motion state prediction result indicating that the target object is in the transient loss state, determine whether the number of times that the target object is continuously labeled in the transient loss state reaches N-1 times, and if so, label the motion state of the target object in the long-term loss state; if not, marking the motion state of the target object as a transient loss state; where N represents a first set number.

Optionally, the non-associated sub-module 4064 is further configured to, in response to the motion state prediction result indicating that the target object is in the long-term loss state, determine whether the number of times that the target object is continuously marked as the long-term loss state reaches M-1 times, and if so, mark the motion state of the target object as the disappearance state; if not, marking the motion state of the target as a long-term loss state; where M represents a second set number.

Optionally, the matching module 404 is configured to match the position prediction result with a position detection result of a corresponding target object in the second video frame detected by the target detector.

Optionally, the detection apparatus of the target object of this embodiment further includes: a first operation module 408, configured to identify an action of the target object according to the motion state after the determination module 406 determines the motion state of the target object according to the matching result and the motion state prediction result.

Optionally, the detection apparatus of the target object of this embodiment further includes: a second operation module 410, configured to count the target object according to the motion state after the determination module 406 determines the motion state of the target object according to the matching result and the motion state prediction result.

Optionally, the detection apparatus of the target object of this embodiment further includes: and a third operation module 412, configured to count the target object according to the motion state after the determination module 406 determines the motion state of the target object according to the matching result and the motion state prediction result, and perform flow analysis on the target object according to the count result.

Optionally, the detection apparatus of the target object of this embodiment further includes: a fourth operation module 414, configured to detect an abnormal target object according to the motion state and alarm the abnormal target object after the determination module 406 determines the motion state of the target object according to the matching result and the motion state prediction result.

Optionally, the detection apparatus of the target object of this embodiment further includes: a fifth operation module 416, configured to perform information recommendation on the target object according to the motion state after the determination module 406 determines the motion state of the target object according to the matching result and the motion state prediction result.

Optionally, the first neural network is an RNN, and/or the second neural network is an RNN.

The target object detection apparatus of this embodiment is used to implement the corresponding target object detection methods in the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again.

EXAMPLE five

The fifth embodiment of the present invention provides an electronic device, which may be, for example, a mobile terminal, a Personal Computer (PC), a tablet computer, a server, or the like. Referring now to fig. 5, shown is a schematic diagram of an electronic device 500 suitable for use as a terminal device or server for implementing embodiments of the present invention. As shown in fig. 5, the electronic device 500 includes one or more processors, communication elements, and the like, for example: one or more Central Processing Units (CPUs) 501, and/or one or more image processors (GPUs) 513, etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM)502 or loaded from a storage section 508 into a Random Access Memory (RAM) 503. The communication elements include a communication component 512 and/or a communication interface 509. Among other things, the communication component 512 may include, but is not limited to, a network card, which may include, but is not limited to, an ib (infiniband) network card, the communication interface 509 includes a communication interface such as a network interface card of a LAN card, a modem, or the like, and the communication interface 509 performs communication processing via a network such as the internet.

The processor may communicate with the read-only memory 502 and/or the random access memory 503 to execute the executable instructions, connect with the communication component 512 through the communication bus 504, and communicate with other target devices through the communication component 512, so as to complete the operation corresponding to any one of the methods for detecting a target object provided by the embodiments of the present invention, for example, predict a motion state of the target object in a second video frame by using a first neural network according to a feature point of at least one target object in a first video frame, and obtain a motion state prediction result, where the first video frame is a current video frame, and the second video frame is a subsequent video frame of the current video frame; predicting the position of the target object in the second video frame by using a second neural network according to the characteristic points of the target object to obtain a position prediction result; matching the position prediction result with a position detection result of a corresponding target object in the second video frame; and determining the motion state of the target object according to the matching result and the motion state prediction result.

In addition, in the RAM503, various programs and data necessary for the operation of the apparatus can also be stored. The CPU501 or GPU513, the ROM502, and the RAM503 are connected to each other through a communication bus 504. The ROM502 is an optional module in case of the RAM 503. The RAM503 stores or writes executable instructions into the ROM502 at runtime, and the executable instructions cause the processor to perform operations corresponding to the above-described communication method. An input/output (I/O) interface 505 is also connected to communication bus 504. The communication component 512 may be integrated or may be configured with multiple sub-modules (e.g., multiple IB cards) and linked over a communication bus.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication interface 509 comprising a network interface card such as a LAN card, modem, or the like. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

It should be noted that the architecture shown in fig. 5 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 5 may be selected, deleted, added or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication element may be separately set, or the GPU and the CPU may be integrated, and so on. These alternative embodiments are all within the scope of the present invention.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, an embodiment of the present invention includes a computer program product including a computer program tangibly embodied on a machine-readable medium, where the computer program includes a program code for executing the method shown in the flowchart, and the program code may include instructions corresponding to the method steps provided in the embodiment of the present invention, for example, predicting a motion state of a target object in a second video frame by using a first neural network according to a feature point of at least one target object in the first video frame, and obtaining a motion state prediction result, where the first video frame is a current video frame, and the second video frame is a subsequent video frame of the current video frame; predicting the position of the target object in the second video frame by using a second neural network according to the characteristic points of the target object to obtain a position prediction result; matching the position prediction result with a position detection result of a corresponding target object in the second video frame; and determining the motion state of the target object according to the matching result and the motion state prediction result. In such an embodiment, the computer program may be downloaded and installed from a network via the communication element, and/or installed from the removable medium 511. Which when executed by a processor performs the above-described functions defined in the method of an embodiment of the invention.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present invention may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present invention.

The above-described method according to an embodiment of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the method described herein may be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the processing methods described herein. Further, when a general-purpose computer accesses code for implementing the processes shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the processes shown herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

The above embodiments are only for illustrating the embodiments of the present invention and not for limiting the embodiments of the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present invention, so that all equivalent technical solutions also belong to the scope of the embodiments of the present invention, and the scope of patent protection of the embodiments of the present invention should be defined by the claims.

Claims

1. A method of detecting a target object, comprising:

predicting the motion state of at least one target object in a second video frame by using a first neural network according to the characteristic point of the target object in a first video frame to obtain a motion state prediction result, wherein the first video frame is a current video frame, and the second video frame is a subsequent video frame of the current video frame;

predicting the position of the target object in a second video frame by using a second neural network according to the characteristic point of the target object to obtain a position prediction result;

matching the position prediction result with a position detection result of the corresponding target object in a second video frame;

determining the motion state of the target object according to the matching result and the motion state prediction result, wherein the motion state of the target object comprises at least one of the following: tracked state, transient loss state, long-term loss state, and vanishing state; the tracked state is used for indicating that the position prediction result of the target object is associated with the position detection result in the corresponding video frame; the transient loss state is used for indicating that the prediction result of the target object is not associated with the position detection result in the corresponding video frame; the long-term loss state is used for indicating that all position prediction results are not associated with corresponding position detection results in a first set number of video frame sequences of the target object; the disappearance state is used for indicating that all position prediction results are not associated with corresponding position detection results in a second set number of video frame sequences of the target object; wherein the first set number is less than the second set number,

wherein the determining the motion state of the target object according to the matching result and the motion state prediction result comprises:

in response to the matching result indicating that the position prediction result is associated with the position detection result, labeling a motion state of the target object as a tracked state;

wherein the determining the motion state of the target object according to the matching result and the motion state prediction result further comprises:

in response to the matching result indicating that the position prediction result is not associated with the position detection result, obtaining a motion state prediction result of the target object;

in response to the motion state prediction result indicating that the target object is in a tracked state, marking the motion state of the target object as a transient loss state.

2. The method of claim 1, wherein matching the position prediction result with a position detection result of the corresponding target object in a second video frame comprises:

and performing association in an expression feature space between the position prediction result and a position detection result of the corresponding target object in the second video frame, and determining a matching result between the position prediction result and the position detection result in the second video frame according to the association result.

3. The method of claim 2, wherein correlating the position prediction results with position detection results of the corresponding target object in the second video frame in a representational feature space comprises:

and determining the difference between the position prediction result and the position detection result, and performing correlation in the expression feature space according to the difference.

4. The method of any of claims 1-3, wherein the motion state further comprises a generation state indicating that a target object first appears in a first video frame.

5. The method of claim 4, wherein after the obtaining the motion state prediction of the target object, the method further comprises:

responding to the motion state prediction result to indicate that the target object is in a transient loss state, judging whether the frequency of the target object which is continuously marked as the transient loss state reaches N-1 times, and if so, marking the motion state of the target object as a long-term loss state; if not, marking the motion state of the target object as a transient loss state; wherein N represents the first set number.

6. The method of claim 5, wherein the method further comprises: responding to the motion state prediction result to indicate that the target object is in a long-term loss state, judging whether the number of times that the target object is continuously marked as the long-term loss state reaches M-1 times, and if so, marking the motion state of the target object as a disappearance state; if not, marking the motion state of the target as a long-term loss state; wherein M represents the second set number.

7. The method of any of claims 1-3, wherein matching the position prediction result with a position detection result of the corresponding target object in a second video frame comprises:

and matching the position prediction result with a position detection result of the corresponding target object in the second video frame detected by the target detector.

8. The method according to any one of claims 1-3, wherein after said determining the motion state of the target object based on the matching result and the motion state prediction result, the method further comprises:

and identifying the action of the target object according to the motion state.

9. The method according to any one of claims 1-3, wherein after said determining the motion state of the target object based on the matching result and the motion state prediction result, the method further comprises:

and counting the target objects according to the motion state.

10. The method according to any one of claims 1-3, wherein after said determining the motion state of the target object based on the matching result and the motion state prediction result, the method further comprises:

and counting the target object according to the motion state, and analyzing the flow of the target object according to a counting result.

11. The method according to any one of claims 1-3, wherein after said determining the motion state of the target object based on the matching result and the motion state prediction result, the method further comprises:

and detecting an abnormal target object according to the motion state, and alarming the abnormal target object.

12. The method according to any one of claims 1-3, wherein after said determining the motion state of the target object based on the matching result and the motion state prediction result, the method further comprises:

and recommending information to the target object according to the motion state.

13. The method according to any of claims 1-3, wherein the first neural network is a Recurrent Neural Network (RNN) and/or the second neural network is a Recurrent Neural Network (RNN).

14. A target object detection apparatus comprising:

the prediction module is used for predicting the motion state of at least one target object in a second video frame by using a first neural network according to the feature point of the target object in a first video frame to obtain a motion state prediction result, wherein the first video frame is a current video frame, and the second video frame is a subsequent video frame of the current video frame; predicting the position of the target object in a second video frame by using a second neural network according to the characteristic point of the target object to obtain a position prediction result;

the matching module is used for matching the position prediction result with a position detection result of the corresponding target object in a second video frame;

a determining module, configured to determine a motion state of the target object according to the matching result and the motion state prediction result, where the motion state of the target object includes at least one of: tracked state, transient loss state, long-term loss state, and vanishing state; the tracked state is used for indicating that the position prediction result of the target object is associated with the position detection result in the corresponding video frame; the transient loss state is used for indicating that the prediction result of the target object is not associated with the position detection result in the corresponding video frame; the long-term loss state is used for indicating that all position prediction results are not associated with corresponding position detection results in a first set number of video frame sequences of the target object; the disappearance state is used for indicating that all position prediction results are not associated with corresponding position detection results in a second set number of video frame sequences of the target object; wherein the first set number is less than the second set number,

wherein the determining module comprises:

a correlation submodule for labeling a motion state of the target object as a tracked state in response to the matching result indicating that the position prediction result is correlated with the position detection result,

a non-correlation submodule, configured to obtain a motion state prediction result of the target object in response to the matching result indicating that the position prediction result is not correlated with the position detection result; and the non-correlation sub-module is further used for marking the motion state of the target object as a transient loss state in response to the motion state prediction result indicating that the target object is in a tracked state after the motion state prediction result of the target object is obtained.

15. The apparatus according to claim 14, wherein the matching module is configured to perform an association in a representation feature space between the position prediction result and a position detection result of the corresponding target object in the second video frame, and determine a matching result between the position prediction result and the position detection result in the second video frame according to the association result.

16. The apparatus of claim 15, wherein the matching module is configured to determine a difference between the location prediction result and the location detection result, and perform the correlation in the representational feature space based on the difference; and determining a matching result between the position prediction result and the position detection result in the second video frame according to the correlation result.

17. The apparatus of any of claims 14-16, wherein the motion state further comprises a generation state indicating that a target object first appears in a first video frame.

18. The apparatus according to claim 17, wherein the non-association sub-module is further configured to, in response to the motion state prediction result indicating that the target object is in a transient loss state, determine whether the number of times that the target object is continuously labeled as the transient loss state reaches N-1 times, and if so, label the motion state of the target object as a long-term loss state; if not, marking the motion state of the target object as a transient loss state; wherein N represents the first set number.

19. The apparatus according to claim 18, wherein the non-association sub-module is further configured to, in response to the motion state prediction result indicating that the target object is in a long-term loss state, determine whether the number of times that the target object is continuously labeled as the long-term loss state reaches M "1 times, and if so, label the motion state of the target object as a lost state; if not, marking the motion state of the target as a long-term loss state; wherein M represents the second set number.

20. The apparatus according to any of claims 14-16, wherein the matching module is configured to match the position prediction result with a position detection result of the corresponding target object in the second video frame detected by the target detector.

21. The apparatus of any of claims 14-16, wherein the apparatus further comprises:

and the first operation module is used for identifying the action of the target object according to the motion state after the determination module determines the motion state of the target object according to the matching result and the motion state prediction result.

22. The apparatus of any of claims 14-16, wherein the apparatus further comprises:

a second operation module for, after the determination module determines the motion state of the target object according to the matching result and the motion state prediction result,

and counting the target objects according to the motion state.

23. The apparatus of any of claims 14-16, wherein the apparatus further comprises:

and the third operation module is used for counting the target object according to the motion state after the determination module determines the motion state of the target object according to the matching result and the motion state prediction result, and analyzing the flow of the target object according to the counting result.

24. The apparatus of any of claims 14-16, wherein the apparatus further comprises:

and the fourth operation module is used for detecting an abnormal target object according to the motion state and alarming the abnormal target object after the determination module determines the motion state of the target object according to the matching result and the motion state prediction result.

25. The apparatus of any of claims 14-16, wherein the apparatus further comprises:

and the fifth operation module is used for recommending information to the target object according to the motion state after the determination module determines the motion state of the target object according to the matching result and the motion state prediction result.

26. The apparatus of any one of claims 14-16, wherein the first neural network is a Recurrent Neural Network (RNN) and/or the second neural network is a Recurrent Neural Network (RNN).

27. An electronic device, comprising: the system comprises a processor, a memory, a communication element and a communication bus, wherein the processor, the memory and the communication element are communicated with each other through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform a method for detecting a target object according to any one of claims 1-13.