CN112634329B

CN112634329B - Scene target activity prediction method and device based on space-time and or graph

Info

Publication number: CN112634329B
Application number: CN202011570370.0A
Authority: CN
Inventors: 吴炜; 蒋成亮
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-12-26
Filing date: 2020-12-26
Publication date: 2024-02-13
Anticipated expiration: 2040-12-26
Also published as: CN112634329A

Abstract

The invention discloses a scene target activity prediction method and device based on space-time and or graph, electronic equipment and storage medium, comprising the following steps: acquiring a scene video aiming at a preset scene; detecting and tracking targets in a scene video to generate a space and or graph model of a preset scene; the space and or graph model characterizes the space position relation of the target in the scene video; obtaining a sub-activity label set for representing the activity state of the target of interest by using a sub-activity extraction algorithm on the space and or graph model; inputting the sub-activity label set into a pre-obtained time and or graph model to obtain a prediction result of future activities of the concerned targets in a preset scene; the time and or graph model is obtained by utilizing an active corpus of targets of a preset scene, which is built in advance. According to the method, the space-time sum or the graph is introduced into the field of target activity prediction for the first time, and the aim of effectively predicting the activity of the target in the preset scene can be achieved.

Description

Scene target activity prediction method and device based on space-time and or graph

Technical Field

The invention belongs to the field of prediction, and particularly relates to a scene target activity prediction method and device based on space-time and or graph.

Background

The activity background is increasingly complex due to non-markov activities of targets such as humans. In order to ensure safety, in various scenarios, the behavior of the target is typically monitored with a video monitoring device.

And detecting and analyzing the monitoring video to obtain the current behavior of the target. However, the detection technology belongs to post detection, and the activity of the target at the future moment cannot be predicted, so that corresponding response cannot be given for some scenes according to the activity of the target at the future moment; or the occurrence of security events, such as car accidents, theft events, etc., cannot be avoided in time.

Therefore, how to effectively predict the activity of the target in the scene is a problem to be solved in the art.

Disclosure of Invention

The embodiment of the invention aims to provide a scene target activity prediction method, device, electronic equipment and storage medium based on space-time sum or graph, so as to realize the aim of effectively predicting the activity of a target in a scene. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a scene target activity prediction method based on space-time and or graph, where the method includes:

Acquiring a scene video aiming at a preset scene;

generating a space and or graph model of the preset scene by detecting and tracking targets in the scene video; the space and or graph model characterizes the space position relation of the target in the scene video;

obtaining a sub-activity label set representing the activity state of the target of interest by using a sub-activity extraction algorithm for the space and/or graph model;

inputting the sub-activity label set into a pre-obtained time and or graph model to obtain a prediction result of future activity of the target of interest in the preset scene; the time and or graph model is obtained by utilizing a pre-established active corpus of targets of the preset scene.

Optionally, the generating the spatial and/or graphical model of the preset scene by detecting and tracking the target in the scene video includes:

detecting targets in the scene video by utilizing a target detection network obtained through pre-training to obtain attribute information corresponding to each target in each frame of image of the scene video; wherein the attribute information includes position information of a bounding box containing the object;

Based on the attribute information corresponding to each target in each frame of image, matching the same targets in each frame of image of the scene video by using a preset multi-target tracking algorithm;

determining the actual spatial distance between different targets in each frame of image;

and generating a space and or graph model of the preset scene by using the attribute information of the target corresponding to each frame of matched image and the actual space distance.

Optionally, the object detection network comprises a yolo_v3 network; the preset multi-target tracking algorithm comprises a deep start algorithm.

Optionally, the determining the actual spatial distance between different targets in each frame of image includes:

in each frame of image, determining the pixel coordinates of each target;

calculating the corresponding actual coordinates of the pixel coordinates of each target in a world coordinate system by using a monocular vision positioning ranging technology;

and for each frame image, obtaining the actual space distance between each two targets in the frame image by using the actual coordinates of each two targets in the frame image.

Optionally, the obtaining, by using a sub-activity extraction algorithm for the spatial and/or graph model, a sub-activity tag set that characterizes an activity state of the object of interest includes:

Determining paired targets with the actual space distance smaller than a preset distance threshold value in the space sum or graph model as concerned targets;

for each frame of image, determining the actual space distance of each pair of the concerned targets and the speed value of each concerned target;

obtaining distance change information representing the actual spatial distance change condition of each pair of concerned targets and speed change information representing the speed value change condition of each concerned target by sequentially comparing the next frame image and the previous frame image;

and describing the distance change information and the speed change information which are sequentially obtained by each concerned object by using semantic tags, and generating a sub-activity tag set for representing the activity state of each concerned object.

Optionally, the construction process of the time and or graph model includes:

observing a sample scene video of the preset scene, extracting corpus of various events about a target in the sample scene video, and establishing an active corpus of the target of the preset scene;

for an active corpus of targets of the preset scene, learning a symbol grammar structure of each event by using an grammar induction algorithm based on ADIOS, and taking sub-activities as terminal nodes of a time and or graph to obtain the time and or graph model; the activity state of the target is represented by sub-activity labels in an activity corpus of the target of the preset scene, and the event is composed of a set of the sub-activities.

Optionally, the inputting the sub-activity label set into a pre-obtained time and or graph model to obtain a prediction result of the future activity of the target of interest in the preset scene includes:

inputting the sub-activity label set into the time and or graph model, and obtaining a prediction result of the future activity of the target of interest in the preset scene by using an online symbol prediction algorithm of an Earley analyzer, wherein the prediction result comprises the sub-activity label of the future of the target of interest and an occurrence probability value.

In a second aspect, an embodiment of the present invention provides a scene target activity prediction apparatus based on space-time and or graph, where the apparatus includes:

the scene video acquisition module is used for acquiring scene videos aiming at preset scenes;

the space and or graph model generation module is used for generating a space and or graph model of the preset scene by detecting and tracking targets in the scene video; the space and or graph model characterizes the space position relation of the target in the scene video;

the sub-activity extraction module is used for obtaining a sub-activity tag set for representing the activity state of the concerned target by using a sub-activity extraction algorithm for the space and/or graph model;

The target activity prediction module is used for inputting the sub-activity label set into a pre-obtained time and or graph model to obtain a prediction result of the future activity of the target of interest in the preset scene; the time and or graph model is obtained by utilizing a pre-established active corpus of targets of the preset scene.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where,

the memory is used for storing a computer program;

the processor is used for realizing the steps of the scene target activity prediction method based on space-time and or graph provided by the embodiment of the invention when executing the program stored on the memory.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of the space-time and or graph-based scene target activity prediction method provided by the embodiments of the present invention.

In the scheme provided by the embodiment of the invention, the space-time sum or the graph is introduced into the field of target activity prediction for the first time. Firstly, target detection and tracking are carried out on a scene video of a preset scene, a space and or graph model of the preset scene is generated, and the space and or graph is used for representing the space position relationship between targets. And performing sub-activity extraction on the space and or graph model to obtain a sub-activity tag set of the concerned target, thereby realizing the advanced semantic extraction of the scene video. The sub-activity tag set is then used as input to a pre-obtained time and or graph model, and the prediction of the next sub-activity is obtained through the time grammar of the time and or graph. Therefore, the aim of effectively predicting the activity of the target in the preset scene can be fulfilled. The scheme provided by the embodiment of the invention can be universally applied to various scenes and has wide applicability.

Drawings

FIG. 1 is a schematic flow chart of a scene target activity prediction method based on space-time and or graph according to an embodiment of the present invention;

FIG. 2 is an exemplary prior art AND or OR diagram;

FIG. 3 is an analytical view for FIG. 2;

FIG. 4 is a spatial and/or diagram of a traffic intersection as an example of an embodiment of the present invention;

FIG. 5 is a graph of the results of traffic intersection target detection and speed calculation according to an exemplary embodiment of the present invention;

FIG. 6 is a result diagram of a traffic intersection time grammar (T-AOG) as an example of an embodiment of the present invention;

FIG. 7 is a schematic diagram of a predictive parse tree for a traffic intersection, according to an example embodiment of the present invention;

FIG. 8 is a diagram of a change in target activity in an actual video of an access control as an example in accordance with an embodiment of the present invention;

FIG. 9 is a confusion matrix diagram of predicted sub-activities and actual sub-activities for a traffic intersection according to an exemplary embodiment of the present invention;

FIG. 10 is a schematic structural diagram of a scene target activity prediction device based on space-time and or graph according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to achieve the purpose of effectively predicting the activity of a target in a scene, the embodiment of the invention provides a scene target activity prediction method based on space-time and or graph.

It should be noted that, the execution subject of the scene target activity prediction method based on space-time and or graph provided by the embodiment of the present invention may be a scene target activity prediction device based on space-time and or graph, and the device may be operated in an electronic apparatus. The electronic device may be a server or a terminal device, or an image processing device, but is not limited thereto.

In a first aspect, a method for predicting scene target activity based on space-time and or graph provided by an embodiment of the present invention is described.

As shown in fig. 1, the method for predicting scene target activity based on space-time and or graph provided by the embodiment of the invention may include the following steps:

s1, acquiring a scene video aiming at a preset scene.

In the embodiment of the present invention, the preset scene refers to a scene at least including a moving target, where the target may be a human, a vehicle, an animal, etc. For example, the preset scene may include a traffic intersection, school, park, etc.

The scene video may be obtained by a video photographing apparatus disposed at a preset scene. The video capturing device may include a camera, a video camera, a mobile phone, etc., such as a scene video for a traffic intersection, which may be captured by a camera disposed on a overpass of the traffic intersection.

According to the embodiment of the invention, the scene video aiming at the preset scene can be acquired from the video shooting equipment in a communication mode. The communication method is not limited to wireless communication, optical fiber communication, and the like.

It will be appreciated that the acquired scene video contains multiple frames of images.

S2, detecting and tracking targets in the scene video to generate a space and or graph model of the preset scene.

In the embodiment of the invention, the spatial and/or graphic model characterizes the spatial position relation of the target in the scene video.

To facilitate an understanding of the present solution, concepts related to or associated with the figures in this section will be first described. The And Or Graph (AOG) is a hierarchical composition model of a random context free grammar (SCSG) that represents a hierarchical decomposition from top level to leaf nodes by a set of terminal And non-terminal nodes, outlining the basic concepts in image grammar. Wherein the node represents a target decomposition, or the node represents an alternative sub-configuration. Referring to fig. 2, fig. 2 is an exemplary diagram of and or diagram of the prior art. An and or graph includes three types of nodes: and Node (And Node) (solid circle in fig. 2); or Node (Or Node) (dashed circle in fig. 2); terminal Node (Terminal Node) (rectangle in fig. 2). And Node (And Node) represents the decomposition of an entity into parts. It corresponds to a grammar rule, for example, b→ab, c→cd shown in fig. 2. Horizontal links to child nodes of a node represent spatial positional relationships and constraints. Or nodes (Or nodes) act as "switches" for the sub-structure that can be replaced, and represent class labels at various levels, such as scenes, objects, and part categories, etc. Which corresponds to a rule such as a→b|c shown in fig. 2. Because of this recursive definition, the and or graphs of many object or scene categories can be combined into one larger and or graph. Theoretically, all scene and object categories can be represented by a huge and or graph. The Terminal Node (Terminal Node) may also be called leaf Node, which is a high-level semantic visual dictionary based on pixels. Due to the scaling properties, the end nodes may appear in all levels of the and or graph. Each end node obtains instances from a specific set, called a dictionary, that contains various complex image patches. Elements in a collection may be indexed by variables, such as their type, geometric transformation, deformation, appearance change, etc. As shown in fig. 2, there are four kinds of visual dictionaries abcd for leaf nodes constituting a rectangle a. The image representation grammar is context-dependent with the or a graph defined in which the end nodes are their visual vocabulary and the nodes and or nodes are production rules.

The AND or graph contains all possible resolution graphs (pg), which is one possible configuration of the generating targets in the AND or graph. The resolution map is interpreted as an image. The parse graph pg consists of a hierarchical parse tree pt and a plurality of relationships E (defined as "horizontal edges"):

pg＝(pt,E) (1)

the parse tree pt is also an and tree in which non-terminal nodes are and nodes. Generating rules that decompose each and node into parts, the rules now no longer generate strings, but instead generate a configuration, see fig. 3, fig. 3 being an analytic graph for fig. 2 that yields configuration relationships of: r.b→c= < a, B >, C represents the configuration. With respect to the probability model in the AND Or graph, the probability is learned mainly at the Or node so that the generated configuration accounts for the probability of such a configuration occurring. Of course, fig. 2 also has another resolution diagram containing c and d, which are not shown here.

For And Or graphs, a small part dictionary is used to represent objects in an image through And nodes And Or node layering of the And Or graph, and the model can embody a Spatial combination structure of the objects in the image And can also be called as a Spatial And Or graph (S-AOG) model. The spatial and or graphical model represents the object by layering components of the object in different spatial configurations based on the spatial positional relationship of the object. Therefore, the method can be used for analyzing the position relation of each target in image analysis, thereby realizing specific applications such as target positioning and tracking. For example, the target recognition and tracking of complex scenes such as traffic intersections, squares and the like can be realized.

Specifically, for S2, the following steps may be included:

firstly, detecting targets in a scene video, and determining the category and the position of each target in each frame of image; the categories include people, vehicles, animals, etc. to distinguish the category of the object to which the target belongs; locations such as the area range and coordinates of the object in the image, etc.

Any target detection method, such as a traditional front-back background segmentation, a target clustering algorithm, or the like, or a target detection method based on deep learning, or the like, can be adopted, which is reasonable.

Secondly, the same target in different frame images is determined by using a target tracking technology.

The object tracking aims at locating the position of an object in each frame of video image and generating an object motion track. The object tracking task for an image is to determine the size and position of an object in a subsequent frame given the size and position of the object in an initial frame of a video sequence.

Any target tracking technique in the prior art, such as a tracking method based on correlation filtering (Correlation Filter) or Convolutional Neural Network (CNN), etc., may be used in the embodiments of the present invention.

Again, the positional relationship between the targets in each frame image, such as distance, front-rear orientation relationship, and the like, is determined.

And finally, carrying out spatial relationship decomposition on each target in the frame image to obtain a space and or graph of the frame image, and integrating the space and or graph corresponding to each frame image in the scene video to obtain a space and or graph model of the preset scene.

In an alternative embodiment, S2 may include S21 to S24:

s21, detecting targets in the scene video by utilizing a target detection network obtained through pre-training, and obtaining attribute information corresponding to each target in each frame of image of the scene video.

The object detection network of the embodiment of the invention can comprise: R-CNN, SPP Net, fast R-CNN, YOLO (You Only Look Once, YOLO), SSD (Single Shot MultiBox Detector), etc.

In an alternative embodiment, the object detection network comprises a yolo_v3 network.

The yolo_v3 network comprises a backbone network (backbone) and three prediction branches, wherein the backbone network is a dark net-53 network, the yolo_v3 is a full convolution network, a number of residual layer jump connections are used, and in order to reduce the gradient negative effects caused by POOLing, POOLing is abandoned, and downsampling is realized by stride of conv. In this network architecture, a convolution of step 2 is used for downsampling. Meanwhile, in order to enhance the accuracy of the algorithm on small target detection, an up-sampling and feature fusion method similar to FPN (Feature Pyramid Networks, feature pyramid) is adopted in Yolo_v3, and detection is performed on feature map maps of multiple scales. The three prediction branches adopt a full convolution structure. Compared with the traditional target detection algorithm, the target detection is performed by adopting the pre-trained yolo_v3 network, so that the accuracy and efficiency of target detection can be improved, and the aims of prediction accuracy and real-time performance are achieved.

For the structure and specific detection procedure of the yolo_v3 network, please refer to the related description of the prior art, and the description is omitted here.

Through a pre-trained yolo_v3 network, attribute information corresponding to each target in each frame of image of the scene video can be obtained. Wherein the attribute information includes position information of a bounding box containing the object. The positional information of the bounding box of the object is represented by (x, y, w, h), where (x, y) represents the center positional coordinates of the current bounding box, w and h represent the width and height of the bounding box, and as will be understood by those skilled in the art, the attribute information includes, in addition to the positional information of the bounding box, the confidence of the bounding box, which reflects the confidence level of the object contained in the bounding box, and the accuracy of the bounding box in predicting the object. Confidence is defined as:

if no target is contained, pr (object) =0, confidence=0; if the object is contained, pr (object) =1, thus confidenceIs the intersection ratio of the real bounding box and the prediction bounding box.

As will be appreciated by those skilled in the art, the attribute information also includes category information of the object. The category information indicates a category of the object, such as a person, a car, an animal, and the like.

It should be noted that, since a video image may often include a plurality of objects, some objects are far away, or are too tiny, or do not belong to "interested objects" in a preset scene, and these objects are not objects with detection purpose. For example, in traffic intersection scenarios, moving vehicles and humans are of interest, while roadside hydrants are among non-interested targets. Thus, in a preferred embodiment, the preset number of targets can be detected for one frame of image by controlling and adjusting the yolo_v3 network setting in advance in the pre-training link, for example, the preset number can be 30, 40, and the like. And meanwhile, training the YOLO_v3 network by using a labeled training sample with detection purpose, so that the YOLO_v3 network has autonomous learning performance, the trained YOLO_v3 network can be used for unknown scene videos serving as test samples, and the attribute information corresponding to a preset number of targets with detection purpose in each frame of image can be obtained, so that the target detection efficiency and the detection pertinence are improved.

Before S21, the yolo_v3 network needs to be trained in advance for the preset scene, and it will be understood by those skilled in the art that the sample data used in the pre-training is the sample scene video and sample attribute information in the scene, where the sample attribute information includes the category information of the target in each frame of image of the sample scene video and the position information of the bounding box containing the target.

The pre-training process can be summarized as follows:

1) And taking the attribute information of the target corresponding to each frame of image of the sample scene video as the true value corresponding to the frame of image, and training each frame of image and the corresponding true value through a yolo_v3 network to obtain a training result of each frame of image.

2) And comparing the training result of each frame image with the true value corresponding to the frame image to obtain the output result corresponding to the frame image.

3) And calculating the loss value of the network according to the output result corresponding to each frame of image.

4) And (3) according to the loss value, adjusting the parameters of the network, and repeating the steps 1) -3) until the loss value of the network reaches a certain convergence condition, namely the loss value reaches the minimum, wherein at the moment, the training result of each frame of image is consistent with the true value corresponding to the frame of image, so that the training of the network is completed, and the pre-trained YOLO_v3 network is obtained.

For example, for a preset scene of a traffic intersection, the pre-trained yolo_v3 network is trained from the MARS dataset and the Vehicle Re-identification dataset Vehicle Re-ID Datasets Collection. As will be appreciated by those skilled in the art, both the MARS dataset and the vector Re-ID Datasets Collection are open source datasets. Wherein the MARS dataset (Motion Analysis and Re-identification Set) is for pedestrians and the Vehicle Re-ID Datasets Collection is for vehicles.

For other scenes, a large number of sample scene videos are required to be obtained in advance, manual or machine labeling is carried out, category information of targets corresponding to each frame of image in each sample scene video and position information of a boundary box containing the targets are obtained, and the yolo_v3 network has target detection performance under the scene through a pre-training process.

S22, based on attribute information corresponding to each target in each frame of image, matching the same targets in each frame of image of the scene video by using a preset multi-target tracking algorithm.

Early target detection tracking is mainly realized by pedestrian detection, detection thought is mainly realized by a traditional characteristic point detection method, and tracking is realized by filtering matched characteristic points. For example, pedestrian detection based on a histogram of direction gradient feature (HOG), early pedestrian detection realizes various problems such as missing detection, false alarm, repeated detection and the like. With the development of deep convolutional neural networks in recent years, various methods for performing target detection tracking by high-precision detection results have appeared.

Because a plurality of targets appear in the preset scene aimed at by the embodiment of the invention, the target tracking needs to be realized by utilizing a multi-target tracking (Multiple Object Tracking, MOT) algorithm. The multi-objective tracking problem can be seen as a data correlation problem with the aim of correlating the cross-frame detection results in a sequence of video frames. By utilizing a preset multi-target tracking algorithm to track and detect targets in the scene video, a boundary box of the same target in different frame images of the front frame and the rear frame of the scene video and an ID (Identity document, identity number) of the target can be obtained, namely, the matching of the same target in each frame image is realized.

In an alternative embodiment, the preset multi-target tracking algorithm may include: SORT (Simple Online and Realtime Tracking) algorithm.

The SORT algorithm uses a TDB (tracking-by-detection) method, and the tracking means is to use Kalman filtering tracking to realize target motion state estimation, and uses a Hungary assignment algorithm to perform position matching. The SORT algorithm does not use any object appearance features in the object tracking process, but uses only the location and size of the bounding box for motion estimation and data correlation of the object. Therefore, the complexity of the SORT algorithm is low, the tracker can achieve a speed of 260Hz, the target tracking detection speed is high, and the real-time requirement in the scene video can be met.

Because the SORT algorithm does not consider the shielding problem, and does not carry out target re-recognition through the appearance characteristics of the target, the method is suitable for being applied to a preset scene without shielding of the target.

In an alternative embodiment, the preset multi-target tracking algorithm may include: deepSort (Simple online and realtime tracking with a deep association metric) algorithm.

The deep SORT is an improvement on the basis of SORT target tracking, the algorithm carries out track preprocessing and state estimation by using a Kalman filtering algorithm and is associated with a Hungary algorithm, and on the basis of improving the SORT algorithm, the algorithm also introduces a deep learning model which is trained offline on a pedestrian re-identification data set, and when a target is tracked on a real-time video, nearest neighbor matching is carried out by extracting depth apparent features of the target in order to improve the problem that the target in the video is blocked and the target ID is frequently switched. The core idea of deep sort is to use recursive kalman filtering and data correlation between the sum and each frame for tracking. Deep SORT adds a depth association metric (Deep Association Metric) to the SORT of pedestrians on the basis of the SORT. Appearance information (Appearance Information) is also added to enable longer-occluded object tracking. The algorithm is faster and more accurate than SORT in real-time multi-target tracking.

For specific tracking procedures of the SORT algorithm and the deep SORT algorithm, please refer to the related art for understanding, and detailed description is omitted herein.

S23, determining the actual space distance between different targets in each frame of image.

The position information of each target of each frame of image in the scene video can be obtained by carrying out target detection and tracking in the previous steps, but the position information of each target is insufficient to represent the relationship of each target in the preset scene. Therefore, this step requires determining the actual spatial distance between different objects in each frame of image, and defining the spatial composition relationship of the objects using the actual spatial distance between the two objects. In this way, accurate results can be obtained only when the constructed space and or graph model is used for prediction in the follow-up process.

In an alternative embodiment, the principle of equal scale scaling may be used to determine the actual distance between two objects in an image. Specifically, the actual spatial distance between the two test targets may be measured in a preset scene, a frame of image including the two test targets is captured, and then the pixel distance between the two test targets in the image is calculated, so as to obtain the number of pixels corresponding to the unit length in practice, for example, the number of pixels corresponding to 1 meter in practice. Then, for two new targets needing to detect the actual space distance, the pixel distance of the two targets in a frame of image shot in the scene can be scaled in equal proportion by using a formula by taking the pixel number corresponding to the unit length in the actual as a factor, so as to obtain the actual space distance of the two targets.

It will be appreciated that this solution is simple and feasible, but is more suitable for situations where the image is not distorted. In the case of distortion of an image, pixel coordinates and physical coordinates are not in one-to-one correspondence, and the distortion needs to be corrected. Such as correcting pictures by cvinitundristor tmap and cvRemap to achieve distortion cancellation, etc. The implementation of such scaling and the specific procedure of image distortion modification can be understood with reference to the related art, and will not be described herein.

In an alternative way, a monocular ranging approach may be used to determine the actual spatial distance between two objects in the image.

The monocular image model can be considered approximately as a pinhole model. Namely, the ranging is realized by using the principle of small-hole imaging. Alternatively, a similar triangle may be constructed by the spatial positional relationship between the camera and the actual object and the positional relationship of the targets in the image, and then the actual spatial distance between the targets is calculated.

Alternatively, the horizontal distance d between the actual position of a pixel point of the target and the video shooting device (camera/camera) can be calculated by using the related algorithm of the monocular ranging mode in the prior art and using the pixel coordinate of the pixel point _x And vertical distance d _y Namely, monocular ranging is realized. Then through the actual coordinates and d of the known video shooting equipment _x 、d _y And deducing and calculating the actual coordinates of the pixel point. Then, for two targets in the image, the actual spatial distance of the two targets can be calculated by using the actual coordinates of the two targets.

In an alternative embodiment, the actual spatial distance between two targets in the image may be determined by calculating the actual coordinate points corresponding to the pixel points of the targets.

The actual coordinate point corresponding to the pixel point of the calculation target is the actual coordinate of the calculation pixel point.

Alternatively, a monocular vision positioning ranging technique may be employed to obtain the actual coordinates of the pixels.

The monocular vision positioning and ranging technology has the advantages of low cost and quick calculation. Specifically, two modes can be included:

1) And obtaining the actual coordinates of each pixel by using positioning measurement interpolation.

Taking into account the equal magnification of the pinhole imaging model, the measurement can be performed by directly printing paper that is full of equidistant array dots. And measuring equidistant array points (such as a calibration plate) at a higher distance, interpolating, and carrying out equal-proportion amplification to obtain the coordinates of the actual ground corresponding to each pixel point. Such an operation can eliminate the need for manual measurement of the painting mark on the ground. The coordinates of the pixels corresponding to the actual ground can be obtained by amplifying H/H (height ratio) after measuring the point distance on the paper. In order to avoid that the keystone distortion of the upper edge of the image is too serious, so that mark points on printing paper are not easy to identify, the mode needs to prepare equidistant array circular point diagrams with different distances.

2) And calculating the actual coordinates of the pixel points according to the similar triangular proportion.

The main idea of the method is still a small-hole imaging model. However, the requirement for calibrating video shooting equipment (a camera/a camera) is high, and meanwhile, the distortion caused by a lens is required to be small, but the portability and the practicability of the mode are high. The video capturing device may be calibrated first, for example, by MATLAB or OPENCV, and then the conversion calculation of the pixel coordinates in the image may be performed.

An alternative to this is described below, S23 may include S231

S233：

S231, in each frame of image, determining the pixel coordinates of each target;

for example, the pixel coordinates of all pixel points in the bounding box containing the target can be determined as the pixel coordinates of the target; or a pixel point on or within the bounding box may be selected as the pixel coordinate of the object, i.e. the object is represented by the pixel coordinate of the object, e.g. the center position coordinate of the bounding box may be selected as the pixel coordinate of the object, etc.

S232, calculating the corresponding actual coordinates of the pixel coordinates of each target in a world coordinate system by utilizing a monocular vision positioning and ranging technology aiming at each target;

The pixel coordinates of any pixel in the image are known. The imaging process of the camera involves four coordinate systems: world coordinate system, camera coordinate system, physical image coordinate system (also called imaging plane coordinate system), pixel coordinate system, and conversion of these four coordinate systems. The conversion relations between these four coordinate systems are known in the art to be derivable. The actual coordinates of the pixel points in the image corresponding to the world coordinate system can be calculated by using a coordinate system transformation formula and other methods, for example, a plurality of open algorithm programs in OPENCV language are utilized to obtain the actual coordinates in the world coordinate system from the pixel coordinates. Specifically, for example, the corresponding world coordinates are obtained by inputting the camera's internal parameters, rotation vectors, translation vectors, pixel coordinates, and the like in some OPENCV programs, and using a correlation function.

Assuming that for the object a and the object B, the obtained actual coordinates corresponding to the center position coordinates of the bounding box representing the object a in the world coordinate system are (X _A ,Y _A ) The obtained actual coordinates corresponding to the center position coordinates of the bounding box representing the target B in the world coordinate system are (X _B ,Y _B ). Further, if the target A has an actual height, the actual coordinates of the target A are Where H is the actual height of the target a and H is the height of the video capturing device.

S233, for each frame image, the actual space distance between every two targets in the frame image is obtained by using the actual coordinates of every two targets in the frame image.

The prior art is to use the actual coordinates to calculate the distance between two points. For the above example, the actual spatial distance D between targets a and B, regardless of the actual height of the targets, is:of course, consider the case of the target actual height to be similar.

Alternatively, if the pixel coordinates of each of the targets a and B are obtained in S231, the actual distances of the targets a and B may be calculated by using the pixel coordinates, and then one of the actual distances may be selected as the actual spatial distance of the targets a and B according to a certain selection criterion, for example, the minimum actual distance is selected as the actual spatial distance of the targets a and B, which is reasonable.

Specific details of the above aspects can be found in computer vision (computer vision) and related concepts related to camera Calibration (camera Calibration), world coordinate system, camera coordinate system, physical image coordinate system (also called imaging plane coordinate system), pixel coordinate system, LABVIEW vision development, OPENCV related algorithm, LABVIEW example, and Calibration example, which will not be described herein.

In an alternative embodiment, determining the actual spatial distance between different targets in each frame of image may also be implemented using a binocular camera optical image ranging method.

The binocular camera is the same as the binocular camera of a person, the two cameras have differences in the photographed images of the same object due to the difference in angle and position, and the difference is called as parallax, and the size of the parallax is related to the distance between the object and the camera, so that the target can be positioned according to the principle. Binocular camera optical image ranging calculates parallax between two images captured by left and right cameras. The specific method is similar to monocular camera optical image ranging, but has more accurate ranging positioning information compared with monocular cameras. The binocular distance measurement method includes the steps of performing image transformation, polar line matching and other operations on two images, and referring to the related prior art for a specific distance measurement process of the binocular camera optical image distance measurement method, which is not described herein.

In an alternative embodiment, determining the actual spatial distance between different targets in each frame of image may also include:

and obtaining the actual space distance between the two targets in each frame image by using a depth camera ranging method.

The depth camera ranging method can directly obtain the depth information of the target from the image, and can accurately and rapidly obtain the actual spatial distance between the target and the video shooting equipment without coordinate calculation, so that the actual spatial distance between the two targets is determined, and the accuracy and timeliness are higher. For a specific ranging procedure of the depth camera ranging method, please refer to the related art, and detailed description thereof is omitted herein.

S24, generating a space and or graph model of the preset scene by utilizing the attribute information of the target corresponding to each frame of matched image and the actual space distance.

In this step, for each frame image, the detected object and the attribute information of the object are taken as leaf nodes of the space and or graph, and the actual space distance between different objects is taken as the space constraint of the space and or graph, thereby generating the space and or graph of the frame image. And forming a space and or graph model of the preset scene by the space and or graphs of all the frame images.

Taking a scene of a traffic intersection as an example, please refer to fig. 4, fig. 4 is a space and/or diagram of the traffic intersection as an example in an embodiment of the present invention.

The upper graph in fig. 4 represents a frame of image of a traffic intersection, which is a root node of a space and or graph as a preset scene. Three targets were detected by the above method, respectively, the left, middle and right three figures below in fig. 4. The left image is a pedestrian, category information is marked in the image, the person is represented, and a boundary box of the pedestrian is marked; the middle graph is a car, the image is marked with category information 'car', which indicates the car, and a boundary box of the car is marked; the right image is a bus, and the image is marked with category information 'bus', which indicates the bus, and is also marked with a boundary box of the bus. The above category information and the positional information of the bounding box are attribute information of the object. Meanwhile, if the same object in different frame images, such as the car, is aimed at, the car is also marked with the ID so as to distinguish the car from other objects in different frame images, such as the ID of the car can be represented by numbers or symbols.

Pedestrians, cars and buses, and the three targets and the corresponding attribute information are leaf nodes of a space and or graph. Wherein the actual spatial distance between each two objects serves as a spatial constraint of a spatial and/or map (not shown in fig. 4).

The generation process of a space sum or diagram may be specifically referred to the description of the related art, and will not be described herein.

Further, after the space and or graph model of the preset scene is generated, a new scene and a new spatial position relationship between the targets can be generated by using the space and or graph model of the preset scene. For example, the space and or graph models of two preset scenes can be integrated to obtain a new space and or graph model containing the two preset scenes, so as to realize scene expansion.

S3, obtaining a sub-activity label set representing the activity state of the target of interest by using a sub-activity extraction algorithm on the space and or graph model.

S1-S2 realize the detection of the leaf nodes of the space AND or graph. The step is to extract sub-activities to obtain an event sequence of sub-activity combination so as to express the whole event represented by the scene video. It should be noted that, the sub-activity extracted in this step is actually the activity of the target, and the sub-activity is described in terms of the leaf node of the graph.

In an alternative embodiment, S3 may include S31 to S34:

before S31, the sub-activity tab set subactivsts=null, which is a string array, may be initialized to store sub-activity tabs. Then, S31 to S34 are performed.

S31, determining the paired targets with the actual space distance smaller than a preset distance threshold value in the space and or graph model as the target of interest.

Optionally, in the space and or graph model, a pair of targets, of which the actual space distance between the space and or in the graph corresponding to the first frame image is smaller than a preset distance threshold, is determined as the target of interest.

If the actual spatial distance between two targets is small, it may be shown that there are more active contacts of the two targets, such as approach, collision, etc., so it is necessary to continuously observe the two targets as the target of interest, and predict the future activity of the two targets; conversely, if the actual spatial distance between two targets is large, this means that there is less likelihood that the two targets will have an active intersection, and therefore no corresponding active prediction is necessary.

Therefore, in the first frame image, the actual spatial distance d between different targets is calculated, and pairs of targets whose actual spatial distance d is smaller than the preset distance threshold minDis are determined as targets of interest. For different preset scenes, preset distance thresholds minDis with different sizes can be set, for example, under a traffic intersection scene, the safety distance between targets (vehicles or people) is concerned, and the minDis can be 200 meters and the like.

Alternatively, for S31, it may be:

and determining paired targets with the actual spatial distance smaller than a preset distance threshold value in the spatial and/or image model except the last frame of image corresponding to each frame of image as the target of interest.

That is, except for the last frame of image, the operation of determining the target of interest is performed in each frame of image, so that more targets of interest can be found in time.

S32, determining the actual space distance of each pair of the objects of interest and the speed value of each object of interest for each frame image.

In this step, starting from the first frame image, the actual spatial Distance d of the target of interest, which is smaller than the preset Distance threshold minDis, may be saved in Distance x; distance x is a multidimensional array that preserves the actual spatial Distance d between different objects. Where x represents the corresponding sequence number of the image, for example x=1 represents the first frame image.

Meanwhile, a speed value of the same object of interest in each frame image, which refers to the speed of the object of interest in the current frame of the scene video, may be calculated. The calculation method of the velocity value of the target is briefly described below:

calculating the speed value of a target requires obtaining the distance s and time t of the target moving in the front and back frame images. First, the frame rate FPS of the camera is calculated. Specifically, in developing software OpenCV, the frame per second FPS of video may be calculated using the get (cap_prop_fps) and get (cv_cap_prop_fps) methods that are self-contained.

Once every k frames, there are:

t＝kFPS(s) (3)

thus, the velocity value v of the target can be calculated by:

wherein (X) ₁ ，Y ₁ ) And (X) ₂ ，Y ₂ ) The actual coordinates of the target in the previous frame image and the next frame image are respectively represented, and the actual coordinates of the target can be obtained through step S232. Since the calculation of the velocity value of the target of the current frame image requires the use of the previous frame image and the current frame image, it is understood that the velocity value of the target may be obtained starting from the second frame image.

The speed of the target of interest in the video can be calculated by the method, and referring to fig. 5, fig. 5 is a graph of the traffic intersection target detection and speed calculation result by taking the embodiment of the invention as an example. Wherein a corresponding velocity value, such as 9.45m/s, is identified alongside the bounding box of each object of interest.

For the same object of interest, the velocity value in the first frame image may be denoted by v1, the velocity value in the second frame image may be denoted by v2, …, and so on.

S33, obtaining distance change information representing the actual space distance change condition of each pair of concerned targets and speed change information representing the speed value change condition of each concerned target by comparing the next frame image and the previous frame image in sequence.

For example, for two objects of interest E and F, the actual spatial distance between the two in the previous frame image is 30, and the actual spatial distance between the two in the subsequent frame image is 20, and then the comparison knows that the actual spatial distance between the two is reduced, which is the distance change information of the two. Similarly, if the speed value of E in the preceding frame image is 8m/s and the speed value in the following frame image is 10m/s, the comparison shows that the speed of E becomes faster, which is the speed change information thereof.

And obtaining the distance change information and the speed change information of each concerned object, which correspond to each frame of image and occur sequentially, until the images of all frames are traversed.

S34, describing distance change information and speed change information which are sequentially obtained by each concerned object by using semantic tags, and generating a sub-activity tag set representing the activity state of each concerned object.

The step is to describe the distance change information and the speed change information into text forms such as acceleration, deceleration, approaching, separating and the like by using language to obtain sub-activity labels representing the activity state of the concerned target, and finally obtaining a sub-activity label set by sub-activity labels which correspond to each frame of image and sequentially occur. The sub-active tag set embodies a sequence of sub-events of the scene video. The embodiment of the invention realizes the description of the scene video by utilizing the sub-activity tag set, namely, the semantic description of the whole video is obtained through the combination of different sub-activities of each target in the video, and the semantic extraction of the scene video is realized.

The sub-activity definitions in embodiments of the present invention may refer to the manner in which sub-activity labels in a CAD-120 dataset are defined, and shorter label patterns may be helpful in generalizing the nodes of the AND or AND graph. Sub-activity tags of interest may be defined targeted under different preset scenarios.

In this step, a complete sub-activity tab set subactivsts can be obtained.

Aiming at different preset scenes, the embodiment of the invention can define sub-activities (namely sub-events) in the scene when analyzing the target activities (events), and each sub-activity can obtain sub-activity labels through the target detection, tracking and speed calculation methods. Sub-activity tags of different preset scenes are different. Taking a traffic intersection scene as an example, the following sub-activity tags may be defined:

parking (car_stopping), person stationary (person_stopping), person away (away), car acceleration (accerate), car deceleration (decelerate), car uniform velocity (moving-uniformly), person car approaching (closing) no person or car (None), person passing zebra crossing (walking, running), collision (crash).

It will be appreciated that if in S31, each frame of image performs an operation of determining the target of interest except for the last frame of image, the sub-active tag sets obtained using S32 to S34 include a greater number of targets of interest, e.g., some targets of interest are determined based on the second frame of image, and so on.

S4, inputting the sub-activity label set into a pre-obtained time and or graph model to obtain a prediction result of future activities of the concerned targets in the preset scene.

The time and or graph model is obtained by utilizing an active corpus of targets of a preset scene, which is built in advance.

The targets of the study are different in different scenarios, so different scenarios need to be modeled to represent target activities (events). Time and or graph (T-AOG) is constructed, and active corpuses of targets of the preset scene are required to be obtained, and these corpuses can be regarded as priori knowledge of videos of the preset scene, wherein the more comprehensive the target activities (events) are, the more accurate the constructed T-AOG model is.

The construction process of the time and or graph model of the embodiment of the invention comprises the following steps:

(1) and observing a sample scene video of the preset scene, extracting corpus of various events about the target in the sample scene video, and establishing an active corpus of the target of the preset scene.

The method comprises the steps of presetting an activity state of a target in an activity corpus of the target of a scene, wherein the activity state of the target is represented by sub-activity labels, and an event is composed of a sub-activity set.

By analyzing videos of different sample scenes in a preset scene, a corpus of events is obtained, the corpus, namely possible combinations of leaf nodes, which occur in time sequence, is taken as an example of the traffic intersection scene, and the next corpus can represent a video: "closing person_closing moving_ uniformly walking away" can be expressed as: the people and the vehicles approach, the people do not move, the vehicles pass through at a constant speed, the vehicles stop, the people pass through, and the people and the vehicles are far away.

The embodiment of the invention requires that the obtained scene corpus contains events in the scene as much as possible, so that the method is more accurate in the process of predicting the target activity.

(2) And for an active corpus of targets of a preset scene, learning a symbol grammar structure of each event by using an grammar induction algorithm based on ADIOS, and taking the sub-activity as a terminal node of a time sum or graph to obtain a time sum or graph model.

Specifically, the ADIOS-based grammar induction algorithm learns And nodes (And nodes) And Or nodes (Or nodes) by generating important patterns And equivalent classes. The algorithm first loads the active corpus onto the graph whose vertices are sub-active and expands with two special symbols (start and end). Each event sample is represented by a separate path on the graph. Candidate patterns are then generated by traversing different search paths. In each iteration, each sub-path is tested for statistical significance according to context sensitive criteria. The important pattern is identified as an AND node; the algorithm then finds the equivalent class by looking up the units that are interchangeable in a given context. The equivalence class is identified as an or node. At the end of the iteration, the important pattern is added to the graph as a new node, replacing the sub-path it contains. The original sequence data of the symbol sub-activities can be obtained from an active corpus of targets of a preset scene, and the symbol grammar structure of each event can be learned from the original sequence data of the symbol sub-activities by using an ADIOS-based grammar induction algorithm. Shorter important modes tend to be used in embodiments of the present invention so that basic syntax elements can be captured. The algorithm learns the And nodes And the Or nodes by generating important patterns And equivalent classes. As an example, the T-AOG generated using the traffic intersection corpus is shown in fig. 6, and fig. 6 is a result diagram of the traffic intersection time grammar (T-AOG) as an example according to an embodiment of the present invention. The two-wire circle And single-wire circle nodes are the nd node And the Or node, respectively. The numbers (fractional numbers less than 1) on the branch edges of the Or node represent the branch probabilities. The numbers on the edges of the And nodes represent the extended time order.

After obtaining the time and or graph model, for S4, the following steps may be included:

inputting the sub-activity label set into a time and or graph model, and obtaining a prediction result of the future activity of the target under a preset scene by using an online symbol prediction algorithm of an Earley analyzer, wherein the prediction result comprises the sub-activity label of the future of the target and an occurrence probability value.

Wherein, the sub-activity label represents the position relation or the motion state of the paired concerned targets at the future time. For S4, a time and or graph model may be input for the sub-active tag set containing each pair of objects of interest, and then the prediction result may include the future sub-active tags and the probability value of occurrence for each pair of objects of interest. It is reasonable to input a time and or graph model into a sub-activity label set containing a certain pair of objects of interest to obtain future sub-activity labels and occurrence probability values of the pair of objects of interest.

The embodiment of the invention constructs the T-AOG by presetting an active corpus of targets of a scene, uses a sub-active tag set obtained by the S-AOG as the input of the T-AOG, and further adopts an on-line symbol prediction algorithm based on an Earley analyzer to predict the next possible sub-activity on the T-AOG. The algorithm of the Earley parser is an algorithm for parsing sentences of a given context free language. The Earley algorithm is designed based on dynamic programming ideas.

The symbol prediction algorithm of the Earley parser is described below. The Earley parser reads the terminal symbols in order, creating a set of all pending derivatives (states) that are consistent with the input of the currently input terminal symbol. Given the next input symbol, the parser iteratively performs one of three basic operations (predict, scan, and complete) on each state in the current state set.

In the following description, α, β, and γ denote terminal or non-terminal characters of an arbitrary character string (including an empty character string), A1 and B1 denote single non-terminal character strings, and T denotes a terminal.

The string is parsed using the "·" symbol of Earley: the analysis a1→αβ, a1→α·β for the character string A1 indicates that the symbol α has been analyzed, and β is a character to be predicted.

The input position n is defined as a position after accepting the nth character, and is defined as a position before input when the input position is 0. At each input position m, the parser generates a state set S (m). Each state is a tuple (a1→α·β, i) consisting of:

(1) Composition of the character string currently being matched (A1→αβ)

(2) The dot "·" indicates the position of the current parse, α has been parsed, and β is the character that needs to be predicted.

(3) i represents the original position where the matching starts, the start-stop position [ i, j ] of one character string: the integer i represents the start of the state (the start of the analyzed substring) and the integer j represents the end of the state (the end of the analyzed substring), i.ltoreq.j.

The parser will repeatedly perform three operations: prediction, scanning and completion:

prediction (predictor): for each state of S (m) in the form of (A1. Fwdarw. Alpha. B1β, i), the dot is followed by a non-terminal character, then there is a possibility of matching with each character of the string B1, and for each parsed character accompanying the grammar in B1, (B1. Fwdarw. γ, m) is added to the left hand side of S (m) (e.g., B1. Fwdarw.);

scanning (Scanner): for each state of S (m) in the form of (a1→α·tβ), if T is the next symbol in the input stream, the dot scans one character to the right because T is the terminal character. Namely, (A1→αT·β, i) to S (m+1);

completion (compter): for each state of S (m) of the form (a1→γ·j), a state of the form (b1→α·a1β, i) in S (j) is found, and (b1→α·a1β, i) to S (m) are added;

in this process, the state set does not add duplicate states. These three operations are repeated until no new states can be added to the state set.

The performing steps of the symbol prediction algorithm with respect to the Earley parser may include:

let n words be input sentences, the character spacing can be written as 0,1, …, n, i.e. n+1 characters are generated.

Step one: the analysis rule of the T-AOG rule, such as S.fwdarw.a, is added to the chart 0 to form the state S.fwdarw.a, [0,0 ].

Step two: for each state in the char [ i ], if the current state is an "incomplete state", and not followed by the terminal character T, then executing the predictor; if the current state is an "incomplete state" and is followed by a terminal character T, then Scanner is performed; if the current state is "complete state," then execute Completer.

Step three: if i is less than n, jumping to the second step, otherwise ending the analysis.

Step four: if the state of S.fwdarw.a, [0, n ] is finally obtained, the input character string is received as a legal matrix, otherwise the analysis fails.

In an embodiment of the present invention, using the Earley parser's symbol prediction algorithm, the current sentence of the sub-activity is used as input to the Earley parser and all pending states are scanned to find the next possible end node (sub-activity).

For details of the symbol prediction algorithm of the Earley parser, please refer to the description of the related art.

In summary, in an embodiment of the present invention, a space-time And Or graph (ST-AOG) is used to represent target activity. The space-time and or graph (ST-AOG) consists of a space-and or graph (S-AOG) and a time-and or graph (T-AOG). A space-time and or graph is understood to be built using the root node of the space and or graph as the leaf node of the time and or graph. S-AOG represents the state of a scene, the spatial relationship among targets is represented in a hierarchical manner through targets and the attributes of the targets, and the minimum sub-event (such as sub-event labels of human stillness, vehicle acceleration, human-vehicle approach and the like) is represented through the spatial position relationship obtained through target detection. The root node of the S-AOG is a sub-active label and the end nodes are targets and relationships between targets. T-AOG is a random time grammar that represents a hierarchy that breaks down an event hierarchy into several sub-events, emulating a target activity, with the root node being the activity (event) and the end nodes being sub-activities (sub-events).

Wherein, the learning of ST-AOG can be broken down into two main parts: the first part is to learn the symbol grammar structure (T-AOG) for each event/task. The second part is to learn parameters of ST-AOG, including or branch probabilities of nodes. Specific details regarding ST-AOG are not described here in detail.

Further, after obtaining the prediction result of the future activity of the target of interest, the embodiment of the present invention may further perform subsequent processing, such as outputting, displaying, sending, etc., based on the prediction result.

In an alternative implementation mode, when the predicted result meets the preset alarm condition, control information for alarming can be sent to the alarm device to control the alarm device to send an alarm signal.

For example, when the predicted result is a collision, control information may be sent to the alarm device to control the alarm device to send an alarm signal. The alarm signal may comprise an acoustic and/or optical signal or the like.

In an alternative embodiment, when the prediction result is that the distance between the two targets is smaller than a preset distance value representing the safe distance, control information may be sent to a warning device, such as a broadcasting device, to control the warning device to send a warning signal, to remind the targets to avoid collision, etc. In an alternative implementation mode, when the prediction result is that the pedestrian approaches the zebra crossing of the intersection, control information can be sent to the traffic light control equipment, the traffic light is changed to be red, and the vehicle is reminded to stop running and the like.

In order to understand the prediction results and effects of the embodiments of the present invention, the experimental results of the embodiments of the present invention are described in summary in the following with different preset scenes, where the preset scenes may include traffic intersections, entrance guards, fences, parks, schools, and the like. The first two preset scenes are selected for explanation:

1) Traffic intersection

Defining sub-activities in the scene includes:

According to the video corpus of the traffic intersection, a T-AOG model is constructed by using the method, and all events in the scene can be found in the T-AOG model.

Two targets of interest are identified by S1 and S2 as person29 and car22. And (3) obtaining a sub-activity label set, namely a statement representing the sub-event, through a sub-activity extraction algorithm of the S3.

The sub-activity tag set is input into the T-AOG model, i.e., the input event statement, sense, that combines sub-activities is as follows:

sentence＝'closing moving_uniformly person_stopping walking'

an online symbol prediction algorithm using the Earley parser predicts the next possible sub-activity in the T-AOG model.

The program output result may be:

['closing','moving_uniformly','person_stopping',walking']

(['away'],[0.28])

and the obtained prediction analysis tree is shown in fig. 7, and fig. 7 is a schematic diagram of the prediction analysis tree of the traffic intersection as an example in the embodiment of the invention. In the program output result, the first row represents the previously observed event statement, which consists of sub-activities, i.e., the first row is a sub-active tab set. The second row represents the predicted string (sub-active tag) and probability. In the parse tree, the lowermost character "walking" represents the observation at the current time, and the right character "away" represents the character predicted according to the T-AOG model. The predicted next sub-activity is: "away". Combining the change of the actual spatial position relationship of the targets of the previous and next frames in the video, it can be known that in the actual video, person29 and car22 approach and then move away from each other, and the sub-activities between the targets in the actual video coincide with the predicted sub-activities. The prediction result of the embodiment of the invention is consistent with the change of the targets in the video, and the embodiment of the invention has better prediction accuracy.

Further, if the next sub-activity of the two objects of interest is predicted to be a collision "crash", it is determined whether the distance between them is less than the pre-warning distance threshold. If so, an early warning signal may be output.

2) Access control

The door access in the embodiment of the invention comprises any access area with similar components of a door, such as a bank gate, a military control area gate and the like. In such a scenario, the position of the access in the video is fixed, and may be manually calibrated, etc.

Defining sub-activities in the scene includes:

unmanned (None), person stationary (person_stop), person approaching (closing), person away (away), person walking (walking, running), person passing (passing).

Similarly, according to the video corpus at the entrance guard, a T-AOG model is constructed by using the method, and all events in the scene can be found in the T-AOG model.

For example, it is determined that a pair of objects of interest are pedestrians and entrance guard through S1 and S2. And (3) obtaining a sub-activity label set of the sub-activity extraction algorithm of the S3, namely a statement representing the sub-event. The sub-activity tag set is input into the T-AOG model, i.e., the input event statement, sense, that combines sub-activities is as follows:

sentence＝'closing walking person_stopping'

The program output result may be:

['closing','walking','person_stopping']

(['passing'],[0.5])

and predicting that the next sub-activity label is a person passing (passing) gate inhibition.

In the scene, the sub-activity result of the entrance guard analyzed in the prior is that a person approaches the entrance guard, then the person stops at the entrance guard, and the next sub-activity label is obtained through prediction to be pass (pass). Referring to fig. 8, fig. 8 is a diagram illustrating a change of target activity in an actual video of an entrance guard according to an embodiment of the present invention. By comparing the front frame image and the rear frame image in the actual video of fig. 8, it can be known that the predicted result is consistent with the relationship between the person and the entrance guard in the actual image.

The prediction process for the remaining preset scenes is similar to the above example, and will not be described herein.

In addition, in the experimental process of sub-activity prediction, the embodiment of the invention extracts and analyzes the multi-target sub-activities in different scenes and then compares the sub-activities with the sub-activities in the actual video. The accuracy of the resulting sub-activity results is predicted using confusion matrix analysis by using the activity prediction methods herein.

Taking traffic intersections as an example, a confusion matrix may be used to analyze the comparison of the spatial position changes between actual targets and the detected position changes. As shown in table 1, by conventional methods such as: the accuracy of the SVM model in target classification detection method, the trained double-layer LSTM model, the VGG-16 network of R-CNN, the KGS Markov random field model and the sub-activity extraction of ATCRF on the CAD-120 data set is about 87% at most.

TABLE 1 comparison of accuracy of traditional target detection methods in sub-activity extraction

	SVM	LSTM	VGG-16	KGS	ATCRF
						P/R(％)	33.4	42.3	-	83.9	87

Referring to fig. 9, fig. 9 is a confusion matrix diagram of predicted sub-activities and actual sub-activities of a traffic intersection according to an embodiment of the present invention. As shown in fig. 9, the abscissa represents the true value of the sub-activity, and the ordinate represents the predicted value of the sub-activity, which can be calculated from the graph, the predicted sub-activity label substantially conforming to the actual sub-activity. The prediction accuracy can reach about 90%, and is higher than that of the method for obtaining sub-activity labels by using a traditional target detection method and then performing prediction. The result proves that the sub-activity prediction result in the embodiment of the invention is very accurate.

In a second aspect, corresponding to the above method embodiment, the embodiment of the present invention further provides a scene-based target activity prediction apparatus, as shown in fig. 10, where the apparatus includes:

a scene video acquisition module 1001, configured to acquire a scene video for a preset scene;

the space and or graph model generation module 1002 is configured to generate a space and or graph model of a preset scene by detecting and tracking a target in a scene video; the space and or graph model characterizes the space position relation of the target in the scene video;

a sub-activity extraction module 1003, configured to obtain, for the spatial and/or graph model, a sub-activity tag set that characterizes an activity state of the target of interest by using a sub-activity extraction algorithm;

the target activity prediction module 1004 is configured to input the sub-activity label set into a time and/or graph model obtained in advance, to obtain a prediction result of a future activity of the target of interest in a preset scene; the time and or graph model is obtained by utilizing an active corpus of targets of a preset scene, which is built in advance.

Optionally, the spatial and or graph model generating module 1002 includes:

the target detection sub-module is used for detecting targets in the scene video by utilizing a target detection network obtained through pre-training to obtain attribute information corresponding to each target in each frame of image of the scene video; wherein the attribute information includes position information of a bounding box containing the object;

The target tracking sub-module is used for matching the same targets in each frame image of the scene video by utilizing a preset multi-target tracking algorithm based on the attribute information corresponding to each target in each frame image;

the distance calculation sub-module is used for determining the actual space distance between different targets in each frame of image;

and the model generation sub-module is used for generating a space and or graph model of the preset scene by utilizing the attribute information of the target corresponding to each frame of matched image and the actual space distance.

Optionally, the distance calculation sub-module is specifically configured to:

in each frame of image, determining the pixel coordinates of each target;

for each target, calculating the corresponding actual coordinates of the pixel coordinates of the target in a world coordinate system by using a monocular vision positioning and ranging technology;

Optionally, the sub-activity extraction module 1003 is specifically configured to:

determining a pair of targets with the actual space distance smaller than a preset distance threshold value in the space sum or graph model as a target of interest;

For each frame of image, determining the actual space distance of each pair of objects of interest and the speed value of each object of interest;

and describing distance change information and speed change information which are sequentially obtained by each concerned object by using semantic tags, and generating a sub-activity tag set for representing the activity state of each concerned object.

Optionally, the construction process of the time and or graph model comprises the following steps:

observing a sample scene video of a preset scene, extracting corpus of various events about a target in the sample scene video, and establishing an active corpus of the target of the preset scene;

for an active corpus of targets of a preset scene, learning a symbol grammar structure of each event by using an grammar induction algorithm based on ADIOS, and taking sub-activities as terminal nodes of a time sum or graph to obtain a time sum or graph model; the method comprises the steps of presetting an activity state of a target in an activity corpus of the target of a scene, wherein the activity state of the target is represented by sub-activity labels, and an event is composed of a sub-activity set.

Optionally, the target activity prediction module 1004 is specifically configured to:

For specific execution of each module, please refer to the method steps of the first aspect, which are not described herein.

In a third aspect, an embodiment of the present invention further provides an electronic device, as shown in fig. 11, including a processor 1101, a communication interface 1102, a memory 1103 and a communication bus 1104, where the processor 1101, the communication interface 1102, and the memory 1103 complete communication with each other through the communication bus 1104,

a memory 1103 for storing a computer program;

the processor 1101 is configured to implement the steps of the scene target activity prediction method based on space-time sum or graph according to the first aspect when executing the program stored on the memory 1103.

The electronic device may be: desktop computers, portable computers, intelligent mobile terminals, servers, etc. Any electronic device capable of implementing the present invention is not limited herein, and falls within the scope of the present invention.

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In a fourth aspect, corresponding to the space-time and graph-based scene target activity prediction method provided in the first aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the space-time and graph-based scene target activity prediction method provided in the embodiment of the present invention are implemented.

The computer readable storage medium stores an application program for executing the scene target activity prediction method based on space-time and or graph provided by the embodiment of the invention at the time of running.

It should be noted that, the apparatus, the electronic device, and the storage medium according to the embodiments of the present invention are the apparatus, the electronic device, and the storage medium to which the above-described space-time and/or graph-based scene target activity prediction method is applied, and all the embodiments of the above-described space-time and/or graph-based scene target activity prediction method are applicable to the apparatus, the electronic device, and the storage medium, and the same or similar beneficial effects can be achieved.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The foregoing is merely illustrative of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A scene target activity prediction method based on space-time and or graph, comprising:

acquiring a scene video aiming at a preset scene;

inputting the sub-activity label set into a pre-obtained time and or graph model to obtain a prediction result of future activity of the target of interest in the preset scene; the time and or graph model is obtained by utilizing a pre-established active corpus of targets of the preset scene;

the generating the space and or graph model of the preset scene by detecting and tracking the target in the scene video comprises the following steps:

generating a space and or graph model of the preset scene by using the attribute information of the target corresponding to each frame of matched image and the actual space distance;

wherein the obtaining a sub-activity tag set characterizing an activity state of the object of interest for the spatial and/or graph model using a sub-activity extraction algorithm comprises:

describing the distance change information and the speed change information which are sequentially obtained by each concerned object by using semantic tags, and generating a sub-activity tag set for representing the activity state of each concerned object;

the construction process of the time and or graph model comprises the following steps:

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The object detection network comprises a yolo_v3 network; the preset multi-target tracking algorithm comprises a deep start algorithm.

3. The method according to claim 1 or 2, wherein said determining the actual spatial distance between different objects in each frame of images comprises:

in each frame of image, determining the pixel coordinates of each target;

4. The method of claim 3, wherein inputting the sub-activity tab set into a pre-obtained time and or graph model to obtain a predicted result of the future activity of the target of interest in the pre-set scene comprises:

5. A scene-based target activity prediction apparatus, comprising:

the target activity prediction module is used for inputting the sub-activity label set into a pre-obtained time and or graph model to obtain a prediction result of the future activity of the target of interest in the preset scene; the time and or graph model is obtained by utilizing a pre-established active corpus of targets of the preset scene;

6. An electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the method steps of any of claims 1-4 when executing a program stored on the memory.

7. A computer-readable storage medium comprising,

The computer readable storage medium has stored therein a computer program which, when executed by a processor, carries out the method steps of any of claims 1-4.