CN109117771B

CN109117771B - System and method for detecting violence events in image based on anchor nodes

Info

Publication number: CN109117771B
Application number: CN201810863149.0A
Authority: CN
Inventors: 殷光强; 田玲; 张栗粽; 包益全
Original assignee: Sichuan Dianke Weiyun Information Technology Co ltd
Current assignee: Sichuan Dianke Weiyun Information Technology Co ltd
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2022-05-27
Anticipated expiration: 2038-08-01
Also published as: CN109117771A

Abstract

The invention relates to a video image processing technology, and discloses an anchor node-based method for detecting a violent event in an image, which can automatically analyze and detect violent behaviors in a video stream, improve the real-time performance and the accuracy of detection and can adapt to various different monitoring environments. The method comprises the following steps: a. the anchor node acquisition unit is used for acquiring sample characteristics according to all the characteristic vectors of the image data set and solving the anchor node according to the sample characteristics; b. calculating hash codes corresponding to the samples in the data set according to the similarity between the anchor nodes and the samples in the image data set; c. the method is used for taking the hash code corresponding to each sample of the image data set as input and adopting a trained SVM model to carry out image prediction so as to judge whether a violent event exists in the image. In addition, the invention also discloses an anchor node-based system for detecting the violent events in the image, which is suitable for various application scenes.

Description

System and method for detecting violence events in image based on anchor nodes

Technical Field

The invention relates to a video image processing technology, in particular to a system and a method for detecting a violent event in an image based on an anchor node.

Background

With the heavy use of surveillance systems, video data has seen explosive growth. The monitoring system is used for detecting targets and abnormal behaviors. With the rapid growth of data, the traditional way of relying on manual monitoring has become increasingly difficult and inefficient. Therefore, the research of monitoring systems relying on artificial intelligence has become a hot spot. Among them, the detection of violent behavior of a human is a very important research direction.

Since the action of violent behavior is much more complicated than simple running and jumping behavior, how to detect violent behavior is also a difficulty of relevant research. At present, the traditional violent behavior detection mainly adopts a method based on artificial design features, although the recognition accuracy is high, the traditional violent behavior detection also has certain defects, such as: cannot achieve the real-time effect, is easily influenced by noise, and the like.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the system and the method for detecting the violent events in the image based on the anchor nodes are provided, the violent behaviors in the video stream are automatically analyzed and detected, the real-time performance and the accuracy of detection are improved, and the system and the method can adapt to various different monitoring environments.

The technical scheme adopted by the invention for solving the technical problems is as follows: an in-image violence event detection system based on anchor nodes, comprising: the device comprises an anchor node extraction module, a hash code calculation module and an image prediction module;

the anchor node extraction module is used for obtaining sample characteristics according to all the characteristic vectors of the image data set and solving anchor nodes according to the sample characteristics;

the hash code calculation module is used for calculating the hash codes corresponding to the samples in the data set according to the similarity between the anchor node and the samples in the image data set;

the image prediction module is used for taking the hash code corresponding to each sample of the image data set as input and adopting a trained SVM model to carry out image prediction so as to judge whether a violent event exists in the image.

As a further optimization, the anchor node extraction module is specifically configured to obtain joint point data of people in the image on the image data set by using a human skeleton feature extraction method, compare joint points between two adjacent frames to obtain a displacement vector of each joint point as a sample feature, and use a clustering algorithm to obtain the anchor node.

As a further optimization, the hash code calculation module is specifically configured to calculate an approximate similarity matrix between the anchor node and all samples of the image data set, simulate the similarity matrix through the approximate similarity matrix, obtain an auxiliary matrix instead of the similarity matrix, calculate a feature value and a feature vector of the auxiliary matrix, and finally calculate a hash code corresponding to the samples in the data set according to the feature value and the feature vector.

As a further optimization, the image prediction is performed by using the trained SVM model to determine whether a violent event exists in the image, and the method specifically includes:

inputting the hash code corresponding to each sample of the image data set to be predicted into the trained SVM model, if the output result of the sample is 1, judging the sample to be a violent frame, if the output result of the sample is 0, judging the sample to be a non-violent frame, and finally judging whether a violent event exists according to whether the proportion of the violent frame exceeds a certain proportion.

As further optimization, the anchor node extraction module, the hash code calculation module and the image prediction module are all deployed on the same server; or, the anchor node extraction module and the hash code calculation module are deployed on the same server, and the image prediction module is deployed on another server.

In addition, based on the system, the invention also provides a method for detecting the violence incident in the image based on the anchor node, which comprises the following steps:

a. the system comprises a data acquisition unit, a data processing unit and a data processing unit, wherein the data acquisition unit is used for acquiring sample characteristics according to all characteristic vectors of an image data set and solving an anchor node according to the sample characteristics;

b. calculating hash codes corresponding to the samples in the data set according to the similarity between the anchor nodes and the samples in the image data set;

c. the method is used for taking the hash code corresponding to each sample of the image data set as input and adopting a trained SVM model to carry out image prediction so as to judge whether a violent event exists in the image.

As a further optimization, step a specifically includes:

the human skeleton feature extraction method is adopted to obtain the joint point data of people in the image on the image data set, then the joint points between two adjacent frames are compared to obtain the displacement vector of each joint point as the sample feature, and the clustering algorithm is used to calculate the anchor node.

As a further optimization, step b specifically includes:

calculating an approximate similarity matrix between all samples of the anchor node and the image data set, simulating the similarity matrix through the approximate similarity matrix, solving an auxiliary matrix replacing the similarity matrix, calculating a characteristic value and a characteristic vector of the auxiliary matrix, and finally calculating a hash code corresponding to the samples in the data set according to the characteristic value and the characteristic vector.

As a further optimization, in step c, the method for training the SVM model is:

and (c) taking the training image data set as an image data set, executing the steps a and b, and training the obtained hash code by using an SVM model to obtain a trained SVM model.

As a further optimization, step c specifically includes:

and inputting the hash codes corresponding to the samples of the image data set to be predicted into the trained SVM model, if the output result of the samples is 1, judging the samples to be violent frames, if the output result of the samples is 0, judging the samples to be non-violent frames, and finally judging whether violent events exist according to the fact that whether the proportion of the violent frames exceeds a certain proportion.

The invention has the beneficial effects that:

the method comprises the steps of processing existing classified images, extracting the images into samples, training the samples into a model by using a machine learning method, processing the image to be predicted, and predicting by using the model.

For various different types of monitoring places, the method can automatically analyze the video stream, if the occurrence of violent behavior is detected, the alarm device is triggered immediately to give an alarm, and a manager can obtain a notice at the first time and perform corresponding processing, so that the method has real-time performance; when the image is processed, the obtained anchor nodes are used as distinguished feature vectors, so that the efficiency and the accuracy of violent behavior prediction are improved.

Drawings

FIG. 1 is a flow chart of a method of violent event detection in the present invention;

FIG. 2 is a schematic diagram of non-real-time distributed handling of violent event detection in example 1;

fig. 3 is a schematic diagram of the real-time centralized processing of violent behavior detection in embodiment 2.

Detailed Description

The invention aims to provide a system and a method for detecting a violent event in an image based on an anchor node, which can automatically analyze and detect the violent behavior in a video stream, improve the real-time performance and the accuracy of detection and can adapt to various different monitoring environments. In the invention, after processing the existing classified images and extracting the images into samples, the samples are trained into a model by a machine learning method, and then the model is used for predicting after processing the image to be detected.

For the sake of understanding, the technical terms that may appear in the present invention are first explained:

1. an anchor node: usually found in many common nodes according to some clustering algorithm or other more prominent features, as if it were a beacon.

An SVM: the support vector machine is a supervised learning model in the field of machine learning, and aims to find a function which can separate data of different labels at most.

3. Similarity matrix: the similarity matrix is a square matrix of n × n, and stores n samples to represent the similarity or distance between every two samples.

4. And (3) Hash code: the hash code is a result obtained by a hash algorithm, is not completely unique, is an algorithm, and allows objects of the same class to have different hash codes as much as possible according to different characteristics of the objects, but does not indicate that the hash codes of the different objects are completely different.

The system for detecting the violent events in the image based on the anchor nodes comprises the following components: the anchor node extraction module, the hash code calculation module and the image prediction module are as follows:

the anchor node extraction module is used for obtaining sample characteristics according to all the characteristic vectors of the image data set and solving anchor nodes according to the sample characteristics; the method specifically comprises the following steps: firstly, joint point data of people in an image are obtained on an original image data set by a human skeleton feature extraction method, then joint points between two adjacent frames are compared to obtain a displacement vector of each joint point, then the displacement vector of the joint point is used as a sample feature, and a small number of anchor nodes are obtained by using a clustering algorithm.

The hash code calculation module is used for calculating the hash codes corresponding to the samples in the data set according to the similarity between the anchor node and the samples in the image data set; the method specifically comprises the following steps: firstly, calculating an approximate similarity matrix Z between an anchor node and all samples of a data set; then, after the approximate similarity matrix Z is obtained, a similarity matrix A between each sample can be estimated through calculation; and then, the Hash code can be obtained by calculating the eigenvector and the eigenvalue of the Laplacian matrix L of the similarity matrix A. However, in order to reduce space consumption, a substitute small matrix M is calculated instead, then the eigenvalue and eigenvector of the matrix are calculated, and finally the hash code corresponding to each sample of the data set is calculated to be used as the input of the next step.

The image prediction module is used for taking the hash code corresponding to each sample of the image data set as input and adopting a trained SVM model to carry out image prediction so as to judge whether a violent event exists in the image. The method specifically comprises the following steps: inputting the hash code corresponding to each sample of the image data set to be predicted into the trained SVM model, if the output result of the sample is 1, judging the sample to be a violent frame, if the output result of the sample is 0, judging the sample to be a non-violent frame, and finally judging whether a violent event exists according to whether the proportion of the violent frame exceeds a certain proportion.

Referring to fig. 1, the system for detecting a violent event in an image based on an anchor node in the invention comprises the following steps:

1. inputting a training image data set;

in this step, the training image dataset is derived from classified image samples, including video images containing violent events and video images without violent events.

2. Obtaining joint point data of human skeleton characteristics;

in this step, joint point data of a person in an image is obtained on an image data set by adopting a human skeleton feature extraction method.

3. Comparing two adjacent frames to obtain a frame difference;

in this step, the joint points between two adjacent frames are compared to obtain the displacement vector of each joint point as the sample feature.

4. Calculating an anchor node;

in the step, the anchor nodes are calculated by using a clustering algorithm according to the sample characteristics.

5. Obtaining an approximate similarity matrix Z;

in this step, an approximate similarity matrix between the anchor node and all samples of the image dataset is calculated by a distance calculation method.

6. Calculating to obtain a substituted small matrix M through an approximate similarity matrix Z;

in this step, a similarity matrix a is simulated by approximating the similarity matrix Z, and a small matrix M replacing the similarity matrix a is obtained as an auxiliary matrix.

7. Calculating a characteristic value and a characteristic vector;

in this step, the eigenvalue and eigenvector of the auxiliary matrix M are calculated.

8. Calculating a hash code;

in this step, a hash code corresponding to a sample in the data set is calculated according to the feature value and the feature vector.

9. Training a sample to obtain an SVM model;

in the step, the SVM model is trained by using the obtained hash code as input, so that the trained SVM model is obtained.

10. Obtaining a hash code of a predicted video image;

in the step, a video image to be predicted is input, and the hash code of the predicted video image can be obtained by executing the step 2-10;

11. predicting the hash code of the predicted image sample by using an SVM model;

in the step, the hash code of the obtained prediction video image is input into a trained SVM model, so that whether the image frame is a violent frame (namely, a frame containing violent behaviors in the image) is predicted;

12. outputting a result according to the ratio of the violent frames;

in the step, if the total frame number of the violent frames in the input image exceeds a certain proportion (can be preset in advance), the violent event is judged to exist in the video, otherwise, the violent event is judged not to exist.

Example 1:

the scenario described in this embodiment is: a distributed application scenario that does not require real-time processing, such as capturing images by a camera in a plurality of streets, transmitting the images to a host for storing image files, processing a small number of images of the camera corresponding to the host by the host and obtaining a prediction result, temporarily storing the result in the local, uploading the temporarily stored prediction results to respective databases respectively every fixed several hours, and providing a query function for a client through a reverse proxy server, wherein the client can directly access data of a large number of cameras without sensing the existence of the databases, as shown in fig. 2, the specific steps are as follows:

before the system is deployed, a large number of image data sets for training are acquired on a collection network or other ways, the data sets are sorted, all image data are accurately classified, and corresponding labels are added to each class, wherein one class of images are violent event images, and the other class of images are non-violent event images. After classification, processing all images by using a human skeleton feature extraction method while keeping classification, taking the obtained human joint point data of each frame as samples, and obtaining anchor nodes in the samples through a clustering algorithm, wherein the anchor nodes need to be kept for a long time so as to be used when a system runs; then, the hash code corresponding to each sample is obtained through the steps in fig. 1 while the classification is still kept unchanged, and meanwhile, the auxiliary matrix for solving the hash code matrix is separated out to calculate the hash code of the predicted sample after the auxiliary matrix is convenient to store for a long time; and finally, using the hash codes of all the training sets and the corresponding classes thereof as training samples of the svm model to obtain the trained svm model, and storing the svm model for a long time for system runtime prediction.

After the system is deployed, each camera captures images in real time, and meanwhile, the captured images are transmitted to a server corresponding to the camera. After each server receives the image, in order to reduce the server pressure, the data are temporarily stored locally, videos accumulated by one camera are taken out every fixed time (such as several hours), the video is processed in a centralized manner, corresponding hash codes are obtained through the anchor nodes and the auxiliary matrix, and the obtained hash codes are predicted by using the svm model.

After the prediction is finished, the detection result obtained by the prediction of the period of time and the used image are uploaded to a database corresponding to the server for long-term storage for client query.

Because the results are respectively stored in different databases, a reverse proxy server is arranged to receive the query request of the client and then access the detection result in the corresponding database, and convenient functions such as time-interval statistical data can be provided.

Example 2:

the scenario described in this embodiment is: in some special places where images need to be processed in a centralized manner in real time, the overall number of cameras with small areas is low, and the requirement for monitoring violent events in real time is high, so that the images need to be predicted in a centralized manner in real time. Each video streaming server is only connected with a small number of cameras, the servers are only responsible for image storage and carry out small number of processing on the images to obtain hash codes, the hash codes are marked out and then sent to the same detection server, the hash codes are rapidly detected by the server and are simultaneously stored in a database, a client can poll the database, and a warning is given when a violent event is inquired. Referring to fig. 3, the specific steps are as follows:

before the system is deployed, a large number of image data sets for training are acquired on a collection network or other ways, the data sets are sorted, all image data are accurately classified, and corresponding labels are added to each class, wherein one class of images are violent event images, and the other class of images are non-violent event images. After classification, processing all images by using human skeleton feature extraction while maintaining classification, taking the obtained human joint point data of each frame as samples, and obtaining anchor nodes in the samples through a clustering algorithm, wherein the anchor nodes need to be maintained for a long time so as to be used when a system runs; then, the hash code corresponding to each sample is obtained through the steps in fig. 1 while the classification is still kept unchanged, and meanwhile, the auxiliary matrix for solving the hash code matrix is separated out to calculate the hash code of the predicted sample after the auxiliary matrix is convenient to store for a long time; and finally, using the hash codes of all the training sets and the corresponding classes thereof as training samples of the svm model to obtain the trained svm model, and storing the svm model for a long time for system runtime prediction.

After the system is deployed, each camera captures images in real time and transmits the captured images to a server corresponding to the camera. Each server stores an anchor node auxiliary matrix, and after receiving an image, each server calculates corresponding hash codes for received video data every short time (for example, half a minute) in order to reduce the pressure of a prediction server, and only transmits the hash codes with a data volume much smaller than that of the video to the prediction server.

The prediction server is connected with all video streaming servers in the area, a svm model trained before the system runs is stored on the server, the prediction server predicts the received hash code sequence by using the model, and then judges the property of the video according to the proportion of the violent event frame and stores the property in the database. The client checks whether a violent event occurs by continuously polling the database, and alarms if the violent event occurs.

In the two embodiments, the database stores the detection results in the following manner:

violence event detection storage table based on anchor nodes

The table has a total of five fields, respectively:

id, i.e. a serial number, for identifying the number of the detected video stream;

the tStamp is a time stamp and is used for identifying the time for detecting violent behaviors of the video stream;

camera ID, i.e. camera ID, used to identify the number of the camera from which the video stream originates;

HashCode, namely a Hash code, is obtained by calculating samples by adopting a Hash algorithm according to the similarity between the extracted anchor node and the samples of the image data set and is used as the input of a prediction model;

isViolent, i.e. whether the video stream is violent or not, is used for identifying the detection result of the violent event of the video stream, wherein 1 indicates that the video stream contains the violent event, and 0 indicates that the video stream does not contain the violent event.

When the client side polls the storage table, a violent behavior detection result of a video stream corresponding to a certain camera is inquired according to the camera Id, whether a violent event exists is judged according to the value of the isViolent field, if the value of the isViolent field is 1, the violent event is indicated, a corresponding alarm device is automatically triggered, and an administrator can inform an attendant in a corresponding area to go to a corresponding place to check and process the violent event.

Because the hash code and the violent behavior detection result isViolent calculated for the video stream at each time are recorded in the storage table, a large amount of historical data can be extracted from a database by the SVM model for training, namely training samples of the SVM model are continuously enriched, and SVM prediction is more and more accurate.

Claims

1. An in-image violence event detection system based on anchor nodes,

the method comprises the following steps: the device comprises an anchor node extraction module, a hash code calculation module and an image prediction module;

the anchor node extraction module is used for obtaining the joint point data of people in the image on the image data set by adopting a human skeleton feature extraction method, then comparing the joint points between two adjacent frames to obtain the displacement vector of each joint point as a sample feature, and solving the anchor node by using a clustering algorithm;

the Hash code calculation module is used for calculating an approximate similarity matrix between the anchor node and all samples of the image data set, simulating the similarity matrix through the approximate similarity matrix, solving an auxiliary matrix replacing the similarity matrix, calculating a characteristic value and a characteristic vector of the auxiliary matrix, and finally calculating a Hash code corresponding to the samples in the data set according to the characteristic value and the characteristic vector;

the image prediction module is used for inputting the hash codes corresponding to the samples of the image data set and adopting the trained SVM model to perform image prediction so as to judge whether a violent event exists in the image.

2. The system for detecting the violent events in the image based on the anchor nodes as claimed in claim 1, wherein the image prediction is performed by using a trained SVM model to judge whether the violent events exist in the image, and the method specifically comprises the following steps:

3. The system for detecting the violent events in the images based on the anchor nodes as claimed in claim 1 or 2, wherein the anchor node extracting module, the hash code calculating module and the image predicting module are all deployed on the same server; or, the anchor node extraction module and the hash code calculation module are deployed on the same server, and the image prediction module is deployed on another server.

4. A method for detecting violent events in an image based on an anchor node is characterized by comprising the following steps:

a. the anchor node acquisition unit is used for acquiring sample characteristics according to all the characteristic vectors of the image data set and solving the anchor node according to the sample characteristics;

c. the image prediction method comprises the steps of using hash codes corresponding to samples of an image data set as input, and adopting a trained SVM model to carry out image prediction so as to judge whether a violent event exists in an image;

the step a specifically comprises the following steps:

acquiring joint point data of people in the image on the image data set by adopting a human skeleton feature extraction method, comparing joint points between two adjacent frames to obtain a displacement vector of each joint point as a sample feature, and solving an anchor node by using a clustering algorithm;

the step b specifically comprises the following steps:

5. The method for detecting the violent events in the image based on the anchor nodes as claimed in claim 4, wherein in the step c, the method for training the SVM model is as follows:

6. The method for detecting the violent events in the image based on the anchor node as claimed in claim 4 or 5, wherein the step c specifically comprises the following steps: