CN114863370A

CN114863370A - Complex scene high altitude parabolic identification method and system

Info

Publication number: CN114863370A
Application number: CN202210796750.9A
Authority: CN
Inventors: 孙俊; 康凯; 刘海峰; 艾坤; 王子磊
Original assignee: Hefei Zhongke Leinao Intelligent Technology Co ltd
Current assignee: Hefei Zhongke Leinao Intelligent Technology Co ltd
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-08-05
Anticipated expiration: 2042-07-08
Also published as: CN114863370B

Abstract

The invention is suitable for the technical field of intelligent identification, and provides a complex scene high altitude parabolic identification method and a complex scene high altitude parabolic identification system, wherein an image sequence set SI and an object picture set SO are synthesized to obtain a training picture sequence set DI with a parabola; using a training picture sequence set DI training time sequence block segmentation network to obtain TPSNet; and (4) carrying out parabolic recognition prediction on the test video by using the TPSNet obtained after training. The method solves the problem that a deep learning algorithm for the high-altitude parabolic field needs a large amount of multi-dimensional high-altitude parabolic labeled data according to the characteristic synthesis data of the high-altitude parabolic. Through higher down sampling, specific shallow features are ignored when the network extracts features, sequence features with relative semantics are concerned, and the problem that seamless migration to each scene cannot be achieved at low cost is solved.

Description

Complex scene high altitude parabolic identification method and system

Technical Field

The invention belongs to the technical field of intelligent identification, and particularly relates to a complex scene high-altitude parabolic identification method and system.

Background

With the continuous development of society, the range of activities of people is changed from only the ground to the ground and above, so that a big problem caused by the development is throwing things at high altitude. Once the residents living in the high-rise and the high-rise cleaning operators or the electric power workers climbing telegraph poles and electric power iron stands do not pay attention to the standard behaviors, the related devices are directly thrown away during the high-altitude operation, and casualty accidents are likely to happen. Thus, high altitude parabolic recognition helps to normalize the behavior of the associated person while helping to backtrack problems.

The method is different from the traditional feature extraction method, the deep learning algorithm is good in robustness and strong in mobility, and features of better representation images can be extracted. Generally, information in a picture is extracted by using a Convolutional Neural Network (CNN), but the CNN cannot directly process a task related to a time sequence, and a Recurrent Neural Network (RNN) can better make up for the deficiency of the CNN because input data is cyclically processed in a modeling process. The deep learning algorithm needs to use enough labeled data to perform iterative training continuously, so as to improve the performance, and therefore, the data is an important consideration for applying the deep learning algorithm. In addition, the deep learning algorithm generally has huge parameters, and the performance of the model (the result obtained by the deep learning algorithm) has a certain relationship with the parameter quantity, so that the relationship between the performance and the efficiency needs to be considered comprehensively for some tasks with higher real-time performance.

The Chinese patent application (publication number: CN 112308000A) discloses a high-altitude parabolic detection method based on space-time information, which is based on the idea of pixel segmentation, collects a large number of high-altitude parabolic sequence pictures (various background interference scenes, various rays and various shooting angles), marks parabolas on each picture by using a rectangular frame, and trains a convolutional neural network. When the method is used, N pictures of continuous video frames are input into a convolutional neural network, a map is predicted, a parabolic sequence is displayed on the map, and whether high-altitude parabolas exist is determined by judging whether a vertical direction communication region with a certain length exists in the sequence. This method has problems: 1) the demand for the data volume is too high, and not only enough data needs to be collected, but also each piece of data needs to be labeled; 2) because the method is based on the idea of segmentation, the integral calculation amount is too high, and the real-time performance is difficult to meet; 3) the method cannot solve the false alarm of the upper throwing object.

The Chinese patent application (publication number: CN 113223081A) discloses a high-altitude parabolic detection method and system based on background modeling and deep learning, wherein a background model is established through a Gaussian mixture model, after the background model is successfully established, a difference image is obtained by subtracting a background model picture from a newly input video frame, image and position information of a foreground target are obtained through image preprocessing, noise in the foreground is extracted through CNN, the steps of extracting the foreground and the characteristics are repeated, matching is carried out according to the characteristics extracted through the CNN to obtain a foreground set, and then LSTM is used for carrying out time sequence analysis to determine whether the object is a high-altitude parabolic object. The problems of this method are: 1) the background modeling method fails in a complex scene; 2) in order to ensure that the CNN can distinguish the background and the foreground in different application scenarios, and the extracted features have discriminability (when the CNN is used for matching two objects, if the features do not have discriminability, it cannot be determined whether the two objects are an object), the CNN needs to acquire enough data if it needs more data for training, and the background and the foreground are dynamic concepts and not fixed.

The Chinese patent application (publication number: CN 111931719A) discloses a high-altitude parabolic detection method and a high-altitude parabolic detection device, wherein a plurality of frames of images in a video are input into a trained high-altitude parabolic detection model, the high-altitude parabolic detection model comprises a convolutional neural network model and a cyclic neural network model which are connected in series, the convolutional neural network model is used for identifying a high-altitude object, and the cyclic neural network model is used for detecting whether the high-altitude object identified by the convolutional neural network is a high-altitude parabolic object. The problems of this method are: 1) because the method considers the problem of the class of the object, the model is required to be retrained when being transferred to other objects; 2) training the convolutional neural network and the cyclic neural network requires a large amount of real data and labeling, and the cost is very high.

In summary, the conventional high altitude parabola detection method has the following problems: 1) the traditional algorithm has low robustness when meeting a complex background, needs a large amount of data in a deep learning algorithm, and has high collection and labeling cost; 2) the algorithm based on deep learning is limited by a modeling mode, and seamless migration is difficult to achieve, namely, seamless migration to each scene with low cost is difficult to achieve; 3) it is difficult to maintain both a fast speed and good results.

Disclosure of Invention

In order to solve the above problems, in one aspect, the present invention discloses a complex scene high altitude parabola identification method, including:

synthesizing the image sequence set SI and the object picture set SO to obtain a training picture sequence set DI with a parabola;

using a training picture sequence set DI training time sequence block segmentation network to obtain TPSNet;

and (4) carrying out parabolic recognition prediction on the test video by using the TPSNet obtained after training, and judging whether a high-altitude parabolic exists in the test video.

Further, the method for acquiring the image sequence set SI includes:

collecting N1 video segments, and extracting each frame of image in the video;

dividing all images in each video into a plurality of image sequence segments S with the length T according to time sequence;

all the image sequence segments obtained by the division constitute an image sequence set SI.

Further, the image sequence set SI specifically includes:

{ (0, 1.·, T-1), (sep, 1+ sep.,....,. T + sep-1),... ere, wherein (sep, 1+ sep.,..., T + sep-1) represents a numbered set of images within a set of image sequence segments S; sep represents the interval of sampling; t represents the length of a single image sequence segment S;

the image sequence set SI includes 1+ FIoor ((VL-T)/sep) image sequence segments, where FIoor represents rounding-down and VL represents the number of pictures extracted per video.

Further, the method for acquiring the object picture set SO includes:

collecting N2 object pictures, and segmenting and labeling the object pictures;

and obtaining a segmented and labeled object picture set SO.

Further, the segmenting and labeling the object picture specifically includes:

marking the required objects in the object picture to obtain a segmentation marking label of each required object _i ；

Generating a label set corresponding to each object picture as { (im) ₁ ，label ₁ ）（im ₂ ，label ₂ ），......，（im _n ，label _n ) In which im is ₁ 、im ₂ 、...、im _n Refers to picture index numbers; label ₁ 、label ₂ 、...、label _n Labels referring to n desired objects, (im) _n ，label _n ) Representing the nth group of labels in the label set;

after all the object pictures are segmented and labeled, integrating the label set of each obtained object picture to obtain a segmented and labeled picture and segmented label set SO, wherein the picture and segmented label set SO is expressed as { (im) _i ，label _i ）| i∈[1,m]Therein (im) _i ，label _i ) And the i-th group of labels in the picture and segmentation label set are represented, and m represents that m required objects are arranged in all the object pictures.

Further, the method for synthesizing the training picture sequence set DI comprises:

randomly selecting an object picture from the object picture set SO as a foreground picture, and selecting an image sequence segment from the image sequence set SI as a background picture;

synthesizing the selected foreground picture and the selected background picture according to the characteristic of the high-altitude parabola to obtain a training picture sequence D with the parabola;

and repeatedly synthesizing to obtain a plurality of training picture sequences D with parabolas, wherein the plurality of training picture sequences D form a training picture sequence set DI.

Further, the characteristics of the high altitude parabola specifically include:

presetting a longitudinal interval sequence Fall conforming to the law of free Fall comprising k coordinates _base Is (fb) ₀ ，fb ₁ ，fb ₂ ，......，fb _k-1 ），fb _k-1 Is shown in the longitudinal interval sequence Fall _base The kth longitudinal interval value;

presetting a group of transverse disturbance XOffset as (xf) ₀ ，xf ₁ ，xf ₂ ，......，xf _n ) Wherein xf is _n Represents the n +1 th lateral disturbance value in a group of lateral disturbances XOffset;

presetting a set of area ratio ratios as (r) ₀ ，r ₁ ，r ₂ ，......，r _n ) Wherein r is _n Representing the (n + 1) th area ratio in a set of area ratio ratios;

presetting a group of ordinate scaling coefficients YScale to (ys) ₀ ，ys ₁ ，ys ₂ ，......，ys _n ) Wherein ys is _n Indicating the (n + 1) th ordinate scaling factor in a set of ordinate scaling factors YScale.

Further, the synthesizing the selected foreground picture and the selected background picture to obtain the training picture sequence D with a parabola specifically includes:

in the selected background picture, determining a position information set of the foreground picture through characteristic transformation of a high-altitude parabola;

determining the size information of the foreground picture according to the areas of the foreground picture and the background picture;

and according to the position information set and the size information of the foreground picture, sequentially combining the foreground picture and a group of background pictures into a group of training pictures with parabolas to form a training picture sequence D.

Further, the determining the position information set of the foreground picture in the selected background picture through the characteristic transformation of the high altitude parabola specifically includes:

randomly selecting a coordinate as an initial coordinate loc within the coverage range of the selected background picture _base ；

Randomly selecting a position index ind from the interval of [0, T-3) as the index of the initial parabolic appearance, and spacing the sequence Fall from the longitudinal direction _base Randomly selecting a longitudinal interval sequence (fb) with the length of T-ind-1 from the middle sequence _j ，...，fb _j+T-ind-2 ) Wherein fb _j To show Fall _base The (j + 1) th longitudinal interval value, fb _j+T-ind-2 To show Fall _base The j + T-ind-1 longitudinal interval value;

randomly selecting a scaling coefficient ys from the ordinate scaling coefficient Yscale to obtain a new longitudinal interval sequence (ys fb) _j ，...，ys*fb _j+T-ind-2 ）；

Randomly selecting abscissa disturbance sequence from transverse disturbance XOffset

Wherein k is _i E.n (i = {0, 1, 2.., T-ind-1 }), wherein

Denotes the kth in XOffset _T-ind-1 An abscissa perturbation value;

new interval sequence and abscissa perturbation sequence are compared with the initial coordinate loc _base Adding the corresponding coordinate values to obtain a position sequence loc = (loc) of the foreground picture in the background picture of one image sequence segment _base +

，loc _base +

+ys*fb _j ，...，loc _base +

+ys*fb _j+T-ind-3 ，loc _base +

+ys*fb _j+T-ind-2 ）。

Further, the determining the size information of the foreground picture according to the areas of the foreground picture and the background picture specifically includes:

calculating the object area A in the foreground picture _fi And picture area A in background picture _bi ；

Randomly selecting a ratio r from the area ratios to obtain a new object area

=A _bi *r；

Further obtaining the scaling fr =of the object

；

Labeling label according to segmentation in foreground picture _i Generating an area mask covered by each required object _i ；

Picture im of an object according to the scaling fr of the object _i Area mask covered by object _i Zooming to obtain a new object picture n _ im with a parabola _i And n _ mask _i 。

Further, the sequentially synthesizing the foreground picture and the set of background pictures into the set of training pictures with the parabola specifically includes:

establishing an index sequence number (0, 1, 2., T-1) for a background picture of an image sequence segment;

sequentially selecting the background picture and the position coordinates in the position sequence loc;

judging whether the index sequence number of the background picture is smaller than the position index ind;

skipping the background picture when the background picture index sequence number is smaller than the position index ind;

when the index number of the background picture is not less than the index ind of the position, a new object picture n _ im with a parabola is taken _i And the area n _ mask covered by the object _i Simultaneously carrying out data transformation; and superposing the new object picture to the corresponding background picture according to the position coordinate in the position sequence loc to synthesize a training picture.

Further, the time sequence block segmentation network comprises a feature extraction layer, a feature transformation layer, a time sequence feature extraction layer and a classification layer;

the characteristic extraction layer is based on a CNN network and is used for extracting the characteristics of pictures, splicing picture sequences with input length T and channel 3 into data with length 1 and channel 3T in sequence, and then changing the number of input channels of the first layer of convolution of the used backhaul network to be 3T;

the feature conversion layer is used for converting the picture features extracted by the feature extraction layer into the acceptable input of the time sequence feature extraction layer according to the number of the divided blocks;

the time sequence feature extraction layer is used for inputting the transformed picture features into a recurrent neural network to obtain the features on the time sequence dimension;

and the classification layer is used for classifying the extracted time sequence characteristics and judging whether a parabola exists.

Further, the method for obtaining the TPSNet comprises the following steps:

and performing data enhancement transformation on the training picture sequence set DI, inputting the training picture sequence set DI into a time sequence block segmentation network, and optimizing by using a loss function to obtain a converged network TPSNet.

Further, the performing parabola identification prediction on the test video and judging whether a high altitude parabola exists in the test video specifically includes:

acquiring a test video, dividing the test video into a plurality of sequences with the length of T, and identifying and predicting by using TPSNet;

obtaining the score of each sequence of high altitude parabolas, and recording the score;

and judging whether a high-altitude parabolic object exists in the test video or not according to the score, and if the high-altitude parabolic object exists in the test video, alarming and recording a corresponding video image segment.

Further, obtaining a score of each sequence of high altitude parabolas, and recording the score specifically includes:

deriving a score scores = (score) predicting whether each block of each sequence contains a parabola or not ₀ ，score ₁ ，...，score _p ) Wherein P is the number of blocks into which the input picture is divided;

screening the score of each patch, and if the score is smaller than a set first threshold, filtering, and not recording the score value; and if the score is not less than the set first threshold, recording the score value, and finally obtaining a qualified effective score sequence set.

Further, the determining whether a high altitude parabola exists in the test video specifically includes:

judging whether the number of data in the effective fraction sequence set is greater than the preset sequence length and whether the number of data in the effective fraction sequence set which is greater than a second threshold value is greater than the preset number;

and when the number of the data in the effective fraction sequence set is larger than the preset sequence length and the number of the data in the effective fraction sequence set, which is larger than the second threshold value, is larger than the preset number, alarming and recording the corresponding video image segment.

In another aspect, the invention further discloses a complex scene high altitude parabolic recognition system, which comprises:

the training picture generation module is used for synthesizing the image sequence set SI and the object picture set SO to obtain a training picture sequence set DI with a parabola;

the time sequence patch segmentation network training module is used for training a time sequence block segmentation network by using a training picture sequence set DI to obtain TPSNet;

and the detection and identification module is used for carrying out parabolic identification prediction on the test video by using the TPSNet obtained after training and judging whether high-altitude parabolic exists in the test video.

Further, the training picture generation module specifically includes:

the video frame extracting unit is used for collecting N1 video segments and extracting each frame image in the video;

the background image dividing unit is used for dividing all images in each video into a plurality of image sequence segments S with the length T according to time sequence; all the image sequence segments obtained by division form an image sequence set SI;

the object picture labeling unit is used for collecting N2 object pictures and carrying out segmentation and labeling on the object pictures; obtaining a segmented and labeled object picture set SO;

the image selecting unit is used for randomly selecting an object image from the object image set SO as a foreground image and selecting an image sequence segment from the image sequence set SI as a background image;

the picture synthesis unit is used for synthesizing the selected foreground picture and the selected background picture according to the characteristic of the high-altitude parabola to obtain a training picture sequence D with the parabola; and repeatedly synthesizing to obtain a plurality of training picture sequences D with parabolas, wherein the plurality of training picture sequences D form a training picture sequence set DI.

Further, the time series patch segmentation network training module is specifically configured to perform data enhancement transformation on a training picture sequence set DI, input the data enhancement transformation into a time series block segmentation network, and perform optimization by using a loss function to obtain a converged network TPSNet;

the time sequence block segmentation network comprises a feature extraction layer, a feature transformation layer, a time sequence feature extraction layer and a classification layer;

Further, the detection and identification module specifically includes:

the test video processing unit is used for acquiring a test video, dividing the test video into a plurality of sequences with the length of T, and performing identification prediction by using TPSNet; obtaining the score of each sequence of high altitude parabolas, and recording the score;

and the identification and judgment unit is used for judging whether the high-altitude parabolic object exists in the test video according to the score, and alarming and recording the corresponding video image segment if the high-altitude parabolic object exists in the test video.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a complex scene high-altitude parabolic recognition method, which provides a data synthesis method and a new modeling mode in the execution process of the recognition method, avoids the requirement on large-scale collection of data, has good generalization in different scenes and is high in speed:

1. the method solves the problem that a deep learning algorithm for the high-altitude parabolic field needs a large amount of multi-dimensional high-altitude parabolic labeled data according to the characteristic synthesis data of the high-altitude parabolic.

2. Through higher down sampling, specific shallow features are ignored when the network extracts features, sequence features with relative semantics are concerned, and the problem that seamless migration to each scene cannot be achieved at low cost is solved.

3. Different from the commonly adopted strong positioning, the method adopts a weak positioning mode, so that the positioning requirement on the model is reduced while the subsequent manual checking is facilitated, and the efficient model can be constructed; the sequence information is coupled to the feature dimension in the input process, and then the features are extracted, so that the efficiency of extracting the image features is ensured; the characteristic dimension is converted into the sequence dimension, and the extracted semantic information can be subjected to sequence integration by using a sequence model, so that the final performance is ensured; the problem that the high speed and the good effect are difficult to maintain simultaneously is solved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 illustrates a data synthesis and model training flow diagram of an embodiment of the present invention;

fig. 2 shows a flow chart of the parabolic identification prediction according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The high-altitude parabolic identification method based on data synthesis and time sequence block segmentation mainly comprises the following steps:

the data synthesis and model training process shown in fig. 1 includes the following steps 1-4; and a parabolic identification prediction flow shown in fig. 2, as follows step 5.

Step 1: collecting N1 video segments, extracting each frame in the video, and dividing an image sequence in each video according to the length T to obtain an image sequence set SI;

step 2: collecting N2 object pictures, and carrying out segmentation and labeling to obtain a picture and a segmentation and labeling object picture set SO;

and step 3: randomly selecting a picture from the SO, randomly selecting a sequence S from the image sequence set SI, synthesizing a picture sequence D with a parabola according to the characteristics of the high altitude parabola (the longitudinal coordinate basically accords with the law of free fall, the rotation occurs in the falling process, and a certain transverse offset exists in the falling process), and repeatedly synthesizing to finally obtain a picture sequence set DI with the parabola;

and 4, step 4: training a Time sequence block Segmentation Network by using a synthesized image sequence set DI (direct integration) and positioning in a parabolic weak positioning mode, namely positioning a vertical region in which a parabola appears in a sequence, integrating input sequence dimensionality into characteristic dimensionality, simultaneously extracting characteristics of a picture, splitting sequence information from the characteristic dimensionality by using a characteristic conversion layer, performing sequence modeling on the characteristics rich in semantic information by using a sequence model, and finally obtaining a TPSNet (Time-sequence Patch Segmentation Network); the sequence model refers to a model that can extract the correlation of consecutive sequences in the time dimension.

And 5: inputting a test video, predicting by using the trained TPSNet to obtain a score Scors for predicting the high-altitude parabola of each sequence, recording the score by using a recorder, judging whether the high-altitude parabola is present or not according to the sensitivity, and alarming and recording a corresponding video image if the high-altitude parabola is present.

Specifically, step 1 comprises the following steps:

step 1.1, collecting N1 video segments, performing frame extraction on each video, numbering the first frame of each video (assumed to be composed of VL pictures) from 0, and obtaining { (0, 1,... page., T-1), (sep, 1+ sep,. page., T + sep-1) } for 1+ FIoor ((VL-T)/sep) group image sequence segments, wherein sep represents the interval of sampling, (sep, 1+ sep,. page., T + sep-1) represents the number set of images in a group of image sequence segments S; t represents the length of a single sequence, FIoor represents rounding-down, and finally the set SI of the image sequences of all videos is obtained;

for step 2, the following steps are included:

step 2.1: collecting N2 object pictures, labeling the objects in the pictures to obtain the segmentation label of each object in need _i Assuming that there are n objects in demand in a picture im1, im1 is a numerical value at this time, and is used to index a specific picture, the annotation set generated for that picture is { (im) ₁ ，label ₁ ）（im ₂ ，label ₂ ），......，（im _n ，label _n ) In which im is ₁ 、im ₂ 、...、im _n Denoted by picture im 1; label ₁ 、label ₂ 、...、label _n Refers to the labeling of n desired objects. Assuming that there are m required objects in all the N2 object pictures, the set SO of all the pictures and the segmentation labels is finally obtained as { (im) _i ，label _i ）| i∈[1,m]Therein (im) _i ，label _i ) Representing the ith group of labels in the segmentation label set; the segmentation label is a series of point coordinate sets, im _i Representing the picture index number in the ith group of labels; label _i And (3) representing the segmentation label of the ith required object, wherein points in the point coordinate sets are sequentially connected to frame the region where the required object is located, or points in the point coordinate sets are sequentially connected to form an outer contour curve of the required object.

For step 3, the following steps are included:

step 3.1: randomly selecting an object label (im) from the SO _i ，label _i ) As foreground picture FI _i FI is the set of all foreground pictures: { (im) ₀ ，label ₀ ）,（im ₁ ，label ₁ ）, ....,（im _i ，label _i ）}，FI _i A picture showing the i-th object label. Randomly selecting a group of image sequences from SI (

，

+1，.....，

+ T-1) as background picture BI _j ，BI={BI ₀ , BI ₁ , ..., BI _n Total n groups of sequences, BI _j Representing the jth sequence. Longitudinal interval sequence Fall conforming to free Fall law and comprising k coordinates is designed in advance _base Is (fb) ₀ ，fb ₁ ，fb ₂ ，......，fb _k-1 ），fb _k-1 Is shown in the longitudinal interval sequence Fall _base The kth longitudinal interval value; presetting a group of transverse disturbance XOffset as (xf) ₀ ，xf ₁ ，xf ₂ ，......，xf _n ) Wherein xf is _n Represents the n +1 th lateral disturbance value in a group of lateral disturbances XOffset; presetting a set of area ratio ratios as (r) ₀ ，r ₁ ，r ₂ ，......，r _n ) Wherein r is _n Representing the (n + 1) th area ratio in a set of area ratio ratios; presetting a group of ordinate scaling coefficients YScale to (ys) ₀ ，ys ₁ ，ys ₂ ，......，ys _n ) Wherein ys is _n Indicating the (n + 1) th ordinate scaling factor in a set of ordinate scaling factors YScale.

Step 3.2: calculating a background picture BI _j And randomly selecting a coordinate as an initial coordinate loc within the picture coverage range _base Said initial coordinate loc _base Is a point coordinate with an abscissa and an ordinate. Randomly selecting a position index ind from the interval of [0, T-3) as an index of the initial parabolic appearance, wherein the value range of the ind is [0, T-3) so as to ensure that at least three continuous photos can be selected; from Fall _base Selecting a longitudinal interval sequence (fb) with the length of T-ind-1 in a random sequence _j ，...，fb _j+T-ind-2 ），fb _j To show Fall _base The (j + 1) th longitudinal interval value, fb _j+T-ind-2 To show Fall _base The (j + T-ind-1) th longitudinal interval value is selected, and a scaling coefficient ys is randomly selected from the scale to obtain a new interval sequence (ys fb) _j ，...，ys*fb _j+T-ind-2 ) (ii) a Subsequent random selection of abscissa from XOffsetPerturbation sequence

Wherein k is _i ∈N（i={0，1，2，...，T-ind-1}），

Denotes the kth in XOffset _T-ind-1 An abscissa perturbation value;

perturbing the new interval sequence and abscissa with loc _base Correspondingly adding to obtain a position sequence loc = (loc) of the foreground picture in the background picture of one image sequence segment _base +

，loc _base +

+ys*fb _j ，...，loc _base +

+ys*fb _j+T-ind-3 ，loc _base +

+ys*fb _j+T-ind-2 ）；

Step 3.3: calculating foreground picture FI _i Area A of the middle object _fi Background picture BI _j Area A of picture in _bi Randomly selecting a ratio r from the ratios according to

=A _bi Calculating to obtain the area of new object

Calculating the scaling fr =ofthe object

；

Step 3.4: according to the foreground mapSlice segmentation label _i Generating a mask for each object covering area _i ；mask _i Representing the area covered by the ith object, im, from the object picture according to the object scaling fr found in step 3.3 _i Area mask covered by object _i Zooming to obtain a new object picture n _ im with a parabola _i And n _ mask _i 。

Step 3.5: establishing an index (0, 1., T-1) for the background picture, selecting the current parabolic coordinate and the background picture according to the sequence, skipping if the background picture index is less than ind, otherwise, performing the step 3.4 of n _ im _i And n _ mask _i After the data transformation is used for transformation, the object picture is superposed into the background picture according to the corresponding coordinate in the loc, and a picture containing a parabola is obtained; finally, a group of training picture sequence sets DI containing parabolas is obtained; the data transformation includes random inversion, gaussian noise, random rotation, and the like.

For step 4, the following steps are included:

step 4.1: designing a time sequence block segmentation network, wherein the time sequence block segmentation network consists of four parts, namely a feature extraction layer, a feature transformation layer, a time sequence feature extraction layer and a classification layer.

Step 4.2: the feature extraction layer is a CNN-based network and mainly aims to extract features of pictures. In order to ensure the overall efficiency in the last run, we first splice the picture sequence with channel 3 and input length T into data with channel 3 and length 1 in sequence, then the commonly used backbone network (a neural network model for extracting the characteristics of the high, middle and low layers of the image) is changed, the number of input channels of the first layer convolution is 3T, the rest is maintained unchanged, so that only the calculated amount is increased on the first layer (almost negligible), the subsequent calculated amount is consistent with the calculated amount when one image is input, moreover, a plurality of down-sampling steps are added on the premise of keeping the number of the network layers unchanged (because a weak positioning mode is adopted, down-sampling can be greatly adopted without worrying about the influence effect too much), the operation efficiency of the network is further ensured, and the size of the finally obtained feature image is C x H W;

step 4.3: the feature conversion layer converts the features extracted by the feature extraction layer into an input acceptable to the time-series feature extraction layer according to the number of blocks dividing the input picture. By way of example, for the feature C × H × W extracted in step 4.2, a pooling layer with kernel (convolution kernel) size (k1, k2) and step size (s1, s2) is used to down-sample W direction, and the feature in the height direction is globally pooled (since we use weak localization, so the height space information can be globally pooled here without much worrying about the effect), so as to obtain a new feature C × 1P, which is then converted into a feature dimension P × T (C/T), where C is an integer multiple of T, and P is the number of blocks dividing the input picture;

step 4.4: the time sequence feature extraction layer inputs the converted features into a recurrent neural network to obtain features on time sequence dimensions, and then a classification layer is used for classifying the extracted time sequence features to judge whether parabolas exist or not;

step 4.5: and (3) performing some data enhancement transformations on the parabolic sequence picture DI containing the parabolic sequence acquired in the step (3), inputting the parabolic sequence picture DI into a time sequence block segmentation network, and optimizing by using BCE (basic loss function) as a loss function to finally obtain a converged network TPSNet. The data enhancement transformation comprises Gaussian blur, random clipping, size change, random inversion and the like.

For step 5, the following steps are included:

step 5.1: for the input video, the sequence T is input, and prediction is performed using TPSNet, resulting in a score scores = (score) that predicts whether each block of each sequence contains a parabola or not ₀ ，score ₁ ，...，score _p ) Wherein P is the number of blocks (patch) dividing the input picture, the score of each patch is screened, if the score is smaller than a set first threshold thr1, filtering is performed, otherwise, recording is performed; the recorded effective score of score is obtained by supposing that the score for obtaining a certain block i continuously meets the condition _i =（score _i ⁰ ，score _i ¹ ，...，score _i ^K ) Wherein K represents the recorded sequence length; if the effective fraction is record-score _i The number of the median score larger than the second threshold thr2 is larger than the preset number HighNum, and the effective score record-score _i If the sequence length K is greater than the preset number SeqNum, the object is regarded as a high-altitude object, an alarm is output, and a corresponding video clip is recorded; it is noted that HighNum, SeqNum, thr1 and thr2 control the sensitivity of the overall method, and their values can be changed if different sensitivities of the model are desired.

An embodiment of the high-altitude parabolic identification method based on data synthesis and time sequence block segmentation is as follows:

step 1: collecting N1 video segments, performing frame extraction on each video, and numbering the first frame extracted from each video (if the video consists of VL pictures) from 0 to obtain a { (0, 1,. talka., T-1), (sep, 1+ sep,. talka., T + sep-1),. talka. }, 1+ FIoor ((VL-T)/sep) group image sequence segments, wherein sep represents the interval of sampling, and (sep, 1+ sep,. talka., T + sep-1) represents the numbering set of images in a group of image sequence segments S; t represents the length of a single sequence, FIoor represents rounding-down, and finally the set SI of the image sequences of all videos is obtained; in this embodiment, N1 is 2, sep is 1, and T is 5;

step 2: collecting N2 object pictures, labeling the objects in the pictures to obtain the segmentation label of each object in need _i Assuming that there are n objects in a picture im1, the annotation set generated for that picture is { (im) ₁ ，label ₁ ）（im ₂ ，label ₂ ），......，（im _n ，label _n ) In which im is ₁ 、im ₂ 、...、im _n Denoted by picture im 1; label ₁ 、label ₂ 、...、label _n Refers to the labeling of n desired objects. Assuming that m required objects exist in all the N2 object pictures, finally, the set SO of all the pictures and the segmentation labels is obtained as { (im) _i ，label _i ）| i∈[1,m]Therein (im) _i ，label _i ) Representing the ith group of labels in the segmentation label set; in this example, N2 is 6;

and step 3: randomly selecting a picture from the SO, randomly selecting a sequence S from the image sequence set SI, synthesizing a picture sequence D with a parabola, and repeatedly synthesizing to finally obtain an image sequence set DI with the parabola;

specifically, step 3.1: randomly selecting an object label (im) from the SO _i ，label _i ) As foreground picture FI _i Randomly selecting a set of image sequences from SI (

，

+1，.....，

+ T-1) as background picture BI _j The longitudinal interval sequence Fall conforming to the free Fall law and comprising k coordinates is designed in advance _base Is (fb) ₀ ，fb ₁ ，fb ₂ ，......，fb _k-1 ) (ii) a Presetting a group of transverse disturbance XOffset as (xf) ₀ ，xf ₁ ，xf ₂ ,..); presetting a set of area ratio ratios as (r) ₀ ，r ₁ ，r ₂ ,..); presetting a group of ordinate scaling coefficients YScale to (ys) ₀ ，ys ₁ ，ys ₂ ,......). Fall in this example _base Is [0, 1, 4, 9, 16, 25, 36, 49 ]]XOffset is [ -20, -19.. 7.19.20]The ratios are uniformly distributed in intervals of (0.0001, 0.005), YScale is [5, 6, 7.., 50 ]]。

Step 3.2: calculating a background picture BI _j And randomly selecting a coordinate as an initial coordinate loc within the picture coverage range _base Said initial coordinate loc _base Is a point coordinate with an abscissa and an ordinate. Randomly selecting a position index ind from the interval of [0, T-3) as the index of the initial parabolic appearance, from Fall _base In random orderSelecting a longitudinal interval sequence (fb) with the length of T-ind-1 _j ，...，fb _j+T-ind-2 ) Then randomly selecting a scaling coefficient ys from the Yscale to obtain a new interval sequence (ys fb) _j ，...，ys*fb _j+T-ind-2 ) (ii) a Then randomly selecting an abscissa perturbation sequence from the XOffset

Wherein k is _i E.n (i = {0, 1, 2.., T-ind-1 }); perturbing the new interval sequence and abscissa with loc _base Correspondingly adding to obtain a position sequence loc = (loc) of the foreground picture in the background picture of one image sequence segment _base +

，loc _base +

+ys*fb _j ，...，loc _base +

+ys*fb _j+T-ind-3 ，loc _base +

+ys*fb _j+T-ind-2 ）；

=A _bi Calculating to obtain the area of new object

Calculating the scaling fr =ofthe object

；

Step 3.4: segmenting labeled label according to foreground picture _i Generating a mask for each object covering area _i (ii) a According to the object scaling fr obtained in step 3.3, the object picture im _i And mask _i Zooming to obtain a new object picture n _ im with a parabola _i And n _ mask _i 。

Step 3.5: establishing an index (0, 1., T-1) for the background picture, selecting the current parabolic coordinate and the background picture according to the sequence, skipping if the background picture index is less than ind, otherwise, performing the step 3.4 of n _ im _i And n _ mask _i After being transformed by using data transformation, the transformed data is superposed into the background picture according to the corresponding coordinates in the loc to obtain a picture containing a parabola; finally, a group of sequence pictures DI containing parabolas is obtained; in this embodiment, the transform includes random inversion with a probability of 0.5, gaussian noise with a probability of 0.5, and random rotation with a probability of 1, and the superposition mode is that parabolic content directly covers background picture content, and of course, the transform and the superposition mode may also adopt other methods, for example, the superposition mode may adopt poisson fusion;

and 4, step 4: the Network is trained using the DI training Time sequence block partition of the synthetic parabolic sequence set to obtain TPSNet (Time-sequence Patch Segmentation Network).

For step 4, specifically:

And 4.2: the feature extraction layer is a CNN-based network and mainly aims to extract features of pictures. In order to ensure the overall efficiency in the last operation, firstly, splicing a picture sequence with a channel of 3 and an input length of T into data with a channel of 3 and a length of 1 in sequence, then changing the number of input channels of the convolution of the first layer of the commonly used backbone network to be 3T, and keeping the rest unchanged, so that the calculation amount is increased (almost ignored) only in the first layer, the subsequent calculation amount is consistent with the calculation amount when one picture is input, and moreover, a plurality of down-sampling steps are added on the premise of keeping the number of network layers unchanged, so that the operation efficiency of the network is further ensured; in the embodiment, the T is 5, the back bone is MobileNet V2 which has the downsampling of 1/128 and removes the final pooling layer and the full-connection layer, the downsampling can be modified into other numbers according to the effect and the speed of the actual requirement, and the back bone can be replaced by other networks such as ResNet/DesNet/LiteHRNet and the like as a basic model;

step 4.3: the feature conversion layer converts the features extracted by the feature extraction layer into an input acceptable to the time-series feature extraction layer according to the number of blocks dividing the input picture. By way of example, for the feature C × H × W extracted in step 4.2, an average pooling layer (which may be other pooling, such as maximum pooling) with a kernel size of (1,2) and a step size of (1,2 is used first, and the kernel size and step size may control the size of the final output patch, and if each patch is desired to cover a larger area, the kernel size and step size may be increased accordingly, and the W direction is down-sampled, and the feature in the height direction is globally average pooled (which may be other pooling) to obtain a new feature C1 (W/2), and then converted into a feature dimension of (W/2) T (C/T), where C is an integer multiple of T, and W/2 is the number of block partitions of the input picture; in this embodiment, T is 5, W/2 is the width of the input picture divided by 256 (obtained by 1/128 in step 4.2 and the down-sampling in the W direction, and the overall down-sampling size is 256);

step 4.4: the time sequence feature extraction layer inputs the converted features into a recurrent neural network to obtain features on time sequence dimensions, and then a classification layer is used for classifying the extracted time sequence features and judging whether parabolas exist or not; in this embodiment, the recurrent neural network used is LSTM (long short term memory), but other networks such as GRU (gated round robin unit) may be used.

Step 4.5: and (3) performing data enhancement transformation on the parabolic sequence picture DI containing the parabolic sequence obtained in the step (3), inputting the parabolic sequence picture DI into a time sequence block segmentation network, and optimizing by using BCE as a loss function to finally obtain a converged network TPSNet. In this embodiment, the data enhancement transformation may be gaussian blur with probability of 0.5, random cropping with probability of 0.5, aspect ratio change of holding area with probability of 1, size change with probability of 1 (the length of the picture is adjusted to 512, the width is changed according to the original length-width ratio, and the value may be adjusted according to actual requirements), multiple of the filling length-width to 256, and random sequence inversion with probability of 0.2 (falling objects become ascending objects), and these data enhancement transformations may use different probabilities, different parameters, different combination orders, and other data enhancement transformations such as color transformation;

and 5: for an input video, 5 consecutive frames are input, the size is changed (the length is adjusted to 512, the width is changed according to the original length-width ratio), and the filling length-width is predicted to a multiple of 256, using TPSNet, so as to obtain a score scores = (score) for predicting whether each block of each sequence contains a parabola or not ₀ ，score ₁ ，...，score _p ) Wherein p is the number of blocks dividing the input picture (the calculation method is described above), the score of each block is screened, if the score is smaller than the set threshold value of 0.6, filtering is performed, otherwise, recording is performed; the recorded effective score of score is obtained by supposing that the score for obtaining a certain block i continuously meets the condition _i =（score _i ⁰ ，score _i ¹ ，...，score _i ^K ) If the effective fraction is record-score _i The score number of the score greater than the preset threshold value 0.93 is greater than the preset number 3, and the effective score of record-score _i If the sequence length K is more than the preset number 3, the object is regarded as a high-altitude object, and an alarm is output and a corresponding video image is recorded.

1. synthesizing data according to the characteristics of the high-altitude parabolas (the longitudinal coordinate basically accords with the free falling body law, the rotation occurs in the falling process, and a certain transverse offset exists in the falling process) to solve the problem that a deep learning algorithm for the high-altitude parabolas field needs a large amount of multi-dimensional high-altitude parabolas labeling data.

3. Different from the commonly adopted strong positioning (positioning to the position of a parabola in an image), a weak positioning mode (positioning to the position of the parabola in a certain vertical area of the image) is adopted, so that the subsequent manual checking is facilitated, and meanwhile, the positioning requirement on the model is reduced, and the efficient model can be constructed; the sequence information is coupled to the feature dimension in the input process, and then the features are extracted, so that the efficiency of extracting the image features is ensured; the characteristic dimension is decoded into the sequence dimension, and the extracted semantic information can be subjected to sequence integration by using a sequence model, so that the final performance is ensured; the problem that the high speed and the good effect are difficult to maintain simultaneously is solved.

In order to support the smooth execution of the method, a complex scene high-altitude parabolic recognition system is correspondingly arranged, and the system comprises:

and the detection and identification module is used for carrying out parabolic identification and prediction on the test video by using the TPSNet obtained after training.

Specifically, the training picture generation module specifically includes the following units:

and the video frame extracting unit is used for collecting N1 video segments and extracting each frame of image in the video.

The background image dividing unit is used for dividing all images in each video into a plurality of image sequence segments S with the length T according to time sequence; all the image sequence segments obtained by division form an image sequence set SI.

The object picture labeling unit is used for collecting N2 object pictures and carrying out segmentation and labeling on the object pictures; and obtaining a segmented and labeled object picture set SO.

And the picture selecting unit is used for randomly selecting an object picture from the object picture set SO as a foreground picture and selecting an image sequence segment from the image sequence set SI as a background picture.

Specifically, the time series patch segmentation network training module is specifically configured to perform data enhancement transformation on a training picture sequence set DI, input the data enhancement transformation into a time series block segmentation network, and perform optimization by using a loss function to obtain a converged network TPSNet;

Specifically, the detection and identification module comprises the following units:

the system comprises a test video processing unit, a data processing unit and a data processing unit, wherein the test video processing unit is used for acquiring a test video, dividing the test video into a plurality of sequences with the length of T, and performing identification prediction by using TPSNet; a score is obtained for each sequence of high altitude parabolas and recorded.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A complex scene high altitude parabola identification method is characterized by comprising the following steps:

training a time sequence block segmentation network by using a training picture sequence set DI to obtain TPSNet;

2. The complex scene high altitude parabola identification method according to claim 1, wherein the method for acquiring the image sequence set SI comprises:

collecting N1 video segments, and extracting each frame of image in the video;

3. The complex scene high-altitude parabolic recognition method according to claim 2, wherein the image sequence set SI is specifically:

4. The complex scene high-altitude parabolic recognition method according to claim 2, wherein the method for acquiring the object picture set SO comprises:

collecting N2 object pictures, and segmenting and labeling the object pictures;

and obtaining a segmented and labeled object picture set SO.

5. The complex scene high-altitude parabolic recognition method according to claim 4, wherein the segmenting and labeling of the object picture specifically comprises:

Each object picture correspondingly generates a label set which is { (im) ₁ ，label ₁ ）（im ₂ ，label ₂ ），......，（im _n ，label _n ) In which im is ₁ 、im ₂ 、...、im _n Refers to picture index numbers; label ₁ 、label ₂ 、...、label _n Labels referring to n desired objects, (im) _n ，label _n ) Representing the nth group of labels in the label set;

after all the object pictures are segmented and labeled, integrating the label set of each obtained object picture to obtain segmented and labeled pictures and segmentationA set of annotations SO, said set of pictures and segmentation annotations SO being denoted by { (im) _i ，label _i ）| i∈[1,m]Therein (im) _i ，label _i ) Representing the i-th group of labels, im, in the set of pictures and segmentation labels _i Representing the picture index number in the ith group of labels; label _i And (3) representing the segmentation label of the ith required object, wherein m represents that m required objects are arranged in all the object pictures.

6. The complex scene high altitude parabola identification method according to claim 4, wherein the method for synthesizing the training picture sequence set DI comprises:

7. The complex scene high-altitude parabola identification method according to claim 6, wherein the characteristics of the high-altitude parabola specifically include:

presetting a set of area ratio ratios as (r) ₀ ，r ₁ ，r ₂ ，......，r _n ) Wherein r is _n Is shown inThe (n + 1) th area ratio in a group of area ratio ratios;

8. The complex scene high-altitude parabolic recognition method according to claim 7, wherein the synthesizing the selected foreground picture and the selected background picture to obtain the training picture sequence D with the parabola specifically comprises:

9. The method for identifying the high-altitude parabola in the complex scene according to claim 8, wherein the determining the position information set of the foreground picture in the selected background picture through the characteristic transformation of the high-altitude parabola specifically comprises:

randomly selecting a scaling coefficient ys from the ordinate scaling coefficient Yscale to obtain a new longitudinal scaling coefficient ysSpacer sequence (ys fb) _j ，...，ys*fb _j+T-ind-2 ）；

Wherein k is _i E.n (i = {0, 1, 2.., T-ind-1 }), wherein

Denotes the kth in XOffset _T-ind-1 An abscissa perturbation value;

，loc _base +

+ys*fb _j ，...，loc _base +

+ys*fb _j+T-ind-3 ，loc _base +

+ys*fb _j+T-ind-2 ）。

10. The complex scene high-altitude parabolic recognition method according to claim 9, wherein the determining the size information of the foreground picture according to the areas of the foreground picture and the background picture specifically comprises:

Randomly selecting a ratio r from the area ratios to obtainTo new object area

=A _bi *r；

Further obtaining the scaling fr =of the object

；

11. The method for identifying high-altitude parabolas in complex scenes according to claim 10, wherein the step of sequentially synthesizing a foreground picture and a group of background pictures into a group of training pictures with parabolas specifically comprises:

12. The complex scene high altitude parabola identification method as claimed in any one of claims 1-11, wherein said time sequence block segmentation network comprises a feature extraction layer, a feature transformation layer, a time sequence feature extraction layer and a classification layer;

13. The complex scene high altitude parabola identification method according to claim 12, wherein the method for obtaining the TPSNet comprises:

14. The complex scene high-altitude parabolic recognition method according to any one of claims 1 to 11, wherein the performing parabolic recognition prediction on the test video and judging whether a high-altitude parabolic exists in the test video specifically comprises:

and judging whether a high-altitude parabola exists in the test video or not according to the score, and if the high-altitude parabola exists in the test video, alarming and recording a corresponding video image segment.

15. The method for identifying high altitude parabolas in complex scenes as claimed in claim 14, wherein obtaining a score of each sequence of high altitude parabolas, and recording the score specifically comprises:

16. The complex scene high-altitude parabola identification method according to claim 15, wherein the judging whether the high-altitude parabola exists in the test video specifically comprises:

17. A complex scene high altitude parabola recognition system, characterized in that the system comprises:

18. The complex scene high altitude parabola recognition system of claim 17, wherein the training picture generation module specifically comprises:

19. The complex scene high altitude parabola identification system of claim 18, wherein said timing patch segmentation network training module is specifically configured to perform data enhancement transformation on a training picture sequence set DI, input the transformed training picture sequence set DI into a timing block segmentation network, and perform optimization using a loss function to obtain a converged network TPSNet;

20. The complex scene high altitude parabola identification system of claim 17, wherein the detection identification module specifically comprises: