CN112613516A

CN112613516A - Semantic segmentation method for aerial video data

Info

Publication number: CN112613516A
Application number: CN202011459565.8A
Authority: CN
Inventors: 郑若冰
Original assignee: Beijing Moviebook Technology Corp ltd
Current assignee: Beijing Moviebook Technology Corp ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-04-06

Abstract

The application discloses a semantic segmentation method for aerial video data, which is characterized in that an aerial video data set is trained and identified through a shot boundary detection algorithm, key frames in the aerial video data set are obtained and form a key frame data set, and then the key frame data set is subjected to semantic segmentation through a semantic segmentation algorithm based on a full convolution network. The semantic segmentation method reduces the calculated amount of data by preprocessing the data and extracting key frames, does not need a large data set driving model to learn, solves the sensitivity of the model to the optical flow change generated by shadows by combining color and texture features, and optimizes the result of semantic segmentation by learning local features and global features in an end-to-end mode by using a convolutional neural network, thereby improving the accuracy and reliability of later period expansibility jessage.

Description

Semantic segmentation method for aerial video data

Technical Field

The application relates to a semantic segmentation method for aerial video data.

Background

Video captured by analyzing drones has a wide range of applications, such as tracking vehicles, object detection, anomaly detection, and the like. For most applications, spatial and contextual information needs to be inferred from the image frames of the video. For example, tracking of vehicles will be easier with knowledge about roads, semantic segmentation being one of the tools used to divide an image into different semantic regions and classify these regions into predefined classes. Semantic segmentation helps to understand the layout of a scene, and therefore it is becoming an increasingly important factor for anomaly detection, autonomous vehicle, object detection, and the like. Semantic segmentation remains challenging due to changes in objects in the class, loss of perspective, context of the scene, presence of noise, and changes in lighting. Current semantic segmentation can be achieved by using traditional machine learning methods such as those of Conditional Random Fields (CRF) and deep Convolutional Neural Networks (CNN).

CRF-based algorithms are widely used for their ability to capture contextual information, and the framework is usually composed of unitary and paired potentials. Unitary potential energy captures local features that depend on the pixel itself, and paired potential energy captures spatial information. Capturing the different potential energies of various features (e.g., texture, color location, etc.) requires manual encoding into the model. However, these manually operated functions may not capture all of the variations in the data.

While the success of automated systems for anomaly detection, event detection, etc. in aerial video relies heavily on scene understanding for greater accuracy. In addition, due to the lack of available data sets, there is limited research on semantic segmentation of drone video.

Therefore, how to more effectively realize semantic segmentation on the unmanned aerial vehicle aerial video and further utilize the semantic segmentation in analysis is a technical problem which needs to be solved urgently at present.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to one aspect of the application, a semantic segmentation method for aerial video data is provided, an aerial video data set is trained and identified through a shot boundary detection algorithm, key frames in the aerial video data set are obtained and form a key frame data set, and then the key frame data set is subjected to semantic segmentation through a semantic segmentation algorithm based on a full convolution network.

Optionally, the shot boundary detection algorithm identifies shot boundaries for consecutive frames in the aerial video data set by calculating histogram differences of the consecutive frames and comparing the histogram differences with a set threshold value to complete the shot boundary identification.

Optionally, the shot boundary detection algorithm identifies the shot boundary of consecutive frames in the aerial video data set by dividing non-overlapping grids and combining histogram difference calculation to identify the shot boundary of each frame.

Optionally, when the shot boundary detection algorithm identifies the shot boundary of each frame by using the non-overlapping grids and combining the histogram difference calculation, each frame is divided into the non-overlapping grids with the size of 16 × 16, then the corresponding grid histogram difference between two adjacent frames is calculated by using the chi-square distance, then the histogram average difference between two consecutive frames is calculated, and finally the histogram average difference and the set threshold T are calculated_shotA comparison is made to identify shot boundaries.

Optionally, the equation for calculating the corresponding grid histogram difference between two adjacent frames by using the chi-square distance is as follows:

wherein H_iRepresents the ith frame histogram, H_i+1Indicates the (I +1) th frame histogram, and I indicates the image block at the same position in the two frames.

Optionally, the calculation formula of the histogram average difference between two consecutive frames is:

where D is the mean difference of the histograms of two consecutive frames, D_kN represents the total number of image blocks in the image as the chi-squared difference between the k-th image blocks.

Optionally, the histogram average difference is compared with a set threshold value T_shotThe calculation formula for comparison is:

where i and i +1 represent two consecutive frames.

Optionally, a U-Net model is adopted in performing semantic segmentation on the key frame data set through a semantic segmentation algorithm based on a full convolution network, the U-Net model includes a contraction path and a symmetric expansion path, convolution operation is performed on features in the key frame through the contraction path, the features are extracted through a Relu activation function, a maxpool function is applied to the extracted features to identify relevant features, and Softmax is applied to the last layer of the U-Net model to activate, so that the pixel probability of each class is obtained.

Optionally, the keyframes processed by the U-Net model are directed to 256 × 256 color images, and each layer of the U-Net model is simultaneously filled with the most relevant features for the keyframe features.

In particular, the present invention also provides a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method as described above when executing the computer program.

The invention also provides a computer-readable storage medium, preferably a non-volatile readable storage medium, having stored therein a computer program which, when executed by a processor, implements a method as described above.

The invention also provides a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method as described above.

The semantic segmentation method for the aerial video data is characterized in that key frames are extracted through data preprocessing, so that the calculated amount of data is reduced, a large data set driving model is not needed for learning, the sensitivity of the model to the light stream change generated by shadows is solved by combining color and texture features, a convolutional neural network is used for learning local features and global features in an end-to-end mode to optimize the result of semantic segmentation, and the accuracy and reliability of the later period expansibility jessay are improved.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram of a method for semantic segmentation of aerial video data according to one embodiment of the present application;

FIG. 2 is a block diagram of a computing device according to another embodiment of the present application;

fig. 3 is a diagram of a computer-readable storage medium structure according to another embodiment of the present application.

Detailed Description

According to the scheme, as shown in fig. 1, an aerial video data set is trained and identified through a shot boundary detection algorithm, key frames in the aerial video data set are obtained and form a key frame data set, and then the key frame data set is subjected to semantic segmentation through a semantic segmentation algorithm based on a full convolution network.

The shot boundary detection algorithm identifies shot boundaries of continuous frames in the aerial video data set by calculating histogram differences of the continuous frames and comparing the histogram differences with a set threshold value so as to complete the identification of the shot boundaries. Further, the shot boundary of each frame is identified by dividing the non-overlapping grids and combining histogram difference calculation.

Specifically, when the shot boundary detection algorithm identifies the shot boundary of each frame by using the non-overlapping grids and combining the histogram difference calculation, each frame is divided into the non-overlapping grids with the size of 16 × 16, then the corresponding grid histogram difference between two adjacent frames is calculated by using the chi-square distance,

Then, calculating the mean difference of the histogram between two continuous frames,

Finally, the average difference of the histogram is compared with a set threshold value T_shotComparing to identify shot boundary, and comparing the histogram average difference with a set threshold value T_shotThe calculation formula for comparison is:

where i and i +1 represent two consecutive frames. The threshold Tshot may be determined according to specific working condition requirements, where the threshold Tshot in this embodiment is determined according to peaks and valleys of a histogram curve, and preferably, the threshold Tshot corresponds to a minimum value between two peaks in a selected histogram, and may be determined according to experimental performance. When determining shot boundaries, e.g. D_i+1-D_i＞T_shotIf the value is 1, the shot boundary is determined, otherwise, the shot boundary is a non-shot boundary.

Optionally, a U-Net model is adopted in performing semantic segmentation on the key frame data set through a semantic segmentation algorithm based on a full convolution network, the U-Net model includes a contraction path and a symmetric expansion path, convolution operation is performed on features in the key frame through the contraction path, the features are extracted through a Relu activation function, a maxpool function is applied to the extracted features to identify relevant features, and Softmax is applied to the last layer of the U-Net model to activate, so that the pixel probability of each class is obtained. Generally, a picture includes a plurality of semantic classes, such as "road", "lawn", "house", and the like, and after the pixel probability of each class is obtained in this embodiment, the semantic class corresponding to the pixel point can be obtained, that is, the semantics in the picture can be analyzed.

In the embodiment, the U-Net model is modified correspondingly to process the aerial image. The key frame processed by the U-Net model aims at 256 × 256 color images, each layer of the U-Net model is filled simultaneously, the input of each layer is convoluted by the upper layer so as to be enriched, and the most relevant characteristic aiming at the key frame characteristic is reserved.

Embodiments also provide a computing device, referring to fig. 2, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, the computer program, when executed by the processor 1110, implementing the method steps 1131 for performing any of the methods according to the invention.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 3, the computer readable storage medium comprises a storage unit for program code provided with a program 1131' for performing the steps of the method according to the invention, which program is executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The semantic segmentation method for the aerial video data is characterized in that an aerial video data set is trained and identified through a shot boundary detection algorithm to obtain key frames in the aerial video data set and form a key frame data set, and then the key frame data set is subjected to semantic segmentation through a semantic segmentation algorithm based on a full convolution network.

2. The method of claim 1, wherein the shot boundary detection algorithm identifies shot boundaries for successive frames in the set of aerial video data by calculating histogram differences for successive frames and comparing the histogram differences to a set threshold.

3. The method for semantic segmentation of aerial video data according to claim 2, wherein the shot boundary detection algorithm identifies shot boundaries for successive frames in the aerial video data set by partitioning non-overlapping meshes in combination with histogram difference calculation.

4. The method of claim 3, wherein when the shot boundary detection algorithm identifies the shot boundary of each frame by non-overlapping meshes and combining histogram difference calculation, each frame is divided into non-overlapping meshes of 16 × 16, then the chi-square distance is used to calculate the corresponding mesh histogram difference between two adjacent frames, then the histogram average difference between two consecutive frames is calculated, and finally the histogram average difference is compared with a set threshold T_shotA comparison is made to identify shot boundaries.

5. The semantic segmentation method for the aerial video data according to claim 4, characterized in that a formula for calculating a corresponding grid histogram difference between two adjacent frames by using a chi-squared distance is as follows:

6. The method of semantic segmentation for aerial video data of claim 5 wherein the histogram mean difference between two consecutive frames is calculated as:

7. The method of claim 6, wherein the histogram mean difference is compared to a predetermined threshold T_shotCalculation formula for comparisonComprises the following steps:

where i and i +1 represent two consecutive frames.

8. The semantic segmentation method for the aerial video data according to claim 2, wherein a U-Net model is adopted in performing semantic segmentation on the key frame data set through a semantic segmentation algorithm based on a full convolution network, the U-Net model comprises a contraction path and a symmetrical expansion path, features in a key frame are subjected to convolution operation through the contraction path, the features are extracted through a Relu activation function, a maxpool function is applied to the extracted features to identify relevant features, and Softmax is applied to the last layer of the U-Net model for activation to obtain the pixel probability of each class.

9. The method of claim 8, wherein the keyframes processed by the U-Net model are 256 x 256 color images, and the U-Net model is populated on each layer simultaneously, preserving the most relevant features for the keyframe features.

10. A computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method of any of claims 1-9.