CN113378717A

CN113378717A - Video identification method based on key object splicing, device storage medium and terminal

Info

Publication number: CN113378717A
Application number: CN202110652794.XA
Authority: CN
Inventors: 宋卓然; 鲁恒; 景乃锋; 梁晓峣
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-09-10
Anticipated expiration: 2041-06-11
Also published as: CN113378717B

Abstract

The invention discloses a video identification method based on key object splicing, a device storage medium and a terminal, wherein the method comprises the steps of decoding a target video; acquiring an I-type frame image identification result; acquiring all key object rectangular frames in the P-type frame image data and all key object rectangular frames in the B-type frame image data through an object tracking algorithm; aggregating the acquired key object rectangular frames through an object aggregation algorithm, and inputting the synthesized frames into a preset deep neural network to obtain synthesized frame identification results; and splitting the combined frame identification result through an object splitting algorithm, and returning the split result to the original image data. The method reduces the redundant calculation corresponding to the video frame by extruding the non-key information input into the preset deep neural network, greatly saves the calculation workload in the target video identification task, and improves the processing speed and the identification accuracy of the identification task.

Description

Video identification method based on key object splicing, device storage medium and terminal

Technical Field

The invention relates to the technical field of neural networks, in particular to a video identification method based on key object splicing, a device storage medium and a terminal.

Background

Deep convolutional neural networks have found widespread application in image recognition, such as in classification, detection, and segmentation of images. With the development of the deep convolutional neural network, people gradually expand the application range of the deep convolutional neural network to the video field.

In general, a video identification task based on a deep neural network can take each video frame as an independent picture and input the picture into the deep neural network for identification, that is, the video identification is taken as an image identification task to identify each frame. However, directly applying the network model suitable for the image recognition task to all video frames needs to bear huge calculation overhead and energy overhead; on the other hand, the neural network applied to the image recognition task is good at processing static objects, and cannot capture moving characteristics of objects between video frames, thereby resulting in a low accuracy of video recognition.

Therefore, researchers have proposed a deep neural network model for the video recognition task, which utilizes temporal locality between video frames to further enhance the recognition accuracy. Caelles et al propose that a double-current FCN network model divides the foreground and the outline of each frame, but the double-current FCN neural network still needs to be applied to each frame, so that the method consumes a lot of time and energy, and the method does not utilize the time locality among video frames, so that the identification accuracy is difficult to guarantee. For higher accuracy, Cheng et al propose a Segflow method, which extracts inter-frame time locality-optical flow information by using a neural network, and then uses the optical flow information to assist the neural network of each frame in recognition to obtain a final recognition result. However, this method takes too much effort to extract the optical flow, and therefore the improved recognition speed on the TiTAN X GPU is also limited.

Disclosure of Invention

The invention aims to solve the technical problems that each frame of video image is often required to be processed in the conventional neural network processing video identification task, the time consumption is long, the energy consumption is high, the identification accuracy rate is difficult to guarantee, and the identification speed is difficult to increase.

In order to solve the technical problem, the invention provides a real-time video identification method based on key object splicing, which comprises the following steps:

decoding a target video through a preset video decoder to obtain I-type frame image data, P-type frame image data, B-type frame image data, a motion vector table and an intra-frame prediction mode table of the target video;

inputting the I-type frame image data into a preset deep neural network to obtain an I-type frame image recognition result;

acquiring all key object rectangular frames in the P-type frame image data and all key object rectangular frames in the B-type frame image data through an object tracking algorithm based on the I-type frame image identification result, the motion vector table and the intra-frame prediction mode;

aggregating the acquired key object rectangular frames through an object aggregation algorithm to obtain a plurality of synthesized frames, and inputting all the synthesized frames into the preset deep neural network to obtain a plurality of synthesized frame identification results;

and splitting all the synthesized frame identification results through an object splitting algorithm, and returning the splitting results to the B-type frame image data and the P-type frame image data to obtain a P-type frame image identification result and a B-type frame image identification result.

Preferably, the step of obtaining all key object rectangular frames in the P-type frame image data and all key object rectangular frames in the B-type frame image data through an object tracking algorithm based on the class I frame image recognition result, the motion vector table, and the intra prediction mode includes:

sequentially acquiring a temporary identification result of each frame of P frame image data and a temporary identification result of each frame of B frame image data according to a preset tracking sequence through a preset recovery operation on the basis of the I-type frame image identification result, the motion vector table and the intra-frame prediction mode table;

sequentially traversing the temporary identification result of each frame of P frame image data and the temporary identification result of each frame of P frame image data, acquiring a key segmentation small block in each frame of P frame image data and a key segmentation small block in each frame of B frame image data, and acquiring a key object identification frame in each frame of P frame image data and a key object identification frame in each frame of B frame image data based on the key segmentation small block in each frame of P frame image data and the key segmentation small block in each frame of B frame image data;

the preset tracking sequence is the sequence of the target video after I-type frame image data is eliminated.

Preferably, the acquiring of the temporary recognition result of the single-frame P-frame image data or the temporary recognition result of the single-frame B-frame image data by the preset restoration operation includes:

assuming P frame image data or B frame image data of a temporary identification result to be obtained as target image data;

acquiring a first-class reference segmentation small block of a part of segmentation small blocks in the target image data based on the I-class frame image recognition result, the preorder frame image temporary recognition result set and the motion vector table, and respectively copying the segmentation recognition result of the first-class reference segmentation small block to the corresponding segmentation small block in the target image data to obtain a first recognition result of the target image data;

acquiring a second type of reference segmentation small blocks of other segmentation small blocks in the target image data based on the first identification result of the target image data and the intra-frame prediction mode table, and respectively copying the segmentation identification results of the second type of reference segmentation small blocks to corresponding segmentation small blocks in the target image data to obtain a temporary identification result of the target image data;

wherein the preamble frame image temporary recognition result set includes temporary recognition results of all currently acquired image data.

Preferably, traversing the temporary recognition result of the single frame image data, acquiring the key segmentation small block in the frame image data, and acquiring the key object recognition frame in the frame image data based on the key segmentation small block in the frame image data comprises:

traversing a temporary recognition result of single frame image data, taking a segmentation small block containing a preset color pixel in the temporary recognition result of the frame image data as a temporary segmentation small block, and taking a segmentation small block corresponding to the temporary segmentation small block in the image data as a key segmentation small block;

and taking the minimum rectangular frame containing all the key segmentation small blocks in the frame of image data as a key object rectangular frame, and recording the original position information of the key object rectangular frame.

Preferably, the aggregating the obtained key object rectangular boxes by the object aggregation algorithm to obtain a plurality of synthesized frames includes:

sequentially arranging all the key object rectangular frames to form an updated object list;

sequentially placing all key object rectangular frames in the updated object list into a plurality of idle frames to form a plurality of synthesized frames;

wherein, the step of placing the key object rectangle frame in the updated object list into an idle frame to form a composite frame comprises:

constructing an idle frame as an idle frame to be placed, and gathering idle areas in the idle frame to be placed into an idle area list;

and sequentially placing the key object rectangular frames in the updated object list into the idle areas in the idle area list according to a preset placing mode until the key object rectangular frames to be placed cannot select the placeable idle areas from the idle area list, and completing synthesis of the idle frames to be placed to form a synthesized frame.

Preferably, the sequentially placing the key object rectangular frames in the updated object list in the free area list according to a preset placing mode includes:

determining the first key object rectangular frame in the updated object list as a key object rectangular frame to be placed;

screening out an idle area with the length and the width respectively larger than those of the rectangular frame of the key object to be placed from the idle area list, and taking the idle area as an idle area to be placed;

placing the rectangular frame of the key object to be placed in the upper left corner of the idle area to be placed, and recording the placement position information of the rectangular frame of the key object to be placed;

dividing the area to be placed, acquiring a new idle area and storing the new idle area in the idle area list;

and removing the idle area to be placed from the idle area list, removing the rectangular frame of the key object to be placed from the updated object list, and re-determining the rectangular frame of the key object to be placed.

Preferably, the dividing the region to be placed includes:

acquiring the height difference and the width difference between the rectangular frame of the key object to be placed and the idle area to be placed;

when the height difference is larger than the width difference, dividing the to-be-placed idle area in which the to-be-placed key object rectangular frame is placed along a straight line where the outer edge of the bottom edge of the to-be-placed key object rectangular frame is located;

and when the height difference is smaller than the width difference, dividing the to-be-placed idle area in which the to-be-placed key object rectangular frame is placed along the straight line where the right outer edge of the to-be-placed key object rectangular frame is located.

Preferably, the step of splitting all the synthesized frame recognition results by an object splitting algorithm and returning the split results to the B-class frame image data and the P-class frame image data to obtain the P-class frame image recognition results and the B-class frame image recognition results includes:

splitting all the synthesized frame identification results respectively based on the placement position information of all the key object rectangular frames to obtain key object identification results corresponding to all the key object rectangular frames;

and sequentially returning the key object identification results corresponding to all the key object rectangular frames to the B-type frame image data and the P-type frame image data based on the original position information of all the key object rectangular frames to obtain a P-type frame image identification result and a B-type frame image identification result.

In order to solve the technical problem, the invention also provides a real-time video identification device based on key object splicing, which comprises a decoding module, an I-type frame image identification result acquisition module, a key object rectangular frame acquisition module, an aggregation module and a splitting and returning module which are sequentially connected;

the decoding module is used for decoding a target video through a preset video decoder to obtain I-type frame image data, P-type frame image data, B-type frame image data, a motion vector table and an intra-frame prediction mode table of the target video;

the I-type frame image recognition result acquisition module is used for inputting the I-type frame image data into a preset deep neural network to obtain an I-type frame image recognition result;

the key object rectangular frame obtaining module is configured to obtain all key object rectangular frames in the P-type frame image data and all key object rectangular frames in the B-type frame image data through an object tracking algorithm based on the class I frame image recognition result, the motion vector table, and the intra-frame prediction mode;

the aggregation module is used for aggregating the acquired key object rectangular frames through an object aggregation algorithm to obtain a plurality of synthesized frames, and inputting all the synthesized frames into the preset deep neural network to obtain a plurality of synthesized frame identification results;

the splitting and returning module is used for splitting all the synthesized frame identification results through an object splitting algorithm and returning the splitting results to the B-type frame image data and the P-type frame image data to obtain a P-type frame image identification result and a B-type frame image identification result.

In order to solve the above technical problem, the present invention further provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program is configured to implement the video identification method based on key object matching according to any one of claims 1 to 8 when executed by a processor.

In order to solve the above technical problem, the present invention further provides a terminal, including: a processor and a memory;

the memory is used for storing a computer program, and the processor is used for executing the computer program stored by the memory to enable the terminal to execute the video identification method based on key object splicing according to any one of claims 1 to 7.

Compared with the prior art, one or more embodiments in the above scheme can have the following advantages or beneficial effects:

by applying the real-time video identification method based on key object splicing provided by the embodiment of the invention, the target video is decoded, the decoded I-type frame image data is sent to the preset depth neural network to obtain the I-type frame image identification result, the key object rectangular frames in the P-type frame image data and the B-type frame image data are obtained based on the I-type frame image identification result, the decoded motion vector table and the intra-frame prediction mode table, the key object rectangular frames are aggregated into the synthesized frame, and then the synthesized frame is only sent to the preset depth neural network for identification, and the identification result is differentiated and is distributed back to the frame to which the synthesized frame belongs, so that the identification task of the target video can be completed. The real-time video identification method based on key object splicing realizes the reduction of the data volume input to the deep neural network by aggregating key objects in a plurality of continuous video frames and taking the synthesized frame as the input of the deep neural network, namely reduces the redundant calculation corresponding to the video frame by extruding the non-key information input to the preset deep neural network, greatly saves the calculation workload in the target video identification task, and improves the processing speed and the identification accuracy of the identification task.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram illustrating steps of a method for real-time video recognition based on key object splicing according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an exemplary process of a method for real-time video recognition based on key object matching according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method for real-time video identification based on key object splicing according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an exemplary process of an object tracking algorithm according to an embodiment of the invention;

FIG. 5 is a diagram illustrating an example of a composite frame formation process in accordance with one embodiment of the present invention;

FIG. 6 is a diagram illustrating a specific process of an object splitting algorithm according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a real-time video recognition apparatus based on key object splicing according to a second embodiment of the present invention;

fig. 8 shows a schematic structural diagram of a four-terminal according to an embodiment of the present invention.

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are within the scope of the present invention.

Deep convolutional neural networks have found widespread application in image recognition, such as in classification, detection, and segmentation of images. With the development of the deep convolutional neural network, people gradually expand the application range of the deep convolutional neural network to the video field. Researchers have proposed a deep neural network model for the video recognition task, which utilizes the temporal locality between video frames to further enhance the recognition accuracy. Caelles et al propose that a double-current FCN network model divides the foreground and the outline of each frame, but the double-current FCN neural network still needs to be applied to each frame, so that the method consumes a lot of time and energy, and the method does not utilize the time locality among video frames, so that the identification accuracy is difficult to guarantee. For higher accuracy, Cheng et al propose a Segflow method, which extracts inter-frame time locality-optical flow information by using a neural network, and then uses the optical flow information to assist the neural network of each frame in recognition to obtain a final recognition result. However, this method takes too much effort to extract the optical flow, and therefore the improved recognition speed on the TiTAN X GPU is also limited.

Example one

In order to solve the technical problems in the prior art, the embodiment of the invention provides a real-time video identification method based on key object splicing.

FIG. 1 is a schematic diagram illustrating steps of a method for real-time video recognition based on key object splicing according to an embodiment of the present invention; FIG. 2 is a diagram illustrating an exemplary process of a method for real-time video recognition based on key object matching according to an embodiment of the present invention; FIG. 3 is a flow chart of a method for real-time video identification based on key object splicing according to an embodiment of the present invention; referring to fig. 1 to 3, a method for identifying a real-time video based on key object splicing according to an embodiment of the present invention includes the following steps.

In the task of identifying the processing target video, a plurality of objects may exist in each frame of image data to be identified, but for the sake of simplicity of description, a single-object video segmentation task is taken as an example below, and of course, our method can also be applied to a multi-object video segmentation task and a video detection task. If a plurality of identification objects need to be identified or detected, all the identification objects can be set as key objects, and then the target video is identified by the method; or the identification objects in the plurality of identification objects are sequentially used as key objects, and the identification of the plurality of identification objects is realized by repeating the method. The latter of the two processing methods is more efficient for the case of multiple recognition objects.

Step S101, decoding a target video through a preset video decoder to obtain I-type frame image data, P-type frame image data, B-type frame image data, a motion vector table and an intra-frame prediction mode table of the target video.

Specifically, the video coding and decoding standard of the target video in the embodiment of the present invention is to have classification of I-frame image data, B-frame image data, and P-frame image data, and each frame of image data is divided into blocks of a preset size and has a motion vector table and an intra-frame prediction mode table. For example, the video encoding and decoding standard of the target video may be h.265 video, and at this time, the small blocks are divided into coding tree blocks; the video coding and decoding standard of the target video can also be H.264 video, and the small blocks are divided into macro blocks. Wherein, the motion vector is the amount of the motion track of the divided small block expressed by the video decoder by recording the code stream of the dependency relationship. The video decoder refers to the depended frames and the depended small blocks as reference frames and reference small blocks, and the motion vector table comprises the reference frames depended on by each B frame image data and each P frame image data in the target video respectively and the reference small blocks depended on by each small block in the B frame image data and each small block in the P frame image data respectively.

It should be noted that each frame of image data may be divided according to a basic unit of divided small blocks, and a typical size of the divided small blocks is 8 × 8 pixels. The video decoder decompresses the bitstream back into successive video frames in a specified decoding order during decoding. For I frame image data, dividing small blocks for intra-frame decoding; for P-frame image data and B-frame image data, the partition tiles are intra-and inter-frame decoded using reference partition tiles, motion vectors, and residuals. The specific decoding process of the video decoder for the I-type frame image data, the P-type frame image data and the B-type frame image data includes the following features: for the class I frame image data, each small partition block selects a certain small partition block in a certain direction, such as the up-down direction, the left-right direction and the like, according to the intra-frame prediction mode, and the residual error between the small partition blocks is added to obtain the final value of each small partition block. For P-type frame image data, each small partition block can be selected to be coded in a frame or predicted between frames; therefore, when decoding the small partitioned blocks in the image data of the P-type frame, the video decoder firstly needs to determine whether the decoding mode is interframe or intraframe according to the information of the small partitioned blocks; if the prediction is intra-frame prediction, performing intra-frame decoding on the prediction; if the prediction is inter-frame prediction, the video decoder needs to locate a reference segmentation small block in a preamble reference frame according to the motion vector, and adds a residual error between the reference segmentation small block and the reference segmentation small block to obtain a final numerical value of each segmentation small block. For B-type frame image data, whether the decoding mode is interframe or intraframe needs to be determined; if the prediction is interframe prediction, a preset decoder divides small blocks according to the reference of the motion vector in the preamble or the subsequent reference frame in the video playing sequence, and adds the residual error between the small blocks to obtain the final value of each divided small block.

Therefore, based on the above, by decoding the target video through the preset video decoder, not only the I-type frame image data, the P-type frame image data, and the B-type frame image data of the target video, but also the corresponding motion vector table and the intra prediction mode table can be obtained. Among them, there are 35 intra prediction modes in the h.265 coding standard.

And the video decoder records the decoding order of the frames according to the inter-frame dependency relationship in the decoding process, so that the decoding order and the playing order of the frames in the target video are usually inconsistent. For example, assume that (I0, B1, B2, B3, P4, I5, B6, P7) is the playing order of the video, and (I0, P4, B3, B2, B1, I5, P7, B6) is the actual decoding order, since B3 now depends on I0 and P4. Further, the video decoder may convert the codestream back into a conventional sequence of frames according to a particular decoding order. It should be noted that all decoded I, P and B frames are written back to global storage or a buffer for display.

And S102, inputting the I-type frame image data into a preset deep neural network to obtain an I-type frame image recognition result.

Specifically, the I-type frame image data is input into a preset deep neural network corresponding to the recognition task to obtain an I-type frame image recognition result in the target video.

Step S103, based on the I-type frame image identification result, the motion vector table and the intra-frame prediction mode table, all key object rectangular frames in the P-type frame image data and all key object rectangular frames in the B-type frame image data are obtained through an object tracking algorithm.

In this step, we mainly use the motion vectors and intra prediction modes generated in the video decoding process to track the key objects in the P-type frame image data and the B-type frame image data. The object tracking algorithm in this embodiment mainly includes two operations, namely, a recovery operation and a classification operation. The output of the recovery operation is the approximate identification result (i.e. temporary identification result) of the P-type frame image data and the B-type frame image data roughly outlined by the motion vector, the result is input into the classification operation, and the classification operation locates the specific coordinates of the key objects in the P-type frame image data and the B-type frame image data through the rectangular frame.

Specifically, the preset restoring operation in this embodiment is mainly to copy the recognition results of the reference segmented small blocks of the P-type frame image data and the B-type frame image data in a copy-paste manner, and paste the copy into the P-type frame image data and the B-type frame image data to obtain the temporary recognition results of the P-type frame image data and the B-type frame image data. The specific operation of acquiring the key objects in the P-type frame image data and the B-type frame image data based on the embodiment includes: and acquiring a temporary identification result of each frame of P frame image data and a temporary identification result of each frame of B frame image data through a preset recovery operation based on a motion vector table and an intra-frame preset mode acquired by decoding the target video and the acquired I frame image identification result, and further acquiring a temporary identification result of the P frame image data and a temporary identification result of the B frame image data. The temporary identification result of each frame of P frame image data and the temporary identification result of each frame of B frame image data are acquired in sequence according to a preset tracking sequence; and the preset tracking sequence is the sequence of the target video after I-type frame image data is removed from the decoding sequence.

Further, the process of obtaining the temporary identification result of the P-type frame image data and the temporary identification result of the B-type frame image data is a process of repeatedly obtaining a plurality of temporary identification results of single frame image data through the same operation. In order to describe the embodiments of the present invention in more detail, the following description will take the process of acquiring the temporary recognition result of the single-frame P-frame image data or the temporary recognition result of the single-frame B-frame image data as an example. Firstly, supposing P frame image data or B frame image data of a temporary identification result to be obtained as target image data; secondly, the identification results of the image data of the preorder frame before the target image data in the preset tracking sequence, namely the temporary identification results of all the currently acquired image data are collected into a preorder frame image temporary identification result set to be used as an acquisition basis for acquiring the temporary identification results of the target image data. It should be noted that, each time the process of acquiring the temporary identification result of one frame of image data is completed, the temporary identification result can be stored in the identification result set of the image data of the preceding frame, so as to be used as the basis for acquiring the temporary identification result of the image data of the subsequent frame. And when the target image data is the first frame image data in the preset tracking sequence, the temporary identification result set of the preorder frame image is an empty set.

Since the P-frame image data or the B-frame image data have inter-frame prediction and intra-frame prediction in the decoding process, the process of the temporary identification result of the single-frame P-frame image data or the temporary identification result of the single-frame B-frame image data specifically includes: acquiring a first-class reference segmentation small block of a part of segmentation small blocks in target image data based on an acquired I-class frame image identification result, a preorder frame image temporary identification result set and a motion vector table, and respectively copying the segmentation identification result of the first-class reference segmentation small block to a corresponding segmentation small block in the target image data to obtain a first identification result of the target image data; and acquiring a second type of reference segmentation small blocks of other segmentation small blocks in the target image data based on the first identification result of the target image data and the intra-frame prediction mode table, and respectively copying the segmentation identification results of the second type of reference segmentation small blocks to corresponding segmentation small blocks in the target image data to obtain a temporary identification result of the target image data.

FIG. 4 is a diagram illustrating an exemplary process of an object tracking algorithm according to an embodiment of the invention; in order to show how the temporary recognition result of the P-frame image data and the temporary recognition result of the B-frame image data are obtained, the following description will take a recovery process of the P4-frame image data as an example. The recovery operation specifically includes inter-frame recovery and intra-frame recovery performed in sequence. As shown in fig. 4, the inter-frame restoration mode includes three motion vectors, where the divided small block with target coordinates (dstx, dsty) of (512,128) has one motion vector (P4, I0,501,120,512,128), i.e., the divided small block with coordinates (512,128) in the P4 frame image data points to the divided small block with coordinates (501,120) in the I0 frame image data, and we can locate the division recognition result of the reference divided small block corresponding to I0 and take out and write back the recognition result to the position with coordinates (512,128) in the P4 frame image data; for the motion vector (P4, I0,625,302,616,328), we know that the segmentation small block is located in the image data of the P4 frame, the coordinate of the segmentation small block is (616,328), and the segmentation small block is pointed to the coordinate of (625,302) of I0, we can locate the segmentation recognition result of the reference segmentation small block corresponding to I0, and take out and write back the recognition result to the position with the coordinate of (616,328) in the P4 frame; and by analogy, all the motion vectors taking the image data of the P4 frame as the target frame are executed according to the steps, and the inter-frame recovery is completed. In the intra recovery mode, the P4 frame image data includes three prediction modes, and the segmented small block with target coordinates (dstx, dsty) of (480,120) has one prediction mode (P4,480,120,0), wherein 0 represents that the segmented small block is the 0 th prediction mode, and the prediction mode represents that the reference segmented small block of the current segmented small block is the segmented small block above the reference segmented small block, so we can locate the segmentation identification result of the reference segmented small block right above the current segmented small block in the P4 frame image data, and take out and write the identification result to the position with coordinates (480,120) in the P4 frame image data; and for the prediction mode (P4,360,80,7), where 7 indicates that the segment is the 7 th prediction mode, and the prediction mode indicates that the reference segment of the current segment is the segment on the right side thereof, we can locate the segment recognition result of the reference segment on the right side of the current segment in the image data of P4 frame, and fetch and write the recognition result back to the position with the coordinates of (360,80) in the P4 frame; by analogy, all prediction modes with the P4 frame image data as the target frame are executed according to the above steps, and the intra-frame recovery is completed.

After the temporary identification result of each frame of P frame image data and the temporary identification result of each frame of P frame image data are obtained, the obtaining work of a key object identification frame in each frame of P frame image data and a key object identification frame in each frame of B frame image data is also required to be completed. The work of acquiring the key object identification frame in each frame of P-frame image data and the key object identification frame in each frame of B-frame image data specifically includes: and successively traversing the temporary identification result of each frame of P frame image data and the temporary identification result of each frame of B frame image data, acquiring a key segmentation small block in each frame of P frame image data and a key segmentation small block in each frame of B frame image data, and acquiring a key object identification frame in each frame of P frame image data and a key object identification frame in each frame of B frame image data based on the key segmentation small block in each frame of P frame image data and the key segmentation small block in each frame of B frame image data.

The process of identifying the key object in each frame of P-frame image data and the process of identifying the key object in each frame of B-frame image data are also the process of repeatedly implementing the same operation, so that the process of acquiring the key object identification frame in the single frame of image data is the process of acquiring all the key object identification frames of P-frame image data and B-frame image data.

Traversing the temporary recognition result of the single frame of image data, acquiring the key segmentation small blocks in the frame of image data, and acquiring the key object recognition frame in the frame of image data based on the key segmentation small blocks in the frame of image data comprises: traversing a temporary recognition result of single frame image data, taking a segmentation small block containing a preset color pixel in the temporary recognition result of the frame image data as a temporary segmentation small block, and taking a segmentation small block corresponding to the temporary segmentation small block in the image data as a key segmentation small block; and taking the minimum rectangular frame containing all key segmentation small blocks in the frame of image data as a key object rectangular frame, and recording the original position information of the key object rectangular frame so as to facilitate subsequent recovery.

It should be noted that the process of acquiring the key object identification frame in the P-frame image data and the B-frame image data does not have to be performed after acquiring the temporary identification results of all the P-frame image data and all the B-frame image data. The temporary recognition result obtaining process and the key object recognition result obtaining process may be performed simultaneously, that is, after the temporary recognition result of one frame of image data is obtained, the key object recognition frame of the frame of image data may be obtained based on the temporary recognition result of the frame of image data.

Also shown in fig. 4 is a procedure of combining the temporary recognition result of the P4 frame image data and the temporary recognition result of the B3 frame image data to form a composite frame. Referring to fig. 4, in the classification operation, we will go through the temporary recognition results of the P4 frame image data and the B3 frame image data, respectively, and check the recognition result of each pixel. Let us set the recognition result pixels corresponding to the key objects to be white (pixel value is 255) and the recognition result pixels corresponding to the non-key objects to be black (pixel value is 0). Therefore, we can divide the pixels into key pixels and non-key pixels according to the pixel values of the recognition result. Finally, we connect all the key pixels, draw a rectangular box that can cover all the key pixels, and record the diagonal coordinates of the rectangular box. At this time, it should be noted that, since the recovery operation is performed in units of divided small blocks, in an actual operation, we perform a classification operation in units of divided small blocks. Specifically, the temporary recognition results of the P4 frame image data and the B3 frame image data are traversed, whether the recognition result corresponding to each divided small block contains a white pixel is checked, if the recognition result contains the white pixel, the block is considered as a key block, and if the recognition result does not contain the white pixel, the block is considered as a non-key block; finally, all key blocks are connected, a rectangular frame capable of covering all key blocks is drawn, and coordinate information of the rectangular frame in the current P-frame image data and B-frame image data is recorded as original position information (the original position information is used for splitting the identified result back into the frame to which the original image belongs in an object splitting algorithm), so that the reason for positioning the key objects in the P-frame image data and the B-frame image data by using the rectangular frame is to facilitate subsequent object aggregation operation.

It should be noted that the original location information may be stored in the updated object list after being acquired.

And step S104, aggregating the acquired key object rectangular frames through an object aggregation algorithm to obtain a plurality of synthesized frames, and inputting all the synthesized frames into a preset deep neural network to obtain a plurality of synthesized frame identification results.

Specifically, after an object tracking algorithm is used for P-type frame image data and B-type frame image data, the key objects in the P-type frame image data and the B-type frame image data can be located, in order to obtain recognition results of the P-type frame image data and the B-type frame image data, corresponding original images covered by all key object rectangular frames need to be aggregated into a plurality of synthesized frames for processing like a jigsaw puzzle, and then all the synthesized frames are input into a preset deep neural network to obtain a plurality of synthesized frame recognition results.

Further, the process of obtaining a plurality of composite frames comprises: sequentially arranging all the key object rectangular frames to form an updated object list; and sequentially placing all the key object rectangular frames in the updated object list into a plurality of idle frames to form a plurality of synthesized frames. Wherein, the step of placing the key object rectangle frame in the updated object list into an idle frame to form a composite frame comprises: constructing an idle frame as an idle frame to be placed, and gathering idle areas in the idle frame to be placed into an idle area list; and sequentially placing the key object rectangular frames in the updated object list into the idle areas in the idle area list according to a preset placing mode until the key object rectangular frames to be placed cannot select the placeable idle areas from the idle area list, and completing synthesis of the idle frames to be placed to form a synthesized frame. Repeating the above single composite frame synthesis process for multiple times can achieve the purpose of sequentially placing all the acquired key object identification frames in multiple idle frames.

Further, sequentially placing the key object rectangular frames in the updated object list in the free area list according to a preset placing mode comprises: determining a first key object rectangular frame in the updated object list as a key object rectangular frame to be placed; screening out an idle area with the length and the width respectively larger than those of the rectangular frame of the key object to be placed from the idle area list in sequence to serve as an idle area to be placed; placing a rectangular frame of the key object to be placed in the upper left corner of the idle area to be placed, and recording the placement position information of the rectangular frame of the key object to be placed; dividing the area to be placed, acquiring a new free area and storing the new free area into a free area list; and removing the idle area to be placed from the idle area list, removing the current key object rectangular frame to be placed from the updated object list, re-determining the key object rectangular frame to be placed, repeating the process until the key object rectangular frame to be placed can not select the free area to be placed from the idle area list, and completing the synthesis of the idle frame to be placed to form a synthesized frame. The placing position information comprises the coordinate information of the upper left corner and the coordinate information of the lower right corner of the rectangular frame of the placed key object to be placed in the idle frame to be placed.

Preferably, the segmenting the region to be placed comprises: acquiring the height difference and the width difference between a rectangular frame of the key object to be placed and the idle area to be placed; when the height difference is larger than the width difference, dividing the to-be-placed idle area in which the to-be-placed key object rectangular frame is placed along a straight line where the outer edge of the bottom edge of the to-be-placed key object rectangular frame is located; and when the height difference is smaller than the width difference, dividing the idle area to be placed, in which the rectangular frame to be placed with the key object is placed, along the straight line where the outer edge of the right side of the rectangular frame to be placed with the key object is located.

FIG. 5 is a diagram illustrating an example of a composite frame formation process in accordance with one embodiment of the present invention; referring to FIG. 5, for a size h_O*w_OThe key object rectangle box to be placed (assuming as key object rectangle box I), we need to find a size h from the free area list_R*w_RTo satisfy h_O<h_RAnd w_O<w_RThe conditions of (1). At this time, the free area list includes only one complete free frame a, since the free frame a satisfies h at this time_O<h_RAnd w_O<w_RSo we place the key object rectangle box I in the upper left corner of the idle area to be placedAnd recording the placement position information of the key object rectangular frame I in the updated object list.

The method comprises the steps of dividing an idle frame A with a key object rectangular frame I in a mode of dividing a region to be placed to obtain an idle region A1 and an idle region A2, maximizing the area difference between the idle region A1 and the idle region A2, removing the idle frame A from an idle region list, storing the idle region A1 and the idle region A2 in the idle region list, and simultaneously removing the key object rectangular frame I from an updated object list.

Assuming that the key object rectangle frame to be placed is the key object rectangle frame II, we find a size h from the free area list_R*w_RFree area of, satisfies h_O<h_RAnd w_O<w_RThen we place the key object rectangle box ii in the upper left corner of the free area a2, and record the placement position information of the key object rectangle box ii in the updated object list, and complete the corresponding culling and update the free area in the free area list. At this time, we find that the next key object rectangle frame to be placed can not find the placeable free area in the current updated free area list, and at this time, the composition of the composite frame is completed.

In the object aggregation algorithm, it is desirable to fit all the key object rectangular boxes into a composite frame accurately and seamlessly. However, in practice, completely seamless placement cannot be achieved due to the non-uniform size of the key object rectangle frames at the time of object aggregation.

It should be noted that the placement position information may be stored in the updated object list after being acquired, that is, each time a key object rectangular frame in the updated object list is removed, a corresponding placement position information is stored.

And step S105, splitting all the synthesized frame identification results through an object splitting algorithm, and returning the split results to the B-type frame image data and the P-type frame image data to obtain the P-type frame image identification results and the B-type frame image identification results.

Fig. 6 shows a schematic diagram of a specific process of an object splitting algorithm in an embodiment of the present invention, and referring to fig. 6, in order to obtain final recognition results of P-type frame image data and B-type frame image data, an object splitting algorithm is required to split the recognition results of the synthesized frames and distribute the split results to corresponding original frame image data. Specifically, all the composite frame identification results are split based on the placement position information of all the key object rectangular frames, and key object identification results corresponding to all the key object rectangular frames are obtained; and then respectively returning the key object identification results corresponding to all the key object rectangular frames to the B-type frame image data and the P-type frame image data based on the original position information of all the key object rectangular frames, thereby obtaining the P-type frame image identification result and the B-type frame image identification result.

The real-time video identification method based on key object splicing provided by the embodiment of the invention can complete the identification task of the target video by decoding the target video, sending the decoded I-type frame image data into a preset deep neural network to obtain an I-type frame image identification result, then obtaining key object rectangular frames in P-type frame image data and B-type frame image data based on the I-type frame image identification result, a decoded motion vector table and an intra-frame prediction mode table, and aggregating the key object rectangular frames into a composite frame, then only sending the composite frame into the preset deep neural network for identification, and differentiating the identification result and distributing the identification result back to the frame to which the composite frame belongs. The real-time video identification method based on key object splicing realizes the reduction of the data volume input to the deep neural network by aggregating key objects in a plurality of continuous video frames and taking the synthesized frame as the input of the deep neural network, namely reduces the redundant calculation corresponding to the video frame by extruding the non-key information input to the preset deep neural network, greatly saves the calculation workload in the target video identification task, and improves the processing speed and the identification accuracy of the identification task.

Example two

In order to solve the technical problems in the prior art, embodiments of the present invention provide a real-time video recognition apparatus based on key object splicing.

Fig. 7 is a schematic structural diagram of a real-time video recognition apparatus based on key object splicing according to a second embodiment of the present invention, and referring to fig. 7, the real-time video recognition apparatus based on key object splicing according to the second embodiment of the present invention includes a decoding module, an I-type frame image recognition result obtaining module, a key object rectangular frame obtaining module, an aggregation module, and a splitting and returning module, which are sequentially connected.

the I-type frame image identification result acquisition module is used for inputting I-type frame image data into a preset deep neural network to obtain an I-type frame image identification result;

the key object rectangular frame acquisition module is used for acquiring all key object rectangular frames in the P-type frame image data and all key object rectangular frames in the B-type frame image data through an object tracking algorithm based on the I-type frame image identification result, the motion vector table and the intra-frame prediction mode table;

the aggregation module is used for aggregating the acquired key object rectangular frames through an object aggregation algorithm to obtain a plurality of synthesized frames, and inputting all the synthesized frames into a preset deep neural network to obtain a plurality of synthesized frame identification results;

The real-time video recognition device based on key object splicing provided by the embodiment of the invention can complete the recognition task of the target video by decoding the target video, sending the decoded I-type frame image data into the preset deep neural network to obtain an I-type frame image recognition result, then obtaining key object rectangular frames in the P-type frame image data and the B-type frame image data based on the I-type frame image recognition result, the decoded motion vector table and the intra-frame prediction mode table, and aggregating the key object rectangular frames into a synthesized frame, then only sending the synthesized frame into the preset deep neural network for recognition, and differentiating and distributing the recognition result back to the frame to which the synthesized frame belongs. The real-time video recognition device based on key object splicing realizes the reduction of the data volume input to the deep neural network by aggregating key objects in a plurality of continuous video frames and taking the synthesized frame as the input of the deep neural network, namely reduces the redundant calculation corresponding to the video frames by extruding the non-key information input to the preset deep neural network, greatly saves the calculation workload in the target video recognition task, and improves the processing speed and the recognition accuracy of the recognition task.

EXAMPLE III

In order to solve the above technical problems in the prior art, an embodiment of the present invention further provides a storage medium storing a computer program, and the computer program, when executed by a processor, can implement all the steps in the real-time video identification method based on key object matching in the first embodiment.

The specific steps of the real-time video identification method based on key object splicing and the beneficial effects obtained by applying the readable storage medium provided by the embodiment of the invention are the same as those of the first embodiment, and are not described herein again.

It should be noted that: the storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Example four

In order to solve the technical problems in the prior art, the embodiment of the invention also provides a terminal.

Fig. 8 is a schematic structural diagram of a fourth terminal according to an embodiment of the present invention, and referring to fig. 8, the terminal according to this embodiment includes a processor and a memory, which are connected to each other; the memory is used for storing computer programs, and the processor is used for executing the computer programs stored in the memory, so that the terminal can realize all the steps in the real-time video identification method based on key object splicing.

The specific steps of the real-time video identification method based on key object splicing and the beneficial effects obtained by applying the terminal provided by the embodiment of the invention are the same as those of the first embodiment, and are not described herein again.

It should be noted that the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Similarly, the Processor may also be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A video identification method based on key object splicing comprises the following steps:

acquiring all key object rectangular frames in the P-type frame image data and all key object rectangular frames in the B-type frame image data through an object tracking algorithm based on the I-type frame image identification result, the motion vector table and the intra-frame prediction mode table;

2. The method according to claim 1, wherein the step of obtaining all key object rectangular frames in the P-type frame image data and all key object rectangular frames in the B-type frame image data by an object tracking algorithm based on the class I frame image recognition result, the motion vector table, and the intra prediction mode table comprises:

3. The method according to claim 2, wherein the obtaining of the temporary recognition result of the single-frame P-frame image data or the temporary recognition result of the single-frame B-frame image data by the preset restoration operation comprises:

4. The method of claim 2, wherein traversing the temporary recognition result of the single frame of image data, obtaining the key segmentation small block in the frame of image data, and obtaining the key object recognition frame in the frame of image data based on the key segmentation small block in the frame of image data comprises:

5. The method of claim 4, wherein aggregating the obtained key object rectangular boxes by an object aggregation algorithm to obtain a plurality of composite frames comprises:

6. The method of claim 5, wherein sequentially placing the key object rectangular frames in the updated object list in the free area list according to a preset placement mode comprises:

7. The method of claim 6, wherein dividing the area to be placed comprises:

8. A video identification device based on key object splicing is characterized by comprising a decoding module, an I-type frame image identification result acquisition module, a key object rectangular frame acquisition module, an aggregation module and a splitting and returning module which are sequentially connected;

the key object rectangular frame obtaining module is configured to obtain all key object rectangular frames in the P-type frame image data and all key object rectangular frames in the B-type frame image data through an object tracking algorithm based on the class I frame image recognition result, the motion vector table, and the intra-frame prediction mode table;

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for video recognition based on key object splicing according to any one of claims 1 to 7.

10. A terminal, comprising: a processor and a memory;