CN111985456A

CN111985456A - Video real-time identification, segmentation and detection architecture

Info

Publication number: CN111985456A
Application number: CN202010945694.1A
Authority: CN
Inventors: 景乃锋; 宋卓然; 吴飞洋
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2020-11-24
Anticipated expiration: 2040-09-10
Also published as: CN111985456B

Abstract

The invention discloses a video real-time identification, segmentation and detection architecture, which comprises a main memory, and a video decoder, a video identification processing module and a neural network processing module which are respectively connected with the main memory through buses; the video identification processing module is used for storing the motion vector of the B-type frame based on the motion vector table, and sequentially reading the image segmentation result in the reference frame of the B-frame image data and the reconstruction result of the acquired B-frame image data or the reference frame image detection result of the B-frame image data from the main memory according to the decoding sequence based on the motion vector of the B-type frame, and processing the image segmentation result and the reconstruction result to acquire the reconstruction result of the B-type frame. The structure of the invention realizes higher performance while maintaining accuracy by closely linking the video decoder and the neural network, and solves the problem that the existing processing method for the video identification task can not reduce the calculation amount and the energy consumption on the basis of ensuring higher precision.

Description

Video real-time identification, segmentation and detection architecture

Technical Field

The invention relates to the technical field of neural networks, in particular to a video real-time identification, segmentation and detection framework.

Background

Deep convolutional neural networks have found widespread application in image recognition, such as in classification, detection, and segmentation of images. With the development of the deep convolutional neural network, people gradually expand the application range of the deep convolutional neural network to the video field.

Wherein deep learning is better suited to handle image recognition tasks. For a target segmentation task, the complete convolution network has been applied to the maximum in the field; and for target detection, the R-CNN family occupies a dominant position. However, if the image recognition model is applied directly to each frame of video, there is an unacceptable amount of computation and energy. Therefore, based on the limitation of image recognition, researchers have proposed many neural networks for video recognition, for example, OSVOS proposed a dual-flow FCN model for foreground and contour, respectively; for better performance, FAVOS proposes local segmentation based on tracked target objects, followed by construction of ROI SegNet, a robust but still large network for segmenting targets. However, the above techniques come at the cost of high accuracy in terms of high computational effort and energy.

Furthermore, it is known that the change of image information between video frames is slow, and the amount of calculation can be reduced by using such data redundancy. Therefore, to achieve real-time video segmentation, DFF proposes a depth feature stream method, which is the first to directly combine optical flow and key features together. The optical flow is extracted by a neural network, key features are extracted by a large convolution neural network on key frames, but the key frames are determined by spacing a fixed number of frames, the method affects the identification accuracy, and the cost for extracting the optical flow is high.

Disclosure of Invention

The invention aims to solve the technical problem that the existing method for processing the video identification task consumes a large amount of calculation amount and energy consumption if the identification precision is higher, and influences the identification precision if the calculation amount and energy are reduced.

In order to solve the technical problem, the invention provides a video real-time identification, segmentation and detection architecture, which comprises a main memory, and a video decoder, a video identification processing module and a neural network processing module which are respectively connected with the main memory through buses;

the video decoder is used for decoding the target video to obtain a decoding sequence and obtaining I frame type image data, P frame type image data and a motion vector table of the target video;

the video identification processing module is used for storing the motion vector of the B-type frame based on the motion vector table, and sequentially reading and processing an image segmentation result in a B-frame image data reference frame and an acquired B-frame image data reconstruction result or a B-frame image data reference frame image detection result from the main memory according to a decoding sequence based on the motion vector of the B-type frame to acquire the B-type frame reconstruction result;

the neural network processing module is used for segmenting the I-frame type image data and the P-frame type image data by utilizing a first preset neural network to obtain an I-frame type video segmentation result and a P-frame type video segmentation result, detecting the I-frame type image data and the P-frame type image data by utilizing a third preset neural network to obtain an I-frame type video detection result and a P-frame type video detection result, segmenting the B-frame reconstruction result, the I-frame type image segmentation result and the P-frame type image segmentation result by utilizing a second preset neural network to obtain an image detection result of the B-frame, and detecting the B-frame reconstruction result, the set I-frame image detection result and the set P-frame image detection result by utilizing the second preset neural network to obtain an image detection result of the B-frame;

the main memory is used for storing I frame type image data, P frame type image data, I frame type video segmentation results, P frame type video segmentation results, I frame type video detection results, P frame type video detection results, B frame type reconstruction results, set I frame type image detection results, set P frame type image detection results, B frame type video segmentation results and B frame type video detection results;

the video coding and decoding standard of the target video is classification of I-frame image data, B-frame image data and P-frame image data, a motion vector table is provided, and each frame of image data is divided into a plurality of small divided blocks according to a preset mode.

Preferably, the video identification processing module comprises a motion vector storage unit, a control unit, a frame ordering unit to be set and a processing unit;

the motion vector storage unit is used for storing the motion vector of the B-type frame based on the motion vector table;

the control unit is used for sequentially storing the index numbers and the corresponding state information of the I-frame image data and the P-frame image data in a first queue according to a decoding sequence and sequentially storing the index numbers and the corresponding state information of the B-frame image data in a second queue according to the decoding sequence;

the frame ordering unit to be set is used for acquiring image detection results of all reference frames of the B frame image data, and classifying and ordering all detected frames in the image detection results of all reference frames of the B frame image data to obtain the frame ordering to be set of the B frame image data; and setting the image detection result of the reference frame of the B frame image data based on the frame ordering to be set of the B frame image data, acquiring the image detection result of the set reference frame of the B frame image data, and merging all sub-reconstruction results of the B frame image data after a processing unit acquires all sub-reconstruction results of the B frame image data to obtain the reconstruction result of the B frame image data.

The processing unit is used for acquiring the reconstruction result of the B frame image data according to all image segmentation results in the reference frame of the B frame image data and the reconstruction results of all acquired B frame image data, or acquiring the sub-reconstruction result of the B frame image data based on the image detection result of the set reference frame corresponding to the B frame image data, the motion vector of the B frame and the reconstruction results of the acquired B frame image data.

Preferably, the initial state of the I-frame image data and the P-frame image data is set as s _ ready, and the state of the image segmentation result of the I-frame image data, the image detection result of the I-frame image data, the image segmentation result of the P-frame image data, and the image detection result of the P-frame image data is set as s _ done; and simultaneously setting the initial state of the B frame image data as s _ unrey, the reconstruction result of the B frame image data as s _ ready, and the states of the image segmentation result of the B frame image data and the video detection result of the B frame class as s _ done.

Preferably, the video identification processing module further comprises a merging unit;

the merging unit is used for sequentially acquiring the motion vectors of the M access requests to be sent from the motion vector storage unit and merging the acquired access requests of the M motion vectors according to a preset merging mode to obtain at least one merged access request;

wherein, the preset merging mode is as follows: combining access requests corresponding to the motion vectors with the same reference frame number and source ordinate in the M motion vectors into a combined access request; m is a positive integer.

Preferably, the video identification processing module further includes a buffer, where the buffer is configured to prefetch an image segmentation result or an image detection result of a reference frame of B frame image data and a reconstruction result of completed B frame image data from the main memory, acquire reference segmentation small blocks corresponding to all motion vectors with reference frame numbers stored in the motion vector storage unit based on the image segmentation result or the image detection result of the prefetched reference frame and the reconstruction result of completed B frame image data, and send the acquired reference segmentation small blocks to the processing unit;

the buffer prefetches an image segmentation result or an image detection result of a reference frame of B frame image data and a reconstruction result of finished B frame image data from the main memory, acquires reference segmentation small blocks corresponding to all motion vectors with reference frame numbers stored in the motion vector storage unit based on the image segmentation result or the image detection result of the prefetched reference frame and the reconstruction result of the finished B frame image data, and sends the acquired reference segmentation small blocks to the processing unit, wherein the buffer comprises:

the pre-fetching pointer set in the buffer points to the first motion vector in the motion vector storage unit and is used as a pre-fetching motion vector, the pre-fetching pointer obtains and stores an image segmentation result or an image detection result of a reference frame in the preset motion vector and a reconstruction result of finished B frame image data from the main memory, and a reference frame number in the pre-fetching motion vector is sent to a searching pointer set in the buffer;

the searching pointer searches motion vectors with all reference frame numbers as preset motion vector reference frame numbers in the motion vector storage unit to form a homogeneous motion vector group, and sequentially acquires reference segmentation small blocks of all motion vectors in the homogeneous motion vector group and sends the reference segmentation small blocks to the processing unit on the basis of the acquired image segmentation result or image detection result of the reference frames in the preset motion vectors and the reconstruction result of the completed B frame image data;

deleting all motion vectors in the same-class motion vector group from the motion vector storage unit, judging whether the motion vectors still exist in the current motion vector storage unit, if so, pointing the pre-fetching pointer to the first motion vector in the current motion vector storage unit, and repeating the working processes of the pre-fetching pointer and the searching pointer; otherwise, ending the pre-fetching.

Preferably, the video identification processing module further comprises a two-dimensional cache memory for obtaining and caching the segmentation result of the reference frame of the B-frame image data from the main memory;

the two-dimensional cache memory comprises at least one bank memory structure, at least one group memory structure is arranged in the bank memory structure, and at least one way memory unit is arranged in the group memory structure.

Preferably, the two-dimensional image data stored in the main memory all conform to a preset memory mapping rule;

the preset storage mapping rule is as follows: the two-dimensional image data are mapped into the main memory in three levels, and each level of mapping is used for segmenting the two-dimensional image data in different granularities;

the number of the secondary blocks in the secondary mapping is the same as the number of the physical storage structures in the main memory, and the size of the tertiary block in the tertiary mapping is the product of the burst length of the main memory and the width of the data bus divided by 8; and the two-dimensional image data comprises I-frame type image data, P-frame type image data, I-frame type video segmentation results, P-frame type video segmentation results, I-frame type video detection results, P-frame type video detection results, B-frame type reconstruction results, B-frame type video segmentation results and B-frame type video detection results.

Preferably, the address generation mode of the reference division small block in the main memory is:

acquiring a main memory page stored by the reference segmentation small block based on the frame number of the reference segmentation small block and the intra-frame coordinates of the reference segmentation small block;

obtaining a row address of the reference partitioned tile based on a main memory page stored by the reference partitioned tile;

obtaining an index of an attribute storage structure of the reference segmentation small block in the main memory based on the intra-frame coordinates of the reference segmentation small block;

calculating an index value of a reference partition tile within a secondary tile based on intra coordinates of the reference partition tile;

a column address of the reference tile is calculated based on an index value of the reference tile within a secondary block.

Preferably, the two-dimensional cache address structure comprises an X address and a Y address; the X address and the Y address each comprise a tag, an index, and a bank storage structure of a two-dimensional cache memory;

the high order bits of the memory structure address, the row address and the column address of the main memory are distributed in the tags of the X address and the Y address, the tags of the X address and the Y address comprise the low order bits of the main memory row address, and the tags of the X address and the Y address respectively comprise one bit of the main memory column address.

Preferably, the intra-set addressing and block replacement policy of the two-dimensional cache is to replace the least recently used blocks from the two dimensions using the least recently used rules, respectively.

Compared with the prior art, one or more embodiments in the above scheme can have the following advantages or beneficial effects:

by applying the video real-time identification, segmentation and detection framework provided by the embodiment of the invention, a video decoder is arranged to decode a target video; processing the I/P frame image data by setting a neural network processing module to obtain an I/P frame video segmentation result and an I/P frame video detection result, providing a realization basis for obtaining a reconstruction result of a B frame, and obtaining an image detection result of the B frame based on the reconstruction result of the B frame; the video identification processing module is arranged to acquire the reconstruction result of the B-type frame, and in order to improve the acquisition efficiency of the reconstruction result of the B-type frame, optimization schemes of rearranging a decoding sequence into two queues, combining motion vector access requests, caching reference frames by using a buffer and a two-dimensional cache memory and the like are further provided; and storing the processing results in the modules by setting a main memory. The structure of the invention realizes higher performance while maintaining accuracy by closely connecting the video decoder and the neural network, and solves the problem that the existing video identification task processing method can not reduce the calculation amount and energy consumption on the basis of ensuring higher precision.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram illustrating an architecture of a video real-time recognition segmentation and detection architecture according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a rescheduling method of a merging unit according to a first embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a video real-time identification, segmentation and detection architecture according to a second embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a video real-time recognition, segmentation and detection architecture according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a structure of a two-dimensional cache memory according to a third embodiment of the present invention;

FIG. 6 shows three cases of crossing basic block boundaries in main memory with reference to the small partitioned blocks in the third embodiment of the present invention;

FIG. 7 is a diagram illustrating a mapping of two-dimensional image data to a main memory according to a third embodiment of the present invention;

FIG. 8 is a diagram showing the translation of a main memory address into a two-dimensional cache memory address according to a third embodiment of the present invention;

fig. 9 is a schematic diagram of a main memory address format conforming to the design of the main memory address in the third embodiment of the present invention.

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so that how to apply technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that, unless conflict arises, the embodiments and features of the embodiments may be combined with each other, and all the technical solutions are within the scope of the present invention.

In the process of recognizing, segmenting and detecting images by deep learning at the present stage, the complete convolution network is fully utilized in the task field of target segmentation, and the R-CNN family has a higher dominance in the target detection range. However, if the image recognition model is directly applied to each frame of video, the calculation amount and energy consumption are not affordable, and the accuracy of recognition is not high. Therefore, based on the limitation of image recognition, researchers have proposed many neural networks for video recognition, such as the dual-flow FCN model proposed by OSVOS, and the local segmentation based on the tracked target object proposed by FAVOS, but the cost of achieving high accuracy is high computation and energy consumption. Further, in order to implement real-time video segmentation, the DFF proposes a depth feature stream method, which directly combines optical flow and key features together, but the key frames are determined by spacing a fixed number of frames, which affects the recognition accuracy and has a large overhead for extracting optical flow.

Example one

In order to solve the technical problems in the prior art, the embodiment of the invention provides a video real-time identification, segmentation and detection architecture.

FIG. 1 is a schematic diagram illustrating an architecture of a video real-time recognition segmentation and detection architecture according to an embodiment of the present invention; referring to fig. 1, the video real-time identification, segmentation and detection architecture of the present invention includes a main memory, and a video decoder, a video identification processing module and a neural network processing module respectively connected to the main memory through a bus.

The video decoder is mainly used for decoding the target video to obtain a decoding sequence and obtaining I-frame image data, P-frame image data and a motion vector table of the target video.

The video coding and decoding standard of the target video is classification of I-frame image data, B-frame image data and P-frame image data, each frame of image data is divided according to preset size blocks, and the video coding and decoding standard is provided with a motion vector table. For example, the video codec standard of the target video may be h.265 video, and at this time, the small blocks are divided into coding tree blocks; the video codec standard for the target video may also be h.264 video, where the tiles are partitioned into macroblocks. Wherein, the motion vector is the amount of the motion track of the divided small block expressed by the video decoder by recording the code stream of the dependency relationship. The video decoder refers to the depended frame and the divided small block as a reference frame and a reference divided small block, and the motion vector table includes a reference frame depended by each B frame image data and each P frame image data in the target video respectively and a reference divided small block depended by each divided small block in the B frame image data and the P frame image data respectively, wherein motion vectors of B-type frames are required to be used.

It should be noted that each frame of image data may be divided according to a basic unit of divided small blocks, and a typical size of the divided small blocks is 8 × 8 pixels. The decoding process of the video decoder for the I-type frame image data, the P-type frame image data and the B-type frame image data comprises the following characteristics: for the class I frame image data, each small partition block is subjected to intra-frame prediction through 14 prediction modes; the video decoder runs different prediction algorithms in different modes and calculates the sum of absolute errors between the current segmented small block and the segmented small block which is already calculated; the video decoder will select the mode and partition the tiles according to the goal of minimizing the sum of absolute errors. For P-type frame image data, a video decoder searches in a larger range; the search range includes already coded partitioned small blocks of a current frame (i.e., intra prediction) and a preamble frame (i.e., inter prediction); the mode is then selected and the tiles are partitioned again according to the goal of minimizing the sum of absolute errors. For B-frame image data, the video decoder searches in a larger range; the search range is not only the current frame or the preamble frame, but also can be a small segmentation block which is already coded in the subsequent frame; the mode is selected and the tiles are partitioned again according to the goal of minimizing the sum of absolute errors.

Meanwhile, in order to prevent deadlock during decoding (i.e. the B1 frame depends on the P1 frame, but the B1 frame wants to decode first), the video decoder generates a decoding order according to the dependency condition between frames, and decodes frames by frames strictly according to the decoding order during decoding.

But the video decoder will record frames according to the inter-frame dependencyDecoding order because class B frame image data may refer to both preceding and succeeding frames. For example, suppose (I)₀,B₁,B₂,B₃,P₄,I₅,B₆,P₇) Is the playing order of the video, and (I)₀,P₄,B₃,B₂,B₁,I₅,P₇,B₆) It will be the actual decoding order since B is now the case₃Dependence on I₀And P₄. Still further, the video decoder may convert the codestream back into a conventional sequence of frames according to a particular decoding order. Specifically for class I frame image data, the segmented patches are restored using intra-prediction based on the selected prediction mode, the segmented patches referenced in the motion vector, and the corresponding residual. For both P-frame image data and B-frame image data, the video decoder uses both intra-frame prediction and inter-frame prediction.

The video identification processing module is mainly used for storing the motion vector of the B-type frame based on the motion vector table, and sequentially reading the reference frame image segmentation result of the B-frame image data and the reconstruction result of the acquired B-frame image data or the reference frame image detection result of the B-frame image data from the main memory according to the decoding sequence based on the motion vector of the B-type frame and processing the reference frame image segmentation result and the reconstruction result to acquire the reconstruction result of the B-type frame. The video identification processing module comprises a motion vector storage unit, a control unit, a frame ordering unit to be set and a processing unit.

The motion vector storage unit is used for processing the motion vector table acquired by the video decoder to acquire the motion vector of the B-type frame in the motion vector table, and then storing the motion vector of the B-type frame. Further, the motion vector storage unit stores the motion vectors of the respective divided small blocks in all the B-type frame image data.

For the order of executing the neural network for the I frame image data, the P frame image data, and the B frame image data, the present invention first proposes a design scheme that can set the order of executing the neural network for I, P, B frames to be consistent with the decoding order. For example, assuming the decoding order is (I0, P4, B3, B2, B1, I5, P7, B6), the order in which I, P, B frames perform neural networks is also I0, P4, B3, B2, B1, I5, P7, B6. However, the above design scheme may trigger a situation of frequent switching between the first preset neural network or the third preset neural network and the second neural network, and frequent switching of the neural network may cause the weight stored in the processing unit to be frequently refreshed, thereby causing a large access overhead.

Therefore, a control unit is designed in a video identification processing module and is mainly used for sequentially storing the index numbers and the corresponding state information of the I-frame image data and the P-frame image data in a first queue according to a decoding sequence and sequentially storing the index numbers and the corresponding state information of the B-frame image data in a second queue according to the decoding sequence. Further, in order to divide the queues more clearly and avoid the mutual influence of the two queues in the operation process, as shown in fig. 1, a first control unit and a second control unit are designed, wherein the first control unit is used for sequentially storing the index numbers and the corresponding state information of the I-frame image data and the P-frame image data in the first queue (i.e., ip _ Q queue) according to the decoding sequence and storing the first queue; the second control unit is used for sequentially storing the index numbers of the B frame image data and the corresponding state information in a second queue (namely a B _ Q queue) according to the decoding sequence and storing the second queue.

Preferably, in the process of executing the neural network, the present invention preferentially executes frames (i.e., the first preset neural network or the third preset neural network) in the first queue (i.e., the ip _ Q queue), until all frames in the first queue are executed, switches to the second queue (i.e., the b _ Q queue), starts executing frames (i.e., the second preset neural network) in the second queue, and switches to the first queue when all frames in the second queue are executed.

Since the reconstruction operation of the B frame image data depends on the image segmentation result of its reference frame, the image detection result, and the reconstruction result of the acquired B frame image data, the B frame image data is to execute the second predetermined neural network depending on the reconstruction result thereof. The present invention therefore designs different flag bits to flag the state of each frame. Further, setting the initial state of the I-frame image data and the P-frame image data as s _ ready, and setting the state of the image segmentation result of the I-frame image data, the image detection result of the I-frame image data, the image segmentation result of the P-frame image data and the image detection result of the P-frame image data as s _ done; and simultaneously setting the initial state of the B frame image data as s _ unrey, the reconstruction result of the B frame image data as s _ ready, and the image segmentation result of the B frame image data and the state of the video detection result of the B frame as s _ done. It should be noted that, when the B frame image data is reconstructed, it is required to check whether the state of the reference frame is s _ done, and only if the states of the reference frames are s _ done, the result of the first preset neural network or the third preset neural network of the reference frame can be used to reconstruct the reconstruction result of the B frame image data.

The frame to be set ordering unit is used for acquiring image detection results of all reference frames of the B frame image data, and classifying and ordering all detected frames in the image detection results of all reference frames of the B frame image data to obtain frame to be set ordering of the B frame image data; and setting the image detection result of the reference frame of the B frame image data based on the frame sequence to be set of the B frame image data, acquiring the image detection result of the set reference frame of the B frame image data, and merging all sub-reconstruction results of the B frame image data after the processing unit acquires all sub-reconstruction results of the B frame image data to obtain the reconstruction result of the B frame image data.

It should be noted that this unit is only used in the process of implementing video real-time detection by using the structure of the present invention. Further, the obtaining process of the frame ordering to be set of the B frame image data by the frame ordering to be set unit includes: taking the first B frame image data in decoding order as target B frame image data; and acquiring a reference frame of the target B frame image data based on the motion vector of the B frame, classifying all detected frames in all reference frames corresponding to the target B frame image data according to a preset classification requirement, and sequencing the classified detected frames according to a preset classification sequence to obtain a to-be-set frame sequence of the target B frame image data. It should be noted that the reference frame of the B frame image data may be I frame image data, P frame image data, and B frame image data. And there may be multiple detected boxes on each reference frame. Further, the preset classification requirements are as follows: the content in the detected frame is the same; classifying all detected frames in all reference frames corresponding to the target B frame image data according to a preset classification requirement, and performing a sequencing process, namely: and sorting all the detected frames in all the reference frames corresponding to the classified target B frame image data according to a preset category sequence set in advance to obtain the frame sorting to be set of the target B frame image data.

The process of setting the image detection result of the reference frame of the B frame image data by the frame ordering unit to be set based on the frame ordering of the B frame image data comprises the following steps: taking the first type detected frame as a frame to be set according to the frame to be set; then, constructing a corresponding sub-reconstruction result process for the frame to be set comprises the following steps; setting a frame to be set and internal pixel points thereof as a first color and setting other pixel points outside the frame to be set as a second color in image detection results of all reference frames with frames to be set in the B frame image data, and acquiring an image detection result of the set reference frame of the B frame image data of the frame to be set, namely acquiring an image detection result of the reference frame after partial setting of the B frame image data; and then storing the image detection result of the set reference frame of the B frame image data of the frame to be set in a main memory, and acquiring the sub-reconstruction result of the B frame image data by the unit to be processed based on the image detection result. The set image detection result of the reference frame includes a set I-frame image detection result and a set P-frame image detection result.

It should be noted that there may be one or more reference frames with frames to be set in the B frame image data; if only one reference frame with the frame to be set exists in the current frame to be set, setting the frame to be set and the internal pixel points of the frame to be set as a first color and setting other pixel points outside the frame to be set as a second color in the image detection result of the reference frame; if a plurality of reference frames with the frame to be set exist in the current frame to be set, setting the frame to be set and internal pixel points of the frame to be set as a first color in image detection results of the plurality of reference frames with the frame to be set, and setting other pixel points outside the frame to be set as a second color. The colors in the first type of color are different from the colors in the second type of color, the first type of color comprises a plurality of colors, and the colors set by the detected frames and the internal pixel points of the detected frames which are not of the same type in all the reference frame image data corresponding to each B frame image data are different. That is, it is assumed that the B frame image data includes N types of detected frames corresponding to all the reference frame image detection results, when the setting is performed, the N types of detected frames and the internal pixel points thereof need to be set to N colors, and other pixel points outside all the detected frames in all the reference frame image detection results are set to N +1 colors, where the first N colors are colors in the first type of color, and the N +1 colors are colors in the second type of color. The above setting can make the colors of the different types of detected frames and the internal pixel points in each set I frame image detection result or set P frame image detection result and the colors outside all detected frames in the frame image detection result different.

And meanwhile, the frame to be set ordering unit is also used for judging whether the sub-reconstruction result obtained by the processing unit is the sub-reconstruction result corresponding to the last type of detected frame in the frame to be set ordering of the currently corresponding B frame image data, and if so, combining all the sub-reconstruction results of the B frame image data to obtain the reconstruction result of the B frame image data.

The processing unit can be used for acquiring the reconstruction result of the B frame image data according to all image segmentation results in the reference frame of the B frame image data and the reconstruction results of all acquired B frame image data in the process of realizing real-time identification and segmentation of the video; and in the process of realizing real-time video detection, the method is used for acquiring the sub-reconstruction result of the B frame image data based on the image detection result of the set reference frame corresponding to the B frame image data, the motion vector of the B frame and the reconstruction result of the acquired B frame image data.

In particular, to obtain B_XThe reconstruction result of frame image data usually needs to be sequentially from B_XTaking out reference frame S of (srcx, srcy) among motion vectors of divided small blocks of frame image data_refThen the extracted corresponding segmented small blocks are put into the current frame reconstruction result with coordinates (dstx, dsty). Furthermore, in order to better explain the reconstruction result obtaining process of the B frame image data, the following shows the B in the B frame image data₁Acquisition of a segmented patch reconstruction result with (dstx, dsty) of (256,108) in the frame image data: i.e. (dstx, dsty) is (256,108), and has two motion vectors (B)₁,I₀256,108,240,136, F) and (B)₁,P₄256,108,392,232, T), it needs to be reconstructed using two reference split patches; we locate two reference partitions, in this example (240,136) and (392,232), from the reference frame I by two motion vectors₀And P₄. Then, the small block data segmented from different references are averaged pixel by pixel in a mean filter mode to obtain B₁The (dstx, dsty) of the frame image data is the (256,108) segmented patch reconstruction result. Wherein S is_curRepresenting the result of the image segmentation of the current frame, S_refDenotes the image segmentation result of the reference frame, srcx and srcy denote S_refThe x, y coordinates of the divided patches, dstx, and dsty represent S_curThe x, y coordinates of the segmented patches.

The processing unit is used for receiving the reference segmentation small blocks corresponding to the motion vectors and acquiring a reconstruction result of a certain segmentation small block of the B frame image data based on the reference segmentation small blocks; the specific process is further as follows: the processing unit puts the received corresponding reference divided small block into a target position based on a target abscissa and a target ordinate of the motion vector, checks the corresponding motion vector, judges whether the reference divided small block is a second block predicted bidirectionally, performs mean filtering by using the processing unit if the reference divided small block is the second block, and takes the processing unit as a buffer for storing the reference divided small block if the reference divided small block is not the second block predicted bidirectionally, and then deletes the corresponding motion vector from the motion vector storage unit.

In order to obtain the reconstruction result of the B-type frame, the reconstruction result of each frame of image data in the B-type frame image data needs to be obtained first, and the reconstruction result of each frame of image data forms the reconstruction result of the B-type frame; and acquiring the reconstruction result of each frame of image data needs to acquire the reconstruction result of each segmentation small block in the corresponding frame of image data, and the reconstruction result of each frame of image data is formed by the reconstruction results of all the segmentation small blocks in each frame of image data. The processing unit obtains the reconstruction result of each frame of image data and further obtains the reconstruction result of the B-type frame by sequentially constructing the reconstruction result of each divided small block.

It should be noted that, in the process of implementing real-time video identification and segmentation, the reference segmentation small blocks of the processing unit are from all image segmentation results in the reference frame of the B frame image data or all reconstruction results of the acquired B frame image data; in the process of realizing real-time video detection, the reference segmented small blocks of the processing unit are from the image detection result of the set reference frame corresponding to the B frame image data or the reconstruction result of the acquired B frame image data.

The neural network processing module comprises a first preset neural network, a second preset neural network and a third preset neural network.

The first preset neural network is used for acquiring an I frame video segmentation result and a P frame video segmentation result based on the I frame image data and the P frame image data; preferably, the first predetermined neural network is a large deep neural network for video segmentation. The third preset neural network is used for acquiring an I frame video detection result and a P frame video detection result based on the I frame image data and the P frame image data; preferably, the third predetermined neural network is a large deep neural network for video detection.

The second preset neural network is used for processing the reconstruction result of the B-type frame, the image segmentation result of the I-type frame and the image segmentation result of the P-type frame to obtain the image segmentation result of the B-type frame. Further, the image segmentation results of the B-class frames include the image segmentation results of all the B-frame image data, and the image segmentation results of each B-frame image data are acquired sequentially. Furthermore, firstly, the reconstruction result of single frame image data in the B-type frame reconstruction result, the image segmentation result of the first I frame image data or P frame image data corresponding to the frame reconstruction result in the playing sequence and the image segmentation result of the first subsequent I frame image data or P frame image data are respectively used as an input group; then, all input groups are sequentially input into a second preset neural network, and image segmentation results of B frame image data corresponding to each input group are respectively obtained; and finally, collecting the image segmentation results of all B-frame image data to obtain the image segmentation results of the B-type frames.

The second preset neural network is also used for detecting the reconstruction result of the B-type frame, the set I-type frame image detection result and the set P-type frame image detection result to obtain the image detection result of the B-type frame. Further, the image detection results of the B-frame as above include the image detection results of all the B-frame image data, and the image detection result of each B-frame image data is acquired in turn. Furthermore, firstly, the reconstruction result of the single frame image data in the B-type frame reconstruction result, the image detection result of the I frame image data or the P frame image data which is corresponding to the frame reconstruction result and is arranged in the first setting of the preamble in the playing sequence and the image detection result of the subsequent I frame image data or the P frame image data which is arranged in the first setting are respectively used as an input group; then, all input groups are sequentially input into a second preset neural network, and image detection results of B frame image data corresponding to each input group are respectively obtained; and finally, collecting the image detection results of all B-frame image data to obtain the image detection result of the B-type frame. It should be noted that the set I-frame image detection result and the set P-frame image detection result may be obtained by the frame ordering unit to be set and stored in the main memory.

It should be noted that the second predetermined convolutional neural network includes three layers, i.e., a convolutional layer, a pooling layer, and an activation layer, where the input image channel of the first layer has three input channels. Preferably, when each input group is in input, the reconstruction result of the B frame image data in each group is input to the second input channel of the first layer of the second preset convolutional neural network, and the image segmentation result of the first I frame image data or P frame image data and the image segmentation result of the first subsequent I frame image data or P frame image data corresponding to the reconstruction result of the frame image data in the playing sequence are input to the first input channel and the third input channel of the first layer of the second preset convolutional neural network; or when each input group is in input, the reconstruction result of the B frame image data in each group is input into a second input channel of a first layer of a second preset convolutional neural network, and the image detection result of the I frame image data or the P frame image data which is set in the first preamble and corresponds to the reconstruction result of the frame image data in the playing sequence and the image detection result of the I frame image data or the P frame image data which is set in the first subsequent frame are input into a first input channel and a third input channel of the first layer of the second preset convolutional neural network.

The main memory is mainly used for storing I frame type image data, P frame type image data, I frame type video segmentation results, P frame type video segmentation results, I frame type video detection results, P frame type video detection results, set I frame type image detection results, set P frame type image detection results, B frame type reconstruction results, B frame type video segmentation results and B frame type video detection results.

In an embodiment, we propose a motion vector rescheduling method to merge multiple motion vector access requests to main memory. Furthermore, a merging unit is designed in the video identification processing module.

The merging unit is used for sequentially acquiring the motion vectors of M access requests to be sent from the motion vector storage unit and merging the access requests of the acquired M motion vectors according to a preset merging mode to obtain at least one merged access request; wherein, the preset merging mode is as follows: merging access requests corresponding to motion vectors with the same reference frame number and source ordinate in the M motion vectors into a merged access request; m is a positive integer. Fig. 2 is a schematic diagram illustrating a rescheduling method of a merging unit according to a first embodiment of the present invention; referring to fig. 2, it is assumed that the merging unit collects 8 motion vectors of access requests to be sent, which are respectively from B _1, B _2, and B _3(cur), and which are directed to different positions of the same reference frame I _ 0. The merging unit merges the access requests of the above 8 motion vectors into four merged access requests, respectively (0,11), (0,13), (0,2), (0,5), according to the reference frame number (ref) and the source ordinate (src) of the motion vector. The merging rule is to merge requests according to the line number of the reference frame which the motion vector wants to access, because the data in the main memory is generally line-prioritized, so the burst characteristic of the main memory can be better utilized by sending the requests to the main memory in line units. Since we request an entire row of data each time, but not all the data will be used in the reconstruction process, the merging unit will send the data to the processing unit according to the current frame number (cur), the destination abscissa (dstx), and the destination ordinate (dsty) specified in the motion vector after the requested data comes back.

The video real-time identification, segmentation and detection architecture provided by the embodiment of the invention decodes a target video by arranging a video decoder; processing the I/P frame image data by setting a neural network processing module to obtain an I/P frame video segmentation result and an I/P frame video detection result, providing a realization basis for obtaining a reconstruction result of a B frame, and obtaining an image detection result of the B frame based on the reconstruction result of the B frame; the method comprises the steps that a video identification processing module is arranged to obtain a reconstruction result of a B-type frame, and in order to improve the obtaining efficiency of the reconstruction result of the B-type frame, optimization schemes of rearranging a decoding sequence into two queues, combining motion vector access requests and the like are further provided; and storing the processing results in the modules by setting a main memory. The structure of the invention realizes higher performance while maintaining accuracy by closely linking the video decoder and the neural network, and solves the problem that the existing processing method for the video identification task can not reduce the calculation amount and the energy consumption on the basis of ensuring higher precision.

Example two

The video real-time identification, segmentation and detection architecture of the embodiment of the invention is further defined on the basis of removing the merging unit in the first embodiment.

In order to speed up the index speed of the reference frame by the reconstruction operation, the reference frame can be read into the buffer from the main memory in advance, and as long as the following continuous motion vectors all point to the reference frame, a large number of main memory accesses can be avoided. The method comprises the following specific steps:

FIG. 3 is a schematic structural diagram of a video real-time identification, segmentation and detection architecture according to a second embodiment of the present invention; referring to fig. 3, the video recognition processing module of the video real-time recognition, segmentation and detection frame according to the embodiment of the present invention further includes a buffer.

The buffer is used for prefetching the image segmentation result or the image detection result of the reference frame of the B frame image data and the reconstruction result of the completed B frame image data from the main memory, acquiring all the reference segmentation small blocks corresponding to the motion vectors with the reference frame numbers stored in the motion vector storage unit based on the image segmentation result or the image detection result of the prefetched reference frame and the reconstruction result of the completed B frame image data, and sending the acquired reference segmentation small blocks to the processing unit.

Specifically, the embodiment of the present invention sets a prefetch pointer and a search pointer in a buffer. Further, the prefetch pointer points to the first motion vector in the motion vector storage unit as a prefetch motion vector; and then the pre-fetching pointer acquires and stores an image segmentation result or an image detection result of a reference frame in a preset motion vector and a reconstruction result of the completed B frame image data from the main memory, and sends the reference frame number in the pre-fetching motion vector to the searching pointer.

And searching motion vectors of which all reference frame numbers are the preset motion vector reference frame numbers in the motion vector storage unit by the searching pointer, and forming a similar motion vector group by the motion vectors of which all reference frame numbers are the preset motion vector reference frame numbers. It should be noted that the homogeneous motion vector group includes the first motion vector in the current motion vector storage unit. And then the searching pointer sequentially acquires the reference segmentation small blocks of all motion vectors in the same type of motion vector group based on the image segmentation result or the image detection result of the reference frame in the preset motion vectors stored in the buffer and the reconstruction result of the image data of the completed B frame, and sends the reference segmentation small blocks to the processing unit. All motion vectors in the set of homogeneous motion vectors are then deleted from the motion vector storage unit.

Judging whether a motion vector still exists in the current motion vector storage unit, if so, pre-fetching the pointer to point to the first motion vector in the current motion vector storage unit, and repeating the working processes of the pre-fetching pointer and the searching pointer; otherwise, the prefetch is ended.

The video real-time identification, segmentation and detection architecture provided by the embodiment of the invention decodes a target video by arranging a video decoder; processing the I/P frame image data by setting a neural network processing module to obtain an I/P frame video segmentation result and an I/P frame video detection result, providing a realization basis for obtaining a reconstruction result of a B frame, and obtaining an image detection result of the B frame based on the reconstruction result of the B frame; the method comprises the steps that a video identification processing module is arranged to obtain a reconstruction result of a B-type frame, and in order to improve the obtaining efficiency of the reconstruction result of the B-type frame, optimization schemes that a decoding sequence is divided into two queues to be rearranged, a buffer is used for caching a reference frame and the like are further provided; and storing the processing results in the modules by setting a main memory. The structure of the invention realizes higher performance while maintaining accuracy by closely connecting the video decoder with the neural network, and solves the problem that the existing processing method for the video identification task can not reduce the calculated amount and the energy consumption on the basis of ensuring higher precision.

EXAMPLE III

The video real-time identification, segmentation and detection architecture of the embodiment of the invention is further defined on the basis of removing the merging unit in the first embodiment. Because the video frame sequence has the condition that a plurality of B frame image data refer to the same reference frame, the embodiment uses the two-dimensional cache memory to temporarily store the reference frame, so that a large number of repeated memory access requests can be reduced, and the cost introduced by the memory access of the system is reduced. The method comprises the following specific steps:

FIG. 4 is a schematic structural diagram of a video real-time recognition, segmentation and detection architecture according to an embodiment of the present invention; referring to fig. 4, the video recognition processing module of the video real-time recognition segmentation and detection architecture according to the embodiment of the present invention further includes a two-dimensional cache memory.

FIG. 5 is a diagram illustrating a structure of a two-dimensional cache memory according to a third embodiment of the present invention; referring to fig. 5, the two-dimensional cache memory is used to obtain and cache the segmentation result of the reference frame of the B-frame image data from the main memory; the two-dimensional cache memory comprises at least one body memory structure, at least one group memory structure is arranged in the body memory structure, and at least one way memory unit is arranged in the group memory structure.

The data in the traditional cache memory is stored in a one-dimensional form, and for a two-dimensional image, the one-dimensional data organization mode is obviously not efficient from design or access, and a two-dimensional cache memory is designed for the invention. The whole two-dimensional cache adopts a design mode of multi-body multi-path group connection, the outermost storage structure is a two-dimensional cache body storage structure (cbank), each two-dimensional cache body storage structure is provided with a plurality of groups of storage structures, and each group of storage structures is provided with a plurality of path storage units. The cache line is located within the group using a two-dimensional tag comparison.

Fig. 6 shows three cases of crossing basic block boundaries in main memory with reference to the division small blocks in the third embodiment of the present invention. In order to improve the access efficiency of adjacent pixel blocks in an image, the invention maps four three-level blocks which are adjacent in space in the second-level block in the main memory into different two-dimensional cache memory body storage structures respectively, thereby facilitating the parallel reading of the three cross-boundary conditions shown in the figure 6. Based on the above mapping rule, three levels of blocks to be 1/4 in each main memory bank storage structure are mapped into the same two-dimensional cache bank storage structure, and then 1/4 reference split small blocks in each main memory page are mapped into the same two-dimensional cache bank storage structure, and in order to reduce the replacement frequency of cache lines, the capacity of the group storage structure in the two-dimensional cache should be larger than 1/4 of the number of blocks in one main memory page. A bank storage structure index of the two-dimensional cache is associated with a main memory page number, and a portion (1/4) of each main memory page is mapped into a bank of the two-dimensional cache with reference to the split tile.

The two-dimensional cache memory address structure includes an X address and a Y address; the X and Y addresses each include a tag, an index, and a bank memory structure of a two-dimensional cache memory. The high order bits of the memory structure address, the row address and the column address of the main memory are distributed in the tags of the X address and the Y address, the tags of the X address and the Y address comprise the low order bits of the main memory row address, and the tags of the X address and the Y address respectively comprise one bit of the main memory column address.

The data in the cache is indexed by two-dimensional coordinates, so the present invention uses the address pair (xAddx, yAddy) to determine the location of the data in the cache. The two-dimensional cache memory contains 4 individual memory structures that are addressed with two binary bits, where 1 bit is allocated in the x dimension and 1 bit is allocated in the y dimension, and the two-dimensional cache bank memory structure address can be generated from the main memory column address. The set storage structure index uses a two-dimensional index coordinate, the aforementioned set storage structure index being associated with a main memory page number, so that the address of the index field can be generated from a main memory line address representing the number of bits associated with the number of set storage structures in the two-dimensional cache bank storage structure, assuming Cb is present in the two-dimensional cache bank storage structure_x·Cb_ySet, then the number of bits in the x dimension index is [ log ]₂ Cb_x]Similarly, the y dimension has index bits of [ log ]₂ Cb_y]. In groupDetermining the position of a cache line in a storage structure in a two-dimensional tag comparison mode, wherein in order to distinguish a plurality of tertiary blocks which are distributed in the storage structure of a two-dimensional cache and belong to the same secondary block, the cache line needs to be identified by using a column address; three-level blocks in the same location in different bank storage structures in main memory are allocated to the same group of the same two-dimensional cache bank storage structure, so that the three-level blocks need to be distinguished by using the index of the bank storage structure in main memory; in addition, different sets of images may be allocated to the same set of storage structures of the same two-dimensional cache bank storage structure, so that the main memory line addresses should also be part of the two-dimensional cache tag, and the allocation of these address bits to the x and y dimensions constitutes the two-bit index tag of the tag.

FIG. 8 is a diagram showing the translation of a main memory address into a two-dimensional cache memory address according to a third embodiment of the present invention; each two-dimensional cache bank storage structure is provided with 16 multiplied by 16 groups, so that the index of the bank storage structure is represented by 4 binary bits in each dimension, and because different main memory pages in a group of pictures are respectively mapped into different two-dimensional cache bank storage structures, the unit group of pictures has the size of 256 main memory pages, the lower 8 bits of the main memory row address are intercepted as the index of the bank storage structure in the bank storage structure; for the intra-group structure, the horizontal 8 units can just store the data of 8 main memory storage structures, so the main memory storage structure number is used as the label bit of x dimension; the data temporarily stored in each column in the group is related to a main memory column address and a row address, redundant bits and used bits in the main memory address are removed, and the upper 3 bits of the column address and the upper 7 bits of the row address are spliced together to be used as label bits in the y dimension. In summary, the address of the two-dimensional cache memory is completely obtained according to the address of the main memory, complex operation and hardware realization are not needed, and the conversion efficiency is very high.

Data is stored linearly in a main memory, a traditional data mapping mechanism stores an image in a main memory unit in a row (or column) sequence, and a block of data is usually requested to read a reference pixel, which is represented by a local two-dimensional pixel block on the image, so that a lot of pixels read into a line buffer each time are unused redundant data, and the access efficiency is low. In order to solve the problem, the invention designs a preset storage mapping rule aiming at two-dimensional image data, the image is mapped into a main memory in three levels, each level performs segmentation of different granularities on the image, so that the image can be written into the main memory block by block with proper size, and thus, when a reference block is read each time, almost no redundancy exists.

FIG. 7 is a diagram illustrating a mapping of two-dimensional image data to a main memory according to a third embodiment of the present invention; referring to fig. 7, it is set that the two-dimensional image data stored in the main memory all conform to the preset memory mapping rule; the preset storage mapping rule is as follows: the two-dimensional image data is mapped into a main memory in three levels, and each level of mapping is used for segmenting the two-dimensional image data with different granularities; the number of the second-level blocks in the second-level mapping is the same as that of the physical storage structures in the main memory, and the size of the third-level block in the third-level mapping is the product of the burst length of the main memory and the width of the data bus divided by 8; and the two-dimensional image data comprises I-frame type image data, P-frame type image data, I-frame type video segmentation results, P-frame type video segmentation results, I-frame type video detection results, P-frame type video detection results, B-frame type reconstruction results, B-frame type video segmentation results and B-frame type video detection results.

The DDRx SDRAM can continuously return BL · US _ WIDTH/8 byte data in one burst (BL · US _ WIDTH is defined by the main memory), so the minimum partition block (Cache Line) of the two-dimensional image data should also contain BL · US _ WIDTH/8 byte data. Assume that the resolution of the input image is P_x·P_yThe first-level mapping divides two-dimensional image data into a plurality of x by rows (or columns)₁·y₁A small rectangular block; the second level of mapping cuts out N inside the primary block_bankRectangular blocks of uniform size (N)_bankThe number of structures in the main memory is stored), the second-level division meets the requirement of parallel reading, and the throughput of memory access is improved; the third level mapping continues partitioning out (x) within each second level block₁·y₁)/[N_bank(x₃·y₃)]Size x₃·y₃Rectangular block, wherein x₃y₃BL · US _ WIDTH/8. FIG. 7 is a diagram illustrating a mapping of two-dimensional image data to a main memory according to a third embodiment of the present invention; referring to fig. 7, we select an input image with a resolution of 854 × 480, and the first level of mapping divides the image into 128 × 64 rectangular blocks by column; the main memory selected here contains 8 individual memory structures, so the second level divides a rectangular block of size 128 x 64 into 8 square blocks of size 32 x 32; the burst length of the main memory is 8, the data bus width is 64 bits, so the burst of the main memory can continuously access 64 bytes of data, if the pixel points are stored according to the data bit length of 8 bits, 16 three-level blocks of 8 × 8 can be divided out at the third level, the data volume of each three-level block is 64 bytes, each three-level block is mapped to a row of units with the starting address of 8 in the main memory, the length of the unit is 64 bytes, and when the reference is read and the small blocks are divided each time, a complete data block can be returned without reading or cutting for many times.

The main memory address contains information of channel, rank, bank, row, column, etc., the sequence of each field can be combined at will, and the length is determined according to the size of the main memory or the design requirement. According to the JESD79-3 specification, the lower 3 bits (c) of the main column address₂,c₁,c₀) Is used for burst sequential function, and in general, the sequential read-write mode takes { c₂,c₁,

c

₀0,0, 0. To align burst length BL, main memory address is made low log₂BL is 0, and if the binary bit of 0 is overlapped, the max (log) of main memory address is low₂BL,3) are both 0. To avoid frequent main memory page switching when accessing data in a bank memory structure in the same main memory, we make x in the second level mapped second level block₁·y₁The byte data is stored in one row of the main memory array, only the row data needs to be loaded into a row buffer area once when a small block in a body storage structure in one main memory is read each time, the main memory is set to be an open page policy (open page policy), the high order of a column address is sent for gating, and the expenditure of opening/closing the main memory page is greatly reduced. Different level one blocks are stored in different rows of the main memory array, in different main memories within the same level one blockThe bank memory structures are stored on the same row address of the bank memory structures in different main memories of the main memory. FIG. 9 is a diagram illustrating a main memory address format conforming to the main memory address design according to a third embodiment of the present invention; the lower 6 bits of the address are all 0, and the upper 7 bits of the column address are used for decoding, so as to support addressing of 16 small blocks of 8 × 8 in the secondary block in fig. 7; the length of the row address is 15 bits, and the maximum support is to address 32768 first-level blocks; the main memory contains 8 memory structures in the main memory, 3 binary bits are used for addressing, the width of an address line is 32 bits, the rest 1 bit is allocated to a channel or rank, and the purpose and value of the bit can be determined according to actual conditions.

Based on the above data mapping scheme and main memory address format, generation of a reference tile address is discussed herein. Specifically, assume that the frame number of the reference block is f_iThe coordinate in the frame is x_i·y_iThen, the main memory page Pg stored by the reference divided small block can be obtained based on the frame number to which the reference divided small block belongs and the intra-frame coordinates thereof_iWith specific reference to equation (1):

wherein, PagesInCol represents a Page contained in a column in the image;

then, the row address of the reference division small block is obtained based on the main memory page stored in the reference division small block, specifically referring to formula (2):

R_i＝f_i·PagesInFrame+Pg_i (2)

wherein, pagesInFrame represents the total number of pages contained in one image;

in the secondary block, the storage structure index of the reference partition small block in the main memory can be obtained based on the intra-frame coordinates of the reference partition small block, and the specific reference formula (3):

B_i＝x_i mod x₁/x₂·BanksIncol+y_i mod y₁/y₂ (3)

wherein, Bank Incol represents the number of secondary blocks contained in a column in the primary block, mod represents the operator of modular operation;

calculating the index values of the reference partitioned patches within the secondary blocks based on the intra-frame coordinates of the reference partitioned patches, in particular, since all reference partitioned patches in the same bank of the memory structure in the main memory are stored in the same row of the main memory array, these blocks need to be addressed by column addresses, which are first calculated based on the column addresses of the main memory, with reference to equation (4):

Blk_i＝x_i mod x₁ mod x₂/x₃·BlocksInCol+y_i mod y₁ mod y₂/y₃ (4)

wherein Bank Incol represents the number of tertiary blocks contained in a column in the secondary block,

further, based on the index value of the reference divided small block in the secondary block, the column address of the reference divided small block is calculated, specifically referring to formula (5):

if the operands of the multiplication and division operation or the modular operation are all powers of 2, the operations can be directly obtained through the shift operation or the bit operation calculation by means of the mask code, and the cost of hardware implementation is low.

An intra-set addressing and block replacement strategy for a two-dimensional cache is to replace the least recently used blocks from both dimensions using the least recently used rules, respectively. Firstly, regarding the x dimension in a two-dimensional cache memory group, the dimension corresponds to the extraction structure in the main memory, the data blocks in the extraction structure in the same main memory are mounted on the same column, and the data of the extraction structure in different main memories are respectively listed in different label _ y labels; for the y-dimension, a one-dimensional intra-group addressing and block replacement update strategy is performed on data of the extracted storage structure (possibly belonging to different video frames) in the same main memory. The design gives consideration to the statistics of cache line access frequency and time, the hit rate of the two-dimensional cache memory can be effectively improved, and the complexity of addressing in the group is O (n) in terms of time complexity, which is obviously superior to the time complexity of element-by-element comparison of the two-dimensional array.

In order to cooperate with the design concept of reducing main memory page switching overhead in the design of a storage system, the invention provides a main memory page-based read request scheduling algorithm, which is used for scheduling read requests generated by a video decoder and used for accessing reference pixel blocks, main memory page fields are added in each motion vector information Item, and a main memory page Pg to which each reference partition small block (MV Item) belongs is calculated when a motion vector table is scanned for the first time_iAnd then, the pixel blocks in the same main memory page are preferentially scheduled to send read requests each time of searching, next polling is executed after all the requests in one main memory page are sent, and the operations are repeated until the video decoder is emptied.

The video real-time identification, segmentation and detection architecture provided by the embodiment of the invention decodes a target video by arranging a video decoder; processing the I/P frame image data by setting a neural network processing module to obtain an I/P frame video segmentation result and an I/P frame video detection result, providing a realization basis for obtaining a reconstruction result of a B frame, and obtaining an image detection result of the B frame based on the reconstruction result of the B frame; the reconstruction result of the B-type frame is acquired by arranging a video identification processing module, and in order to improve the acquisition efficiency of the reconstruction result of the B-type frame, optimization schemes of rearranging a decoding sequence into two queues, caching a reference frame by a two-dimensional cache memory and the like are further provided; and storing the processing results in the modules by setting a main memory. The structure of the invention realizes higher performance while maintaining accuracy by closely linking the video decoder and the neural network, and solves the problem that the existing processing method for the video identification task can not reduce the calculation amount and the energy consumption on the basis of ensuring higher precision.

Although the embodiments of the present invention have been described above, the embodiments are merely for facilitating understanding of the present invention, and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A video real-time identification, segmentation and detection architecture is characterized by comprising a main memory, and a video decoder, a video identification processing module and a neural network processing module which are respectively connected with the main memory through buses;

the neural network processing module is used for segmenting I-frame image data and P-frame image data by using a first preset neural network to obtain an I-frame video segmentation result and a P-frame video segmentation result, detecting the I-frame image data and the P-frame image data by using a third preset neural network to obtain an I-frame video detection result and a P-frame video detection result, segmenting a B-frame reconstruction result, an I-frame image segmentation result and a P-frame image segmentation result by using a second preset neural network to obtain an image detection result of a B-frame, and detecting the B-frame reconstruction result, the set I-frame image detection result and the set P-frame image detection result by using the second preset neural network to obtain an image detection result of the B-frame;

2. The architecture of claim 1, wherein the video recognition processing module comprises a motion vector storage unit, a control unit, a frame ordering unit to be set, and a processing unit;

the frame ordering unit to be set is used for acquiring image detection results of all reference frames of the B frame image data, and classifying and ordering all detected frames in the image detection results of all reference frames of the B frame image data to obtain the frame ordering to be set of the B frame image data; and setting the image detection result of the reference frame of the B frame image data based on the frame ordering to be set of the B frame image data, acquiring the image detection result of the set reference frame of the B frame image data, and merging all sub-reconstruction results of the B frame image data after the processing unit acquires all sub-reconstruction results of the B frame image data to obtain the reconstruction result of the B frame image data.

3. The architecture according to claim 2, wherein the initial state of the I frame image data and the P frame image data is set to s _ ready, and the state of the image segmentation result of the I frame image data, the image detection result of the I frame image data, the image segmentation result of the P frame image data, and the image detection result of the P frame image data is set to s _ done; and simultaneously setting the initial state of the B frame image data as s _ unrey, the state of the reconstruction result of the B frame image data as s _ ready, and the states of the image segmentation result of the B frame image data and the video detection result of the B frame class as s _ done.

4. The architecture of claim 2, wherein the video recognition processing module further comprises a merging unit;

the merging unit is used for sequentially acquiring the motion vectors of the M access requests to be sent from the motion vector storage unit and merging the access requests of the acquired M motion vectors according to a preset merging mode to obtain at least one merged access request;

5. The framework of claim 2, wherein the video recognition processing module further comprises a buffer, and the buffer is configured to pre-fetch an image segmentation result or an image detection result of a reference frame of B-frame image data and a reconstruction result of completed B-frame image data from the main memory, and obtain reference segmented small blocks corresponding to all motion vectors with reference frame numbers stored in the motion vector storage unit based on the image segmentation result or the image detection result of the pre-fetched reference frame and the reconstruction result of completed B-frame image data, and send the obtained reference segmented small blocks to the processing unit;

deleting all motion vectors in the same-class motion vector group from the motion vector storage unit, judging whether motion vectors exist in the current motion vector storage unit or not, if so, pointing the pre-fetching pointer to a first motion vector in the current motion vector storage unit, and repeating the working processes of the pre-fetching pointer and the searching pointer; otherwise, the prefetch is ended.

6. The architecture according to claim 2, wherein said video recognition processing module further comprises a two-dimensional cache memory for fetching and caching from said main memory the segmentation results of the reference frames of B-frame image data;

7. The architecture according to claim 6, characterized in that the two-dimensional image data stored in said main memory each comply with preset memory mapping rules;

the preset storage mapping rule is as follows: the two-dimensional image data are mapped into the main memory in three levels, and each level of mapping is used for segmenting the two-dimensional image data with different granularities;

8. The architecture of claim 7, wherein the address of the reference tile in the main memory is generated by:

acquiring an attribute storage structure index of the reference segmentation small block in the main memory based on the intra-frame coordinates of the reference segmentation small block;

9. The architecture of claim 6, wherein the two-dimensional cache address structure comprises an X address and a Y address; the X address and the Y address each comprise a tag, an index, and a bank storage structure of a two-dimensional cache memory;

10. The architecture of claim 6, wherein the intra-set addressing and block replacement policy of the two-dimensional cache is to replace a least recently used block from two dimensions using a least recently used rule, respectively.