CN112422983B

CN112422983B - Universal multi-core parallel decoder system and application thereof

Info

Publication number: CN112422983B
Application number: CN202011154537.5A
Authority: CN
Inventors: 雷理; 张云; 韦虎; 贾大中
Original assignee: Mouxin Technology Shanghai Co ltd
Current assignee: Mouxin Technology Shanghai Co ltd
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2023-05-23
Anticipated expiration: 2040-10-26
Also published as: CN112422983A

Abstract

The invention discloses a universal multi-core parallel decoder system and application thereof, and relates to the technical field of video decoding. The system comprises decoding firmware and a multi-core hardware decoding accelerator which are in communication connection, wherein Slice level data in a video code stream are used as interaction units between the decoding firmware and the multi-core hardware decoding accelerator; the decoding firmware is used for analyzing the non-entropy coding data of the upper layer of the video code stream, and the multi-core hardware decoding accelerator is used for processing the decoding task of the macro block layer in the video code stream; the multi-core hardware decoding accelerator comprises a preprocessor module and at least two isomorphic full-function hardware decoders, wherein the preprocessor module is used for entropy decoding tasks of video code streams; each fully functional hardware decoder is responsible for decoding a row of macroblocks and spaces at least two macroblocks from the macroblock being decoded in two adjacent upstream and downstream rows. The parallelism of the invention is not limited by the number of the stripe/slice partitions, the general decoding is realized, the expandability is good, and the parallel processing efficiency is high.

Description

Universal multi-core parallel decoder system and application thereof

Technical Field

The invention relates to the technical field of video decoding, in particular to a universal multi-core parallel decoder system and application thereof.

Background

In the "ultra-high definition video industry development action plan (2019-2022)" issued in 2019, the application of the ultra-high definition video industry development and related fields is greatly promoted according to the general technical route of "4K in advance and 8K at the same time" is explicitly proposed. The development of the ultra-high definition video industry is accelerated, and the method has great significance for driving the intelligent transformation of industries such as 5G, security and the like which take videos as important carriers and improving the integral strength of information industry and cultural industry in China. Whereas ultra-high definition video technology naturally places higher demands on the video codec process. For high-definition Video, the current mainstream Video compression standards include Advanced Video Coding (AVC) and High Efficiency Video Coding (HEVC), and the software decoding incurs significant CPU and power consumption overhead, so that the industry usually uses a special hardware accelerator as a Video Decoder (VDEC, video Decoder). Taking a common single-core hardware decoder as an example, the single-core hardware decoder mostly adopts a pipeline design, and a macroblock (denoted as MB in AVC, i.e. Macro Block; denoted as CTB in HEVC, i.e. Coding Tree Blocks) is used as a pipeline unit. Taking HEVC as an example, its main pipeline stage division can be seen in fig. 1. Fig. 1 includes 4 levels, each of which functions are described as follows: first stage Entropy Dec: entropy decoding (CABAC/CAVLC); second stage IQT, IDCT: inverse quantization, inverse DCT transformation; third stage IPred, reC: intra, inter prediction, image reconstruction. Fourth stage Dblock, SAO: deblocking filtering and adaptive compensation.

For ultra-high definition video, the decoder is required to achieve the performance of 8K@30fps/4K@120fps, and the single-core hardware decoder has difficulty in meeting the performance requirement within the 600MHz clock running frequency. Therefore, in the prior art, a multi-core design scheme of the video decoder is proposed to realize efficient parallel decoding, and the good expandability of the multi-core can meet the requirement of higher performance. Currently, the parallel decoding scheme is generally developed around a Frame (i.e., frame), slice (i.e., slice), MB/CTB line (i.e., macroblock line), etc., and the main stream of the parallel decoding scheme is as follows.

First, frame parallel decoding. AVC/HEVC frames are divided into I, P, B frames, where B frames are often used as non-reference frames, since such B frames are not correlated with other frame data and can be used for parallel processing.

Second, slice parallel decoding. Each frame of AVC/HEVC image can be divided into one or more slices, a Slice consisting of an integer number of macroblocks arranged one after the other. The purpose of the Slice is to enhance robustness in case of transmission errors, once the transmission is in error, the Slice without errors is not affected.

Third, tile parallel decoding. HEVC introduces a Tile segmentation mode for improving parallelism, namely, a frame is segmented into M x N rectangular breadth parallel computing (wherein M, N is an integer, M is more than or equal to 1, and N is more than or equal to 1).

Fourth, WPP decodes in parallel. HEVC provides a special Wavefront Parallel Processing (WPP for short) scheme for improving the parallelism, and directly initializes entropy decoding at the line head of each CTB line during encoding without waiting until the entropy decoding of the previous line is completely finished, so that the inter-line serial characteristic of the entropy decoded macro blocks is blocked, and the parallel calculation of the macro blocks is realized.

However, the above decoding scheme has the following drawbacks: for the first decoding scheme, the number of B frames between P frames is usually small, resulting in low parallelism; meanwhile, in some protocol standards, B frames may be used as reference frames, and such B frames cannot be decoded in parallel by the frames. For the second decoding scheme, although the slices of a frame of image are usually independent of each other, since deblocking filtering can exceed Slice boundaries, slice-based parallel decoding can fail in such an application scenario; meanwhile, when the Slice is decoded in parallel, the parallelism is directly influenced by the Slice dividing number in one frame, and the dependence on the Slice dividing number is too high. For the third decoding scheme, as with the Slice parallel decoding, deblocking filtering in certain protocol standards still has the option of exceeding the Tile boundary, and in this application scenario, tile-based parallel decoding will likewise fail; meanwhile, the parallelism is directly influenced by the number of Tile divisions in one frame, and the dependency on the number of Tile divisions is too high. For the fourth decoding scheme, since each line of initialization can increase the video code rate, the application of the WPP parallel decoding mode is affected to a certain extent.

On the other hand, in AVC/HEVC, when single-thread decoding is adopted, the processing order of macroblocks within a frame (MBs in AVC, CTBs in HEVC) is generally left to right, top to bottom. A macroblock typically depends on its left and upper neighboring macroblock information during entropy decoding, intra prediction, inter prediction, deblocking filtering, etc. To achieve macroblock parallel processing within a frame, data independent macroblocks must be found. The prior art provides a 2D-Wave algorithm, and the 2D-Wave algorithm is similar to the HEVC Wave front idea. Specifically, the 2D-Wave algorithm may enable the nearest macro block independent of one macro block to be located at the relative position of "right 2, top 1" according to a preset protocol, and the dependency relationship of typical macro blocks is shown in fig. 2, where the diagonal macro block (x, y) and the diagonal macro block (x+2, y-1) are independent macro blocks, and the macro blocks (x, y)The dependent macroblocks are punctiform macroblocks (x, y-1) and punctiform macroblocks (x+1, y-1). Thus, as long as a synchronization constraint is imposed between the multiple threads, it is ensured that the upstream decoding thread leads the downstream thread by a minimum of two macroblocks, which can be decoded simultaneously. Referring to fig. 3, the decoded macroblock 10 is complete, the in-flight macroblock 20 is the decoded macroblock, and the blank macroblock 30 is the non-decoded macroblock. When video decoding, each thread is responsible for decoding one row of macro block, and forward and backward wave type right progressive decoding is kept between the upper row and the lower row, so that macro block parallel processing is realized. 2D-Wave has good expandability, the parallelism is not limited by the number of Slice/Tile (i.e. Slice/Slice) partitions in a frame, and the maximum parallelism (i.e. the number of macro blocks which can be processed simultaneously) is related to the width and height of the frame, and can be generally expressed by the formula ParMB _max，2D =min (mb_width/2, mb_height) to determine maximum parallelism, wherein ParMB _max，2D Representing maximum parallelism, mb_width represents the number of macroblocks over the frame width, and mb_height represents the number of macroblocks over the frame height. For example, for an image frame with a size of 1920×1080 (width×height) pixels, the number of macro blocks is 120×68 (width×height) MBs, and the maximum parallelism ParMB obtained by the above formula is calculated _max，2D 120/2=60 (taking the minimum of 120/2 and 68). In a 4K/8K ultra-high definition application scene, the parallelism can be greatly improved through a 2D-Wave algorithm, and the decoding processing of the high-definition video is accelerated.

However, the 2D-Wave algorithm has the following drawbacks: on the one hand, since entropy decoding is completely serial (entropy decoding of each macro block in Slice is related to its previous macro block), the first macro block of the downlink needs to obtain the field information after the last macro block of the previous line is decoded; on the other hand, the code stream data in the Slice are completely serial, and the code stream data corresponding to the first macroblock of the downlink also needs to be displayed after the last macroblock of the last line is decoded, so that the parallel processing efficiency is affected to a certain extent.

In summary, the existing parallel decoding schemes cannot achieve the best aspects in terms of generality, expandability, parallelism, parallel processing efficiency and the like, and are often limited to achieve multi-core parallel on a coding stream in a certain frame, so how to provide a general multi-core parallel decoding scheme with good expandability and strong universality, wherein the parallelism of the general multi-core parallel decoding scheme is not limited by the number of Slice/Tile partitions in a frame, is a technical problem to be solved currently.

Disclosure of Invention

The invention aims at: overcomes the defects of the prior art and provides a universal multi-core parallel decoder system and application thereof. According to the characteristics of code stream data, a Video Decoder (VDEC) is divided into decoding FirmWare (VDEC_FW for short, which is called Video FirmWare for all) and a Multi-Core hardware decoding accelerator (VDEC_MCORE for short, which is called VDEC Multi-Core for all) to respectively analyze non-entropy coding data and decode a macro block layer, and parallel processing of software and hardware can improve parallel processing efficiency; meanwhile, a preprocessor and a full-function hardware decoder are adopted in the multi-core hardware decoding accelerator, and a multi-core synchronization mechanism is set to realize synchronous parallel processing of upper and lower macro block rows. The invention can process one frame macro block row of any code stream in parallel, and the parallelism is not limited by the number of Slice/Tile (i.e. stripe/Slice) partitions in one frame, thus realizing general decoding, having good expandability and high parallel processing efficiency.

In order to achieve the above object, the present invention provides the following technical solutions:

a general multi-core parallel decoder system comprises decoding firmware and a multi-core hardware decoding accelerator which are in communication connection;

the decoding firmware is used for analyzing the non-entropy coding data of the upper layer of the video code stream; the multi-core hardware decoding accelerator is used for processing decoding tasks of a macro block layer in the video code stream; the Slice level data in the video code stream is used as an interaction unit between the decoding firmware and the multi-core hardware decoding accelerator, and Slice parallel processing is carried out through Slice Queue;

the multi-core hardware decoding accelerator comprises a preprocessor module and at least two isomorphic full-function hardware decoders; the preprocessor module is used for entropy decoding task of the video code stream; each of the fully functional hardware decoders is responsible for decoding a row of macroblocks including inverse DCT transform, inverse quantization, intra-frame inter-prediction and pixel reconstruction steps, and spacing the macroblocks being decoded in two adjacent upstream and downstream rows by at least two macroblocks.

Further, when the Slice parallel processing is performed by the Slice Queue, the decoding firmware is configured to: after the upper layer analysis of the video code stream is completed, the parameter information of the upper layer of the silicon is packed and pressed into the silicon Queue;

the multi-core hardware decoding accelerator is configured to: inquiring ready information of Slice Queue data, after reading a Queue and completing configuration, analyzing a current Slice inner macro block until analysis of the Slice inner macro block is completed, and sending an interrupt signal after analysis is completed to release the Slice Queue.

Further, the preprocessor module is configured to:

and only executing the entropy decoding task of the video code stream, recording necessary information at the Line head of each macro block Line, and pressing the necessary information into the Line Queue of each full-function hardware decoder in a Command Queue mode.

Further, the multi-core hardware decoding accelerator comprises a first full-function hardware decoder and a second full-function hardware decoder, wherein the first full-function hardware decoder is used for processing decoding of a 2N macro block row, and the second full-function hardware decoder is used for processing decoding of a 2N+1th macro block row, wherein N is an integer more than 0; two full-function hardware decoders share a group of line caches to form dual-core sharing, and dependency relationship information between macro blocks is stored in the line caches;

the full function hardware decoder is configured to: checking whether instructions exist in the respective Line Queue, and starting decoding of a macro block Line according to the content of the instructions when the instructions are included; and monitoring the working position of the full-function hardware decoding corresponding to the last macroblock line in the decoding process, and enabling the processing position of the full-function hardware decoding to be at least two macroblocks later than the processing position of the last macroblock line.

Further, the line cache arbiter grants the line cache ready flag to realize dual-core synchronous decoding;

the multi-core hardware decoding accelerator is configured to: when each macro block is started, firstly reading line cache information through an allocated full-function hardware decoder, and coordinating the processing speeds of the full-function hardware decoders of the upper macro block line and the lower macro block line so that the processing positions of the lower macro block line always lag behind the processing positions of the adjacent upper macro block line by at least two macro blocks for the two upper macro block lines and the lower macro block line in decoding; in the vertical direction, when the first fully functional hardware decoder starts decoding a new 2 (n+1) th macroblock line, it is guaranteed that the processing position of the 2 (n+1) th macroblock line lags the processing position of the preceding 2n+1 th macroblock line by at least two macroblocks.

Further, a single-core line cache prefetching module is arranged corresponding to each full-function hardware decoder, and comprises a ping-pong prefetching cache consisting of a first cache and a second cache, and line cache data of the last macro block line are stored through the ping-pong prefetching cache;

the single core line cache prefetch module is configured to: starting from the leftmost macro block of one macro block row, firstly reading required data from a dual-core shared row cache into a ping-pong prefetching cache, and filling the first cache and the second cache completely when the data is first filled; after one macro block in the decoder is decoded, updating the corresponding first buffer or second buffer and writing the information recorded by the corresponding buffer back to the shared line buffer; and after the write-back is finished, recovering the cache and prefetching line cache data required by the next macro block, and repeating the operation until the rightmost side of the image.

Further, the non-entropy encoded data includes one or more of constituent video parameter set VPS, sequence parameter set SPS, picture parameter set PPS, and Slice header information.

Further, the decoding tasks of the macroblock layer include full hardware code stream reading, entropy decoding, inverse DCT transformation, inverse quantization, intra-frame inter prediction, motion compensation, pixel reconstruction, and deblocking filtering steps.

The invention also provides a video decoding method, which comprises the following steps:

receiving video code stream data;

analyzing non-entropy coding data of an upper layer of a video code stream by a decoding firmware, processing decoding tasks of a macro block layer in the video code stream by a multi-core hardware decoding accelerator, wherein Slice level data in the video code stream is used as an interaction unit between the decoding firmware and the multi-core hardware decoding accelerator, and Slice parallel processing is performed by Slice Queue;

wherein,,

the multi-core hardware decoding accelerator comprises a preprocessor module and at least two isomorphic full-function hardware decoders, entropy decoding of video code streams is carried out through the preprocessor module, each full-function hardware decoder is responsible for decoding one row of macro blocks including inverse DCT transformation, inverse quantization, intra-frame inter-frame prediction and pixel reconstruction steps, and at least two macro blocks are separated from each other in two adjacent uplink and downlink macro blocks so as to realize multi-core synchronous decoding.

Further, the video is AVC video or HEVC video.

Compared with the prior art, the invention has the following advantages and positive effects by taking the technical scheme as an example:

1) Based on the current mainstream video compression standards AVC and HEVC, a video decoder (i.e., VDEC) is divided into two parts, namely decoding firmware and multi-core hardware decoding accelerator, according to the code stream data characteristics. FirmWare (i.e., firmWare) is software embedded in a hardware device, typically a program written in EPROM (erasable programmable read only memory) or EEPROM (electrically erasable programmable read only memory), and is used as a software portion for parsing non-entropy encoded data of an upper layer of a video bitstream, and a multi-core hardware decoding accelerator is used as a hardware portion for centrally processing all decoding operations of a macroblock layer in the video bitstream, and simultaneously uses a Slice (i.e., stripe) level in the video stream as an interactive unit of vdec_fw and vdec_mcore, and performs data interaction through Slice Queue (i.e., stripe Queue) inside the VDEC, thereby improving parallel processing efficiency through parallel processing of software and hardware.

2) The VDEC_MCORE adopts a preprocessor and a full-function hardware decoder multi-core design, and can realize the parallel processing of the upper and lower macro block lines by utilizing a Line Queue (i.e. Line Queue) strategy, and a multi-core synchronization mechanism is set to realize the synchronous parallel processing of the upper and lower macro block lines.

3) And the accuracy and stability of the parallel decoding of the macro block lines are improved by utilizing a data sharing and synchronous management mechanism of the Line buffer (namely Line cache) among the multiple cores.

The technical scheme provided by the invention can process the macro block line in one frame of the code stream in any coding mode in parallel, and the parallelism is not limited by the number of Slice/Tile partitions in one frame, so that the universal decoding is realized, and meanwhile, the method has the advantages of good expandability and high parallel processing efficiency.

Drawings

FIG. 1 is a schematic diagram of a pipeline design of a prior art single core hardware decoder.

Fig. 2 is a schematic diagram of the dependency relationship of a macroblock in the prior art.

FIG. 3 is a schematic diagram illustrating the operation of a parallel decoding process performed by a plurality of threads according to the prior art.

Fig. 4 is a schematic diagram of interaction between module structures of a system according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a dual-core structure of a multi-core hardware decoding accelerator according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of an operation of dual-core parallel decoding according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of information transmission of setting dual-core synchronization by the line cache arbiter according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of a macro block processing position of dual-core coordination according to an embodiment of the present invention.

Fig. 9 is a schematic diagram of a hardware decoder for setting ping-pong prefetch buffer according to an embodiment of the present invention.

Fig. 10 is a schematic flow chart of performing ping-pong prefetching according to an embodiment of the present invention.

Reference numerals illustrate:

decoded macroblock 10, in-decoding macroblock 20, and not decoded macroblock 30.

Detailed Description

The general multi-core parallel decoder system and its application disclosed in the present invention are described in further detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the technical features or combinations of technical features described in the following embodiments should not be regarded as being isolated, and they may be combined with each other to achieve a better technical effect. In the drawings of the embodiments described below, like reference numerals appearing in the various drawings represent like features or components and are applicable to the various embodiments. Thus, once an item is defined in one drawing, no further discussion thereof is required in subsequent drawings.

It should be noted that the structures, proportions, sizes, etc. shown in the drawings are merely used in conjunction with the disclosure of the present specification, and are not intended to limit the applicable scope of the present invention, but rather to limit the scope of the present invention. The scope of the preferred embodiments of the present invention includes additional implementations in which functions may be performed out of the order described or discussed, including in a substantially simultaneous manner or in an order that is reverse, depending on the function involved, as would be understood by those of skill in the art to which embodiments of the present invention pertain.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate. In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.

Examples

Referring to fig. 4, a generic multi-core parallel decoder system is provided for the present invention. The system includes a communicatively coupled decoding FirmWare (vdec_fw, collectively referred to as Video FirmWare in fig. 4) and a Multi-Core hardware decoding accelerator (vdec_mcore, collectively referred to as VDEC Multi-Core in fig. 4).

The decoding firmware is used for analyzing the non-entropy coding data of the upper layer of the video code stream.

The multi-core hardware decoding accelerator is used for processing the decoding task of the macro block layer in the video code stream. In this embodiment, the multi-core hardware decoding accelerator includes a preprocessor module and at least two isomorphic full-function hardware decoders.

The preprocessor module is used for entropy decoding task of the video code stream.

Each fully functional hardware decoder is responsible for decoding a row of macro blocks including inverse DCT transform, inverse quantization, intra-frame inter-prediction and pixel reconstruction steps, and for spacing at least two macro blocks apart from each other in two adjacent upstream and downstream macro blocks to achieve multi-core synchronous decoding.

And Slice parallel processing is performed by Slice Queue by taking Slice level data in the video code stream as an interaction unit between the decoding firmware and the multi-core hardware decoding accelerator.

The code streams of the AVC video and the HEVC video adopt a layered structure, most grammars shared in a GOP layer and a Slice layer are dissociated, and a video parameter set VPS (namely Video Parameter Set), a sequence parameter set SPS (namely Sequence Parameter Set), a picture parameter set PPS (namely Picture Parameter Set) and the like are formed, so that the code streams are very suitable for software analysis due to the fact that the data proportion is small and the analysis is simple. According to the characteristics of the above-mentioned code stream data, the decoder system provided in this embodiment divides the video decoder VDEC into two parts, namely, the decoding firmware vdec_fw and the multi-core hardware decoding accelerator vdec_mcore, where the decoding firmware as a software part may be used to parse the non-entropy encoded data of the upper layer of the video code stream, and the multi-core hardware decoding accelerator as a hardware part may be used to centrally process all decoding operations of the macroblock layer in the video code stream.

Preferably, the decoding firmware VDEC_FW is responsible for analyzing non-entropy coding data such as a video parameter set VPS, a sequence parameter set SPS, a picture parameter set PPS, slice header information (i.e. Slice header) and the like at the upper layer of the code stream; the multi-core hardware decoding accelerator is used for intensively processing all decoding works of a macro block layer in the video code stream, and can comprise the steps of code stream reading, entropy decoding, inverse DCT transformation, inverse quantization, intra-frame inter-frame prediction, pixel reconstruction, deblocking filtering and the like of full hardware.

With continued reference to fig. 4, in this embodiment, the software/hardware uses the Slice level in the video bitstream as an interaction unit, and performs data interaction through a Slice Queue (i.e., a stripe Queue) inside the video decoder. The interaction flow of the decoding firmware vdec_fw and the multi-core hardware decoding accelerator vdec_mcore may be as follows:

1) After the decoding firmware VDEC_FW completes the upper layer analysis task of the code stream, the Slice upper layer parameter information is packed and pressed into Slice Queue, i.e. the information is put (pushed) into the Queue of Slice Queue for queuing. The downward arrow in fig. 4 indicates the operation of pressing in the Slice Queue.

At this time, the decoding firmware is configured to: after the upper layer analysis of the video code stream is completed, the Slice upper layer parameter information is packed and pressed into Slice Queue.

2) The multi-core hardware decoding accelerator VDEC_MCORE inquires ready information (ready state information) of Slice Queue data, reads Queue information and completes configuration, then the whole hardware analyzes macro blocks in the current Slice until finishing, and sends an interrupt signal when finishing to release Slice Queue, namely corresponding information in a Queue of the Slice Queue is released (pop). The upward arrow in fig. 4 indicates the operation of releasing the Slice Queue.

At this time, the multi-core hardware decoding accelerator is configured to: inquiring ready information of Slice Queue data, after reading a Queue and completing configuration, analyzing a current Slice inner macro block until analysis of the Slice inner macro block is completed, and sending an interrupt signal after analysis is completed to release the Slice Queue.

Therefore, the software and hardware division is combined with the software queue to realize the parallel processing of the software, and the parallel processing of the software and the hardware can obviously save the software processing time, thereby improving the parallel processing efficiency.

The multi-core hardware decoding accelerator vdec_mcore may include a Pre-processor (Pre-Parser) module and a plurality of isomorphic Full-function hardware decoders (Full-function decoders). The full-function hardware decoder is at least capable of processing the steps of inverse DCT transformation, inverse quantization, inter-frame prediction and pixel reconstruction necessary for macroblock line decoding.

The isomorphic full-function hardware decoder may be set to more than two (including two), be referred to as a dual-core hardware decoding accelerator when set to two, be referred to as a tri-core hardware decoding accelerator when set to three, be referred to as a quad-core hardware decoding accelerator when set to four, and so on. Each full-function hardware decoder is responsible for decoding one row of macro block rows, the dual-core hardware decoding accelerator can simultaneously perform parallel decoding work of two rows of macro block rows, the three-core hardware decoding accelerator can simultaneously perform parallel decoding work of three rows of macro block rows, and so on.

Preferably, the preprocessor module is configured to perform only the task of entropy decoding of the video bitstream. Specifically, the preprocessor module may record the necessary information at the Line head of each macroblock Line, and press the necessary information into the Line Queue (i.e., line Queue) of each full-function hardware decoder in a Command Queue (i.e., command Queue) manner, that is, put (push) the information into the Queue of the Line Queue for queuing.

The present embodiment is described in detail below with a dual core hardware decoding accelerator as a preferred option.

Referring to fig. 5, the dual-core hardware decoding accelerator includes a Pre-processor (Pre-Parser) module and two homogeneous Full-function hardware decoders (Full-function decoders), specifically including a first Full-function hardware Decoder (Full-fuction Decoder Core 0) and a second Full-function hardware Decoder (Full-fuction Decoder Core 1).

The Pre-Parser module is set based on serial characteristics of entropy decoding. Because of the serial nature of entropy decoding, the full-function hardware decoder needs to be provided with information such as the start address and the entropy decoding start state of each macroblock line in the code stream before the full-function hardware decoder performs decoding operation, so the code stream entropy decoding task is performed by setting a Pre-processor (Pre-Parser) module.

The Pre-Parser module can simply execute the code stream entropy decoding task, record necessary information at the Line head of each macro block, and then press (put) the necessary information into the Line Queue of each full-function hardware decoder in a Command Queue mode. Because the Pre-Parser module is only responsible for entropy decoding, the Pre-Parser module is easy to work on a high clock frequency through a fine design so as to improve the entropy decoding processing speed, thereby matching the performance of the multi-core Full-function Decoder.

By way of example and not limitation, the first full-function hardware decoder may be used to process decoding of the 2n+1th macroblock row and the second full-function hardware decoder may be used to process decoding of the 2n+1th macroblock row, where N is an integer greater than 0, as shown in fig. 6.

When performing the decoding task, each full-function hardware decoder may check whether there is an instruction in the respective Line Queue, and then initiate decoding of a macroblock Line according to the content of the instruction. Meanwhile, each full-function hardware decoder can monitor the working position of the full-function hardware decoder corresponding to the last macro block row in the decoding process, and the working position of the full-function hardware decoder must be at least two macro blocks later than the processing position of the last row.

Two full-function hardware decoders may share a set of line buffers to form a dual core share. And all dependency relationship information among the macro blocks is stored on the line buffer. The two full-function hardware decoders may commonly maintain the line buffers, such as may include read, update operations, etc. of the line buffers.

At this time, the full function hardware decoder is configured to: checking whether instructions exist in the respective Line Queue, and starting decoding of a macro block Line according to the content of the instructions when the instructions are included; and monitoring the working position of the full-function hardware decoding corresponding to the last macroblock line in the decoding process, and enabling the processing position of the full-function hardware decoding to be at least two macroblocks later than the processing position of the last macroblock line.

In specific implementation, the processing positions of the two full-function hardware decoders can be coordinated by setting a multi-core synchronization mechanism so as to keep that the uplink and downlink decoding macro blocks are separated by at least two macro blocks. In this embodiment, dual core synchronous decoding is preferably achieved by way of a line cache arbiter (i.e., line buffer arbiter) arbitrating grant line cache ready flags (i.e., line buffer ready flag): since the full function hardware decoder needs to read the line buffer information first at every macroblock start-up. The dual core synchronization mechanism is further described in connection with fig. 6, 7 and 8.

Referring to fig. 7, the first full-function hardware decoder and the second full-function hardware decoder are both in communication connection with the line cache through the line cache arbiter, and when the full-function hardware decoder needs to read the line cache information, a line cache ready flag authorized by the line cache arbiter needs to be obtained. Thus, the processing speeds of the first full-function hardware decoder and the second full-function hardware decoder can be coordinated by presetting the arbitration rule of the line buffer arbiter.

At this time, the multi-core hardware decoding accelerator is configured to: when each macro block is started, the row cache information is firstly read through the distributed full-function hardware decoder, and the processing speeds of the full-function hardware decoders of the upper macro block row and the lower macro block row are coordinated so that the processing positions of the lower macro block row always lag behind the processing positions of the adjacent upper macro block row by at least two macro blocks for the two upper macro block rows and the lower macro block row in decoding. In this solution, in addition to ensuring that the processing position of the first full-function hardware decoder (corresponding to Core0 in fig. 6) is at least two macro block intervals ahead of the processing position of the second full-function hardware decoder (corresponding to Core1 in fig. 6) in the horizontal direction, it is also required to ensure that when Core0 starts decoding a new 2 (n+1) th macro block row (Core 0 enters the next round of macro block row processing) in the vertical direction, the processing position of the 2 (n+1) th macro block row is behind the processing position of the previous 2n+1 th macro block row by at least two macro blocks. At this time, 2n+1-th macroblock line is processed by Core, since Core0 leads Core1 by at least two macroblock positions during the previous round of macroblock line processing, core1 still performs the previous round of macroblock line processing when Core0 newly enters the next round of macroblock line processing, and at this time, the macroblock line (2n+1-th line) processed by Core1 is the last line of the macroblock line (2 (n+1) -th line) processed by Core 0. In fig. 8, dot-like macro blocks represent macro blocks (belonging to 2n+1 th row) processed by Core1, and diagonal macro blocks represent macro blocks (belonging to 2 (n+1)) of a new row where Core0 starts processing, the macro blocks processed by Core0 being located after the macro blocks processed by Core 1.

In this embodiment, considering that the arbitration grant line buffer ready flag may generate hysteresis due to the multi-core synchronization mechanism, in order to eliminate the time overhead caused by this, a group of Ping-Pong prefetch buffers (Ping-Pong buffers) may be set for each full-function hardware decoder to store the line buffer data of the previous line.

Referring to fig. 9, in each full-function hardware decoder, a single-core line cache prefetch module is provided. The single-core line cache prefetching module is a ping-pong prefetching cache consisting of a first cache (buf 0) and a second cache (buf 1), and line cache data of the last macro block line is stored through the ping-pong prefetching cache.

Specifically, the first buffer and the second buffer can be alternately switched, when the first buffer executes the current first command, the second buffer prefetches data required by the next command, when the first command is executed, the first buffer is switched to the second buffer to execute the next command, and the time cost in the hardware executing process is saved by alternately switching the first buffer and the second buffer.

At this point, the single core line cache prefetch module is configured to: starting from the leftmost macro block of one macro block line, firstly reading needed data from a double-core shared line cache into a ping-pong prefetching cache, filling the first cache and the second cache completely when filling for the first time, updating the corresponding first cache or second cache after one macro block in the decoder is decoded, and writing information recorded by the corresponding cache back to the shared line cache; and after the write-back is finished, recovering the cache and prefetching line cache data required by the next macro block, and repeating the operation until the rightmost side of the image.

Referring to fig. 10, it is illustrated that when starting from the leftmost side of a line, the required data is first read into a Ping-Pong prefetch buffer (Ping-Pong buffer) from a line buffer (buffer) shared by dual cores.

When data is written for the first time, the first buffer (buf 0) and the second buffer (buf 1) may be filled up. If the data is not written for the first time, a decoder internal access (internal access) operation is performed through one cache, such as a first cache, and the internal access is performed by reading and then writing. After decoding of one macro block in the decoder is completed, updating a corresponding first buffer and writing back information recorded in the first buffer to a shared line buffer (update bufx to line buf, bufx in fig. 10 represents an xth buffer, x can be 0 and 1 and respectively represents buf0 and buf 1), and recovering the first buffer after writing back and prefetching line buffer data (prefetch line buf to bufx) required by the next macro block; this is repeated until the right side of the image.

It should be noted that steps update bufx to line buf to prefetch line buf to bufx in fig. 10 are continuous operations based on the same bufx, and belong to an atomic operation of Ping-Pong prefetch buffer (Ping-Pong buffer).

In another embodiment of the present invention, a video decoding method using the general multi-core parallel decoder system in the foregoing embodiment is also provided. The method comprises the following steps:

step 100, video code stream data is received.

And 200, analyzing non-entropy coding data of an upper layer of the video code stream by a decoding firmware, and processing decoding tasks of a macro block layer in the video code stream by a multi-core hardware decoding accelerator, wherein Slice level data in the video code stream is used as an interaction unit between the decoding firmware and the multi-core hardware decoding accelerator, and Slice parallel processing is performed by Slice Queue.

The multi-core hardware decoding accelerator comprises a preprocessor module and at least two isomorphic full-function hardware decoders, entropy decoding of video code streams is carried out through the preprocessor module, each full-function hardware decoder is responsible for decoding one row of macro blocks including inverse DCT transformation, inverse quantization, inter-frame prediction and pixel reconstruction steps, and at least two macro blocks are separated from each other in two adjacent uplink and downlink macro blocks to realize multi-core synchronous decoding.

In this embodiment, the video may be AVC video or HEVC video.

Other technical features are described in the previous embodiments and are not described in detail here.

The technical scheme provided by the invention can be used for multi-core parallel decoding of AVC video or HEVC video, and the software and hardware parallel processing is realized by dividing the video decoder into software and hardware and combining a Slice Queue strategy. Meanwhile, a preprocessor and a multi-core full-function hardware decoder are adopted, and a Line Queue strategy is combined, so that parallel processing of upper and lower macro block lines is realized. And moreover, a Line buffer data sharing and synchronous management scheme among multiple cores is set, so that the accuracy and stability of parallel decoding of the macro block lines can be improved.

In the above description, the disclosure of the present invention is not intended to limit itself to these aspects. Rather, the components may be selectively and operatively combined in any number within the scope of the present disclosure. In addition, terms like "comprising," "including," and "having" should be construed by default as inclusive or open-ended, rather than exclusive or closed-ended, unless expressly defined to the contrary. All technical, scientific, or other terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Common terms found in dictionaries should not be too idealized or too unrealistically interpreted in the context of the relevant technical document unless the present disclosure explicitly defines them as such. Any alterations and modifications of the present invention, which are made by those of ordinary skill in the art based on the above disclosure, are intended to be within the scope of the appended claims.

Claims

1. A universal multi-core parallel decoder system, characterized by: the system comprises decoding firmware and a multi-core hardware decoding accelerator which are in communication connection;

the multi-core hardware decoding accelerator comprises a preprocessor module and at least two isomorphic full-function hardware decoders, wherein the at least two isomorphic full-function hardware decoders simultaneously perform parallel decoding of at least two rows of macro block rows; the at least two isomorphic full-function hardware decoders share a group of line caches to form multi-core sharing, and dependency relationship information among macro blocks is stored in the line caches; the preprocessor module is used for entropy decoding task of the video code stream; each fully functional hardware decoder is responsible for decoding a row of macroblock rows including inverse DCT transform, inverse quantization, intra-frame inter-prediction and pixel reconstruction steps, and spacing at least two macroblocks from the macroblock being decoded in two adjacent upstream and downstream rows;

wherein the preprocessor module is configured to: only executing an entropy decoding task of a video code stream, recording necessary information at the Line head of each macro block Line, and respectively pressing the necessary information into the Line Queue of each full-function hardware decoder in a Command Queue mode;

when the Slice parallel processing is performed by Slice Queue, the decoding firmware is configured to: after the upper layer analysis of the video code stream is completed, packing the Slice upper layer parameter information into Slice Queue; the multi-core hardware decoding accelerator is configured to: inquiring ready information of Slice Queue data, after reading a Queue and completing configuration, analyzing a current Slice inner macro block until analysis of the Slice inner macro block is completed, and sending an interrupt signal after analysis is completed to release the Slice Queue.

2. The generic multi-core parallel decoder system of claim 1, wherein: the multi-core hardware decoding accelerator comprises a first full-function hardware decoder and a second full-function hardware decoder, wherein the first full-function hardware decoder is used for processing the decoding of the 2N macro block row, and the second full-function hardware decoder is used for processing the decoding of the 2N+1th macro block row, wherein N is an integer more than 0; two full-function hardware decoders share a group of line caches to form dual-core sharing, and dependency relationship information between macro blocks is stored in the line caches;

3. The generic multi-core parallel decoder system of claim 2, wherein: the line cache arbiter grants the line cache ready flag to realize dual-core synchronous decoding;

4. A generic multi-core parallel decoder system according to claim 3, characterized in that: a single-core line cache prefetching module is arranged corresponding to each full-function hardware decoder, and comprises a ping-pong prefetching cache consisting of a first cache and a second cache, and line cache data of the last macro block line are stored through the ping-pong prefetching cache;

5. The universal multi-core parallel decoder system of any of claims 1 through 4, characterized in that: the non-entropy encoded data includes one or more of constituent video parameter set VPS, sequence parameter set SPS, picture parameter set PPS, and Slice header information.

6. The universal multi-core parallel decoder system of any of claims 1 through 4, characterized in that: the decoding task of the macro block layer comprises the steps of code stream reading, entropy decoding, inverse DCT transformation, inverse quantization, inter-frame prediction, motion compensation, pixel reconstruction and deblocking filtering of all hardware.

7. A method of video decoding in accordance with the system of claim 1, comprising the steps of:

receiving video code stream data;

wherein,,

the multi-core hardware decoding accelerator comprises a preprocessor module and at least two isomorphic full-function hardware decoders, entropy decoding of video code streams is carried out through the preprocessor module, each full-function hardware decoder is responsible for decoding one row of macro blocks including inverse DCT transformation, inverse quantization, inter-frame prediction and pixel reconstruction steps, and two adjacent macro blocks in decoding in the uplink and the downlink are separated by at least two macro blocks.

8. The video decoding method of claim 7, wherein: the video is an AVC video or an HEVC video.