CN112422983A

CN112422983A - Universal multi-core parallel decoder system and application thereof

Info

Publication number: CN112422983A
Application number: CN202011154537.5A
Authority: CN
Inventors: 雷理; 张云; 韦虎; 贾大中
Original assignee: Mouxin Technology Shanghai Co ltd
Current assignee: Mouxin Technology Shanghai Co ltd
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2021-02-26
Anticipated expiration: 2040-10-26
Also published as: CN112422983B

Abstract

The invention discloses a general multi-core parallel decoder system and application thereof, relating to the technical field of video decoding. The system comprises a decoding firmware and a multi-core hardware decoding accelerator which are in communication connection, and Slice level data in a video code stream is used as an interactive unit between the decoding firmware and the multi-core hardware decoding accelerator; the decoding firmware is used for analyzing non-entropy coding data on the upper layer of the video code stream, and the multi-core hardware decoding accelerator is used for processing a decoding task of a macro block layer in the video code stream; the multi-core hardware decoding accelerator comprises a preprocessor module and at least two isomorphic full-function hardware decoders, wherein the preprocessor module is used for entropy decoding tasks of video code streams; each full-function hardware decoder is responsible for decoding a row of macroblock rows, and separates the macroblock being decoded in two adjacent up-down rows by at least two macroblocks. The parallelism degree of the invention is not limited by the number of the strip/slice division, the universal decoding is realized, the expandability is good, and the parallel processing efficiency is high.

Description

Universal multi-core parallel decoder system and application thereof

Technical Field

The invention relates to the technical field of video decoding, in particular to a general multi-core parallel decoder system and application thereof.

Background

In the action plan for development of ultra high definition video industry (2019-2022) published in 2019, it is clearly proposed that the development of the ultra high definition video industry and the application in related fields are greatly promoted according to a general technical route of "4K first and 8K simultaneously". The development of the ultra-high-definition video industry is accelerated, and the method has great significance for driving intelligent transformation of industries such as 5G, security and the like which take videos as important carriers and improving the overall strength of the information industry and the cultural industry in China. The ultra-high definition video technology naturally puts higher requirements on video coding and decoding processing. For High definition Video, the current mainstream Video compression standards include Advanced Video Coding (AVC) and High Efficiency Video Coding (HEVC), etc., and software decoding may bring large overhead such as CPU and power consumption, so that the industry generally adopts a special hardware accelerator as a Video Decoder (VDEC, full-name Video Decoder) to perform Video decoding. Taking a commonly-used single-core hardware decoder as an example, the single-core hardware decoder mostly adopts a pipeline design, and takes macroblocks (denoted as MB in AVC, i.e., Macro Blocks; and denoted as CTB in HEVC, i.e., Coding Tree Blocks) as pipeline units. Taking HEVC as an example, the main pipeline level division can be seen in fig. 1. Fig. 1 includes 4 levels, each level of which is described as follows: first-order Encopy Dec: entropy decoding (CABAC/CAVLC); second stage IQT, IDCT: inverse quantization, inverse DCT transform; third stage IPred, ReC: intra, inter prediction and image reconstruction. Fourth-stage Dblock, SAO: deblocking filtering and adaptive compensation.

For ultra-high definition video, the decoder needs to realize the performance of 8K @30fps/4K @120fps, and the single-core hardware decoder is difficult to meet the performance requirement within the running frequency of a 600MHz clock. Therefore, in the prior art, a multi-core design scheme of a video decoder is proposed to realize efficient parallel decoding, and the good expandability of multi-core can meet the requirement of higher performance. Currently, parallel decoding schemes are generally developed around the Frame (i.e., Frame), Slice (i.e., Slice), Tile (i.e., Slice), MB/CTB line (i.e., macroblock line) and so on, and there are the following mainstream parallel decoding schemes.

First, Frame parallel decoding. AVC/HEVC frames are divided into I, P, B frames, where B-frames are often used as non-reference frames, which can be used for parallel processing since such B-frames are not correlated with other frame data.

Second, Slice parallel decoding. Each frame of the AVC/HEVC image can be divided into one or more slices, and each Slice is composed of an integer number of macroblocks arranged in succession. The purpose of Slice is to enhance the robustness when transmission errors occur, and once transmission errors occur, Slice without errors is not affected.

Third, Tile parallel decoding. In order to improve parallelism of HEVC, a Tile segmentation mode is introduced, namely a frame is segmented into M x N rectangular breadth parallel computing (wherein M, N is an integer, M is larger than or equal to 1, and N is larger than or equal to 1).

Fourth, WPP parallel decoding. HEVC provides a special Wavelet Parallel Processing (WPP) scheme for improving parallelism, and during encoding, entropy decoding is directly initialized at the line head of each CTB line without waiting until the entropy decoding of the previous line is completely finished, so that inter-line serial characteristics of entropy-decoded macroblocks are blocked, and Parallel calculation of macroblock lines is realized.

However, the above decoding scheme has the following disadvantages: for the first decoding scheme, the parallelism is not high because the number of B frames between P frames is usually small; meanwhile, in some protocol standards, B frames may be used as reference frames, and such B frames cannot be subjected to the Frame parallel decoding. For the second decoding scheme, although the slices of a frame of image are usually independent, since the deblocking filtering can exceed Slice boundaries, in this application scenario, Slice-based parallel decoding fails; meanwhile, when Slice is decoded in parallel, the parallelism is directly influenced by the Slice division number in one frame, and the dependency on the Slice division number is too high. For the third decoding scheme, like the Slice parallel decoding, the deblocking filtering in some protocol standards still has an option of exceeding the Tile boundary, and in this application scenario, Tile-based parallel decoding is also disabled; meanwhile, the parallelism is also directly influenced by the number of Tile partitions in one frame, and the dependency on the number of Tile partitions is too high. For the fourth decoding scheme, the video code rate is increased due to the initialization of each line, which affects the application of the WPP parallel decoding mode to some extent.

On the other hand, in AVC/HEVC, when single-thread decoding is adopted, the processing order of macroblocks (MB in AVC, CTB in HEVC) within one frame is generallyFrom left to right and top to bottom. A macroblock usually depends on its left and upper neighboring macroblock information during entropy decoding, intra prediction, inter prediction, deblocking filtering, etc. To achieve macroblock parallel processing within a frame, data independent macroblocks must be found. A2D-Wave algorithm is provided in the prior art, and the 2D-Wave algorithm is similar to the idea of HEVC Wave front. Specifically, the 2D-Wave algorithm may position the nearest macroblock that is independent from a macroblock at a relative position of "right 2, top 1" according to a preset protocol, and the dependency relationship of a typical macroblock is shown in fig. 2, where a diagonal macroblock (x, y) and a diagonal macroblock (x +2, y-1) are independent macroblocks, and the dependent macroblocks of the macroblock (x, y) are a dotted macroblock (x, y-1) and a dotted macroblock (x +1, y-1). Therefore, as long as the synchronization limit is added among multiple threads, the uplink decoding thread can be decoded at the same time by ensuring that the uplink decoding thread leads the downlink thread by at least two macro blocks. Referring to fig. 3, the decoded macroblock 10 is complete, the decoded macroblock 20 is shown in the lattice, and the un-decoded macroblock 30 is shown in the blank. When the video is decoded, each thread is responsible for decoding a row of macro blocks, and front and back wave type rightward progressive decoding is kept between the upper row and the lower row, so that the parallel processing of the macro blocks is realized. The 2D-Wave has good expandability, the parallelism degree is not limited by the dividing quantity of Slice/Tile in a frame, the maximum parallelism degree (namely the number of macro blocks which can be processed simultaneously) is related to the width and height of the frame, and the formula ParMB can be used for general_max，2DThe maximum parallelism is determined at min (mb _ width/2, mb _ height), where ParMB is_max，2DIndicating the maximum parallelism, mb _ width indicates the number of macroblocks over the frame width, and mb _ height indicates the number of macroblocks over the frame height. By way of example, for an image frame having a size of 1920 × 1080 (width × height) pixels, the number of macroblocks is 120 × 68 (width × height) MBs, and the maximum parallelism ParMB calculated by the above formula_max，2D120/2 ═ 60 (minimum of 120/2 and 68). In a 4K/8K ultra-high definition application scene, the parallelism can be greatly improved through a 2D-Wave algorithm, and the high-definition video decoding processing is accelerated.

However, the 2D-Wave algorithm has the following drawbacks: on one hand, as the entropy decoding is completely serial (the entropy decoding of each macro block in Slice is related to the previous macro block), the first macro block in the downlink needs to acquire the field information after the last macro block in the previous line is decoded; on the other hand, the code stream data in Slice is completely serial, and the code stream data corresponding to the first downlink macro block can be presented only after the last macro block in the last line is decoded, thereby affecting the parallel processing efficiency to a certain extent.

In summary, the conventional parallel decoding scheme cannot achieve all aspects in the aspects of generality, expandability, parallelism, parallel processing efficiency, and the like, and often can only be limited to achieve multi-core parallelism on an encoded stream within a certain frame, and how to provide a general multi-core parallel decoding scheme with good expandability and strong universality, of which the parallelism is not limited by the number of Slice/Tile partitions within one frame, is a technical problem that needs to be solved at present.

Disclosure of Invention

The invention aims to: the defects of the prior art are overcome, and a general multi-core parallel decoder system and application thereof are provided. According to the data characteristics of the code stream, a Video Decoder (VDEC) is divided into a decoding FirmWare (VDEC _ FW for short and fully called Video FirmWare) and a Multi-Core hardware decoding accelerator (VDEC _ MCORE for short and fully called VDEC Multi-Core) to respectively carry out non-entropy coding data analysis and macro block layer decoding, and the parallel processing of software and hardware can improve the parallel processing efficiency; meanwhile, a multi-core design of a preprocessor and a full-function hardware decoder is adopted in the multi-core hardware decoding accelerator, and a multi-core synchronization mechanism is set to realize synchronous parallel processing of an upper macro block line and a lower macro block line. The invention can process the macro block line in one frame of the code stream in any coding mode in parallel, and the parallelism degree is not limited by the dividing quantity of Slice/Tile in one frame, thereby realizing universal decoding, good expandability and high parallel processing efficiency.

In order to achieve the above object, the present invention provides the following technical solutions:

a general multi-core parallel decoder system comprising communicatively coupled decoding firmware and a multi-core hardware decoding accelerator;

the decoding firmware is used for analyzing non-entropy coding data on the upper layer of the video code stream; the multi-core hardware decoding accelerator is used for processing decoding tasks of a macro block layer in a video code stream; slice level data in the video code stream is used as an interactive unit between the decoding firmware and the multi-core hardware decoding accelerator, and Slice parallel processing is carried out through Slice Queue;

the multi-core hardware decoding accelerator comprises a preprocessor module and at least two isomorphic full-function hardware decoders; the preprocessor module is used for entropy decoding tasks of video code streams; each full-function hardware decoder is responsible for decoding a row of macroblock rows, including steps of inverse DCT transformation, inverse quantization, intra-frame inter-frame prediction and pixel reconstruction, and enabling the decoded macroblock in two adjacent upper and lower rows to be separated by at least two macroblocks.

Further, when Slice parallel processing is performed by Slice Queue, the decoding firmware is configured to: after the upper layer of the video code stream is analyzed, the upper layer parameter information of the silicon is packed and pressed into Slice Queue;

the multi-core hardware decode accelerator is configured to: and inquiring ready information of Slice Queue data, after reading the Queue and completing configuration, analyzing the current macroblock in the Slice until the macroblock in the Slice is analyzed, sending an interrupt signal after the analysis is finished, and releasing the Slice Queue.

Further, the preprocessor module is configured to:

and only executing an entropy decoding task of the video code stream, recording necessary information at the Line head of each macro block Line, and respectively pressing the necessary information into the Line Queue of each full-function hardware decoder in a Command Queue mode.

Further, the multi-core hardware decoding accelerator comprises a first full-function hardware decoder and a second full-function hardware decoder, wherein the first full-function hardware decoder is used for processing the decoding of the 2N macroblock row, the second full-function hardware decoder is used for processing the decoding of the 2N +1 macroblock row, and N is an integer more than 0; two full-function hardware decoders share a group of line caches to form dual-core sharing, and dependency relationship information between macro blocks is stored in the line caches;

the full-function hardware decoder is configured to: checking whether each Line Queue has an instruction or not, and starting decoding of a macro block Line according to the content of the instruction when the Line Queue contains the instruction; and monitoring the working position of the full-function hardware decoding corresponding to the last macro block line in the decoding process, and enabling the processing position of the self to be at least two macro blocks later than the processing position of the last macro block line.

Further, a line cache arbiter authorizes a line cache ready flag to realize dual-core synchronous decoding;

the multi-core hardware decode accelerator is configured to: when each macro block is started, line cache information is read by the distributed full-function hardware decoder, the processing speeds of the full-function hardware decoders of the upper macro block line and the lower macro block line are coordinated, so that the processing position of the lower macro block line always lags behind the processing position of the adjacent upper macro block line by at least two macro blocks for the two upper macro block lines and the lower macro block lines in decoding; in the vertical direction, when the first full function hardware decoder starts decoding of a new 2(N +1) th macroblock row, it is guaranteed that the processing position of the 2(N +1) th macroblock row lags behind the processing position of the preceding 2N +1 th macroblock row by at least two macroblocks.

Furthermore, a single-core line cache prefetching module is arranged corresponding to each full-function hardware decoder, and comprises a ping-pong prefetching cache consisting of a first cache and a second cache, and line cache data of a last macro block line are stored through the ping-pong prefetching cache;

the single core line cache prefetch module is configured to: reading required data into a ping-pong pre-fetching cache from a line cache shared by dual cores from a leftmost macro block of a macro block line, and completely filling a first cache and a second cache when the data are filled for the first time; after a macro block in the decoder is decoded, updating the corresponding first cache or second cache and writing back the information recorded by the corresponding cache to the shared line cache; and recovering the cache after the write-back is completed and prefetching the line cache data required by the next macro block, and repeating the operation till the rightmost side of the image.

Further, the non-entropy encoded data comprises one or more of a component video parameter set VPS, a sequence parameter set SPS, a picture parameter set PPS, and Slice header information.

Further, the decoding task of the macro block layer comprises the steps of code stream reading of full hardware, entropy decoding, inverse DCT (discrete cosine transformation), inverse quantization, intra-frame and inter-frame prediction, motion compensation, pixel reconstruction and deblocking filtering.

The invention also provides a video decoding method, which comprises the following steps:

receiving video code stream data;

analyzing non-entropy coding data on the upper layer of the video code stream through a decoding firmware, processing a decoding task of a macro block layer in the video code stream through a multi-core hardware decoding accelerator, and performing Slice parallel processing through Slice Queue by taking Slice level data in the video code stream as an interactive unit between the decoding firmware and the multi-core hardware decoding accelerator;

wherein the content of the first and second substances,

the multi-core hardware decoding accelerator comprises a preprocessor module and at least two isomorphic full-function hardware decoders, the preprocessor module is used for carrying out entropy decoding on video code streams, each full-function hardware decoder is used for carrying out decoding on a row of macro block lines, wherein the decoding comprises the steps of inverse DCT (discrete cosine transformation), inverse quantization, intra-frame inter-frame prediction and pixel reconstruction, and the macro blocks which are decoded in two adjacent upper and lower lines are separated by at least two macro blocks so as to realize multi-core synchronous decoding.

Further, the video is an AVC video or an HEVC video.

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects as examples:

1) based on the existing mainstream video compression standards of AVC and HEVC, a video decoder (namely VDEC) is divided into two parts, namely decoding firmware and a multi-core hardware decoding accelerator according to the characteristics of code stream data. FirmWare (i.e. FirmWare) is software embedded in a hardware device, and is usually a program written in EPROM (erasable programmable read only memory) or EEPROM (electrically erasable programmable read only memory), the decoding FirmWare is used as a software part for parsing non-entropy coding data on an upper layer of a video code stream, a multi-core hardware decoding accelerator is used as a hardware part for collectively processing all decoding work of a macro block layer in the video code stream, and meanwhile, a Slice (i.e. bar) level in the video stream is used as an interaction unit of VDEC _ FW and VDEC _ MCORE, data interaction is performed through Slice Queue (i.e. bar Queue) inside VDEC, and parallel processing efficiency is improved through parallel processing of software and hardware.

2) In the VDEC _ MCORE, a preprocessor and a full-function hardware decoder are adopted for multi-core design, and Line Queue (namely Line Queue) strategies are utilized, so that parallel processing of upper and lower macro block lines can be realized, and a multi-core synchronization mechanism is set to realize synchronous parallel processing of the upper and lower macro block lines.

3) By utilizing a data sharing and synchronous management mechanism of Line buffers among multiple cores, the accuracy and stability of parallel decoding of macro block lines are improved.

The technical scheme provided by the invention can carry out parallel processing on the macro block line in one frame of the code stream in any coding mode, and the parallelism degree of the macro block line is not limited by the dividing quantity of Slice/Tile in one frame, thereby realizing universal decoding and having the advantages of good expandability and high parallel processing efficiency.

Drawings

FIG. 1 is a schematic diagram of a pipeline design of a prior art single core hardware decoder.

Fig. 2 is a schematic diagram illustrating dependency relationships of macro blocks in the prior art.

Fig. 3 is a schematic diagram illustrating the operation of parallel decoding processing performed by multiple threads in the prior art.

Fig. 4 is a schematic diagram of interaction of module structures of the system according to the embodiment of the present invention.

Fig. 5 is a schematic diagram of a dual-core structure of a multi-core hardware decoding accelerator according to an embodiment of the present invention.

Fig. 6 is a schematic diagram illustrating an operation of dual-core parallel decoding according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of information transmission for setting dual core synchronization of a line cache arbiter according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of a dual-core coordinated macroblock processing location according to an embodiment of the present invention.

Fig. 9 is a schematic structural diagram of a hardware decoder configured with a ping-pong prefetch buffer according to an embodiment of the present invention.

Fig. 10 is a flowchart illustrating a ping-pong pre-fetching process according to an embodiment of the invention.

Description of reference numerals:

decoded macroblock 10, decoded macroblock 20, not decoded macroblock 30.

Detailed Description

The general multi-core parallel decoder system and the application thereof disclosed by the invention are further explained in detail in the following by combining the figures and the specific embodiments. It should be noted that technical features or combinations of technical features described in the following embodiments should not be considered as being isolated, and they may be combined with each other to achieve better technical effects. In the drawings of the embodiments described below, the same reference numerals appearing in the respective drawings denote the same features or components, and may be applied to different embodiments. Thus, once an item is defined in one drawing, it need not be further discussed in subsequent drawings.

It should be noted that the structures, proportions, sizes, and other dimensions shown in the drawings and described in the specification are only for the purpose of understanding and reading the present disclosure, and are not intended to limit the scope of the invention, which is defined by the claims, and any modifications of the structures, changes in the proportions and adjustments of the sizes and other dimensions, should be construed as falling within the scope of the invention unless the function and objectives of the invention are affected. The scope of the preferred embodiments of the present invention includes additional implementations in which functions may be executed out of order from that described or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate. In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

Examples

Referring to fig. 4, a generalized multi-core parallel decoder system is provided for the present invention. The system includes communicatively coupled decoding FirmWare (VDEC _ FW in fig. 4, fully referred to as Video FirmWare) and a Multi-Core hardware decoding accelerator (VDEC _ MCORE in fig. 4, fully referred to as VDEC Multi-Core).

And the decoding firmware is used for analyzing the non-entropy coding data on the upper layer of the video code stream.

The multi-core hardware decoding accelerator is used for processing decoding tasks of a macro block layer in a video code stream. In this embodiment, the multi-core hardware decoding accelerator includes a preprocessor module and at least two homogeneous full-function hardware decoders.

The preprocessor module is used for entropy decoding tasks of video code streams.

Each full-function hardware decoder is responsible for decoding a row of macroblock rows, including steps of inverse DCT (discrete cosine transform), inverse quantization, intra-frame inter-frame prediction and pixel reconstruction, and enabling the decoded macroblock in two adjacent upper and lower rows to be separated by at least two macroblocks so as to realize multi-core synchronous decoding.

Slice level data in the video code stream is used as an interactive unit between the decoding firmware and the multi-core hardware decoding accelerator, and Slice parallel processing is carried out through Slice Queue.

The code stream of the AVC Video and the HEVC Video adopts a layered structure, most grammars shared in a GOP layer and a Slice layer are liberated, a Video Parameter Set VPS (Video Parameter Set), a Sequence Parameter Set SPS (Sequence Parameter Set) and a Picture Parameter Set PPS (Picture Parameter Set) are formed, and the method is very suitable for software analysis due to the fact that the data of the part is small in occupied ratio and simple in analysis. According to the characteristics of the code stream data, the decoder system provided by this embodiment divides the video decoder VDEC into two parts, namely, a decoding firmware VDEC _ FW and a multi-core hardware decoding accelerator VDEC _ MCORE, where the decoding firmware as a software part can be used to analyze non-entropy-coded data on an upper layer of the video code stream, and the multi-core hardware decoding accelerator as a hardware part can be used to collectively process all decoding operations of a macro block layer in the video code stream.

Preferably, non-entropy coding data such as a video parameter set VPS, a sequence parameter set SPS, a picture parameter set PPS, Slice header information (i.e., Slice header) and the like at an upper layer of the code stream are parsed by decoding firmware VDEC _ FW; the multi-core hardware decoding accelerator is used for processing all decoding works of a macro block layer in a video code stream in a centralized mode, and can comprise the steps of code stream reading, entropy decoding, inverse DCT (discrete cosine transformation), inverse quantization, intra-frame and inter-frame prediction, pixel reconstruction, deblocking filtering and the like of all hardware.

With continued reference to fig. 4, in the present embodiment, the software/hardware uses the Slice level in the video code stream as an interaction unit, and performs data interaction through Slice Queue (i.e., a stripe Queue) inside the video decoder. The interaction flow of the decoding firmware VDEC _ FW and the multi-core hardware decoding accelerator VDEC _ MCORE may be as follows:

1) after the decoding firmware VDEC _ FW finishes the upper layer analysis task of the code stream, the Slice upper layer parameter information is packed and pressed into a Slice Queue, namely the information is put (push) into the Slice Queue for queuing. The downward arrow in fig. 4 represents the operation of pushing in Slice Queue.

At this time, the decoding firmware is configured to: and after the upper layer of the video code stream is analyzed, the Slice upper layer parameter information is packed and pressed into Slice Queue.

2) The multi-core hardware decoding accelerator VDEC _ MCORE inquires ready information (ready state information) of Slice Queue data, reads Queue information and completes configuration, the full hardware analyzes a macro block in the current Slice until the end, sends an interrupt signal when the end, and releases the Slice Queue, namely releases (pop) corresponding information in the Queue of the Slice Queue. The upward arrow in fig. 4 represents the operation of releasing Slice Queue.

At this point, the multi-core hardware decode accelerator is configured to: and inquiring ready information of Slice Queue data, after reading the Queue and completing configuration, analyzing the current macroblock in the Slice until the macroblock in the Slice is analyzed, sending an interrupt signal after the analysis is finished, and releasing the Slice Queue.

Therefore, Slice parallel processing is realized by combining software and hardware division with Slice queue, software processing time can be obviously saved by software and hardware parallel processing, and then parallel processing efficiency is improved.

The multi-core hardware decoding accelerator VDEC _ MCORE may include a preprocessor (Pre-Parser) module and a plurality of homogeneous Full-function hardware decoders (Full-function decoders). The full-function hardware decoder is at least capable of processing the steps of inverse DCT transformation, inverse quantization, intra-frame inter-frame prediction and pixel reconstruction which are necessary for decoding a macro block line.

The homogeneous full-function hardware decoder can be set to be more than two (including two), the hardware decoder is called a dual-core hardware decoding accelerator when being set to be two, the hardware decoder is called a tri-core hardware decoding accelerator when being set to be three, the hardware decoder is called a quad-core hardware decoding accelerator when being set to be four, and the like. Each full-function hardware decoder is responsible for decoding one row of macro block rows, the dual-core hardware decoding accelerator can simultaneously perform parallel decoding work on two rows of macro block rows, the three-core hardware decoding accelerator can simultaneously perform parallel decoding work on three rows of macro block rows, and the like.

Preferably, the preprocessor module is configured to perform only entropy decoding tasks of the video code stream. Specifically, the preprocessor module may record necessary information at the Line head of each macroblock Line, and push the necessary information into the Line Queue of each full-function hardware decoder in a Command Queue manner, that is, put (push) the information into the Line Queue of the Line Queue for queuing.

The present embodiment is described in detail below with a dual-core hardware decode accelerator as a preference.

Referring to fig. 5, the dual Core hardware decoding accelerator includes a preprocessor (Pre-Parser) module and two homogeneous Full-function hardware decoders (Full-function decoders), specifically including a first Full-function Decoder Core0 and a second Full-function Decoder Core 1.

The Pre-Parser module is set based on the serial nature of entropy decoding. Due to the serial characteristic of entropy decoding, before a full-function hardware decoder performs decoding work, information such as the starting address and the entropy decoding starting state of each macro block row in a code stream needs to be provided for the full-function hardware decoder, and therefore a Pre-processor (Pre-Parser) module is arranged to execute a code stream entropy decoding task.

The Pre-Parser module can simply execute the code stream entropy decoding task, record necessary information at the head of each macro block Line, and then respectively push (put) the information into the Line Queue of each full-function hardware decoder in a Command Queue mode. Because the Pre-Parser module is only responsible for entropy decoding, the Pre-Parser module can easily work at a high clock frequency through fine design to improve the processing speed of entropy decoding, thereby matching the performance of the multi-core Full-function Decoder.

By way of example and not limitation, the first full-function hardware decoder may be configured to handle decoding of a 2N macroblock line, and the second full-function hardware decoder may be configured to handle decoding of a 2N +1 macroblock line, where N is an integer greater than 0, as shown in fig. 6.

In performing the decode task, each full function hardware decoder may check whether there is an instruction in its respective Line Queue, and then initiate the decoding of a macroblock Line based on the contents of the instruction. Meanwhile, each full-function hardware decoder can monitor the working position of the full-function hardware decoder corresponding to the previous macro block line in the decoding process, and at least two macro blocks are required to be behind the processing position of the previous line.

Two full-function hardware decoders may share a set of line buffers to form a dual core share. All dependency information among the macro blocks is stored in the line buffer. The two full-function hardware decoders may commonly maintain the line buffer, which may include, for example, read and update operations of the line buffer.

At this point, the full function hardware decoder is configured to: checking whether each Line Queue has an instruction or not, and starting decoding of a macro block Line according to the content of the instruction when the Line Queue contains the instruction; and monitoring the working position of the full-function hardware decoding corresponding to the last macro block line in the decoding process, and enabling the processing position of the self to be at least two macro blocks later than the processing position of the last macro block line.

In specific implementation, the processing positions of two full-function hardware decoders can be coordinated by setting a multi-core synchronization mechanism so as to keep that the uplink and downlink decoding macro blocks are separated by at least two macro blocks. In this embodiment, preferably, the dual-core synchronous decoding is implemented by a way that a line buffer arbiter (i.e., a line buffer ready flag) arbitrates an authorized line buffer ready flag: since the full-function hardware decoder needs to read the line buffer information first when each macroblock is started. The dual core synchronization mechanism is further described in conjunction with fig. 6, 7 and 8.

Referring to fig. 7, the first full-function hardware decoder and the second full-function hardware decoder are both communicatively connected to the line cache through the line cache arbiter, and when the full-function hardware decoder needs to read the line cache information, it needs to obtain a line cache ready flag granted by the line cache arbiter. Therefore, the processing speed of the first full-function hardware decoder and the second full-function hardware decoder can be coordinated by presetting the arbitration rule of the line cache arbiter.

At this point, the multi-core hardware decode accelerator is configured to: when each macro block is started, the line cache information is read by the distributed full-function hardware decoder, and the processing speeds of the full-function hardware decoders of the upper macro block line and the lower macro block line are coordinated, so that the processing position of the lower macro block line always lags behind the processing position of the adjacent upper macro block line by at least two macro blocks for the two upper macro block lines and the lower macro block lines in decoding. In this technical solution, the line buffer needs to ensure that, in addition to ensuring that the processing position of the first full-function hardware decoder (corresponding to Core0 in fig. 6) at least precedes the processing position of the second full-function hardware decoder (corresponding to Core1 in fig. 6) by two macroblock intervals in the horizontal direction, when Core0 starts decoding a new 2(N +1) th macroblock row (Core0 enters the next round of macroblock row processing), the processing position of the 2(N +1) th macroblock row lags the processing position of the previous 2N +1 th macroblock row by at least two macroblocks in the vertical direction. At this time, the 2N +1 th macroblock line is processed by Core, and since Core0 leads Core1 by at least two macroblock positions in the previous round of macroblock line processing, when Core0 newly enters the next round of macroblock line processing, Core1 is still performing the previous round of macroblock line processing, and the macroblock line processed by Core1 (2N +1 th line) is the previous line of the macroblock line processed by Core0 (2 (N +1) th line). In fig. 8, a dotted macroblock indicates a macroblock processed by Core1 (belonging to the 2N +1 th row), a diagonal macroblock indicates a macroblock of a new row (belonging to 2(N +1)) which Core0 starts processing, and a macroblock processed by Core0 is located after a macroblock processed by Core 1.

In this embodiment, in consideration of the fact that a line buffer ready flag of the arbitration grant is delayed due to a multi-core synchronization mechanism, in order to eliminate the time overhead caused by the delay, a set of Ping-Pong prefetch buffers (Ping-Pong buffers) may be further configured for each full-function hardware decoder to store the line buffer data of the previous line.

Referring to fig. 9, in each full-function hardware decoder, a single-core line cache prefetch module is provided. The single-core line cache prefetching module is a ping-pong prefetching cache which is composed of a first cache (buf0) and a second cache (buf1), and line cache data of a previous macro block line are stored through the ping-pong prefetching cache.

Specifically, the first cache and the second cache can be switched in turn, the second cache prefetches data required by the next command when the first cache executes the current first command, the first command is switched to the second cache to execute the next command when the execution of the first command is completed, and the like, and the time overhead in the hardware execution process is saved by switching the first cache and the second cache in turn.

At this point, the single-core line cache prefetch module is configured to: reading required data into a ping-pong pre-fetching cache from a double-core shared line cache from a leftmost macro block of a macro block line, completely filling a first cache and a second cache when the data is filled for the first time, updating the corresponding first cache or second cache after decoding of a macro block in a decoder is completed, and writing back information recorded in the corresponding cache to the shared line cache; and recovering the cache after the write-back is completed and prefetching the line cache data required by the next macro block, and repeating the operation till the rightmost side of the image.

Referring to fig. 10, when starting from the leftmost side of a row, the required data is read into the Ping-Pong prefetch buffer (Ping-Pong buffer) from the line buffer (buffer may be abbreviated as buf) of the dual-core shared row buffer first.

When data is written for the first time, the first cache (buf0) and the second cache (buf1) can be filled completely. If the data is not written for the first time, a decoder internal access (internal access) operation is performed through a buffer, such as a first buffer, and the internal access is read first and then written. After a macroblock in a decoder is decoded, updating a corresponding first cache and writing information recorded in the first cache back to a shared line cache (update buffer to line buffer, where the bufx in fig. 10 indicates an x-th cache, and x may be 0 or 1, and indicates buf0 or buf1, respectively), and after the write-back is completed, recovering the first cache and prefetching line buffer data (line buffer) required by a next macroblock; this is repeated until the right side of the image.

It should be noted that the steps update to line buf to prefetch line buf to bufx in fig. 10 are based on the same bufx, and belong to an atomic operation of a Ping-Pong prefetch buffer (Ping-Pong buffer).

The invention also provides a video decoding method using the general multi-core parallel decoder system in the previous embodiment. The method comprises the following steps:

step 100, receiving video code stream data.

And 200, analyzing non-entropy coding data on the upper layer of the video code stream through a decoding firmware, processing a decoding task of a macro block layer in the video code stream through a multi-core hardware decoding accelerator, and performing Slice parallel processing between the decoding firmware and the multi-core hardware decoding accelerator by using Slice level data in the video code stream as an interactive unit through Slice Queue.

The multi-core hardware decoding accelerator comprises a preprocessor module and at least two isomorphic full-function hardware decoders, the preprocessor module is used for carrying out entropy decoding on video code streams, each full-function hardware decoder is responsible for decoding a row of macroblock rows in steps of inverse DCT (discrete cosine transform), inverse quantization, intra-frame inter-frame prediction and pixel reconstruction, and the decoded macroblocks in two adjacent upper and lower rows are separated by at least two macroblocks so as to realize multi-core synchronous decoding.

In this embodiment, the video may be an AVC video or an HEVC video.

Other technical features are described in the previous embodiment and are not described in detail herein.

The technical scheme provided by the invention can be used for multi-core parallel decoding of an AVC (automatic video coding) video or an HEVC (high efficiency video coding), and realizes software and hardware parallel processing by dividing a video decoder into software and hardware and combining a Slice Queue strategy. Meanwhile, the parallel processing of an upper macro block Line and a lower macro block Line is realized by adopting a preprocessor and a multi-core full-function hardware decoder and combining a Line Queue strategy. And then, a scheme for Line buffer data sharing and synchronous management among multiple cores is set, so that accuracy and stability of parallel decoding of macro block lines can be improved.

In the foregoing description, the disclosure of the present invention is not intended to limit itself to these aspects. Rather, the various components may be selectively and operatively combined in any number within the intended scope of the present disclosure. In addition, terms like "comprising," "including," and "having" should be interpreted as inclusive or open-ended, rather than exclusive or closed-ended, by default, unless explicitly defined to the contrary. All technical, scientific, or other terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. Common terms found in dictionaries should not be interpreted too ideally or too realistically in the context of related art documents unless the present disclosure expressly limits them to that. Any changes and modifications of the present invention based on the above disclosure will be within the scope of the appended claims.

Claims

1. A general multi-core parallel decoder system, characterized by: a decode firmware and a multi-core hardware decode accelerator comprising a communication connection;

2. The generalized multi-core parallel decoder system according to claim 1, wherein: when Slice parallel processing is performed by Slice Queue,

the decoding firmware is configured to: after the upper layer of the video code stream is analyzed, the Slice upper layer parameter information is packed and pressed into Slice Queue;

3. The generalized multi-core parallel decoder system according to claim 1, wherein: the preprocessor module is configured to:

4. The generalized multi-core parallel decoder system according to claim 3, wherein: the multi-core hardware decoding accelerator comprises a first full-function hardware decoder and a second full-function hardware decoder, wherein the first full-function hardware decoder is used for processing the decoding of the 2N macroblock line, the second full-function hardware decoder is used for processing the decoding of the 2N +1 macroblock line, and N is an integer more than 0; two full-function hardware decoders share a group of line caches to form dual-core sharing, and dependency relationship information between macro blocks is stored in the line caches;

5. The generalized multi-core parallel decoder system according to claim 4, wherein: the line cache arbiter is used for authorizing a line cache ready mark to realize dual-core synchronous decoding;

6. The generalized multi-core parallel decoder system according to claim 5, wherein: a single-core line cache prefetching module is arranged corresponding to each full-function hardware decoder and comprises a ping-pong prefetching cache consisting of a first cache and a second cache, and line cache data of a previous macro block line are stored through the ping-pong prefetching cache;

7. The generalized multi-core parallel decoder system according to any of claims 1-6, wherein: the non-entropy encoded data comprises one or more of a component video parameter set, VPS, a sequence parameter set, SPS, a picture parameter set, PPS, and Slice header information.

8. The generalized multi-core parallel decoder system according to any of claims 1-6, wherein: the decoding task of the macro block layer comprises the steps of code stream reading of full hardware, entropy decoding, inverse DCT (discrete cosine transformation), inverse quantization, intra-frame and inter-frame prediction, motion compensation, pixel reconstruction and deblocking filtering.

9. A video decoding method, characterized by comprising the steps of:

receiving video code stream data;

wherein the content of the first and second substances,

the multi-core hardware decoding accelerator comprises a preprocessor module and at least two isomorphic full-function hardware decoders, the preprocessor module is used for carrying out entropy decoding on video code streams, each full-function hardware decoder is used for carrying out decoding on a row of macro block lines, and the decoding comprises the steps of inverse DCT (discrete cosine transformation), inverse quantization, intra-frame inter-frame prediction and pixel reconstruction, and the macro blocks which are decoded in two adjacent upper and lower lines are separated by at least two macro blocks.

10. The video decoding method of claim 9, wherein: the video is an AVC video or an HEVC video.