CN108848384A

CN108848384A - A kind of efficient parallel code-transferring method towards multi-core platform

Info

Publication number: CN108848384A
Application number: CN201810628187.8A
Authority: CN
Inventors: 张为华; 李弋; 鲁云萍
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2018-06-19
Filing date: 2018-06-19
Publication date: 2018-11-20

Abstract

The invention belongs to field of computer technology, specially a kind of efficient parallel code-transferring method towards multi-core platform.In the present invention, video code conversion includes decoding and encoding two stages, and energy level includes two modules of decoding and coding parallel, and data level includes GOP grades and frame level parallel；One section of buffer area is equipped in system to store the image arranged by display order, coding thread is taken out continuous one section（Coding unit）Absolute coding is carried out, and generates intermediate temporary file；Finally, temporary file can be merged into target video；After video input, thread is waken up and executes transcoding task；In transcoding process, the stripping and slicing of thread experience, coding, merges this four-stage at decoding, and the thread of different phase is parallel in pipelined fashion；The result that the previous stage generates is supplied to latter stage use, and by special data structure managing；The present invention can make full use of the efficiency of the computing resource raising transcoding of bottom multicore hardware under the premise of guaranteeing video quality.

Description

A kind of efficient parallel code-transferring method towards multi-core platform

Technical field

The invention belongs to field of computer technology, and in particular to a kind of efficient parallel code-transferring method towards multi-core platform, The computing resource of bottom multicore hardware is made full use of to improve the efficiency of transcoding under the premise of guaranteeing video quality.

Background technique

With the rapid development of internet and multimedia information, data start explosive growth, and internet is every The mass data of its transmission, digital video account for main part.According to CISCO in the network traffic data report of publication in 2017 It accuses, network total flow in 2016 is 1.15ZB（1ZB=1024³TB）, the ratio of video flow is 72%；Pre-estimation by 2021, Total flow is up to 3.33ZB, and the ratio of video flow is even more to reach 82%.

The universal of digital video enriches people's lives, and people can be used mobile phone, apparatus such as computer and watch view online Frequently.However, video is needed in playing process in view of compatibility issues such as resolution ratio, code rate, coded formats.For example, video It to play, be needed according to corresponding scaling in the equipment of different screen size；It is broadcast under the poor environment of network broadband It puts, needs to reduce code rate；Played in specific player, need transform coding format, such as H.264, MPEG-4.Video code conversion Technology is exactly to develop to solve the above-mentioned problems.

In order to allow user to watch video under various circumstances, service provider can first be regarded local Video Quality Metric at certain specification Frequently, user is then transmitted to by network.By taking Netflix as an example, a video need to be transcoded into 120 targets according to different parameters Video file is then transferred to user.When video code conversion, generally require to turn according to different resolution ratio, code rate, coded format At multiple target video files.Transcoding needs to guarantee lower delay, such as 25 frames are per second above just can guarantee good user Experience.Along with the application of transcoding inherently computation-intensive, these all bring huge challenge to Video service quotient, add Fast video code conversion very it is necessary to.

Compatibility of the video under different scenes can be improved in transcoding technology, and according to different parameters, the same video can To change into the target video of multiple format.For ordinary user, input video is transcoded into certain format target video very Kind is common, i.e., single source single goal transcoding.For Video service quotient, need certain HD video by different transcoding parameters（Point Resolution, code rate, coded format）Change into multichannel target video, i.e., the transcoding scene of single source multiple target.Either which kind of scene turns Code, which generally requires lower delay, can just good user experience, such as must reach 25 frame per second, and transcoding usually needs Guarantee the constant mass of target video.

The computing cost of transcoding is larger, and the transcoding frame per second of single core processor is not usually in ten frames hereinafter, be able to satisfy user's Demand.The appearance of multi-core technology provides opportunity to transcoding acceleration, and has had relevant concurrent technique to add applied to transcoding Speed can be mainly divided into GOP（Group of Picture, picture group）Rank and frame level are other parallel.Although GOP rank and Row scalability is preferable, but develop and it is immature, there are problems that objective video quality decline.Frame level is still main parallel The parallel scheme of stream is used by the mainstreams codec such as FFmpeg, x264, but their parallel scalability is poor, can not Make full use of the computing resource of multi-core platform.

The present invention analyzes the concurrency based on GOP transcoding, devises a kind of efficient parallel code-transferring method, solves meter Calculate the low problem of resource utilization.

Summary of the invention

The purpose of the present invention is to provide a kind of high efficient parallel transcodings towards multi-core platform of computing resource utilization rate Method.

Efficient parallel code-transferring method provided by the invention towards multi-core platform is the independence using video GOP encoding and decoding Property, under the premise of guaranteeing video image quality, the concurrency of video code conversion is excavated, makes full use of the calculating of bottom multicore hardware Resource accelerates the process of transcoding.

Video code conversion includes decoding and encoding two stages, and energy level mainly includes decoding and encoding this coarseness mould parallel Block, data level mainly include GOP grades and frame level parallel.GOP grades of parallel transcodings need in advance by video by closure GOP cutting, decoding Thread obtains different closure GOP, and is decoded into original sequence.It is suitable by showing to store that one section of buffer area is had in system The image of sequence arrangement, coding thread are taken out continuous one section（Coding unit）Absolute coding is carried out, and generates intermediate interim text Part.Finally, temporary file can be merged into target video.

It is closed between GOP and data dependence is not present, so the scalability of this parallel mode is preferable, the present invention is based on close The GOP of conjunction realizes efficient parallel trans-coding system.

Efficient parallel code-transferring method provided by the invention towards multi-core platform, frame are as shown in Figure 1.The present invention utilizes Thread pool manages computing resource, when without transcoding task, thread suspend mode；After video input, thread, which is waken up and executes transcoding, appoints Business.In transcoding process, the stripping and slicing of thread experience, coding, merges this four-stage at decoding, and the thread of different phase is with assembly line Mode is parallel.The adjacent stage meets producers and consumers' relationship, and the result that the previous stage generates can be supplied to the latter Stage uses, and by special data structure managing, such as the video section information that closure GOP queue storage stripping and slicing generates.

After transcoding threads are waken up, dicing stage is initially entered.The thread of dicing stage is video to be closed GOP as unit Being cut into independently decoded section, other threads can immediately enter decoding stage.System uses the label of a stripping and slicing state Come control only one thread can stripping and slicing, specific implementation can be parallel between the stage part be discussed in detail.Stripping and slicing thread will close The block information that closing GOP indicates is put into closure GOP queue, and decodes thread and obtain block information from the queue and be decoded, The two can execute parallel.

In decoding process, the original image that decoding generates is put into coding unit by thread.With existing parallel transcoding system Unified sample, coding unit are to store the data structure of continuous one section of original image to carry out as a whole after being filled Coding.In view of decoding intermediate data committed memory is larger, efficient parallel trans-coding system using in annular team to coding unit into Row unified management.It decodes thread and coding thread and dynamic dispatching is carried out according to the state of circle queue, to guarantee encoding and decoding Higher computing resource utilization rate is maintained in journey.

After having encoded the original image frame in a coding unit, coding thread can export in the section at temporary file. If there is continuous one section of temporary file generates, coding thread can be responsible in advance merging these temporary files, avoid integrating Used time is longer.Which temporary file is the present invention, which record using reorder table, is generated, and merges in advance to help to encode thread.Institute After the completion of there are encoding tasks, then integrate.After file destination generates, transcoding task terminates, and thread is recycled by thread pool, Into dormant state, transcoding task next time is waited.

In the present invention, the parallel transcoding, it is parallel by the way of assembly line to be primarily referred to as four stage of transcoding, each stage Next stage can't be entered back into after being fully completed, but the adjacent stage parallel simultaneously can execute.Flowing water is parallel Mode serially executes for eliminating stripping and slicing, decodes thread and frequently sleep, merge and serial execute asking for bring computing resource waste Topic.

Firstly, dicing stage can execute parallel with decoding.In efficient parallel trans-coding system, all threads are by same Entrance executes transcoding task.However, due to dicing stage discomfort merging rows, in order to ensure only one thread cuts video Block, the label of one stripping and slicing state of system maintenance, the label share non-stripping and slicing, stripping and slicing carry out in, stripping and slicing three kinds of states are completed, It uses respectivelyc ₀,c ₁,c ₂It indicates, as shown in Figure 5.

The label is initialized as when transcoding task startc ₀, when reading if there is a thread, this is labeled asc ₀, then The thread is set toc ₁, and execute stripping and slicing task.When other threads read stripping and slicing label.The label has been configured toc ₁Orc ₂, it is then directly entered decoding stage, as shown in Figure 4.Thread needs to be carried out with lock to the reading or modification of stripping and slicing status indication Protection, just can ensure that the atomicity of read operation.After thread has determined the task of oneself, stripping and slicing thread will by scan video The block information of closure GOP is put into closure GOP queue, and decodes thread and obtain closure GOP information from the queue, indicates it Video section be decoded.Therefore, stripping and slicing thread can be executed with decoding thread parallel.After stripping and slicing thread executes completion, Stripping and slicing state can be set toc ₂, and enter decoding stage.

According to the state of annular coding unit queue, decoding stage and coding stage will do it dynamic dispatching.Due to decoding The raw image data committed memory that process generates is more, is uniformly deposited using annular coding queue to original image herein Storage.Coding unit is available free, is saturated two states, as shown in Figure 6.Idle state represents the not stored any original graph of coding unit Picture；Saturation state represents the coding unit and is filled up by original image.The state of coding unit can be during encoding and decoding Saturation state can be set to after decoding thread fills a full coding unit by carrying out switching at runtime；The coding of coding unit is appointed After business terminates, coding unit can be set to idle state.

In the present invention, parallel transcoding further includes parallel in the stage in each stage of transcoding；Video is by a series of video frame structures At the GOP of multiple closures can be divided into.The decoding process for being closed GOP is mutually indepedent, can be with parallel decoding.Decoded figure It, can be to their parallel encodings as being divided into different image sequences by display order.Finally, the adjacent code sequence that coding generates Column can be merged parallel with the method for merger.

Dicing stage：If video does not have frame index, stripping and slicing needs progressive scan video and is divided into closure GOP.Such as Fruit video file includes frame index, then reads the frame index data of video, is that the time is long video slicing according to the number of cutting Similar several segments are spent, and dicing position is sent to next stage --- decoding.

Decoding stage：It is decoded in order to prevent there are data dependence between thread, the present invention is single as cutting using closure GOP The of member, and adjacent closure GOP has an overlapping of I frame, the last frame of previous closure GOP and the latter closure GOP One frame is same frame, as shown in Figure 2.After closure GOP is decoded completely, last I frame is thrown away, to guarantee that the I frame of overlapping only can Retained by the latter decoding GOP.The I frame at the last one decoding end GOP is not Chong Die with other decoding GOP, therefore the frame needs Retain.

Merging phase：Parallel encoding generates many temporary files, and the method for present invention merger merges temporary file. In order to reduce the read-write number to disk, temporary file carries out two-stage merger, as shown in Figure 3.Level-one temporary file is that coding is single The temporary file that member generates；Second level temporary file is the file after once merging.

Further include parallel Data Rate Distribution in the present invention, is exactly distributed according to the code rate of input video to help parallel encoding Data Rate Distribution, it is specific to press SATD using a kind of（Sum of Absolute Transformed Difference, Ha Deman change The 4X4 prediction residual absolute value summation changed）Carry out the algorithm of Data Rate Distribution.

Since SATD uses half precision residual error data, if calculating SATD in an encoding process, need Complicated prediction process is completed first, can bring biggish performance cost.It can be given birth to after inverse quantization and inverse transformation in view of decoding At residual error, the calculating process of SATD is placed decoding stage by the present invention, thus need to only complete simple Ha Deman transformation and absolutely Calculating to value, and avoid complicated prediction process.The process of Data Rate Distribution is carried out as shown in fig. 7, the present invention according to SATD Input video is decoded；Then the SATD value of every frame is calculated, and uses it as the standard of complexity；Then for coding requirement Corresponding code rate is distributed for coding unit；Video frame is finally recompiled according to the code rate of distribution.

Using code rate allocation method proposed by the present invention, the calculating of SATD is completed in decoding, due to decoded image A large amount of memories are occupied, video re-encodes after cannot decoding completely, thus can not calculate SATD points of entire video before encoding Cloth.Start after circle queue is filled due to encoding, the present invention sets being averaged in circle queue section in the initial state Code rate is target bit rate, then accordingly distributes code rate according to the distribution of SATD.With the progress of coding, encoded coding unit Number can be more and more, and the distribution curve of SATD also can be increasingly more complete, what Data Rate Distribution algorithm only needed to guarantee to inscribe when coding Average bit rate is up to standard.

Detailed description of the invention

Fig. 1 is efficient parallel transcoding frame.

Fig. 2 is the GOP of closure.

Fig. 3 is video merging.

Fig. 4 is stripping and slicing and the parallel signal of decoding.

Fig. 5 is that stripping and slicing state is that symbol corresponds to table.

Fig. 6 is coding unit state transition diagram.

Fig. 7 is the parallel Data Rate Distribution signal of SATD.

Fig. 8 is stripping and slicing and decoded Parallel Implementation.

Fig. 9 is the dynamic dispatching process of decoding and coding.

Figure 10 is the process for encoding file mergences.

Figure 11 is the process of SATD distribution distribution.

Specific embodiment

In order to keep the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, specifically Bright preferred implementation of the invention.Before this it should be noted that term used in present specification and claims or It is the meaning in common meaning or dictionary that word, which is unable to limited interpretation, and should be based in order to illustrate its hair in the best way The principle that bright people can suitably define the concept of term is construed to meaning and concept of the technical idea of the present invention.With It, the structure indicated in embodiment and attached drawing documented by this specification is one of preferred embodiment, can not be complete Quan represents technical idea of the invention, is able to carry out each of replacement it will therefore be appreciated that may exist for the present invention Kind equivalent and variation.

FFmpeg is a powerful open source multi-media processing frame, and convertible, the more clock audio-video documents of editor are used It is very extensive on the way.Here how we is based on FFmpeg encoding and decoding frame and POSIX multi-thread programming model if introducing, realize high The parallel trans-coding system of effect.

1, parallel transcoding

In FFmpeg frame, we are by realizing clip_video, decode_closed_gop, scale_and_encode With tetra- power functions of concatenate, corresponding to the stripping and slicing of parallel transcoding, decoding, coding, merge four-stage.Further, We also need to realize the execution entrance after launch_transcoding function is waken up as thread, and each power function can lead to Setting trans_ctx parameter is crossed to be called.

To realize stripping and slicing and decoded parallel, the variable record stripping and slicing shape of clip_status is defined in trans_ctx State, the variable share CLIP_NOT_YET, tri- kinds of states of CLIPPING, CLIP_FIN.Simultaneously as the read-write of the variable is former Sub-operation, therefore it is protected using mutual exclusion lock.Thread enter the execution stream after launch_transcoding as shown in figure 8, If after thread obtains lock, the state for reading clip_status is CLIP_NOT_YET, then thread dispatching clip_video Function executes stripping and slicing, and calls decode_closed_gop to be decoded after the completion of stripping and slicing.The line of execution stripping and slicing is not needed Journey will directly decode.In the realization of clip_video, we use the av_seek_frame index functions I of FFmpeg offer Frame obtains pts after the I frame decoding（Presentation Time Stamp）It records, after the completion of stripping and slicing, clip_status is set to CLIP_FIN.

After thread enters decoding stage, traversal closure GOP queue is decoded the section, as shown in Figure 9.If acquisition is closed GOP success is closed, and there are enough free code units, then thread will traverse closure GOP, and calls avcodec_ The decoding of decode_video2 function.If obtaining closure GOP failure, if stripping and slicing at this time has been completed, illustrate not have New closure GOP is generated, then thread will enter coding stage, this is first entrance of the decoding scheduling to coding.If obtained Closure GOP success is taken, but without enough free code units, illustrates that decoding speed is too fast, thread enters coding stage, this It is second entrance of the decoding scheduling to coding.

After thread enters coding stage, traversal saturation coding unit, and avcodec_encode_video2 is called to be compiled Code.The case where if it is multichannel, needs to be embedded in one layer of circulation again to traverse multiplex coding context, to realize that a frame image is pressed The target of multichannel parameter coding.Certainly, if obtaining saturation coding unit failure, there are two kinds of situations, and one is decodings It is over, new saturation coding unit there will be no to generate, enter merging phase at this time；Another kind is to decode and do not complete, It is only sky in annular team, at this moment encodes thread and be rescheduled into decoding stage.

After coding unit generates temporary file, thread can't go to obtain next saturation coding unit at once, but sentence The temporary file after whether one section being reordered that breaks merges in advance.Therefore, as shown in Figure 10, a temporary file is generated in coding Afterwards, thread can enter merging phase, the label in corresponding reorder table is set as being completed, if the son that reorders at place Section is all set, then these temporary files are just merged into second level temporary file in advance.Finally, again that second level is temporarily literary Part is merged into target video.

2, the realization of SATD

The SATD of each frame is calculated in decoding stage, the summation of corresponding coding unit SATD is then counted, then analyzes it in video Proportion in total SATD, to realize Data Rate Distribution.Although the SATD summation of input video, thread can not be obtained in advance After only filling full circle queue, coding just will do it, so still suffering from sufficient distributed intelligence.In addition, with transcoding Progress, obtained SATD information can be more and more, can reasonably for coding unit distribute code rate.

The process of SATD distribution distribution is as shown in figure 11.In the decoding process of H264, the calculating process of SATD is will be pre- It surveys residual error and carries out Ha Deman transformation, then seek the sum of absolute value.FFmpeg call encoding and decoding library libavcodec, realization it is anti- The function that transformation generates macroblock residuals can be stored in residual error in the residual array of H264Context structural body.Therefore, real It only needs to carry out Ha Deman transformation to the residual array of the H264Context structural body when existing SATD.

Claims

1. a kind of efficient parallel code-transferring method towards multi-core platform, which is characterized in that video code conversion includes decoding and encoding two A stage, energy level include two modules of decoding and coding parallel, and data level includes GOP grades and frame level parallel；GOP grades of parallel transcodings It needs in advance by video by closure GOP cutting, decoding thread obtains different closure GOP, and is decoded into original sequence；System In be equipped with one section of buffer area to store the image arranged by display order, coding thread is taken out continuous one section i.e. coding unit Absolute coding is carried out, and generates intermediate temporary file；Finally, temporary file is merged into target video.

2. the efficient parallel code-transferring method according to claim 1 towards multi-core platform, which is characterized in that video input Afterwards, thread is waken up and executes transcoding task；In transcoding process, the stripping and slicing of thread experience, coding, merges this four-stage at decoding, The thread of different phase is parallel in pipelined fashion；The result that the previous stage generates is supplied to latter stage use, and By special data structure managing；

After transcoding threads are waken up, dicing stage is initially entered；Video is cut using being closed GOP as unit in the thread of dicing stage Being cut into independently decoded section, other threads to immediately enter decoding stage；System is controlled using the label of a stripping and slicing state Making only one thread being capable of stripping and slicing；The block information for being closed GOP expression is put into closure GOP queue by stripping and slicing thread, and is decoded Thread obtains block information from the queue and is decoded, and the two executes parallel；

In decoding process, the original image that decoding generates is put into coding unit by thread；Coding unit is storage continuous one The data structure of section original image is encoded as a whole after being filled；System is single to coding using circle queue Member is managed collectively；It decodes thread and coding thread and dynamic dispatching is carried out according to the state of circle queue, to guarantee to compile solution Higher computing resource utilization rate is maintained during code；

After having encoded the original image frame in a coding unit, coding thread exports in the section at temporary file；If there is Continuous one section of temporary file generates, and coding thread is responsible in advance merging these temporary files；Which is recorded using reorder table A little temporary files are generated, and are encoded thread with help and are merged in advance；After the completion of all encoding tasks, then integrate；Mesh After marking file generated, transcoding task terminates, and thread is recycled by thread pool, into dormant state, waits transcoding task next time.

3. the efficient parallel code-transferring method according to claim 2 towards multi-core platform, which is characterized in that the quadravalence of transcoding The mode of Duan Caiyong assembly line is parallel, i.e., the parallel execution simultaneously of adjacent stage；

Firstly, dicing stage can execute parallel with decoding；In system, all threads execute transcoding task by the same entrance； However, in order to ensure only one thread carries out stripping and slicing to video, system maintenance one is cut due to dicing stage discomfort merging rows The label of bulk state, the label be divided into non-stripping and slicing, stripping and slicing carry out in, stripping and slicing three kinds of states are completed, use respectivelyc ₀,c ₁,c ₂Table Show；

The label is initialized as when transcoding task startc ₀, when reading if there is a thread, this is labeled asc ₀, then the line Journey is set toc ₁, and execute stripping and slicing task；When other threads read stripping and slicing label, which has been configured toc ₁Orc ₂, in It is to be directly entered decoding stage；Thread protects the reading or modification of stripping and slicing status indication with lock；When thread has determined certainly After oneself task, the block information for being closed GOP is put into closure GOP queue by scan video by stripping and slicing thread, and decodes thread Closure GOP information is obtained from the queue, the video section indicated it is decoded；That is stripping and slicing thread and decoding thread parallel It executes；After stripping and slicing thread executes completion, stripping and slicing state is set toc ₂, and enter decoding stage；

According to the state of annular coding unit queue, decoding stage and coding stage carry out dynamic dispatching；Since decoding process produces Raw raw image data committed memory is more, carries out unified storage to original image using annular coding queue；Coding unit Available free, saturation two states；Idle state represents the not stored any original image of coding unit；Saturation state represents the coding Unit is filled up by original image；The state of coding unit carries out switching at runtime during encoding and decoding, and decoding thread is filled out After a coding unit, it is set to saturation state；After the encoding tasks of coding unit terminate, coding unit is set to Idle state.

4. the efficient parallel code-transferring method according to claim 3 towards multi-core platform, which is characterized in that further include transcoding Each stage stage in it is parallel；

Dicing stage：If video does not have frame index, stripping and slicing needs progressive scan video and is divided into closure GOP；If depending on Frequency file includes frame index, then reads the frame index data of video, is time span phase video slicing according to the number of cutting Close several segments, and dicing position is sent to next stage --- decoding；

Decoding stage：Using closure GOP as cutter unit, there is the overlapping of I frame, previous closure in adjacent closure GOP The first frame of last frame and the latter the closure GOP of GOP is same frame；After closure GOP is decoded completely, last I is thrown away Frame is retained with guaranteeing that the I frame of overlapping only can decode GOP by the latter；The last one decoding the end GOP I frame not with it is other GOP overlapping is decoded, therefore the frame needs to retain；

Merging phase：Parallel encoding generates many temporary files, merges temporary file with the method for merger；In order to reduce pair The read-write number of disk, temporary file carry out two-stage merger；Level-one temporary file is the temporary file that coding unit generates；Second level Temporary file is the file after once merging.

5. the efficient parallel code-transferring method according to claim 4 towards multi-core platform, which is characterized in that further include parallel Data Rate Distribution, is exactly distributed the Data Rate Distribution to help parallel encoding according to the code rate of input video, carries out code using by SATD The algorithm of rate distribution, concrete operations are as follows：

Before encoding to data, the SATD value of the every frame of video is calculated first, it is right then using SATD as the standard of complexity Coding unit distributes corresponding code rate；Here the calculating process of SATD is placed into decoding stage, since coding must be in annular Queue just starts after being filled, and therefore, can set the average bit rate in circle queue section as object code in the initial state Then rate accordingly distributes code rate according to the distribution of SATD；With the progress of coding, encoded coding unit number is more and more, The distribution curve of SATD also can be increasingly more complete, and Data Rate Distribution algorithm only needs to guarantee that the average bit rate inscribed when coding is up to standard i.e. It can.