CN114374848A

CN114374848A - Video coding optimization method and system

Info

Publication number: CN114374848A
Application number: CN202111565742.5A
Authority: CN
Inventors: 李日; 朱建国; 廖义; 谢亚光; 孙彦龙
Original assignee: Hangzhou Arcvideo Technology Co ltd
Current assignee: Hangzhou Arcvideo Technology Co ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-04-19
Anticipated expiration: 2041-12-20
Also published as: CN114374848B

Abstract

The invention relates to a video coding technology, and discloses a video coding optimization method and a system, wherein the video coding optimization method is applied to a NUMA architecture, and the method comprises the following steps: detecting CPU hardware of a server, and acquiring the total number of nodes of the CPU and a logic core included by the nodes of the CPU, wherein the total number of the nodes of the CPU is M; according to the detected total number of the CPU nodes, establishing a coding kernel; encoding the video on an encoder according to the gop id of the current video frame; and for the coded video output video code stream, sequentially splicing the code stream into a complete code stream according to the display sequence of I frames in the code stream from low to high. The invention improves the code parallelism and avoids the remote memory access of the conventional software code. Meanwhile, the coding parallelism is improved, the coding speed is improved, and meanwhile, the coding quality cannot be lost.

Description

Video coding optimization method and system

Technical Field

The invention relates to a video coding technology, in particular to a video coding optimization method and a video coding optimization system applied to a NUMA (non uniform memory access) architecture.

Background

At present, software encoders for ultra-high-definition videos all adopt a combination of multiple parallel encoding strategies to realize ultra-high-definition real-time transcoding. Common parallel coding strategies are:

(1) intra-frame line-level parallel encoding (multi-line simultaneous encoding within a frame is achieved by using multiple threads);

(2) frame-level parallel encoding (multi-frame simultaneous encoding is realized by utilizing multithreading);

(3) GOP level parallel encoding (multiple GOP simultaneous encoding is achieved with multiple threads).

Through parallel coding based on multiple threads, the computing resources of the multi-core CPU are fully transferred as much as possible. Meanwhile, in order to meet the calculation requirement of the ultra-high-definition video real-time coding, a server of a super multi-core CPU is generally adopted as hardware.

The NUMA architecture, namely 'non-uniform memory access', is commonly adopted by the current multi-core CPU. The NUMA architecture solves the performance bottleneck problem caused by accessing the memory through the traditional north bridge on the multi-core CPU. In NUMA architecture, a server is divided into several nodes (socks), each with a separate CPU and memory. The CPU directly accesses the local address through the memory controller, and the speed and the time delay are high; other node physical addresses are remotely accessed through QPI LINK.

Multithread software running on the CPU of the NUMA architecture should reduce memory remote access as much as possible. Normal multi-thread coding, especially 4K/8K ultra high definition coding, requires much memory to be accessed, wherein the reconstructed frame data and the original frame data are the most occupied parts of the memory.

The original frame is mainly used for calculating image characteristics and coding complexity in a pre-analysis stage so as to determine a quantization parameter and other parameters of coding; the reconstructed frame mainly provides reference pixels for intra-frame and inter-frame prediction. In addition, in the encoding mode selection stage, it is necessary to calculate encoding distortion using the reconstructed image and the original image. Therefore, the memory storing the reconstructed frame and the original frame is undoubtedly the most frequently accessed portion in the encoding.

When 4K/8K ultra-high definition coding is operated on a multi-core CPU of a NUMA architecture, because the coding threads are interdependent and the dependency condition is not always satisfied, the threads often enter a WAIT state, and when the condition is satisfied, the threads enter a RUN state again. The CPU schedules threads to effect a switch in thread state, during which it is possible for a thread to switch from one NUMA node to another NUMA node.

Therefore, the reconstructed frame memory, the original frame memory and the encoding thread accessed by the thread cannot be guaranteed to be in the same node at any time, and a large amount of memory remote access inevitably exists, so that the computing capacity of the multi-core CPU cannot be exerted to the maximum extent.

In video coding standards, three types of coded frames are generally included: i frame, P frame, B frame. GOP (group of picture) is commonly used in video coding to represent a group of pictures between two I frames in a video coding sequence, and the GOP length represents the number of frames in the GOP, which is an important parameter of an encoder.

GOP parallel coding is a common approach for ultra high definition real-time coding. And a Close GOP frame structure is adopted, so that independent coding can be ensured between adjacent GOPs, and the parallelism can be improved by times. And the encoder repacks the GOP code stream according to the GOP sequence to form a final complete code stream. The coding kernel refers to a software module with complete coding capability; the input of the coding kernel is an original video image sequence, and the output is a coded video code stream.

For example, in the prior art, patent application numbers are: CN202011644043.5201910600394.7, respectively; the patent name is, a method and system for real-time coding of 8K ultra high definition video, patent application date: 2020-12-31. On NUMA architecture CPU, it limits the coding task to one node inside, can not exert the computing power of multinode CPU to the maximum extent.

Disclosure of Invention

The invention provides a video coding optimization method and a video coding optimization system aiming at the problem that in the prior art, on a NUMA (non uniform memory access) architecture CPU (central processing unit), coding tasks are limited in a node and the computing capacity of a multi-node CPU (central processing unit) cannot be exerted to the maximum extent.

In order to solve the technical problem, the invention is solved by the following technical scheme:

a video coding optimization method is applied to a NUMA architecture, and comprises the following steps:

detecting a server, namely detecting CPU hardware of the server, and acquiring the total number of nodes of the CPU and a logic core included by the nodes of the CPU, wherein the total number of the nodes of the CPU is M;

establishing an encoder, namely establishing an encoding kernel according to the detected total number of the CPU nodes;

coding a video, namely coding the video on a coder according to the gop id of the current video frame;

and synthesizing the code stream, namely splicing the code stream into a complete code stream in sequence according to the display sequence of the I frames in the code stream from low to high for the coded video output video code stream.

Preferably, the method for creating the encoder includes:

step 1, initializing a node index variable i to be 0;

step 2, calling pthread _ detail _ np to set the thread of the encoder to run on the logic core of the node i;

step 3, creating a coding kernel i, wherein the coding kernel is bound on the node i;

step 4, initializing the coding kernel, including the allocation of an original frame memory and a reconstructed frame memory;

and step 5, adding 1 to the node index variable, namely i is i +1, continuing the step 2 when i is less than M, otherwise, completing the creation of the encoder.

Preferably, the video encoding method includes:

acquiring the gop id of the current video frame and updating the gop id, wherein the gop id is initialized to be 0 and the maximum value is M-1;

and for the video frame with the current gop id i, sending the current frame to a coding kernel on the node i, copying the original image to an original frame memory of the coding kernel, and then starting coding.

Preferably, the method for updating the gop id is as follows: after the first frame, every time an I frame or IDR frame is received, the gop id is incremented by 1, and if the gop id is equal to M, the gop id is assigned to 0.

In order to solve the technical problem, the invention also provides a video coding optimization system, which is applied to the NUMA architecture and comprises a server detection module, an encoder creation module, a video coding module and a code stream synthesis module;

the server detection module detects CPU hardware of the server and obtains the total number of nodes of the CPU and a logic core included by the nodes of the CPU, wherein the total number of the nodes of the CPU is M;

the encoder creating module is used for creating an encoding kernel according to the detected total number of the CPU nodes;

the video coding module codes the video on the coder according to the gop id of the current video frame;

and the code stream synthesis module is used for outputting video code streams to the coded video, and sequentially splicing the code streams into complete code streams according to the display sequence of the I frames in the code streams from low to high.

Preferably, the encoder creating module includes: the system comprises a node initialization module, a thread running setting module, a coding kernel creating module, a coding kernel initialization module and an index variable updating module;

the node initialization module is used for initializing a node index variable i to be 0;

the thread running setting module calls a pthread _ detail _ np setting encoder to run on a logic core of the node i;

the code kernel creating module is used for creating a code kernel i and binding the code kernel to the node i;

the coding kernel initialization module initializes the coding kernel, and comprises the allocation of an original frame memory and a reconstructed frame memory;

and an index variable updating module, wherein the node index variable is added with 1, namely i is i + 1.

Preferably, the video encoding module includes:

the video frame acquisition module is used for acquiring the gop id of the current video frame and updating the gop id, wherein the gop id is initialized to be 0, and the maximum value is M-1;

the video frame coding module is used for copying an original frame image to an original frame memory of a coding kernel of the node i for a video frame with the current gop id being i; the encoding kernel on scheduling node i encodes the current video frame.

In order to solve the above technical problem, the present invention also provides an electronic device, including: at least one processor and memory; the memory stores computer-executable instructions; the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the method for video coding optimization.

In order to solve the above technical problem, the present invention further provides a computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the computer-readable storage medium is implemented by a video coding optimization method.

Due to the adoption of the technical scheme, the invention has the remarkable technical effects that:

the invention improves the code parallelism and avoids the remote memory access of the conventional software code.

The invention is applied to the HEVC and AVS3 encoders of ArcVideo, the comprehensive improvement range of the transcoding speed of the ultra-high definition videos such as 4K/8K and the like reaches about 10%, and the encoding quality is not lost at all.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a flow chart of the creation of the encoder of the present invention.

Fig. 3 is a schematic diagram of the coding framework of the present invention.

Wherein:

NUMA is short for Non-Uniform Memory Access, namely Non-Uniform Memory Access;

i frame: intra-coded picture Intra-coded image frames;

p frame: predictive-coded picture prediction coding image frame;

b frame: bidirectional predicted picture frames;

GOP: group of pictures, a Group of pictures between two I-frames.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Example 1

The method for creating the encoder comprises the following steps:

step 1, initializing a node index variable i to be 0;

The video coding method includes:

The method for updating the gop id is as follows: after the first frame, every time an I frame or IDR frame is received, the gop id is incremented by 1, and if the gop id is equal to M, the gop id is assigned to 0.

Example 2

On the basis of embodiment 1, for the detection of the hardware condition of the server CPU, for example, in the Linux system, information in the/proc/cpuinfo file is read, and the keywords "processor" and "physical id" are filtered, so that CPU id and physical id information can be obtained. The cpu id is the id of the logical processor, and the physical id is the NUMA node where the logical processor is located. For example, in an Intel (R) Xeon (R) Gold 6258R CPU, there are 112 logical processors and 2 nodes. Wherein the node 0(physical id 0) includes logical processors (0-27, 56-83), and the node 1(physical id 1) includes logical processors (28-55, 84-111).

For the creation of the encoder, for the Linux system and the dual-node intel (R) xeon (R) Gold 6258R CPU (M ═ 2), the pthread _ detail _ np function is used to set the affinity of the logical processors, and then the subsequent software code will only run on these set logical processors. Calling pthread _ setup _ np to set subsequent software codes to run on all logic processors of the node 0, then creating the coding kernel 0, and calling an initialization function of the coding kernel 0 to complete initialization. Because the coding kernel is a module with a complete coding function, the allocation of other memories including an original frame memory, a reconstructed frame memory and the coding kernel is completed in an initialization function, so that the coding kernel 0 is ensured to be operated only on the node 0, and the memories of the original frame, the reconstructed frame and the like accessed by the coding kernel are also in the physical address of the node 0. Similarly, calling pthread _ setup _ np sets the subsequent software code to run on all the logical processors of the node 1, creating the coding kernel 1, and calling the initialization function of the coding kernel 1 to complete initialization.

When a new frame is coded, firstly, acquiring the current gop id; the gop id is initialized to be 0, and the maximum value is M-1(M is the number of CPU nodes); the gop id update strategy is as follows: after the first frame, every time an I frame or IDR frame is received, the gop id is incremented by 1, and if the gop id is equal to M, the gop id is assigned to 0. And if the gop id of the current frame is equal to i, the current frame is sent to a coding kernel i on the node i, and the coding kernel copies the original image into an original frame memory of the coding kernel and then normally codes the original image.

Example 3

On the basis of the above embodiments, this embodiment provides a video coding optimization system, which is applied to a NUMA architecture, and includes a server detection module, an encoder creation module, a video coding module, and a stream composition module;

The encoder creation module includes: the system comprises a node initialization module, a thread running setting module, a coding kernel creating module, a coding kernel initialization module and an index variable updating module;

The video encoding module includes:

Example 4

On the basis of the above embodiment, the present embodiment provides an electronic device, which includes: at least one processor and memory;

the memory stores computer-executable instructions; the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the method for video coding optimization.

Example 5

On the basis of the above embodiments, the present embodiment provides a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the implementation is performed by a video coding optimization method.

Claims

1. A video coding optimization method is applied to a NUMA architecture, and is characterized by comprising the following steps:

2. The method of claim 1, wherein the encoder is created by a method comprising:

step 1, initializing a node index variable i to be 0;

3. The method of claim 1, wherein the video coding method comprises:

4. The video coding optimization method of claim 3, wherein the gop id update mode is as follows: after the first frame, every time an I frame or IDR frame is received, the gop id is incremented by 1, and if the gop id is equal to M, the gop id is assigned to 0.

5. A video coding optimization system is applied to a NUMA architecture, and is characterized in that: the system comprises a server detection module, an encoder creation module, a video coding module and a code stream synthesis module;

and the code stream synthesis module is used for splicing the code streams into complete code streams in sequence according to the display sequence of the I frames in the code streams from low to high for the coded output video code streams.

6. The video coding optimization system of claim 5, wherein the encoder creation module comprises: the system comprises a node initialization module, a thread running setting module, a coding kernel creating module, a coding kernel initialization module and an index variable updating module;

the coding kernel creating module is used for creating a coding kernel i, and the coding kernel is bound on the node i at the moment;

7. The video coding optimization system of claim 5, wherein the video coding module comprises:

8. An electronic device, comprising: at least one processor and memory; the memory stores computer-executable instructions; the at least one processor executing computer-executable instructions stored by the memory causes the at least one processor to perform a method of video coding optimization as claimed in any one of claims 1 to 4.

9. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement a video coding optimization method as claimed in any one of claims 1 to 4.