CN112804471A

CN112804471A - Video conference method, conference terminal, server and storage medium

Info

Publication number: CN112804471A
Application number: CN201911115565.3A
Authority: CN
Inventors: 曹泊
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2021-05-14
Also published as: WO2021093882A1

Abstract

The embodiment of the invention provides a video conference method, a conference terminal, a server and a storage medium, wherein in the video conference process, after the server receives video code streams sent by all video sources in the video conference, the server only decodes, synthesizes and re-encodes the video code streams of part of the video sources to form a synthesized code stream and sends the synthesized code stream to the conference terminal, and meanwhile, the server also sends the video code streams of the other video sources as independent code streams to all the conference terminals, so that the conference terminal decodes and displays the synthesized code stream and the independent code streams, and the requirement on the coding and decoding capacity of the server side is reduced. And other video code streams outside the server processing capacity range are directly sent to the conference terminal, so that the processing resources of the conference terminal side are fully utilized, the time delay of video pictures of the conference terminal side is reduced, the fluency of the video conference is improved, and the user experience is enhanced.

Description

Video conference method, conference terminal, server and storage medium

Technical Field

The present invention relates to the field of video conferencing technologies, and in particular, to a video conferencing method, a conference terminal, a server, and a storage medium.

Background

In a video conference in a cloud conference system, each conference terminal firstly encodes local video data and then sends the encoded video data to an MCU (multipoint Control Unit) server, and after receiving video code streams of each conference terminal, the MCU server firstly decodes the video code streams, then carries out Multi-picture synthesis, and finally encodes video data corresponding to multiple pictures and sends the encoded video data to each conference terminal participating in the video conference. And each conference terminal receives the video code stream of the MCU server, decodes the data and then displays the picture.

It is foreseen that, if there are more conference terminals participating in a conference in a video conference, the MCU server needs to process video streams of multiple video sources at the same time during the video conference: and decoding the video code streams received from all the video sources, synthesizing the video pictures, and then coding the synthesized pictures. Therefore, the processing process requires the MCU server to have a high processing capability, and once the performance of the MCU server is not sufficient, the video frame at the conference terminal side is delayed, which affects the conference experience of the participating users.

Disclosure of Invention

The embodiment of the invention provides a video conference method, a conference terminal, a server and a storage medium, and mainly solves the technical problems that: how to solve the not enough meeting terminal that causes of MCU server processing property looks into the video picture time delay seriously, with meeting user's meeting experience not good problem.

To solve the foregoing technical problem, an embodiment of the present invention provides a video conference method, including:

receiving a composite code stream and an independent code stream sent by a server, wherein the composite code stream is formed by decoding, synthesizing and re-encoding video code streams of part of video sources in the video conference by the server, and the independent code stream is a video code stream which is transmitted to the conference terminal after being received by the server from other video sources;

decoding the composite code stream and the independent code stream respectively;

and displaying the video pictures corresponding to the composite code stream and the independent code stream.

The embodiment of the invention also provides a video conference method, which comprises the following steps:

receiving a video code stream sent by a video source in the video conference;

and decoding, picture synthesizing and re-encoding the video code streams of part of the video sources to obtain a synthesized code stream, sending the synthesized code stream to each conference terminal, and forwarding the video code streams of the other video sources to each conference terminal as independent code streams.

The embodiment of the invention also provides a conference terminal, which comprises a first processor, a first memory and a first communication bus;

the first communication bus is used for realizing connection communication between the first processor and the first memory;

the first processor is configured to execute one or more programs stored in the first memory to implement the steps of the first video conferencing method described above.

The embodiment of the invention also provides a server, which comprises a second processor, a second memory and a second communication bus;

the second communication bus is used for realizing connection communication between the second processor and the second memory;

the second processor is configured to execute one or more programs stored in the second memory to implement the steps of the second video conferencing method described above.

The embodiment of the present invention further provides a storage medium, where at least one of a first video conference program and a video conference program is stored in the storage medium, and the first video conference program can be executed by one or more processors to implement the steps of the first video conference method; the second video conferencing program may be executed by one or more processors to implement the steps of the second video conferencing method described above.

The invention has the beneficial effects that:

according to the video conference method, the conference terminal, the server and the storage medium provided by the embodiment of the invention, in the video conference process, after the server receives the video code stream sent by each video source in the video conference, only the video code streams of partial video sources are decoded, synthesized and re-encoded to form a synthesized code stream and sent to the conference terminal, and meanwhile, the server also sends the video code streams of other video sources to each conference terminal as independent code streams, so that the conference terminal decodes and displays the synthesized code stream and the independent code streams. And other video code streams outside the server processing capacity range are directly sent to the conference terminal, so that the processing resources of the conference terminal side are fully utilized, the time delay of video pictures of the conference terminal side is reduced, the fluency of the video conference is improved, and the user experience is enhanced.

Additional features and corresponding advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

Fig. 1 is an interaction flowchart of a video conference method according to an embodiment of the present invention;

fig. 2 is an interaction flowchart of a video conference method according to a second embodiment of the present invention;

fig. 3 is a schematic diagram of a conference terminal displaying video frames of video sources according to a second embodiment of the present invention;

fig. 4 is another schematic diagram of a conference terminal displaying video frames of video sources according to a second embodiment of the present invention;

fig. 5 is a flowchart illustrating a server processing a video bitstream of a partial video source to form a composite bitstream according to a second embodiment of the present invention;

fig. 6 is an interaction flowchart of a video conference method according to a third embodiment of the present invention;

fig. 7 is a schematic diagram of a video conference picture layout of a conference terminal according to a third embodiment of the present invention;

fig. 8 is another schematic diagram of a video conference screen layout of a conference terminal according to a third embodiment of the present invention;

fig. 9 is a schematic hardware structure diagram of a conference terminal according to a fourth embodiment of the present invention;

fig. 10 is a schematic hardware configuration diagram of a server according to a fourth embodiment of the present invention;

fig. 11 is a schematic diagram of a video conference system provided in the fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The first embodiment is as follows:

in order to solve the problem that the efficiency of the MCU server is low when the MCU server decodes, synthesizes and re-encodes the video streams from the video sources due to insufficient performance at the MCU server in the related art, and further the video frames displayed by the conference terminals in the video conference are delayed seriously, which affects the conference experience of the participating users, the present embodiment provides a video conference method, please refer to an interaction flowchart of the video conference method shown in fig. 1:

s102: and the server receives the video code stream sent by the video source in the video conference.

In this embodiment, the server may be an MCU server, which is substantially a multimedia information switch, and performs multi-point calling and connection, so as to implement functions of video broadcasting, video selection, audio mixing, data broadcasting, and the like, and complete the tandem and handover of signals of each terminal.

It can be understood that, after the video conference is initiated to the server by the video conference initiator through the conference terminal of the video conference initiator, the server will notify the corresponding conference terminals, and allow the conference terminals to join in the video conference. During the video conference, the media capturing device, such as a camera, at each conference terminal side can capture the image information at each conference terminal side, so as to form a video. The conference terminal encodes the video collected by the media collection equipment to form a video code stream of the local terminal, and then sends the video code stream to the server.

It should be noted that, in general, the media capturing device at the conference terminal side may further include a microphone or the like in addition to the camera, and is used for capturing audio information of the participating users. The video collected by the conference terminal also comprises image information and audio information.

In a video conference, there are usually a plurality of participating users, so for the server, it receives video streams sent by a plurality of conference terminals in the video conference. From the perspective of the server, these conference terminals providing video streams are video sources. It is needless to say that in some conference scenes, some of the participating users may turn off their cameras, that is, the image information of the local terminal is not provided in the conference process, and in this case, the conference terminal may not be a video source.

S104: the server decodes, synthesizes and re-encodes the video code streams of part of the video sources to obtain synthesized code streams, sends the synthesized code streams to each conference terminal, and forwards the video code streams of the other video sources to each conference terminal as independent code streams.

After receiving the video code stream sent by the video source, the server can only decode, synthesize and re-encode part of the video code stream, and does not perform the processing of decoding, synthesizing and the like on the video code stream of the other part of the video source. For example, assuming that a video conference includes four video sources, the server may decode, synthesize pictures, and re-encode only video streams of three of the video sources to form a synthesized stream, and continue to keep independent with respect to another video stream. Of course, in other examples, the server may only perform decoding, picture composition, and re-encoding processes on the video streams of two of the video sources, so that the remaining two video streams continue to be independent.

It should be understood that the video code stream without picture composition by the server only contains video pictures of a single conference terminal side, and the run of the composite code stream undergoes picture composition, which contains video pictures of at least two conference terminal sides, so that the video code stream without server decoding, picture composition and re-encoding can be referred to as an "independent code stream" herein, with respect to the composite code stream.

After the received video code stream is processed to obtain a composite code stream, the server can send the composite code stream to each conference terminal, so that a video picture corresponding to the composite code stream is displayed on each conference terminal. On the other hand, for the independent code stream, the server also needs to launch the independent code stream to each conference terminal, so that each conference terminal can display the video picture corresponding to the independent code stream. Based on the composite code stream and the independent code stream, the image information of each video source side can be displayed on the conference terminal. It should be noted that there is no strict timing relationship between the action of sending the composite code stream and the action of sending the independent code stream by the server, in some scenarios, the server sends the composite code stream first and then sends the independent code stream, in other scenarios, the independent code stream is transmitted before the composite code stream, and even in some examples, the composite code stream and the independent code stream are transmitted to the conference terminal side at the same time. In fact, in order to reduce the picture delay at the conference terminal side and enable the conference terminal to process the video code stream in time so as to display the picture as early as possible, the server can immediately transmit a certain video code stream without paying attention to other video code streams when the certain video code stream can be transmitted to the conference terminal side.

For example, in an example, during the process of decoding, picture synthesizing and re-encoding the received video code streams a1, c1 and d1 by the server, the server receives the video code stream b1 again, because the server does not need to perform additional processing on the video code stream b1, it can directly transmit the video code stream b1 as an independent code stream to each conference terminal side. And then, after the generation of the composite code stream is finished, transmitting the composite code stream to the conference terminal. For example, five parties a2, b2, c2, d2 and e2 are in a video conference, and the server decodes, synthesizes and re-encodes the video streams of three parties b2, c2 and d2 according to the setting, while the video streams of two parties a2 and e2 are respectively used as independent streams. If the first video code stream received by the server is the video code stream of a2, the server can directly send the video code stream independent code stream to each conference terminal side. Subsequently, the server receives the video code streams of b2, c2, e2 and d2 in sequence, after receiving the video code streams of b2 and c2, the server may decode the two video code streams first, and after receiving the video code stream of e2, forward the video code stream to the conference terminal. After receiving the video code stream of d2, the server decodes the video code stream, performs picture synthesis with the decoding results of the video code streams on both sides of b2 and c2, re-encodes the synthesized picture to obtain a synthesized code stream, and sends the synthesized code stream to the conference terminal side.

S106: and the conference terminal respectively decodes the composite code stream and the independent code stream.

After receiving the video code stream sent by the server, the conference terminal can decode the received video code stream. It can be understood that the conference terminal in the related art only needs to decode one video code stream, that is, only needs to decode a composite code stream containing all video source video pictures, so that the decoding mode only corresponds to the encoding mode of the server side, and is fixed and unique. In this embodiment, the conference terminal needs to decode at least two video code streams (one composite code stream, at least one independent code stream), where the independent code stream and the composite code stream are encoded by different objects: the composite code stream is encoded by the server, and the independent codes are encoded by the corresponding conference terminals. In some examples, the conference terminal and the service use the same encoding mode, and the conference terminal may use the same decoding mode when decoding the independent code stream and the composite code stream received from the server, that is, the conference terminal does not need to distinguish different video code streams to use different decoding modes.

However, in many cases, the encoding methods of the conference terminal and the server may be different, and even the encoding methods adopted by different conference terminals are not completely the same. In these cases, when the conference terminal decodes the video stream received by itself, different decoding methods are also required for different video streams.

In some examples of this embodiment, the server and each conference terminal may pre-agree on a coding and decoding manner for each video stream, so that when the server transmits a video stream to the corresponding conference terminal, it is only necessary to carry an identification information in the corresponding video stream to indicate to the corresponding conference terminal which video stream the video stream is. For example, the server and the conference terminal agree to decode the composite code stream by adopting a first decoding method, and decode the independent code stream by adopting a second decoding method. Therefore, when the server sends the composite code stream to the conference terminal, the conference terminal can inquire whether the video code stream needs to be decoded in a decoding mode according to the identification information carried by the video code stream after receiving the video code stream as long as the video code stream carries the corresponding identification information.

S108: and the conference terminal displays the video pictures corresponding to the composite code stream and the independent code stream.

After the conference terminal completes decoding of the video code stream received by the conference terminal, the conference terminal can display the corresponding video picture on a screen. In some examples of this embodiment, the conference terminal may display a video picture corresponding to a decoded video stream immediately after decoding a certain video stream, and does not need to wait for displaying after decoding of both the independent stream and the composite stream is completed.

In the video conference method provided by this embodiment, the server may only decode, synthesize and re-encode the video code stream of a part of the video sources, and does not need to perform the processing on the video code streams of all the video sources, and the video code streams that are not processed may be directly sent to each conference terminal, so that each conference terminal can directly decode and display, thereby fully utilizing the processing resources at the conference terminal side, reducing the processing load of the server, reducing the time delay of the video pictures, and improving the quality of the video conference.

Example two:

according to the introduction of the foregoing embodiment, in the embodiment of the present invention, the server may decode, synthesize the picture, and re-encode only the video code stream of a part of the video sources to form a synthesized code stream, and then send the synthesized code stream to each conference terminal, and as for another part of the video code stream, the server directly sends the synthesized code stream to each conference terminal.

It can be understood that how many video code streams are included in the composite code stream generated by the server, and how the composite code stream and the encoding and decoding modes of the independent code streams are preset, and even which video code streams of which video sources are included in the composite code stream generated by the server are also preset, for example, in some examples, the server and the conference terminal belong to the same manufacturer, and designers can firstly solidify the contents in the conference terminal and the server before the equipment leaves the factory; or, the programmer can write the contents into the upgrading program in a device upgrading mode so as to push the contents to the server and the conference terminal side respectively.

Of course, in some examples of this embodiment, the above content may also be determined by the server according to the current video conference situation, and the following describes the video conference method with reference to the flowchart shown in fig. 2:

s202: and the conference terminal sends the video coding and decoding capability parameters of the conference terminal to the server.

In this embodiment, after the conference terminal enters the video conference, the video codec capability parameter of the conference terminal may be sent to the server, and it may be understood that the sending of the video codec capability parameter by the conference terminal may be actively reported to the server, or may be sent to the server after receiving a request from the server.

The video codec capability parameter may represent the video codec capability of the conference terminal itself, and in an example of this embodiment, the video codec capability parameter includes a coding parameter and a decoding parameter, and the coding parameter includes the coding capability of the conference terminal. The decoding parameters include at least one of decoding capability, conference speed, frame frequency and format information of the conference terminal.

S204: and the server determines a decoding display strategy of the video conference according to the video coding and decoding capability parameters of each conference terminal and the video coding and decoding capability of the server.

After the server acquires the video coding and decoding capability parameters in each conference, the decoding display strategy of the video conference can be determined based on the coding and decoding capabilities of the terminals and the video coding and decoding capability of the server. The server can determine the processing requirement for processing the video code streams of all video sources into the composite code stream in the video conference based on the number of the conference terminals in the video conference, the mode of encoding the local videos of each conference terminal, and the like. Further, the server can judge whether the own video encoding and decoding capacity meets the processing requirement, if so, the server can decode, synthesize and re-encode the video code streams of all the video sources, namely, the server can process the video code streams of all the video sources into a synthesized code stream. However, if the server determines that the video encoding and decoding capability of the server is lower than the processing requirement for processing the video code streams of all the video sources in the video conference into the composite code stream, that is, the processing capability of the server is not enough to decode, synthesize and re-encode the video code streams of all the video sources, the server will determine that the server only processes the composite code stream of a part of the video code streams in the subsequent process of the video conference. As to how many video code streams are processed, which video code streams are processed, and how the coding mode of the processed composite code stream is, it needs to be determined by further combining the video coding and decoding capabilities of each conference terminal. The server needs to ensure that the video code streams sent to each conference terminal by the server, including the coding modes adopted by the composite code stream and each independent code stream, are supported by the conference terminal, otherwise, if a certain conference terminal cannot decode the video code stream received by the server, the conference terminal cannot display the video picture at least at one conference terminal side.

In some examples of this embodiment, the decoding display policy determined by the server includes a decoding instruction, where the decoding instruction is used to instruct the conference terminal to perform decoding on each video code stream received by the conference terminal, that is, a manner of decoding a composite code stream and an independent code stream. For example, the server instructs the conference terminal to decode the video code stream carrying the identification information "1" in the first decoding mode, decode the video code stream carrying the identification information "2" in the second decoding mode, and decode the video code stream carrying the identification information "3" in the third decoding mode, so that when the server sends the video code stream to each conference terminal in the subsequent process, the conference terminal can determine the decoding mode of the video code stream by combining with the decoding display strategy as long as the server carries the corresponding identification information.

In some other examples of this embodiment, the decoding display policy further includes a display instruction, where the display instruction is used to inform the conference terminal server of a mapping relationship between each video stream and a display area, so that, after the conference terminal receives and decodes one video stream, it may determine, based on the display instruction in the decoding display policy, to which area of the screen the corresponding video picture is to be displayed.

Of course, it may be understood that, in some examples, the display indication is not necessary content in the decoding display policy, because in these examples, the conference terminal may set a corresponding number of display areas on the display screen according to the number of paths of the video code stream that the conference terminal will receive, for example, assuming that the conference terminal determines that the number of paths of the video code stream that the conference terminal will receive in this time is k paths after the server performs negotiation, the conference terminal side may set k display areas, and after receiving and decoding one path of video code stream, the conference terminal randomly selects one video picture for displaying the video code stream from the display area of the current unfilled picture. However, since the video pictures corresponding to the composite code stream simultaneously include the video pictures of at least two conference terminal sides, if the composite code stream cannot be guaranteed to be displayed in a display area with a large area, the problems that people and the like in the video pictures are very small, and the user is labored in watching details are caused. As shown in fig. 3, there are three video sources a3, b3 and c3 in the video conference, the server processes the video streams a3 and b3 to form a composite stream, and c3 continues to be used as an independent stream. According to the display scheme, each conference terminal side only needs to be provided with two display areas, wherein the synthesized code stream occupies one display area, and the independent code stream occupies the other display area, so that the video pictures corresponding to a3 and b3 need to share one display area, and the video pictures at the two sides of a3 and b3 only have half size of the video pictures seen by c3, which not only makes the pictures at the two sides of a3 and b3 difficult to see by users, but also does not conform to the habit of the user video conference.

Therefore, in more examples of this embodiment, the decoding display policy includes mapping relationships between the video code streams and the display areas, and in this way, the server can ensure the display effect of the video pictures of the video sources at the conference terminal side, for example, ensure that the video sources are displayed in the display areas with the same size, and for example, ensure that the video pictures of the video sources can be displayed in a whole area in a splicing manner, as shown in fig. 4: the video conference is participated in by six conference terminals of a4, b4, c4, d4, e4 and f4, and each conference terminal is provided with a camera, so that the conference terminal has six video sources, the server processes video code streams of a4, b4, c4 and d4 into a composite code stream, and the video code streams of the other two video sources are used as independent code streams. The server and each conference terminal write a display area corresponding to each video code stream, wherein the first area 401 is used for displaying a video picture corresponding to the composite code stream, and the arrangement of each sub-picture in the video picture is determined by the server; the second area 402 and the third area 403 are used for displaying a video picture corresponding to e4 and a video picture corresponding to f4, respectively. Through the display splicing, pictures of six video sources are intensively displayed in one area and cannot be distributed on each position of a screen, and in the picture of fig. 4, the video picture specifications of all the video sources are consistent, so that the video watching habit of a user is met.

S206: and the server sends the decoding display strategy to each conference terminal.

After the server determines the decoding display strategy, the server can send the decoding display strategy to each conference terminal, so that each conference terminal can know the implementation scheme of the video conference.

S208: and the server decodes and synthesizes the m video code streams in the n video sources of the video conference, and recodes the synthesized picture according to the coding mode corresponding to the decoding mode in the decoding display strategy to form the synthesized code stream.

In this embodiment, it is assumed that there are n video sources, and the server determines that it will process the video streams of m video sources to form a composite stream in the negotiation process with the conference terminal, and the remaining n-m video streams will be sent to each conference terminal as independent streams, where m is smaller than n.

Therefore, after receiving the video code streams for forming the composite code stream, the server can decode, synthesize and re-encode the video code streams to form the composite code stream. In some examples of this embodiment, the server and each conference terminal determine, in the negotiation stage, which video streams of video sources form the composite stream, and the server must wait until the video streams of the video sources are received before generating the composite stream. However, in other examples of this embodiment, the server does not specify in the decoding display policy which video streams of which video sources constitute the composite stream, and therefore, the server may temporarily determine which video streams of which video sources are combined together according to the actual situation of the video conference. For example, in some examples of this embodiment, the server may select the first m video streams to form the composite stream according to the order in which the server receives the video streams from the video sources. Please refer to the flow chart of fig. 5, which illustrates the process of decoding, picture-synthesizing and re-encoding the video code stream of a part of video sources by the server to obtain a synthesized code stream:

s502: acquiring video code streams of the previous m video sources according to the sequence of receiving the video code streams from all the video sources;

s504: decoding the video code streams of the m video sources;

s506: picture synthesis is carried out on decoding results corresponding to the m video code streams to obtain a synthesized picture;

s508: and coding the synthesized picture according to a coding mode corresponding to the decoding mode in the decoding display strategy to obtain a synthesized code stream.

S210: and the server sends the synthesized code stream and the rest n-m independent code streams received from the video source to each conference terminal.

The server may send the composite code stream generated by the server to each conference terminal, and on the other hand, the server may send the video code stream received by the server as the independent code stream to each conference terminal, and for the sending sequence of the video code streams, the foregoing embodiment has been described in more detail, and details are not repeated here.

S212: and the conference terminal decodes and displays the video code stream according to the decoding and displaying strategy.

After receiving the video code stream sent by the server, the conference terminal can decode the composite code stream and the independent code stream according to a decoding mode indicated by a decoding instruction in the decoding display strategy. And then, the conference terminal fills the video pictures of the video code streams into the corresponding display areas according to the display instructions in the decoding display strategy for displaying.

In the video conference method provided by this embodiment, before the video conference formally starts, the server may obtain the video encoding and decoding capability parameters of each conference terminal, and then determine whether the server can process the video code streams of all video sources in the video conference to form a composite code stream based on the video encoding and decoding capability of the server and the video encoding and decoding capability of each conference terminal, and if so, the server may still process the video code streams of all video sources according to the video conference scheme provided in the related art in the subsequent process; however, if the server determines that the video encoding and decoding capability of the server is lower than the processing requirement for processing the video code streams of all the video sources in the video conference into a composite code stream, the server can process only the video code streams of part of the video sources, and fully utilize the decoding resources at the sides of all the conference terminals, thereby reducing the time delay of side video pictures of all the conference terminals and improving the conference experience of users participating in the video conference.

Example three:

in order to make those skilled in the art more clearly understand the advantages and details of the video conference method provided by the embodiment of the present invention, the embodiment will describe the video conference method in detail with reference to the combination example, please refer to fig. 6:

s602: and the conference initiator creates a conference on the MCU management platform.

It will be appreciated that the conference initiator is also in fact a conference endpoint in the video conference, which is typically held by the conference host. The conference initiator creates a conference on the MCU management platform, and equivalently, a network conference room is opened on the management platform.

S604: the MCU informs the conference terminals needing to participate in the conference to enter the video conference;

s606: the conference terminal reports the video coding and decoding capability parameters of the conference terminal to the MCU;

s608: the MCU determines the self optimal decoding capability and the conference terminal optimal decoding capability and informs the conference terminal;

when meeting, the MCU defaults to select picture layout, the MCU informs the conference terminal of the picture layout, and the control message sent to the conference terminal by MVU contains the number of multiple pictures, the multiple picture layout mode and the content of each path of sub-picture (such as a main video source, an auxiliary video source, a first path of main video decoding, a second path of video decoding, an auxiliary video decoding and the like).

S610: the conference terminal encodes the local video according to the negotiated encoding mode and sends the encoded local video to the MCU;

s610: and the MCU processes the received video code stream and then sends the processed video code stream to each conference terminal.

If the video code stream is a single picture or a multi-picture layout (double pictures and three pictures) which completely meets the decoding capability of the MCU, the MCU decodes, synthesizes and re-encodes the video code stream after receiving the video code stream, and then sends the video code stream to the conference terminal. As shown in fig. 7, only three remote end (Far) video streams and one local end (Near) video stream are provided, and the decoding capability of the MCU is completely sufficient, so that it can be responsible for decoding, picture synthesis, and re-encoding all video streams, forming a synthesized stream, and then sending to each conference terminal. At this time, only the composite code stream is contained in the video code stream sent out by the server, and the independent code stream is not contained.

If the picture layout is a multi-picture layout exceeding the decoding capability of the MCU, the MCU defaults to select part of the video code streams to form a synthesized code stream according to the negotiated decoding capability and the sequence of the self received code streams: the MCU decodes the video code streams, carries out picture synthesis after the decoding is finished, codes the picture synthesis data to form a synthesis code stream, and sends the synthesis code stream to each conference terminal. And for the rest video code streams, the MCU directly takes the identification tag to send to each conference terminal. As shown in fig. 8, a multi-screen layout video conference includes six video sources, after negotiation, the MCU will process video streams of four of the video sources, and each conference terminal processes video streams of the remaining two video sources. The video code streams of Far1, Far3 and Far4 are decoded by the MCU, synthesized by pictures and then coded to form a synthesized code stream, and then sent to the conference terminal; after receiving the code stream, the Far5 and Near two-way code stream MCU are pasted with corresponding identification labels and then sent to each conference terminal.

S612: and the conference terminal decodes and displays the received video code stream.

The conference terminal receives the synthesized code stream sent by the MCU, decodes the synthesized code stream and displays the decoded code stream in a corresponding area; and on the other hand, the conference terminal decodes the independent code stream after receiving the independent code stream and displays the independent code stream in a corresponding area according to the corresponding identification tag.

And under the scene of a multi-picture mode, according to the code stream information, the conference terminal fills the video picture corresponding to the synthesized code stream into the designated area, and fills other independent code streams into the designated area in the multi-picture layout according to the corresponding tags.

Example four:

the present embodiment provides a storage medium, in which one or more computer programs that can be read, compiled and executed by one or more processors are stored, and in the present embodiment, the storage medium may store at least one of a first video conference program and a second video conference program, where the first video conference program can be used by one or more processors to execute a process on a conference terminal side of a video conference method that implements any one of the video conference methods described in the foregoing embodiments. The second video conference program may be executable by one or more processors to implement the server-side process of any one of the video conference methods described in the foregoing embodiments.

The present embodiment further provides a conference terminal, as shown in fig. 9: the conference terminal 90 includes a first processor 91, a first memory 92, and a first communication bus 93 for connecting the first processor 91 and the first memory 92, where the first memory 92 may be the storage medium storing the first video conference program, and the first processor 91 may read the first video conference program, compile and execute the flow of implementing the video conference method described in the foregoing embodiment:

the first processor 91 receives a composite code stream and an independent code stream sent by the server, wherein the composite code stream is formed by decoding, synthesizing and re-encoding the video code stream of a part of video sources in the video conference by the server, and the independent code stream is the video code stream which is forwarded to the conference terminal after being received by the server from other video sources. Subsequently, the first processor 91 decodes the composite code stream and the independent code stream, respectively, and displays a video picture corresponding to the composite code stream and the independent code stream.

In an example of this embodiment, before receiving the composite code stream and the independent code stream sent by the server, the first processor 91 also sends a video coding and decoding capability parameter of the conference terminal to the server, and receives a decoding display policy sent by the server, where the decoding display policy is determined by the server according to the video coding and decoding capability of the server and the video coding and decoding capability parameters of each conference terminal of the video conference, and is used to instruct the conference terminal to display a decoding mode of the composite code stream and each independent code stream.

In an example of this embodiment, the video codec capability parameter sent to the server includes an encoding parameter and a decoding parameter, and the encoding parameter includes the encoding capability of the conference terminal; the decoding parameters include at least one of decoding capability, conference speed, frame frequency and format information of the conference terminal.

Optionally, the decoding display policy includes a decoding instruction and a display instruction, where the decoding instruction is used to instruct the conference terminal to decode each video code stream; the display indication is used for indicating the mapping relation between each video code stream sent by the server and the display area. When the first processor 91 decodes the composite code stream and the independent code stream, respectively, the composite code stream and the independent code stream may be decoded according to a decoding manner indicated by the decoding instruction. When displaying the video pictures corresponding to the composite code stream and the independent code stream, the first processor 91 fills the video pictures of each video code stream into the corresponding display area according to the display instruction for displaying.

In this embodiment, a server is further provided, as shown in fig. 10: the server 100 includes a second processor 101, a second memory 102, and a second communication bus 103 for connecting the second processor 101 and the second memory 102, where the second memory 102 may be the storage medium storing the second video conference program, and the second processor 101 may read the second video conference program, compile and execute the flow on the server side for implementing the video conference method described in the foregoing embodiment:

the second processor 101 receives the video code stream sent by the video source in the video conference, then decodes, synthesizes and re-encodes the video code stream of a part of the video sources, obtains a synthesized code stream, sends the synthesized code stream to each conference terminal, and forwards the video code streams of the other video sources to each conference terminal as independent code streams.

Optionally, before the second processor 101 receives the video code stream sent by the video source in the video conference, the second processor may also obtain the video coding and decoding capability parameters of each conference terminal in the video conference, and then determine a decoding display policy of the video conference according to the video coding and decoding capability parameters of each conference terminal and the video coding and decoding capability of the server, where the decoding display policy may indicate a decoding display manner of the composite code stream and each independent code stream by each conference terminal. Subsequently, the second processor 101 sends the decoding display policy to each conference terminal.

In an example of this embodiment, before the second processor 101 decodes, synthesizes and re-encodes the video code stream of a part of the video sources, it is determined that the codec capability of the server itself is lower than the processing requirement for processing the video code streams of all the video sources in the video conference into the synthesized code stream according to the video codec capability parameters of each conference terminal and the video codec capability of the server.

In one example of the present embodiment, the composite stream is formed of video streams of m video sources, m being 2 or more; when the second processor 101 decodes, synthesizes and re-encodes the video code streams of part of the video sources to obtain a synthesized code stream, the video code streams of the previous m video sources are obtained according to the sequence of receiving the video code streams from the video sources, then the video code streams of the m video sources are decoded, and then the picture synthesis is performed on the decoding results corresponding to the m video code streams to obtain a synthesized picture. Subsequently, the second processor 101 encodes the composite picture according to the encoding mode corresponding to the decoding mode in the decoding display strategy to obtain the composite code stream.

Referring to fig. 11, the video conference system 11 further includes a server 100 and a plurality of conference terminals 90. The server 100 may be an MCU server and the conference terminals may be implemented in various forms. For example, mobile terminals such as mobile phones, tablet computers, notebook computers, palm computers, PDAs (Personal Digital assistants), navigation devices, wearable devices, smart bands, pedometers, and fixed terminals such as Digital TVs, desktop computers, and the like may be included.

In the conference terminal and the server provided by this embodiment, in the video conference process, after receiving the video code stream sent by each video source in the video conference, the server decodes, synthesizes and re-encodes only the video code stream of a part of the video sources to form a synthesized code stream and sends the synthesized code stream to the conference terminal, and meanwhile, the server sends the video code streams of the other video sources as independent code streams to each conference terminal, so that the conference terminal decodes and displays the synthesized code stream and the independent code streams. And other video code streams outside the server processing capacity range are directly sent to the conference terminal, so that the processing resources of the conference terminal side are fully utilized, the time delay of video pictures of the conference terminal side is reduced, the fluency of the video conference is improved, and the user experience is enhanced.

It will be apparent to those skilled in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software (which may be implemented in program code executable by a computing device), firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed over computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media), executed by a computing device, and in some cases may perform the steps shown or described in a different order than here. The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art. Thus, the present invention is not limited to any specific combination of hardware and software.

The foregoing is a more detailed description of embodiments of the present invention, and the present invention is not to be considered limited to such descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A video conferencing method, comprising:

receiving a composite code stream and an independent code stream sent by a server, wherein the composite code stream is formed by decoding, picture synthesizing and re-encoding video code streams of part of video sources in the video conference by the server, and the independent code stream is a video code stream which is transmitted to a conference terminal after being received by the server from other video sources;

decoding the synthesized code stream and the independent code stream respectively;

2. The video conferencing method of claim 1, wherein the receiving the composite stream and the independent stream sent by the server further comprises, before:

sending the video coding and decoding capability parameters of the conference terminal to the server;

and receiving a decoding display strategy sent by the server, wherein the decoding display strategy is determined by the server according to the video coding and decoding capability of the server and the video coding and decoding capability parameters of each conference terminal of the video conference and is used for indicating the decoding display mode of the conference terminal for the composite code stream and each independent code stream.

3. The video conferencing method of claim 2, wherein the video codec capability parameters include encoding parameters and decoding parameters, and the encoding parameters include encoding capabilities of the conference terminal; the decoding parameters comprise at least one of decoding capability, conference-on speed, frame frequency and format information of the conference terminal.

4. The video conference method according to claim 2, wherein the decoding display policy includes a decoding instruction and a display instruction, the decoding instruction is used for instructing a local conference terminal to decode each video stream; the display indication is used for indicating the mapping relation between each video code stream sent by the server and the display area;

the decoding the composite code stream and the independent code stream respectively comprises:

decoding the synthesized code stream and the independent code stream according to the decoding mode indicated by the decoding indication;

the displaying the video pictures corresponding to the composite code stream and the independent code stream comprises:

and filling the video pictures of the video code streams into corresponding display areas according to the display instructions for displaying.

5. A video conferencing method, comprising:

receiving a video code stream sent by a video source in the video conference;

6. The video conference method of claim 5, wherein before receiving the video stream sent by the video source in the video conference, further comprising:

acquiring video coding and decoding capability parameters of each conference terminal in the video conference;

determining a decoding display strategy of the video conference according to the video coding and decoding capability parameters of each conference terminal and the video coding and decoding capability of the server, wherein the decoding display strategy can indicate the decoding display mode of each conference terminal on the composite code stream and each independent code stream;

and sending the decoding display strategy to each conference terminal.

7. The video conferencing method of claim 6, wherein before decoding, picture-combining, and re-encoding the video streams of the partial video sources, further comprising:

and determining that the coding and decoding capability of the server is lower than the processing requirement of processing the video code streams of all video sources in the video conference into a composite code stream according to the video coding and decoding capability parameters of the conference terminals and the video coding and decoding capability of the server.

8. The video conferencing method of any of claims 5-7, wherein the composite stream is formed from video streams of m video sources, m being greater than or equal to 2; the decoding, picture synthesizing and re-encoding of the video code stream of the partial video source to obtain a synthesized code stream comprises:

acquiring video code streams of the previous m video sources according to the sequence of receiving the video code streams from all the video sources;

decoding the video code streams of the m video sources;

picture synthesis is carried out on decoding results corresponding to the m video code streams to obtain a synthesized picture;

and coding the synthesized picture according to a coding mode corresponding to the decoding mode in the decoding display strategy to obtain a synthesized code stream.

9. A conference terminal comprising a first processor, a first memory and a first communication bus;

the first processor is configured to execute one or more programs stored in the first memory to implement the steps of the video conferencing method as claimed in any of claims 1 to 4.

10. A server comprising a second processor, a second memory, and a second communication bus;

the second processor is configured to execute one or more programs stored in the second memory to implement the steps of the video conferencing method as claimed in any of claims 5 to 8.

11. A storage medium storing at least one of a first video conferencing program and a video conferencing program, the first video conferencing program executable by one or more processors to implement the steps of the video conferencing method of any of claims 1 to 4; the second video conferencing program is executable by one or more processors to implement the steps of the video conferencing method of any of claims 5 to 8.