CN114554277B

CN114554277B - Multimedia processing method, device, server and computer readable storage medium

Info

Publication number: CN114554277B
Application number: CN202011334114.1A
Authority: CN
Inventors: 李志成
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2024-02-09
Anticipated expiration: 2040-11-24
Also published as: CN114554277A

Abstract

The application provides a multimedia processing method, a multimedia processing device, electronic equipment and a computer readable storage medium, and relates to the field of Internet. The method comprises the following steps: respectively acquiring audio data and video data of the multimedia in real time; independently packaging the audio data to generate an audio stream, and independently packaging the video data to generate a video stream; independently transmitting the audio stream to a terminal through an audio transmission channel, and independently transmitting the video stream to the terminal through a video transmission channel, so that the terminal independently performs play processing on the audio stream, and independently performs rendering processing on the video stream; the audio transmission channel and the video transmission channel are independent of each other. The method and the device solve the problem of large delay in the existing cloud game and live broadcast scheme.

Description

Multimedia processing method, device, server and computer readable storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a multimedia processing method, device, server and computer readable storage medium.

Background

Live and cloud games have extremely high requirements on end-to-end acquisition, coding, transmission and rendering delay of video streams, and particularly, activities such as cloud FPS (First-person shooting) games, live red packages, e-commerce low-price second killing promotion and the like.

In the live game video stream scheme and the cloud game video stream scheme, the video streams converted from the collected video pictures are collected through a camera or a desktop screen capture, sound card sound collection, voice is converted into audio streams, the audio streams are transmitted to a terminal through a network, and the terminal renders experience through hardware or soft decoding.

In order to keep strict synchronization of pictures and sound during audio and video rendering and display, a terminal generally outputs audio frames and audio sampling frequencies according to PTS (Presentation Time Stamp) of an audio stream during display, calculates video frame intervals according to FPS (Frames Per Second) of a video stream, and converts time base information into uniform synchronous rendering and display in ms (millisecond) units according to video and audio container format.

For example, the frame interval of the audio stream of AAC 44.1K sampling frequency is 22.7ms, the frame interval of the video stream of 30fps is 33.3ms, so that when the terminal is rendering, if the current relative time stamp is about 22.7ms, the 2 nd audio frame (1 st frame in 0 second) is rendered, then when the frame is 33.3ms, the 2 nd video frame is rendered, if there is no 2 nd video frame, the rendering of the terminal is stuck, and the 2 nd video frame is waited for based on the requirement of audio-video synchronization.

Aiming at the problems, in the existing cloud game and quick live broadcast scheme, an audio and video stream cross synchronous queue (audio and video are stored through two different queues) is added in the processing process of audio and video acquisition and encoding, PTS of each frame in the audio and video queues is used as a sequencing key to carry out queue cross sequencing synchronous output, and then the PTS is transmitted to a player to be decoded and rendered (or played).

However, this solution has the following drawbacks:

1) The queue cross sequencing synchronous output can cause 20-50ms (almost 1-2 frames are generated, the low-delay coding code control reference frame queue is generally controlled to be about 1-2 frames) of waiting delay caused by the asynchronous acquisition and coding delay of a sound card and a display card (or a camera), when some audio and video coding compression reference frames and code control reference frames are larger, the waiting delay is larger, for example, FPS is 30, the waiting delay is (1000/30×20)/(667 ms) and the larger the audio and video coding reference frame queue is, the larger the waiting delay of the queue is;

2) In the process of transmitting the audio stream and the video stream, for users with weak Internet and unstable network, the situation of increasing the transmission delay of the audio stream caused by the large code rate of the video stream can occur.

Disclosure of Invention

The application provides a multimedia processing method, a multimedia processing device, electronic equipment and a computer readable storage medium, which can solve the problem of larger delay in the existing cloud game and live broadcast scheme. The technical scheme is as follows:

in one aspect, a method for processing multimedia is provided, the method comprising:

respectively acquiring audio data and video data of the multimedia in real time;

independently packaging the audio data to generate an audio stream, and independently packaging the video data to generate a video stream;

independently transmitting the audio stream to a terminal through an audio transmission channel, and independently transmitting the video stream to the terminal through a video transmission channel, so that the terminal independently performs play processing on the audio stream, and independently performs rendering processing on the video stream; the audio transmission channel and the video transmission channel are independent of each other.

Preferably, the independently encapsulating the audio data to generate an audio stream includes:

independently encoding the audio data from a time node with a decoding time stamp and a display time stamp as starting points to obtain encoded audio data;

And independently packaging the encoded audio data by adopting an audio container format and an audio transmission protocol format agreed with the terminal to obtain the audio stream.

Preferably, the independently encapsulating the video data to generate a video stream includes:

independently encoding the video data from a time node with a decoding time stamp and a display time stamp as starting points to obtain encoded video data;

and independently packaging the coded video data by adopting a video container format and a video transmission protocol format agreed with the terminal to obtain the video stream.

Preferably, the method further comprises:

receiving feedback information returned by the terminal in real time;

when the network state of the terminal is abnormal based on the feedback information, determining a target processing strategy from at least two preset processing strategies based on the feedback information, and executing the target processing strategy; the at least two preset processing strategies comprise an audio processing strategy and a video processing strategy.

Preferably, the determining a target processing policy from at least two preset processing policies based on the feedback information, and executing the target processing policy, includes:

When the target processing strategy is an audio processing strategy, performing fault location on an audio acquisition device for acquiring audio data in real time and an application program for packaging the audio data, and generating alarm information;

and displaying the alarm information.

when the target processing strategy is a video processing strategy, reducing the frame rate of video data acquired in real time to a preset frame rate, and reducing the video resolution and the code rate adopted for generating the video stream to the preset video resolution and the preset code rate respectively.

Preferably, the method further comprises:

and stopping executing the target processing strategy when the network state of the terminal is determined to be normal based on the feedback information and the target processing strategy is currently executed.

Preferably, when it is determined that the network state of the terminal is normal based on the feedback information, and the target processing policy is currently being executed, stopping executing the target processing policy includes:

when the network state of the terminal is determined to be normal based on the feedback information and a target processing strategy is currently executed, the frame rate of video data acquired in real time is increased to an initial frame rate, and the video resolution and the code rate adopted for generating the video stream are respectively increased to an initial video resolution and an initial code rate.

In another aspect, a method for processing multimedia is provided, the method comprising:

respectively and independently receiving an audio stream and a video stream sent by a server;

independently decoding the audio stream to obtain at least one frame of audio frame, and independently decoding the video stream to obtain at least one frame of video frame;

determining the latest target audio frame from the at least one frame of audio frames, determining the latest target video frame from the at least one frame of video frames, and generating feedback information based on the target audio frame and the target video frame;

and independently playing the at least one frame of audio frame, independently rendering the at least one frame of video frame, and sending the feedback information to the server.

Preferably, determining the latest target audio frame from the at least one frame of audio frames, and determining the latest target video frame from the at least one frame of video frames, and generating feedback information based on the target audio frame and the target video frame, includes:

acquiring display time stamps corresponding to the at least one frame of audio frames respectively, taking the audio frame corresponding to the latest display time stamp as a target audio frame, acquiring the display time stamp corresponding to the at least one frame of video frames respectively, and taking the video frame corresponding to the latest display time stamp as a target video frame;

And calculating a time difference value of the display time stamp of the target audio frame and the target video frame, and generating feedback information based on the time difference value.

In another aspect, there is provided a multimedia processing apparatus, the apparatus comprising:

the acquisition module is used for respectively acquiring the audio data and the video data of the multimedia in real time;

the first generation module is used for independently packaging the audio data to generate an audio stream, and independently packaging the video data to generate a video stream;

a first sending module, configured to independently send the audio stream to a terminal through an audio transmission channel, and independently send the video stream to the terminal through a video transmission channel, so that the terminal independently performs play processing on the audio stream, and independently performs rendering processing on the video stream; the audio transmission channel and the video transmission channel are independent of each other.

Preferably, the first generating module includes:

the first encoding submodule is used for independently encoding the audio data from a time node with a decoding time stamp and a display time stamp as starting points to obtain encoded audio data;

And the first packaging submodule is used for independently packaging the encoded audio data by adopting an audio container format and an audio transmission protocol format agreed with the terminal to obtain the audio stream.

Preferably, the first generating module includes:

the second coding submodule is used for independently coding the video data from a time node with a decoding time stamp and a display time stamp as starting points to obtain coded video data;

and the second encapsulation submodule is used for independently encapsulating the coded video data by adopting a video container format and a video transmission protocol format agreed with the terminal to obtain the video stream.

Preferably, the method further comprises:

the first receiving module is used for receiving feedback information returned by the terminal in real time;

the execution module is used for determining a target processing strategy from at least two preset processing strategies based on the feedback information when the network state of the terminal is determined to be abnormal based on the feedback information, and executing the target processing strategy; the at least two preset processing strategies comprise an audio processing strategy and a video processing strategy.

Preferably, the execution module is specifically configured to:

When the target processing strategy is an audio processing strategy, performing fault location on an audio acquisition device for acquiring audio data in real time and an application program for packaging the audio data, and generating alarm information; and displaying the alarm information.

Preferably, the execution module is specifically configured to:

Preferably, the execution module is further configured to:

Preferably, the execution module is specifically configured to:

the second receiving module is used for independently receiving the audio stream and the video stream sent by the server respectively;

the decoding module is used for independently decoding the audio stream to obtain at least one frame of audio frame, and independently decoding the video stream to obtain at least one frame of video frame;

the determining module is used for determining the latest target audio frame from the at least one frame of audio frames and determining the latest target video frame from the at least one frame of video frames;

the second generation module is used for generating feedback information based on the target audio frame and the target video frame;

the playing module is used for independently playing the at least one frame of audio frame and independently rendering the at least one frame of video frame;

and the second sending module is used for sending the feedback information to the server.

Preferably, the determining module is specifically configured to:

The second generating module is specifically configured to:

In another aspect, there is provided a server, the electronic device comprising:

a processor, a memory, and a bus;

the bus is used for connecting the processor and the memory;

the memory is used for storing operation instructions;

the processor is configured to, by invoking the operation instruction, cause the processor to execute an operation corresponding to the multimedia processing method as shown in the first aspect of the present application.

In another aspect, there is provided an electronic device comprising:

a processor, a memory, and a bus;

the bus is used for connecting the processor and the memory;

the memory is used for storing operation instructions;

the processor is configured to, by invoking the operation instruction, cause the processor to execute an operation corresponding to the multimedia processing method as shown in the second aspect of the present application.

In another aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for processing multimedia as described in the first aspect of the present application.

In another aspect, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of processing multimedia as shown in the second aspect of the present application.

The beneficial effects that this application provided technical scheme brought are:

in the embodiment of the invention, a server respectively acquires audio data and video data of the multimedia in real time, then independently packages the audio data to generate an audio stream, independently packages the video data to generate a video stream, independently transmits the audio stream to a terminal through an audio transmission channel, and independently transmits the video stream to the terminal through a video transmission channel, so that the terminal independently performs playing processing on the audio stream and independently performs rendering processing on the video stream; the audio transmission channel and the video transmission channel are independent of each other. In this way, the audio stream and the video stream are independent and do not affect each other in the processes of collection, encapsulation and transmission, so that the problem of waiting for delay in synchronous sequencing caused by synchronous output of the audio and video stream through the cross sequencing of the audio and video queues in the prior art is solved, and the problem of increasing the transmission delay of the audio stream caused by large code rate of the video stream is also solved; in addition, the terminals are independent in the process of processing the audio stream and the video stream, and are not affected mutually, so that synchronous waiting operation on the audio stream and the video stream is not needed, and the problem that delay is generated in the prior art when the jitter is adopted to uniformly manage the audio and video streams is solved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of a video streaming scheme of a prior art cloud game;

FIG. 2 is a schematic diagram of a basic flow of a cloud game, live capturing, encoding, packaging, transmitting, and rendering in the prior art;

fig. 3 is a schematic application environment diagram of a multimedia processing method according to an embodiment of the present application;

fig. 4 is a flow chart of a multimedia processing method according to an embodiment of the present application;

FIG. 5 is a basic flow diagram of the cloud game, live capturing, encoding, packaging, transmitting, and rendering of the present application;

fig. 6 is a flow chart of a multimedia processing method according to another embodiment of the present application;

fig. 7 is a flow chart of a multimedia processing method according to another embodiment of the present application;

fig. 8 is a schematic structural diagram of a multimedia processing apparatus according to another embodiment of the present application;

fig. 9 is a schematic structural diagram of a multimedia processing apparatus according to another embodiment of the present application;

fig. 10 is a schematic structural diagram of a server for processing multimedia according to another embodiment of the present application;

Fig. 11 is a schematic structural diagram of an electronic device for processing multimedia according to another embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present application and are not to be construed as limiting the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Several terms which are referred to in this application are first introduced and explained:

WebRTC: web Real-Time Communication, web instant messaging. Is an API (Application Programming Interface, application program interface) that supports web browsers for real-time voice conversations or video conversations. WebRTC includes audio and video acquisition, codec, network transmission, display, etc. functions, and also supports cross-platform: windows, linux, mac, android. WebRTC is only a low-latency real-time interaction protocol, and APP supports WebRTC as long as integration supports webtc protocol, for example, a web browser supporting WebRTC.

Fast live broadcast: live Event Broadcasting, LEB for short, is an ultra-low delay live broadcast, which is an extension of standard live broadcast in an ultra-low delay play scene, and has lower delay than the traditional live broadcast protocol, thereby providing millisecond-level extreme live broadcast viewing experience for audiences. The method can meet specific scene requirements with higher requirements on delay performance, such as online education, live sports event, online answer questions and the like.

Cloud gaming: in a cloud computing-based game mode, in a cloud game running mode, all games run at a server side, and rendered game pictures are compressed and then transmitted to a user through a network. At the client, the user's game device does not need any high-end processor and graphics card, but only needs basic video decompression capability.

Encoding: transcoding from one file format or audio-video resolution rate to another, such as transcoding an audio-video RTMP (h.264/1080P) stream format to an FLV (h.265/720P) file format, the encoding may also be referred to as "transcoding".

DTS: decoding Time Stamp, decoding a time stamp in the sense of telling the player when to decode the data of this frame.

PTS: presentation Time Stamp, a time stamp is displayed which tells the player when to display the data for this frame.

Audio sampling frequency: the most intuitive effect of the sampling frequency is the frequency range expressive force of the sound, and the higher the sampling frequency, the larger the frequency range that can be expressed. For example, a 44.1KHz sampling frequency may exhibit a frequency range of 0Hz-22050Hz; the frequency range which can be represented by the 48KHz sampling frequency is 0Hz-24000Hz; the 96KHz sampling frequency may exhibit a frequency range of 0Hz to 48000Hz. The average frequency range audible to the human ear is approximately 20Hz-20000Hz.

Code rate of audio and video: refers to the amount of information that can pass per second in a data stream and can also be understood as: with the data amount of how many bits per second to represent.

VP8/VP9: a video compression format. VP8/VP9 is an eighth/ninth generation video compression format, can provide higher quality video with less data, can play video with less processing power, and provides ideal solutions for network televisions, interactive network televisions and video conference companies aiming at realizing product and service differentiation.

AV1: the AV1 coding standard inherits from the VP8 and VP9 standards, thanks to advanced coding tools, AV1 can save 30% of the code rate compared to VP9 and HEVC (High Efficiency Video Coding ) at the same coding quality.

sikl/opus: sikl is an audio compression format and audio codec. It was developed for Skype to replace the svop codec. opus is an audio compression format and audio codec based on a fusion of sikl and cell for the next generation of sikl.

Taking a cloud game as an example, in the prior art, the cloud game is divided into two types according to the content of a transmission data stream:

1) Video streaming

The game runs on an edge computing node with a GPU (Graphics Processing Unit, a graphic processor), game images generated by the GPU are converted into video streams of VP8/VP9/H.264/H.265/AV1 and silk/opus/aac audio stream data, the video streams and the silk/opus/aac audio stream data are transmitted to a terminal through a network, and meanwhile, the terminal transmits operation instructions such as mouse, keyboard, touch control and the like back to a server.

2) Instruction stream

The game runs in the edge computing node, the graphic API issued by the game is copied through a virtual GPU or a software graphic library with a supporting graphic API, the graphic API is serialized into an instruction stream, the instruction stream is transmitted to a terminal with the GPU through a network, the terminal runs the instruction stream, a game image is rendered, and meanwhile, the terminal returns operation instructions such as mouse, keyboard, touch control and the like to the server.

At present, the cloud game scheme which is already released externally in the market is mainly a video stream scheme, and the instruction stream scheme is only a pre-research stage in game compatibility, rendering stability and bandwidth code rate control, and the technology is not mature.

Further, the basic architecture of the video streaming scheme of the prior art cloud game is shown in fig. 1. The streaming processing of the video stream mainly comprises the steps of collecting audio and video data of a game and encoding the audio and video data into an audio and video stream, and in order to improve the processing efficiency and reduce the delay, the cloud game video processing directly obtains game pictures from a GPU video memory and then transmits the game pictures to a GPU encoding module for encoding and outputting, so that the performance loss caused by copying between the GPU and a CPU (central processing unit ) is reduced; the audio processing directly obtains game sound data from the sound card for encoding and outputting, game images generated by the GPU are encoded into video streams of VP8/VP9/H.264/H265/AV1, and sound data collected by the sound card are encoded into audio stream data of silk/opus/aac, and the audio stream data is transmitted to the terminal through the network, and meanwhile, the terminal returns operation instructions such as data of a mouse, a keyboard, touch control and the like to the server.

Fig. 2 shows the basic flow of cloud gaming, live capturing, encoding, packaging, transmitting, rendering in the prior art.

In the existing cloud game and quick live broadcast scheme, an audio and video stream cross-synchronous queue (audio and video are stored through two different queues) is added in the processing process of audio and video acquisition and encoding, PTSs of each frame in the audio and video queues are used as sequencing keys to carry out queue cross-sequencing synchronous output, and then the output is transmitted to a player for decoding and rendering (or playing).

However, the queue cross-ordering synchronous output can cause 20-50ms (almost 1-2 frames, the low-delay coding code control reference frame queue is generally controlled to be about 1-2 frames) of waiting delay caused by the asynchronous acquisition and coding delay of the sound card and the display card (or the camera), when some audio and video coding compression reference frame frames and code control reference frames are larger, the waiting delay can be larger, for example, if the FPS is 30, the code control reference frame rc_alookahead is 20, the waiting delay can be (1000/30×20) ×667ms, and the larger the audio and video coding reference frame queue is, the larger the waiting delay of the queue can be.

The latency in the existing scheme is generated by two shaded portions in the above-described flow.

And one part is the delay generated when the audio and video queues are subjected to cross sequencing and synchronous output in the encoding process, and the delay generated by the part is about 20-50 ms.

The other part is that the player can manage an audio and video stream jitter for audio and video synchronization during decoding and rendering (or playing), and the part can delay about 1-2 frames, namely about 15 ms. For example, taking an example that the audio frame interval of AAC 44.1K sampling rate is 22.7ms, the video frame interval of 60fps is 16.7ms, the player starts from a time node with relative PTS being 0 when rendering, and assuming that the current relative PTS is 0ms, the 1 st frame of the audio stream and the video stream will be rendered, then when about 16.7ms, the jitter buffer receives the decoded video frame with PTS being 16.7ms, if the video frame of just 16.7ms is lost or the acquisition code has fluctuation, the jitter buffer will wait for the next decoded video frame to appear, and the playback of the frame will be skipped after confirming that the video frame of 16.7ms is abnormal from the next frame.

On the one hand, the synchronous output of the audio and video queues in a cross ordering way can bring synchronous ordering waiting delay; on the other hand, in the process of transmitting the audio stream and the video stream, for users with weak Internet and unstable network, the situation of increasing the transmission delay of the audio stream due to the large code rate of the video stream can occur.

The multimedia processing method, device, electronic equipment and computer readable storage medium provided by the application aim to solve the technical problems in the prior art.

The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The embodiment of the invention provides an application environment of a multimedia processing method, referring to fig. 3, the application environment comprises: a first device 301 and a second device 302. The first device 301 and the second device 302 are connected through a network, the first device 301 is an access device, and the second device 302 is an accessed device. The first device 301 may be a terminal, in which an APP supporting WebRTC may be installed, and the terminal may have the following characteristics:

(1) In a hardware system, the device includes a central processing unit, a memory, an input unit, and an output unit, that is, the device is often a microcomputer device having a communication function. In addition, there may be various input modes such as a keyboard, a mouse, a touch panel, a microphone, a camera, and the like, and the input may be adjusted as necessary. Meanwhile, the equipment often has various output modes, such as a receiver, a display screen and the like, and can be adjusted according to the needs;

(2) On a software architecture, the device must be provided with an operating system, such as Windows Mobile, symbian, palm, android, iOS, etc. Meanwhile, the operating systems are more and more open, and personalized application programs developed based on the open operating system platforms are layered endlessly, such as an address book, a calendar, a notepad, a calculator, various games and the like, so that the demands of personalized users are met to a great extent;

(3) In terms of communication capability, the device has flexible access mode and high-bandwidth communication performance, and can automatically adjust the selected communication mode according to the selected service and the environment, thereby facilitating the use of users. The equipment supports 5G SA (stand alone networking) access, so that the equipment end side is ensured to support network slicing, not only voice service, but also various wireless data service;

(4) In terms of functional use, the device is more focused on humanization, individualization and multifunctionality. With the development of computer technology, the device enters a mode of 'centering on people' from a mode of 'centering on the device', and embedded computing, control technology, artificial intelligence technology, biological authentication technology and the like are integrated, so that the aim of people is fully embodied. Due to the development of software technology, the device can adjust the settings according to personal needs, and is more personalized. Meanwhile, the device integrates a plurality of software and hardware, and the functions are more and more powerful.

The second device 302 may be a server, which may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides a cloud computing service.

Communication between the terminals and the server may be accomplished by any communication means including, but not limited to, 3GPP (3 rd Generation Partnership Project, third Generation partnership project), 4GPP (4 rd Generation Partnership Project, fourth Generation partnership project), 5GPP (5 rd Generation Partnership Project, fifth Generation partnership project), LTE (Long Term Evolution ), WIMAX (World Interoperability for Microwave Access, worldwide interoperability for microwave Access), computer network communication based on TCP/IP (Transmission Control Protocol/Internet Protocol ), UDP (User Datagram Protocol, user datagram protocol) protocols, and near-distance wireless transmission means based on Bluetooth, infrared transmission standards.

A method for processing multimedia may be performed in the second device in the application environment, as shown in fig. 4, where the method includes:

Step S401, respectively acquiring audio data and video data of multimedia in real time;

in particular, multimedia includes, but is not limited to, live broadcast, cloud games, so the audio data and video data of multimedia may be live broadcast sounds and live broadcast pictures of live broadcast, and may also be game sounds and game pictures of cloud games.

Further, the audio data of the multimedia can be obtained independently through an audio acquisition device, such as a sound card; the video data of the multimedia can be independently acquired through a video acquisition device, such as a camera (for fast live broadcast) and a video card (for cloud games). Of course, in practical application, other forms of audio acquisition devices and video acquisition devices are suitable for the embodiment of the present invention, and may be set according to practical requirements, which is not limited by the embodiment of the present invention.

Step S402, independently packaging audio data to generate an audio stream, and independently packaging video data to generate a video stream;

after the audio data and the video data are respectively and independently acquired, the audio data can be independently packaged to generate an audio stream, and the video data can be independently packaged to generate a video stream. That is, the process of generating the audio stream and the process of generating the video stream are independent of each other and do not affect each other.

Wherein, independently encapsulate audio data, produce the audio stream, include:

and independently packaging the encoded audio data by adopting an audio container format and an audio transmission protocol format appointed by the terminal to obtain an audio stream.

Specifically, when the audio data is collected, the audio data is collected according to a time sequence, so that the audio data is independently encoded from a time node with the DTS and the PTS as starting points for each audio frame in the collected audio data, and the encoded audio data is obtained. And then, independently packaging the encoded audio data by adopting an audio container format and an audio transmission protocol format which are pre-agreed with the terminal, thereby obtaining an audio stream.

The starting point in the embodiment of the invention is preferably 0ms; audio container formats include, but are not limited to silk, opus, aac; the audio transmission protocol is preferably the WebRTC protocol. Of course, in practical application, the starting points of the DTS and the PTS, the audio container format, and the specific protocol of the audio transmission protocol may be adjusted according to the actual requirements, which is not limited by the embodiment of the present invention.

Independently encapsulating the video data to generate a video stream, comprising:

and independently packaging the coded video data by adopting a video container format and a video transmission protocol format appointed by the terminal to obtain a video stream.

Specifically, when the video data is collected, the video data is collected according to a time sequence, so that the video data is independently encoded from a time node with the DTS and the PTS as starting points for each video frame in the collected video data, and encoded audio data is obtained. And then independently packaging the coded video data by adopting a video container format and a video transmission protocol format which are pre-agreed with the terminal, thereby obtaining a video stream.

The starting point in the embodiment of the invention is preferably 0ms; video container formats include, but are not limited to, VP8, VP9, h.264, h.265, AV1; the video transmission protocol is preferably the WebRTC protocol. Of course, in practical application, the starting points of the DTS and the PTS, the video container format, and the specific protocol of the video transmission protocol may be adjusted according to the actual requirements, which is not limited by the embodiment of the present invention.

It should be noted that, before generating the audio stream and the video stream, the server and the APP in the terminal may interact with a signaling, where the signaling is used to determine the audio container format, the audio transmission protocol, the video container format and what the video transmission protocol supports by the server and the APP, so as to avoid the situation that the audio stream and the video stream encoded by the server cannot be decoded in the APP, or the audio stream and the video stream cannot be sent to the APP.

Step S403, the audio stream is independently sent to the terminal through the audio transmission channel, and the video stream is independently sent to the terminal through the video transmission channel, so that the terminal independently performs playing processing on the audio stream, and independently performs rendering processing on the video stream; the audio transmission channel and the video transmission channel are independent of each other.

Specifically, after the audio stream and the video stream are generated, the audio stream may be independently transmitted to the terminal through the audio transmission channel, and the video stream may be independently transmitted to the terminal through the video transmission channel. After the terminal receives the audio stream and the video stream, respectively, the terminal independently performs playing processing (including decoding and playing) on the audio stream, and independently performs rendering processing (including decoding and rendering) on the video stream.

Fig. 5 shows a basic flow of cloud game, fast live broadcast acquisition, encoding, transmission and rendering in the present application, and compared with the prior art, the audio stream and the video stream in the embodiments of the present invention are independent and do not affect each other in the process of acquisition, encapsulation and transmission, so that not only is the problem of waiting for delay of synchronous ordering caused by synchronous output of the audio and video stream cross ordering without using an audio and video stream cross synchronizing queue solved in the prior art, but also the problem of increasing the transmission delay of the audio stream caused by large video stream code rate is solved; in addition, the terminals are independent in the process of processing the audio stream and the video stream, and are not affected mutually, so that synchronous waiting operation on the audio stream and the video stream is not needed, and the delay generated by unified management of the audio and video streams by adopting a jitter buffer in the prior art is solved.

In the embodiment of the invention, a server respectively acquires audio data and video data of multimedia in real time, then independently packages the audio data to generate an audio stream, independently packages the video data to generate a video stream, independently transmits the audio stream to a terminal through an audio transmission channel, and independently transmits the video stream to the terminal through the video transmission channel, so that the terminal independently performs playing processing on the audio stream and independently performs rendering processing on the video stream; the audio transmission channel and the video transmission channel are independent of each other. In this way, the audio stream and the video stream are independent and do not affect each other in the processes of collection, encapsulation and transmission, so that the problem of waiting for delay in synchronous sequencing caused by synchronous output of the audio and video stream through the cross sequencing of the audio and video queues in the prior art is solved, and the problem of increasing the transmission delay of the audio stream caused by large code rate of the video stream is also solved; in addition, the terminals are independent in the process of processing the audio stream and the video stream, and are not affected mutually, so that synchronous waiting operation on the audio stream and the video stream is not needed, and the problem that delay is generated in the prior art when the jitter is adopted to uniformly manage the audio and video streams is solved.

In another embodiment, a method for processing multimedia is provided, as shown in fig. 6, and the method includes:

step S601, respectively acquiring audio data and video data of multimedia in real time;

step S602, independently packaging audio data to generate an audio stream, and independently packaging video data to generate a video stream;

step S603, the audio stream is independently sent to the terminal through the audio transmission channel, and the video stream is independently sent to the terminal through the video transmission channel, so that the terminal independently performs playing processing on the audio stream, and independently performs rendering processing on the video stream; the audio transmission channel and the video transmission channel are not related to each other;

the steps S601 to S603 are substantially the same as the steps S401 to S403, and the detailed description thereof will be omitted herein for avoiding repetition, and the specific embodiments refer to the steps S401 to S403.

Step S604, feedback information returned by the terminal is received in real time;

because the audio-visual stream and the video stream are mutually independent and do not affect each other in terms of collection, encoding, encapsulation, transmission, decoding and rendering (or playing), a problem occurs that the audio-visual is possibly out of sync, especially in the scenes of high frame rate, high resolution and high code rate video streams, unstable user network and the like.

Aiming at the situation, the APP can generate feedback information during decoding, the feedback information can be updated in real time and returned to the server in real time, and the server can determine whether the APP is asynchronous in sound and picture or not based on the feedback information, so that the problem of asynchronous sound and picture is solved by executing corresponding abnormal processing strategies according to different situations.

Step S605, when the network state of the terminal is abnormal based on the feedback information, determining a target processing strategy from at least two preset processing strategies based on the feedback information, and executing the target processing strategy; the at least two preset processing strategies comprise an audio processing strategy and a video processing strategy;

when feedback information is received and the network state of the terminal is determined to be abnormal based on the feedback information, a target processing strategy can be determined from at least two preset processing strategies based on the feedback information, and the target processing strategy is executed. The at least two preset processing strategies comprise an audio processing strategy and a video processing strategy. That is, when the network state of the terminal is abnormal, the independent transmission of the audio stream is abnormal, or the independent transmission of the video stream is abnormal, and finally, the situation that the audio and the video of the APP are out of sync is caused, so that whether the independent transmission of the audio stream or the independent transmission of the video stream is influenced by the network state abnormality can be determined based on the feedback information.

Wherein the feedback information may comprise a time difference value. When the time difference value belongs to a first preset interval, judging that the target processing strategy is an audio processing strategy; when the time difference value belongs to a second preset interval, judging that the network state of the terminal is normal; and when the time difference value belongs to a third preset interval, judging the target processing strategy as the video processing strategy. The manner in which the time difference is generated will be described in detail hereinafter and will not be described in detail here.

Preferably, the first preset interval is (++), -50), the second preset interval is [ -50, 50), and the third preset interval is [50, ]. Of course, in practical application, the numerical values of each interval may be set according to practical requirements, which is not limited in the embodiment of the present invention.

In a preferred embodiment of the present invention, determining a target processing policy from at least two preset processing policies based on the feedback information, and executing the target processing policy includes:

And displaying the alarm information.

Specifically, when the target processing strategy is determined to be the audio processing strategy, the audio acquisition equipment for acquiring the audio data in real time and the application program for packaging the audio data are subjected to fault positioning, and alarm information is generated and displayed. The audio collection device includes, but is not limited to, a sound card, and the application program for packaging the audio data includes, but is not limited to, silk and opus, which can be adjusted according to actual requirements in practical application.

It should be noted that, the audio collection device may include a physical device, and may further include a driver corresponding to the physical device. For example, the audio capture device may be a sound card and a driver for the sound card.

Therefore, a technician can perform fault location on the audio acquisition equipment and the application program based on the alarm information, so that the fault is solved.

Specifically, when the target processing policy is a video processing policy, the frame rate of video data acquired in real time may be reduced to a preset frame rate, and the video resolution and the code rate used for generating the video stream may be reduced to a preset resolution and a preset code rate, respectively.

When the frame rate is reduced to the preset frame rate, the fixed frame rate can be reduced every time a period of time passes until the frame rate is reduced to the preset frame rate, for example, 5FPS is reduced every 2 seconds passes until the frame rate is reduced to 25FPS; or directly reducing to a preset frame rate, for example, directly reducing to 25FPS from the current initial frame rate; the frame rate of a plurality of gears can be set, and the frame rate can be reduced according to the plurality of gears in sequence, for example, four gears of 25FPS, 35FPS, 45FPS and 55FPS are preset, when the frame rate is reduced, the frame rate is reduced to 55FPS for the first time, the frame rate is reduced to 45FPS for the second time, and the like; of course, the frame rate may be reduced in other manners, and may be set according to actual requirements in practical applications, which is not limited by the embodiment of the present invention.

It should be noted that, the preset frame rate must not be lower than 25FPS, otherwise, the video stream cannot be played smoothly during playing.

Further, in the embodiment of the present invention, the resolution and the code rate of a plurality of gears are preset, including four gears of 4K/(25 Mbps, 15Mbps, 10 Mbps), 1080P/(10 Mbps, 5Mbps, 3 Mbps), 720P/(3 Mbps, 1.5Mbps, 1 Mbps), 480P/(1 Mbps, 800Kbps, 600 Kbps), and when the resolution and the code rate are reduced, the resolution and the code rate may also be reduced sequentially according to the order of the gears every time a period passes. The lowest resolution and code rate is 480P/600Kbps, otherwise the video stream is not clear when playing. Of course, in addition to the resolution and the code rate of the four gears, the resolution and the code rate of other gears (not lower than the lowest resolution and code rate) may be set, and in practical application, the setting may be performed according to practical requirements, which is not limited by the embodiment of the present invention.

And step S606, when the network state of the terminal is determined to be normal based on the feedback information and the target processing strategy is currently executed, stopping executing the target processing strategy.

When the network state of the terminal is determined to be normal based on the feedback information and the target processing strategy is currently being executed, the execution of the target processing strategy can be immediately stopped, and the frame rate, resolution and code rate of the video data acquired in real time can be recovered.

And when the network state of the terminal is determined to be normal based on the feedback information and the target processing strategy is currently executed, stopping executing the target processing strategy, wherein the method comprises the following steps:

when the network state of the terminal is determined to be normal based on the feedback information and the target processing strategy is currently executed, the frame rate of the video data acquired in real time is increased to the initial frame rate, and the video resolution and the code rate adopted for generating the video stream are respectively increased to the initial video resolution and the initial code rate.

Specifically, when the time difference value in the feedback information belongs to the second preset interval, it is determined that APP in the terminal decodes normally, and if the target processing policy is currently being executed, the frame rate of video data acquired in real time is increased to an initial frame rate, and the video resolution and the code rate adopted for generating the video stream are respectively increased to an initial video resolution and an initial code rate. The initial frame rate, the initial resolution, and the initial code rate may be preset by a technician. That is, when the network state of the terminal is normal, the frame rate, resolution and code rate that the APP defaults to use are the initial frame rate, initial resolution and initial code rate, respectively. Of course, in practical application, the initial frame rate, the initial resolution and the initial code rate may be set according to practical requirements, which is not limited by the embodiment of the present invention.

Further, the server can also receive feedback information returned by the terminal in real time, determine whether the situation of asynchronous sound and picture caused by abnormal network state of the terminal occurs based on the feedback information, if so, determine a target processing strategy from at least two preset processing strategies based on the feedback information, and execute the target processing strategy, thereby ensuring the sound and picture synchronization. Under the condition that the audio stream and the video stream are independent and do not affect each other in the processes of collection, packaging, transmission, decoding and rendering (or playing), the embodiment of the invention can save the comprehensive delay of 50ms+, and has more obvious effect especially in the scene that the terminal is in a weak network or network congestion.

A method for processing multimedia may be performed in the first device in the application environment, as shown in fig. 7, where the method includes:

step S701, respectively and independently receiving an audio stream and a video stream sent by a server;

specifically, the terminal independently receives an audio stream transmitted from the server through an audio transmission channel, and independently receives a video stream transmitted from the server through a video transmission channel. That is, the process of receiving an audio stream and the process of receiving a video stream by a terminal are independent of each other and do not affect each other.

Step S702, independently decoding an audio stream to obtain at least one frame of audio frame, and independently decoding a video stream to obtain at least one frame of video frame;

specifically, APP in the terminal independently decodes the audio stream based on DTS to obtain at least one frame of audio frame, and independently decodes the video stream based on DTS to obtain at least one frame of video frame. That is, the process of independently decoding the audio stream and the process of independently decoding the video stream are independent of each other and do not affect each other.

Step S703, determining the latest target audio frame from at least one frame of audio frames, determining the latest target video frame from at least one frame of video frames, and generating feedback information based on the target audio frame and the target video frame;

after at least one frame of audio frame and at least one frame of video frame are obtained through decoding, the latest target audio frame is determined based on the PTS of each audio frame, the latest target video frame is determined based on the PTS of each video frame, and feedback information is generated based on the PTS of the target audio frame and the PTS of the target video frame.

In a preferred embodiment of the present invention, determining a latest target audio frame from at least one audio frame, and determining a latest target video frame from at least one video frame, and generating feedback information based on the target audio frame and the target video frame, includes:

Acquiring display time stamps corresponding to at least one frame of audio frames respectively, taking the audio frame corresponding to the latest display time stamp as a target audio frame, acquiring the display time stamp corresponding to at least one frame of video frames respectively, and taking the video frame corresponding to the latest display time stamp as a target video frame;

Specifically, each corresponding PTS of each audio frame is acquired, then the audio frame corresponding to the latest PTS is taken as a target audio frame and marked as T1, and each corresponding PTS of each video frame is acquired, then the video frame corresponding to the latest PTS is taken as a target video frame and marked as T2, then T1-T2 is taken to obtain a time difference value, and feedback information is generated based on the time difference value.

Step S704, playing the at least one audio frame independently, rendering the at least one video frame independently, and sending feedback information to the server.

Specifically, the APP in the terminal may play each audio frame independently in real time, and render each video frame in real time, and may send feedback information to the server in real time.

It should be noted that, because the APP in the terminal decodes in real time, the time difference is also generated in real time, and the time difference is returned to the server in real time, so that the server can determine whether the APP in the terminal has abnormal decoding in real time based on the time difference.

In the embodiment of the invention, the terminal independently receives the audio stream and the video stream sent by the server, independently decodes the audio stream to obtain at least one frame of audio frame, independently decodes the video stream to obtain at least one frame of video frame, determines the latest target audio frame from the at least one frame of audio frame, determines the latest target video frame from the at least one frame of video frame, generates feedback information based on the target audio frame and the target video frame, independently plays the at least one frame of audio frame, independently renders the at least one frame of video frame, and sends the feedback information to the server. In this way, the terminal can generate feedback information in real time in the process of independently decoding the audio stream and the video stream respectively, and return the feedback information to the server in real time, so that the server determines whether the situation of asynchronous audio and video caused by abnormal network state of the terminal occurs based on the feedback information, if so, determines a target processing strategy from at least two preset processing strategies based on the feedback information, and executes the target processing strategy, thereby ensuring the synchronization of the audio and video. Under the condition that the audio stream and the video stream are independent and do not affect each other in the processes of collection, packaging, transmission, decoding and rendering (or playing), the embodiment of the invention can save the comprehensive delay of 50ms+, and has more obvious effect especially in the scene that the terminal is in a weak network or network congestion.

Fig. 8 is a schematic structural diagram of a multimedia processing apparatus according to another embodiment of the present application, and as shown in fig. 8, the apparatus of this embodiment may include:

an acquisition module 801, configured to acquire audio data and video data of multimedia in real time, respectively;

a first generating module 802, configured to independently encapsulate audio data to generate an audio stream, and independently encapsulate video data to generate a video stream;

a first sending module 803, configured to send an audio stream to a terminal independently through an audio transmission channel, and send a video stream to the terminal independently through a video transmission channel, so that the terminal independently performs a playing process on the audio stream, and independently performs a rendering process on the video stream; the audio transmission channel and the video transmission channel are independent of each other.

In a preferred embodiment of the present invention, the first generating module includes:

and the first packaging submodule is used for independently packaging the encoded audio data by adopting an audio container format and an audio transmission protocol format appointed by the terminal to obtain an audio stream.

and the second encapsulation submodule is used for independently encapsulating the coded video data by adopting a video container format and a video transmission protocol format appointed by the terminal to obtain a video stream.

In a preferred embodiment of the present invention, further comprising:

the execution module is used for determining a target processing strategy from at least two preset processing strategies based on the feedback information when the network state of the terminal is abnormal based on the feedback information, and executing the target processing strategy; the at least two preset processing strategies comprise an audio processing strategy and a video processing strategy.

In a preferred embodiment of the present invention, the execution module is specifically configured to:

In a preferred embodiment of the invention, the execution module is further configured to:

The multimedia processing apparatus of the present embodiment may execute the multimedia processing method shown in the first embodiment of the present application, and the implementation principle is similar, and will not be described herein.

Fig. 9 is a schematic structural diagram of a multimedia processing apparatus according to another embodiment of the present application, and as shown in fig. 9, the apparatus of this embodiment may include:

a second receiving module 901, configured to independently receive an audio stream and a video stream sent by a server, respectively;

a decoding module 902, configured to independently decode the audio stream to obtain at least one frame of audio frame, and independently decode the video stream to obtain at least one frame of video frame;

a determining module 903, configured to determine a latest target audio frame from at least one audio frame, and determine a latest target video frame from at least one video frame;

A second generating module 904, configured to generate feedback information based on the target audio frame and the target video frame;

the playing module 905 is configured to independently play at least one frame of audio frame and independently render at least one frame of video frame;

and a second sending module 906, configured to send the feedback information to the server.

In a preferred embodiment of the present invention, the determining module is specifically configured to:

the second generating module is specifically configured to:

In the embodiment of the invention, the terminal independently receives the audio stream and the video stream sent by the server, independently decodes the audio stream to obtain at least one frame of audio frame, independently decodes the video stream to obtain at least one frame of video frame, determines the latest target audio frame from the at least one frame of audio frame, determines the latest target video frame from the at least one frame of video frame, generates feedback information based on the target audio frame and the target video frame, independently plays the at least one frame of audio frame, independently renders the at least one frame of video frame, and sends the feedback information to the server. In this way, the terminal can generate feedback information in real time in the process of independently decoding the audio stream and the video stream respectively, and return the feedback information to the server in real time, so that the server determines whether the APP in the terminal is asynchronous with the audio and the video due to abnormal network state based on the feedback information, if so, determines a target processing strategy from at least two preset processing strategies based on the feedback information, and executes the target processing strategy, thereby ensuring the synchronization of the audio and the video. Under the condition that the audio stream and the video stream are independent and do not affect each other in the processes of collection, packaging, transmission, decoding and rendering (or playing), the embodiment of the invention can save the comprehensive delay of 50ms+, and has more obvious effect especially in the scene that the terminal is in a weak network or network congestion.

In yet another embodiment of the present application, a server is provided, the server including: a memory and a processor; at least one program stored in the memory for execution by the processor, which, when executed by the processor, performs: in the embodiment of the invention, a server respectively acquires audio data and video data of multimedia in real time, then independently packages the audio data to generate an audio stream, independently packages the video data to generate a video stream, independently transmits the audio stream to a terminal through an audio transmission channel, and independently transmits the video stream to the terminal through the video transmission channel, so that the terminal independently performs playing processing on the audio stream and independently performs rendering processing on the video stream; the audio transmission channel and the video transmission channel are independent of each other. In this way, the audio stream and the video stream are independent and do not affect each other in the processes of collection, encapsulation and transmission, so that the problem of waiting for delay in synchronous sequencing caused by synchronous output of the audio and video stream through the cross sequencing of the audio and video queues in the prior art is solved, and the problem of increasing the transmission delay of the audio stream caused by large code rate of the video stream is also solved; in addition, the terminals are independent in the process of processing the audio stream and the video stream, and are not affected mutually, so that synchronous waiting operation on the audio stream and the video stream is not needed, and the problem that delay is generated in the prior art when the jitter is adopted to uniformly manage the audio and video streams is solved.

In an alternative embodiment, a server is provided, as shown in fig. 10, where the server 10000 shown in fig. 10 includes: a processor 10001 and a memory 10003. Wherein the processor 10001 is coupled to the memory 10003, such as via a bus 10002. Optionally, the server 10000 may further comprise a transceiver 10004. It should be noted that, in practical applications, the transceiver 10004 is not limited to one, and the structure of the server 10000 is not limited to the embodiment of the present application.

The processor 10001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 10001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 10002 may include a pathway to transfer information between the aforementioned components. Bus 10002 may be a PCI bus or an EISA bus, etc. The bus 10002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.

The memory 10003 may be, but is not limited to, ROM or other type of static storage device that can store static information and instructions, RAM or other type of dynamic storage device that can store information and instructions, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 10003 is used for storing application program codes for executing the embodiments of the present application, and the processor 10001 controls the execution. The processor 10001 is configured to execute application code stored in the memory 10003 to implement what is shown in any of the method embodiments described above.

Yet another embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the corresponding content of the foregoing method embodiments. Compared with the prior art, in the embodiment of the invention, the server respectively acquires the audio data and the video data of the multimedia in real time, then independently packages the audio data to generate the audio stream, independently packages the video data to generate the video stream, independently transmits the audio stream to the terminal through the audio transmission channel, independently transmits the video stream to the terminal through the video transmission channel, so that the terminal independently performs playing processing on the audio stream, and independently performs rendering processing on the video stream; the audio transmission channel and the video transmission channel are independent of each other. In this way, the audio stream and the video stream are independent and do not affect each other in the processes of collection, encapsulation and transmission, so that the problem of waiting for delay in synchronous sequencing caused by synchronous output of the audio and video stream through the cross sequencing of the audio and video queues in the prior art is solved, and the problem of increasing the transmission delay of the audio stream caused by large code rate of the video stream is also solved; in addition, the terminals are independent in the process of processing the audio stream and the video stream, and are not affected mutually, so that synchronous waiting operation on the audio stream and the video stream is not needed, and the problem that delay is generated in the prior art when the jitter is adopted to uniformly manage the audio and video streams is solved.

In yet another embodiment of the present application, there is provided an electronic device including: a memory and a processor; at least one program stored in the memory for execution by the processor, which, when executed by the processor, performs: the terminal independently receives the audio stream and the video stream sent by the server respectively, independently decodes the audio stream to obtain at least one frame of audio frame, independently decodes the video stream to obtain at least one frame of video frame, determines the latest target audio frame from the at least one frame of audio frame, determines the latest target video frame from the at least one frame of video frame, generates feedback information based on the target audio frame and the target video frame, independently plays the at least one frame of audio frame, independently renders the at least one frame of video frame, and sends the feedback information to the server. In this way, the terminal can generate feedback information in real time in the process of independently decoding the audio stream and the video stream respectively, and return the feedback information to the server in real time, so that the server determines whether the situation of asynchronous audio and video caused by abnormal network state of the terminal occurs based on the feedback information, if so, determines a target processing strategy from at least two preset processing strategies based on the feedback information, and executes the target processing strategy, thereby ensuring the synchronization of the audio and video. Under the condition that the audio stream and the video stream are independent and do not affect each other in the processes of collection, packaging, transmission, decoding and rendering (or playing), the embodiment of the invention can save the comprehensive delay of 50ms+, and has more obvious effect especially in the scene that the terminal is in a weak network or network congestion.

In an alternative embodiment, an electronic device is provided, as shown in fig. 11, the electronic device 11000 shown in fig. 11 includes: a processor 11001 and a memory 11003. In which a processor 11001 is coupled to a memory 11003, such as via a bus 11002. Optionally, the electronic device 11000 may also include a transceiver 11004. Note that, in practical applications, the transceiver 11004 is not limited to one, and the structure of the electronic device 11000 is not limited to the embodiment of the present application.

The processor 11001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 11001 may also be a combination of computing functions, e.g., including one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 11002 may include a path to transfer information between the aforementioned components. Bus 11002 may be a PCI bus, an EISA bus, or the like. The bus 11002 may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 11, but not only one bus or one type of bus.

The memory 11003 may be, but is not limited to, ROM or other type of static storage device, RAM or other type of dynamic storage device, which can store static information and instructions, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disc, etc.), magnetic disk storage or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and capable of being accessed by a computer.

The memory 11003 is used for storing application program codes for executing the present application and is controlled to be executed by the processor 11001. The processor 11001 is configured to execute application code stored in the memory 11003 to implement what is shown in any of the method embodiments described above.

Among them, electronic devices include, but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like.

Yet another embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the corresponding content of the foregoing method embodiments. Compared with the prior art, in the embodiment of the invention, the terminal independently receives the audio stream and the video stream sent by the server, independently decodes the audio stream to obtain at least one frame of audio frame, independently decodes the video stream to obtain at least one frame of video frame, determines the latest target audio frame from the at least one frame of audio frame, determines the latest target video frame from the at least one frame of video frame, generates feedback information based on the target audio frame and the target video frame, independently plays the at least one frame of audio frame, independently renders the at least one frame of video frame, and sends the feedback information to the server. In this way, the terminal can generate feedback information in real time in the process of independently decoding the audio stream and the video stream respectively, and return the feedback information to the server in real time, so that the server determines whether the situation of asynchronous audio and video caused by abnormal network state of the terminal occurs based on the feedback information, if so, determines a target processing strategy from at least two preset processing strategies based on the feedback information, and executes the target processing strategy, thereby ensuring the synchronization of the audio and video. Under the condition that the audio stream and the video stream are independent and do not affect each other in the processes of collection, packaging, transmission, decoding and rendering (or playing), the embodiment of the invention can save the comprehensive delay of 50ms+, and has more obvious effect especially in the scene that the terminal is in a weak network or network congestion.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions such that the computer device performs:

Another computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions such that the computer device performs:

Claims

1. A method for processing multimedia, applied to a server, comprising:

the audio data of the multimedia are independently collected in real time through the audio collection equipment, and the video data of the multimedia are independently collected in real time through the video collection equipment;

encoding the audio data, independently packaging the encoded audio data to generate an audio stream, encoding the video data, independently packaging the encoded video data to generate a video stream;

Independently transmitting the audio stream to a terminal through an audio transmission channel, and independently transmitting the video stream to the terminal through a video transmission channel, so that the terminal independently performs play processing on the audio stream, and independently performs rendering processing on the video stream; the audio transmission channel and the video transmission channel are not related to each other; the audio stream and the video stream are not related to each other in the processes of collection, encapsulation and transmission, and the process of playing the audio stream by the terminal and the process of rendering the video stream by the terminal are not related to each other;

and receiving feedback information returned by the terminal in real time, and processing the video stream or the audio stream based on the feedback information so as to realize synchronous playing of the audio and the video.

2. The method of claim 1, wherein the encoding the audio data and independently encapsulating the encoded audio data to generate an audio stream comprises:

3. The method of claim 1, wherein encoding the video data and independently encapsulating the encoded video data to generate a video stream comprises:

4. A method of processing multimedia according to any one of claims 1-3, wherein said processing said video stream or said audio stream based on said feedback information comprises:

5. The method according to claim 4, wherein determining a target processing policy from at least two preset processing policies based on the feedback information, and executing the target processing policy, comprises:

and displaying the alarm information.

6. The method according to claim 4, wherein determining a target processing policy from at least two preset processing policies based on the feedback information, and executing the target processing policy, comprises:

7. The method for processing multimedia according to claim 4, further comprising:

8. The method according to claim 7, wherein stopping execution of the target processing policy when it is determined that the network state of the terminal is normal based on the feedback information and the target processing policy is currently being executed, comprises:

9. A method for processing multimedia, comprising:

respectively and independently receiving an audio stream and a video stream sent by a server; the audio stream is generated by independently collecting multimedia audio data through an audio collecting device by the server, encoding the audio data and independently packaging the encoded audio data; the video stream is generated by independently collecting multimedia video data through video collecting equipment by the server, encoding the video data and independently packaging the encoded video data;

independently playing the at least one frame of audio frame, independently rendering the at least one frame of video frame, and sending the feedback information to the server; wherein, the process of playing the audio stream and the process of rendering the video stream are not related to each other; the feedback information is used for enabling the server to process the video stream or the audio stream so as to realize audio and video synchronous playing.

10. The method of claim 9, wherein determining the latest target audio frame from the at least one audio frame, and determining the latest target video frame from the at least one video frame, and generating feedback information based on the target audio frame and the target video frame, comprises:

11. A multimedia processing apparatus, comprising:

the acquisition module is used for independently acquiring the audio data of the multimedia in real time through the audio acquisition equipment and independently acquiring the video data of the multimedia in real time through the video acquisition equipment;

the first generation module is used for encoding the audio data, independently packaging the encoded audio data to generate an audio stream, encoding the video data, independently packaging the encoded video data to generate a video stream;

a first sending module, configured to independently send the audio stream to a terminal through an audio transmission channel, and independently send the video stream to the terminal through a video transmission channel, so that the terminal independently performs play processing on the audio stream, and independently performs rendering processing on the video stream; the audio transmission channel and the video transmission channel are not related to each other; the audio stream and the video stream are not related to each other in the processes of collection, encapsulation and transmission, and the process of playing the audio stream by the terminal and the process of rendering the video stream by the terminal are not related to each other;

The first receiving module is used for receiving feedback information returned by the terminal in real time, and processing the video stream or the audio stream based on the feedback information so as to realize synchronous playing of the audio and the video.

12. A multimedia processing apparatus, comprising:

the second receiving module is used for independently receiving the audio stream and the video stream sent by the server respectively; the audio stream is generated by independently collecting multimedia audio data through an audio collecting device by the server, encoding the audio data and independently packaging the encoded audio data; the video stream is generated by independently collecting multimedia video data through video collecting equipment by the server, encoding the video data and independently packaging the encoded video data;

the playing module is used for independently playing the at least one frame of audio frame and independently rendering the at least one frame of video frame; wherein, the process of playing the audio stream and the process of rendering the video stream are not related to each other;

the second sending module is used for sending the feedback information to the server; the feedback information is used for enabling the server to process the video stream or the audio stream so as to realize audio and video synchronous playing.

13. A server, comprising:

a processor, a memory, and a bus;

the bus is used for connecting the processor and the memory;

the memory is used for storing operation instructions;

the processor is configured to execute the multimedia processing method according to any one of claims 1 to 8 by invoking the operation instruction.

14. An electronic device, comprising:

a processor, a memory, and a bus;

the bus is used for connecting the processor and the memory;

The memory is used for storing operation instructions;

the processor is configured to execute the multimedia processing method according to any one of the preceding claims 9-10 by invoking the operation instruction.

15. A computer readable storage medium for storing computer instructions which, when run on a computer, cause the computer to perform the method of processing multimedia according to any one of the preceding claims 1-8 or 9-10.