CN111866542B

CN111866542B - Audio signal processing method, multimedia information processing device and electronic equipment

Info

Publication number: CN111866542B
Application number: CN201910364898.3A
Authority: CN
Inventors: 杜正中
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2022-11-04
Anticipated expiration: 2039-04-30
Also published as: CN111866542A

Abstract

The invention provides an audio signal processing method, which comprises the following steps: acquiring an initialization parameter of a target audio signal; determining the initial position and the end position of slicing the target audio signal according to the initialization parameters of the target audio signal; slicing the target audio signal according to the starting position and the ending position to form a slice of the target audio signal; respectively encoding each slice of the target audio signal; and merging the coding results of all the slices of the target audio signal to obtain a code stream corresponding to the target audio signal. The invention also provides a multimedia information processing method, an audio signal processing device and a storage medium. The invention can accurately process the target audio signal to be transcoded to form a lossless code stream.

Description

Audio signal processing method, multimedia information processing device and electronic equipment

Technical Field

The present invention relates to audio technologies, and in particular, to an audio signal processing method, a multimedia information processing apparatus, and an electronic device.

Background

In the process of transcoding uploaded video information, a video server needs to transcode corresponding Audio files to adapt to different network environments or terminal types, although an existing Gapless play (Gapless play) solution can solve the problem of seamless continuous play of a plurality of Audio files, the existing Gapless play solution cannot support Coding formats such as Advanced Audio Coding (AAC) and the like commonly used for streaming media, and meanwhile, a Gapless play scheme records the start and end mute lengths by using Metadata (Metadata), so that the size of a code stream can be increased, and the existing Gapless play scheme is not suitable for adapting to different terminals.

Disclosure of Invention

The embodiment of the invention provides an audio signal processing method, a multimedia information processing device and electronic equipment, which can accurately process a target audio signal needing transcoding to form a lossless code stream.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides an audio signal processing method, which comprises the following steps:

acquiring initialization parameters of a target audio signal;

determining the starting position and the ending position of slicing the target audio signal according to the initialization parameter of the target audio signal;

slicing the target audio signal according to the starting position and the ending position to form a slice of the target audio signal;

respectively encoding each slice of the target audio signal;

and merging the coding results of all the slices of the target audio signal to obtain a code stream corresponding to the target audio signal.

The embodiment of the invention also provides a multimedia information processing method, which comprises the following steps:

separating a target audio signal and a target video signal from multimedia information;

slicing the target audio signal according to the starting position and the ending position of the slice of the target audio signal to form the slice of the target audio signal;

respectively coding each slice of the target audio signal, and combining coding results to obtain a code stream corresponding to the target audio signal;

and packaging the obtained code stream corresponding to the target audio signal and the code stream of the target video signal into new multimedia information.

An embodiment of the present invention further provides an audio signal processing apparatus, including:

the signal acquisition module is used for acquiring initialization parameters of the target audio signal;

the signal slicing module is used for determining the initial position and the end position of slicing the target audio signal according to the initialization parameter of the target audio signal;

the signal slicing module is configured to perform slicing processing on the target audio signal according to the starting position and the ending position to form a slice of the target audio signal;

the signal coding module is used for respectively coding each slice of the target audio signal;

and the signal merging module is used for merging the coding results of all the slices of the target audio signal to obtain a code stream corresponding to the target audio signal.

In the above-mentioned scheme, the first step of the method,

the signal slicing module is used for determining the number of sampling points of the target audio signal slice according to the product of the initialized slicing length and the coding frame length;

the signal slice module is used for performing complementation operation on the initialized coding time delay and the length of the coding frame, and determining a complementation result as the time delay compensation of the target audio signal slice;

the signal slicing module is used for determining a first parameter according to the product of the initialized slice overlapping length and the coding frame length, and determining a second parameter according to the difference value of the coding frame length and the time delay compensation and the complementation result;

and the signal slicing module is used for determining the number of slice overlapping samples of the target audio signal slice according to the sum of the first parameter and the second parameter.

In the above-mentioned scheme, the first step of the method,

the signal encoding module is configured to extract any slice from each slice of the target audio signal;

the signal coding module is used for adding a segmented preset compensation amount at the head of any slice to form a new audio signal, so that the sum of the segmented preset compensation amount of the new audio signal and the first compensation carried by the target audio signal is an integral multiple of the length of the audio coding standard frame;

and the signal coding module is used for performing windowing processing on each frame of audio signal in the new audio signal to form coding results of different slices of the target audio signal.

In the above-mentioned scheme, the first step of the method,

and the signal coding module is used for adding a second compensation to the last slice of the target audio signal after the last frame of the slice to form a new audio signal, so that the sum of the preset compensation amount of the section of the new audio signal, the first compensation carried by the target audio signal and the second compensation is an integral multiple of the length of the audio coding standard frame.

In the above-mentioned scheme, the first and second light sources,

the signal encoding module is configured to perform multi-thread encoding processing on each slice of the target audio signal according to the number of slices of the target audio signal; or,

the signal encoding module is configured to perform distributed encoding processing on different slices of the target audio signal.

In the above-mentioned scheme, the first step of the method,

the signal merging module is used for deleting the overlapped frames among the slices;

and the signal merging module is used for merging the slices from which the overlapped frames are deleted according to the positions of the different slices to form a code stream corresponding to the target audio signal.

In the above-mentioned scheme, the first step of the method,

the signal merging module is used for deleting the overlapped frame of the first slice of the target audio signal, and the deleted frame number is a first frame number which is one half of the slice overlapping length;

the signal merging module is used for deleting the overlapped frame of the last slice of the target audio signal, the deleted frame number is the sum of a second frame number and the first frame number, and the second frame number is the minimum positive integer which is greater than or equal to the ratio of the coding time delay to the coding frame length;

the signal merging module is configured to delete an overlapped frame of a middle slice of the target audio signal, where the deleted frame number is a sum of a third frame number and the second frame number, the third frame number is a slice overlapping length, and the middle slice is a slice between the first slice and the last slice.

An embodiment of the present invention further provides an electronic device, where the electronic device is capable of processing multimedia information, and the electronic device includes:

information separation means for separating a target audio signal and a target video signal from multimedia information;

the audio signal processing device is used for carrying out slice processing on the target audio signal according to the starting position and the ending position of the slice of the target audio signal to form the slice of the target audio signal;

the audio signal processing device is used for respectively coding each slice of the target audio signal and combining coding results to obtain a code stream corresponding to the target audio signal;

and the code stream merging device is used for packaging the obtained code stream corresponding to the target audio signal and the code stream of the target video signal into new multimedia information.

An embodiment of the present invention further provides an audio signal processing apparatus, where the audio signal processing apparatus includes:

a memory for storing executable instructions;

and the processor is used for realizing the audio signal processing method of the preamble when the executable instructions stored in the memory are executed.

An embodiment of the present invention further provides an electronic device, where the electronic device includes:

a memory for storing executable instructions;

and the processor is used for realizing the preorder multimedia information processing method when the executable instructions stored in the memory are operated.

The embodiment of the invention also provides a computer-readable storage medium, which stores executable instructions, and when the executable instructions are executed by a processor, the audio signal processing method provided by the embodiment of the invention is realized, or the multimedia information processing method provided by the embodiment of the invention is realized.

The embodiment of the invention has the following beneficial effects:

and accurately slicing the target audio signal according to the calculated starting position and the calculated ending position, and respectively encoding each slice of the target audio signal to obtain a code stream corresponding to the target audio signal, so that the target audio signal is accurately processed while being transcoded to form a lossless code stream.

Drawings

FIG. 1A is a schematic diagram illustrating an environment for processing an audio signal according to an embodiment of the present invention;

FIG. 1B is a schematic diagram of introducing extra silent sections at the beginning and end of each encoded file according to an embodiment of the present invention;

fig. 1C is a schematic diagram of a continuous code stream after slice encoding of an AAC-LC encoding algorithm provided in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a usage scenario of an audio signal processing method and a multimedia information processing method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an alternative electronic device 30 according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of an alternative audio signal processing method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an encoding process according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an overlap-add process in overlapped time-frequency transformation according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an audio signal processing method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an audio signal processing method according to an embodiment of the present invention;

fig. 9 is an alternative structural schematic diagram of an electronic device according to an embodiment of the present invention;

FIG. 10 is a flowchart illustrating a multimedia message processing method according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of an audio signal processing method according to an embodiment of the present invention;

fig. 12 is a schematic diagram illustrating an effect of the audio information processing method according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) Advanced Audio Coding (AAC Advanced Audio Coding). Is a file compression format designed for voice data. Different from MP3, the method adopts a brand-new algorithm for coding, and the compression efficiency is higher.

2) Moving Picture experts group-1 or moving Picture experts group-2 Audio Layer III (MP 3, MPEG-1or MPEG-2Audio Layer III); with the technology of MPEG Audio Layer 3, music is compressed into a file with a smaller capacity at a compression ratio of 1.

3) Ogg Vorbis, a lossy audio compression format, is a free, open-source software project that is led by xiph. The items are a lossy audio compression-generated audio coding format and a software reference coder/decoder (codec). Vorbis generally takes Ogg as a container format, so it is commonly called Ogg Vorbis.

4) Opus, a lossy vocoded format, was developed by the xiph.

5) Video Transcoding (Video Transcoding) refers to converting a Video code stream that has been compressed and encoded into another Video code stream to adapt to different network bandwidths, different terminal processing capabilities and different user requirements.

6) The client, a carrier in the terminal for implementing a specific function, for example, a mobile client (APP) is a carrier of a specific function in the mobile terminal, for example, a function of performing live online (video push streaming) or a playing function of online video.

First, an audio signal processing method provided by an embodiment of the present invention is explained.

Fig. 1A is a schematic diagram of a usage environment of an audio signal processing method according to an embodiment of the present invention, and to support an exemplary application, a transcoding device implementing an embodiment of the present invention may be a server or an electronic device with a transcoding function.

By means of the transcoding device 20, the target video data may be converted into new video data, the target video data comprising: a target audio signal and a target video signal, which the transcoding device 20 may transcode respectively before synthesizing new video data. In the process of transcoding the target audio signal, because the coding standards of the target audio are different, when the related art adopts lossy audio coding formats such as AAC, MP3, ogg Vorbis, opus, and the like, additional silence segments are introduced at the beginning and end of each encoded file because of the presence of overlapping Time/Frequency transforms (overlapping Time/Frequency transforms) of adjacent frames in the above audio coding formats. If the additional mute sections are not eliminated, when a plurality of audio files with the additional mute sections are continuously played, obvious pause can occur, and the continuous impression experience of listeners is influenced.

Fig. 1B is a schematic diagram of introducing extra silence segments at the beginning and end of each encoded file according to an embodiment of the present invention, taking AAC _ LC encoding format as an example, which is a lossy encoding mode according to overlapping time-frequency transform, where an AAC _ LC encoding algorithm needs to calculate some pre-buffer data to output new audio data. These buffer values are typically 0 before the first audio data is output, i.e. the coding delay, which is represented as the first compensation (Priming) part in fig. 1B. Since each frame in the AAC-LC coding format is 1024 samples, if the length of the audio signal plus the coding delay is not an integral multiple of the frame length, the AAC-LC coding algorithm fills the complementary frame length with 0 in the last frame of audio data, which is represented as the second compensation (remaining) part in fig. 1B. Fig. 1C is a schematic diagram of a continuous code stream after slice encoding of an AAC-LC encoding algorithm provided in an embodiment of the present invention, as shown in fig. 1C, if the code streams formed after slice encoding are directly connected in a process of distributed transcoding, discontinuous silence segments are generated, that is, if these extra silence segments are not removed, two portions of silence segments, namely, a first compensation (Priming) and a second compensation (Remainder), are generated when a plurality of audio files with extra silence segments are continuously played.

To solve this drawback, some lossy audio coding formats provided by the related art, such as Ogg Vorbis, specify the lengths of the silence segments introduced at the beginning and end in the codec processing standard, and eliminate the relevant silence segments when decoding. While other lossy audio coding formats such as AAC, MP3 require the recording and the removal of the relevant silence segments by using the length of the silence segments introduced at the beginning and the end of the Metadata (Metadata) record encapsulated by the bitstream. In the process of converting target video data into new video data in a large scale, in order to increase the processing speed, the target video can be processed in a distributed transcoding mode, that is, each server transcodes a small segment of the target video and then synthesizes the small segment into a complete video.

However, the solution of Gapless playback (Gapless playback) provided by the related art does not allow an accurate Gapless player to introduce gaps or overlaps between consecutive tracks (crossfading), and does not allow gaps or overlaps to jump using guessing, so that although the related art can solve the problem of seamless continuous playback of multiple audio files, the lossy audio coding formats such as AAC, MP3, ogg Vorbis, opus and the like provided by the related art do not support Gapless playback and cannot meet the requirement of transcoding audio slices.

In view of the foregoing problems in the related art, embodiments of the present invention provide an audio signal processing method, a multimedia information processing method, an apparatus, and an electronic device, which can meet the requirement of gapless playing and can be applied in a distributed transcoding environment. As an example, fig. 2 is a schematic view of a usage scenario of an audio signal processing method and a multimedia information processing method according to an embodiment of the present invention.

As an example, the server 400 implementing the embodiment of the present invention may integrate at least one of an audio signal processing device and a multimedia information processing device, and the terminal 10 is a terminal capable of operating with a video playing function or an audio playing function, and the two are connected through a network 40, wherein the network 40 may be a wide area network or a local area network, or a combination of the two, and the data transmission is implemented using a wireless link.

For example, the server 400 can acquire the initialization parameter of the target audio signal by integrating at least one of the audio signal processing apparatus and the multimedia information processing apparatus; forming a slice of the target audio signal by performing a slice processing on the target audio signal; respectively encoding each slice of the target audio signal; and merging the coding results of all the slices of the target audio signal to obtain a code stream corresponding to the target audio signal. Certainly, according to different use scenes, the obtained code stream corresponding to the target audio signal and the code stream of the target video signal can be encapsulated into new multimedia information to form multimedia information to be requested; alternatively, the formed multimedia information is pushed to the terminal 10 through the network 40.

A video client can run in the terminal 10, and the server 400 can transcode the audio signal according to the type of the terminal 10 (for example, whether the playing of the lossless audio signal is supported); or transcoding the audio signal according to a network environment adapted to the terminal 10.

As another example, the terminal 10 itself may also integrate at least one of an audio signal processing apparatus and a multimedia information processing apparatus, and form a slice of the target audio signal by performing a slice processing on the target audio signal; coding each slice of the target audio signal; and combining the coding results of all the slices of the target audio signal to obtain a code stream corresponding to the target audio signal. And packaging the obtained code stream corresponding to the target audio signal and the code stream of the target video signal into new multimedia information. The terminal 10 may also operate a video client to decode and play new multimedia information.

Referring to fig. 3, fig. 3 is an optional schematic structural diagram of an electronic device 30 according to an embodiment of the present invention, where the electronic device 30 is used to implement the audio signal processing method according to the embodiment of the present invention, and the electronic device 30 may be various terminals of a computer and a mobile phone, or may be the server 400 shown in fig. 2. The description will be made with reference to the structure shown in fig. 3.

The electronic device 30 provided by the embodiment of the present invention may include a processing device 301 (e.g., a central processing unit, a graphics processor, etc.), which may load a preset operating system stored in a Read Only Memory (ROM) 302 or a program stored in a storage device 308 into a Random Access Memory (RAM) 303 for execution, so as to perform various appropriate actions and processes.

As an example of the program stored in the storage device 308, an audio signal processing device 3080 may be included, the audio signal processing device 3080 including the following software modules therein: signal acquisition module 3081, signal slicing module 3082, signal coding module 3083 and signal combining module 3084. When the software modules in the audio signal processing device 3080 are read into the RAM 303 by the processing device 301 and executed, the audio signal processing method provided by the embodiment of the invention will be implemented, and the functions of the software modules in the audio signal processing device 3080 will be described below.

The signal obtaining module 3081 is configured to obtain initialization parameters of the target audio signal.

The signal slicing module 3082 is configured to determine, according to the initialization parameter of the target audio signal, a start position and an end position of slicing the target audio signal. The signal slicing module 3082 is configured to perform slicing processing on the target audio signal according to the start position and the end position to form a slice of the target audio signal. The signal encoding module 3083 is configured to perform encoding processing on each slice of the target audio signal. The signal merging module 3084 is configured to merge the coding results of the slices of the target audio signal to obtain a code stream corresponding to the target audio signal.

The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 306 is also connected to bus 304.

Generally, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; including storage devices 308 such as magnetic tape, hard disk, etc.

The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 3 illustrates an electronic device having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided.

It should be noted that, although the audio signal processing apparatus provided by the embodiment of the present invention is in the form of software in fig. 3, the audio signal processing apparatus provided by the embodiment of the present invention is not limited to this form, for example, the audio signal processing apparatus provided by the embodiment of the present invention may be implemented in a manner of combining software and hardware, and the apparatus provided by the embodiment of the present invention may be a processor in the form of a hardware decoding processor, wherein the processor in the form of the hardware decoding processor is programmed to execute the audio signal processing method provided by the embodiment of the present invention, and the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable Logic Devices (PLDs), complex Programmable Logic Devices (CPLDs), field Programmable Gate Arrays (FPGAs), or other electronic components.

Referring to fig. 4, fig. 4 is an optional flowchart of the audio signal processing method provided by the embodiment of the present invention, and fig. 4 is a schematic diagram of the audio signal processing method provided by the embodiment of the present invention, and it can be understood that the steps shown in fig. 4 may be executed by various electronic devices 30 operating the audio signal processing apparatus 3080, for example, the electronic devices may be terminals, servers or server clusters of types such as computers, smart phones, and the like. The following is a description of the steps shown in fig. 4.

Step 401: obtaining initialization parameters of the target audio signal.

Step 402: and determining the starting position and the ending position of slicing the target audio signal according to the initialization parameter of the target audio signal.

Step 403: and carrying out slicing processing on the target audio signal according to the starting position and the ending position to form a slice of the target audio signal.

Step 404: and respectively carrying out coding processing on each slice of the target audio signal.

Step 405: and combining the coding results of all the slices of the target audio signal to obtain a code stream corresponding to the target audio signal.

In some embodiments of the present invention, the determining a start position and an end position of slicing the target audio signal according to the initialization parameter of the target audio signal includes: determining the number of sampling points, time delay compensation and the number of fragment overlapping sampling points of the target audio signal according to the initialization parameters of the target audio signal; determining the corresponding initial position and end position of the target audio signal slice according to the determined number of the sampling points, the time delay compensation and the number of the slice overlapping sampling points; among them, in some embodiments of the present invention, the parameters initialized first include: setting a coding frame length (frame length), a coding delay (delay), a segment length (segment length) and a segment overlapping length (segment overlapping length), wherein the frame length is determined by a coding format and a coding configuration, for example, the coding frame length of an AAC-LC coding frame is 1024, and the coding frame lengths of an AAC-HE coding frame and an AAC-HEv2 coding frame are 2048 sampling points; the coding delay is related to the coding format, coding configuration and coding tool selection. For example, a Nero-AAC tool is used for carrying out encoding processing of an AAC-LC encoding format, and the encoding time delay of the encoding processing is 2624 sampling points; the fragment length is N frames, the fragment overlapping length is M frames, and N and M are integers and can be set by a user according to the capacity of the target video.

In some embodiments of the present invention, the determining, according to the initialization parameter of the target audio signal, the number of sampling points, the delay compensation, and the number of sliced overlapping sampling points of the target audio signal includes:

1) According to the product of the initialized slice length and the coding frame length, determining the number of sampling points of the target audio signal slice, namely: segment _ samples = N frame _ length; wherein segment _ samples represents the number of sampling points of the slice;

2) And (3) performing complementation on the initialized coding time delay and the length of the coding frame, and determining the time delay compensation of the target audio signal slice, namely: delay _ remaining = delay% frame _ length; wherein delay _ remaining represents delay compensation of the slice;

3) Determining a first parameter according to the product of the initialized slice overlapping length and the coding frame length, determining a second parameter according to the difference between the coding frame length and the time delay compensation and the complementation result, and determining the number of slice overlapping samples of the target audio signal slice according to the sum of the first parameter and the second parameter, namely: overlap _ samples = M frame _ length + (frame _ length-delay _ remaining)% frame _ length, where overlap _ samples represents the number of slice overlapping samples of the slice.

When determining the corresponding initial position and end position of the target audio signal slice, the start position of slice i is recorded as (start _ i), the end position is recorded as (end _ i), and the last slice of the target audio signal is recorded as T. The respective initial and end positions of the target audio signal slice may be:

start_1＝0,

end_1＝segment_samples–1-delay_remainder

when 1< -i < -T,

start_i＝end_(i-1)+1–overlap_samples，

end_i＝end_(i-1)+segment_samples，

for the last slice T of the video sequence,

start_T＝end_(T-1)+1–overlap_samples，

end _ i = last sample position.

In some embodiments of the present invention, the separately encoding different slices of the formed target audio signal includes: extracting any one of the slices of the target audio signal; adding a preset compensation amount of a segment at the head of any one slice to form a new audio signal, wherein the sum of the preset compensation amount of the segment of the new audio signal and a first compensation carried by the target audio signal is an integral multiple of the length of the standard audio coding frame; and windowing each frame of audio signal in the new audio signal to form the coding results of different slices of the target audio signal.

For example, when any extracted slice is the last slice of the target audio signal, adding a second compensation after the last frame of the slice to form a new audio signal, and realizing that the sum of the predetermined compensation amount of the segmentation of the new audio signal, the first compensation carried by the target audio signal and the second compensation is an integral multiple of the length of the audio coding standard frame.

As described above, in the slice shown in fig. 1C, if the code streams formed after the fragment coding are directly connected in the process of distributed transcoding, discontinuous silence segments are generated, that is, if these extra silence segments are not removed, when a plurality of audio files with extra silence segments are continuously played, two parts of silence segments, namely, a first compensation (Priming) and a second compensation (remaining), are generated. Fig. 5 is a schematic diagram of an encoding process according to an embodiment of the present invention, and according to the technical solution shown in this embodiment, as shown in fig. 5, a segmented predetermined compensation amount (Preadd) is added to a header of a target audio signal, so that a sum of the segmented predetermined compensation amount and a first compensation (Priming) carried by the target audio signal is an integer multiple of a frame length of the audio coding standard, where a second compensation carried by the target audio signal is already 0 although it exists; and after each slice is coded independently, discarding k extra 0-complementing frames at the head of each slice code stream, splicing the rest code streams in sequence, discarding extra zero-complementing frames at the head, namely discarding the sum of the preset compensation quantity added to the head of the target audio signal and the first compensation carried by the target audio signal, and only obtaining the coding result of the rest code stream, wherein the coding result only contains original data.

In some embodiments of the present invention, since different target audio signals have uncertainty in size, the number of slices of the formed target audio signal is also different, and thus, the slices of the target audio signal may be encoded in multiple threads according to the number of slices of the target audio signal; alternatively, different slices of the target audio signal are distributively encoded to increase the speed of the transcoding process.

After the distributed transcoding is completed, the coding results of the slices of the target audio signal may be merged to obtain a code stream corresponding to the target audio signal, and as described above, in an application scenario of gapless playing, an accurate gapless player is not allowed to introduce gaps or overlaps (cross-fade) between consecutive tracks, and a guessed manner is not allowed to jump gaps or overlaps, so that overlapping frames need to be deleted, that is, overlapping frames between the slices are deleted; and then merging the slices from which the overlapped frames have been deleted according to the positions of the different slices to form a code stream corresponding to the target audio signal.

In some embodiments of the present invention, since the number of the overlapped frames corresponding to different positions of the target audio signal slice is different, the corresponding overlapped frames need to be deleted according to the position of the target audio signal slice. The following description is made in connection with different positions of audio slices.

1) When the slice position is the first slice of the target audio signal, deleting overlapped frames from the first slice in the target audio signal, wherein the deleted frame number is a first frame number, and the first frame number is one half of the slice overlapping length; 2) When the slice position is the last slice of the target audio signal, deleting the overlapped frame of the last slice in the target audio signal, wherein the deleted frame number is the sum of a second frame number and the first frame number, and the second frame number is the minimum positive integer which is larger than or equal to the ratio of the coding delay to the coding frame length; 3) And when the slice position is between the first slice and the last slice of the target audio signal, deleting overlapped frames of the slices between the first slice and the last slice of the target audio signal, wherein the deleted frame number is the sum of a third frame number and the second frame number, and the third frame number is the slice overlapping length. And splicing the intercepted code streams in sequence to obtain the code stream corresponding to the target audio signal. It should be noted that in some embodiments of the present invention, the position of each slice in the target audio signal may be represented by a slice number, for example: a slice reference numeral 1 denotes a first slice in the target audio signal, a slice reference numeral N denotes a last slice in the target audio signal, and a slice reference numeral 2 to N-1 denotes any one of slices between the first slice and the last slice in the target audio signal.

Referring to fig. 6, fig. 6 is a schematic diagram of an overlap-add process in the overlap time-frequency transform according to the embodiment of the present invention. The input audio signal is [6,9, 12, 15, 18, 21], wherein, when the AAC-LC coding scheme is used, 1) the input signal is supplemented with silence segments (zero-filling) in front and back to obtain [0, 6,9, 12, 15, 18, 21, 0], and is divided into 4 frames [0, 6,9], [6,9, 12, 15], [12, 15, 18, 21], [18, 21, 0], the frame length is 4, and the overlap length is 2; 2) Windowing each frame of signal [0.33,0.67, 0.33], multiplying the window by the corresponding sample value of the signal; 3) Assuming that the encoding and decoding process does not introduce distortion; 4) After the 4 frames of signals are overlapped and added, the output signal is consistent with the input signal.

Referring to fig. 7, fig. 7 is a schematic diagram of an audio signal processing method according to an embodiment of the present invention, when an input audio signal [6,9, 12, 15, 18, 21] is sliced by using distributed transcoding, the input signal is divided into two slices, that is, [6,9] and [12, 15, 18, 21], and during a process of forming a new audio signal by adding a second compensation after a last frame of the slice, if overlapping frames are not deleted, a codec result of a first slice is [0, 4,3], a codec result of a second slice is [0, 8,5], [4, 10, 12,7] and [6, 14, 0], and the first slice and the second slice are combined to obtain a code stream [4,3, 12, 15, 18, 21], which has different distortions from an input target audio signal.

Referring to fig. 8, fig. 8 is a schematic diagram of an audio signal processing method according to an embodiment of the present invention, when an input audio signal [6,9, 12, 15, 18, 21] is subjected to a slicing process by using a distributed transcoding manner, and the input signal is divided into two slices, i.e., [6,9, 12, 15] and [12, 15, 18, 21], by overlap-add; in the process of adding a second compensation after the last frame of the slice to form a new audio signal, the codec result of the first slice is: [0,0,4,3], [2,6,8,5] and [8,5,0,0]; the result of the second slice is: [8,5, 0], [4, 10, 12,7] and [6, 14, 0], wherein [8,5, 0] in the first slice and [8,5, 0] in the second slice are redundant bitstream frames, which can be deleted, and the bitstream of the first slice and the second slice are combined to [6,9, 12, 15, 18, 21], consistent with the input audio signal.

Referring to fig. 9, fig. 9 is an optional schematic structural diagram of an electronic device 50 according to an embodiment of the present invention, where the electronic device 50 is used to implement the multimedia information processing method according to the embodiment of the present invention, and the electronic device 50 may be various terminals such as a computer and a mobile phone, and may also be the server 400 shown in fig. 2. The description will be made with reference to the structure shown in fig. 9.

The electronic device 50 provided by the embodiment of the present invention may include a processing device 101 (e.g., a central processing unit, a graphics processor, etc.), which may load a preset operating system stored in a Read Only Memory (ROM) 102 or a program stored in a storage device 108 into a Random Access Memory (RAM) 103 for execution, so as to perform various appropriate actions and processes.

As an example of the program stored in the storage device 108, a multimedia information processing apparatus 500 may be included, the multimedia information processing apparatus 500 including the following software modules therein: an information separating device 501, an audio signal processing device 502, an audio signal processing device 503 and a code stream merging device 504. When the software modules in the multimedia information processing apparatus 500 are read into the RAM 103 and executed by the processing apparatus 101, the method for processing multimedia information according to the embodiment of the present invention will be implemented, and the functions of the software modules in the multimedia information processing apparatus 500 will be described below.

Information separating means 501 for separating a target audio signal and a target video signal from multimedia information; an audio signal processing device 502, configured to perform slice processing on the target audio signal according to a start position and an end position of a slice of the target audio signal to form a slice of the target audio signal; the audio signal processing device 503 is configured to encode each slice of the target audio signal, and combine encoding results to obtain a code stream corresponding to the target audio signal; a code stream merging device 504, configured to encapsulate the obtained code stream corresponding to the target audio signal and the code stream of the target video signal into new multimedia information.

The processing device 101, the ROM 102, and the RAM 103 are connected to each other through a bus 104. An input/output (I/O) interface 106 is also connected to bus 104.

Generally, the following devices may be connected to the I/O interface 105: input devices 106 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 107 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; including storage devices 108 such as magnetic tape, hard disk, etc.

The communication means 109 may allow the electronic device 50 to communicate with other devices wirelessly or by wire to exchange data. While fig. 9 illustrates an electronic device having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

It should be noted that, although the multimedia information processing apparatus 500 provided by the embodiment of the present invention shown in fig. 9 is in the form of software, it is not limited to this, for example, the audio signal processing apparatus provided by the embodiment of the present invention may be implemented in a manner of combining software and hardware, and by way of example, the multimedia information processing apparatus provided by the embodiment of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the audio signal processing method provided by the embodiment of the present invention, and the processor in the form of the hardware decoding processor may be one or more application-specific ASICs, DSPs, PLDs, CPLDs, FPGAs, or other electronic components.

Referring to fig. 10, fig. 10 is a flowchart illustrating a multimedia information processing method according to an embodiment of the present invention with reference to an electronic device 50 shown in fig. 9, and it can be understood that the steps shown in fig. 10 can be executed by various electronic devices 50 operating a multimedia information processing apparatus 500, for example, a terminal, a server or a server cluster, such as a computer, a smart phone, and the like. The following is a description of the steps shown in fig. 10.

Step 1001, separating a target audio signal and a target video signal from multimedia information;

step 1002, performing slicing processing on the target audio signal according to a start position and an end position of a slice of the target audio signal to form a slice of the target audio signal;

step 1003, respectively encoding each slice of the target audio signal, and combining encoding results to obtain a code stream corresponding to the target audio signal;

step 1004, encapsulating the obtained code stream corresponding to the target audio signal and the code stream of the target video signal into new multimedia information. The packaged new multimedia information can be uploaded to a corresponding server to wait for the on-demand of a user to form multimedia information to be on-demand; or directly pushing new multimedia information to the user to form a push stream.

Referring to fig. 11, fig. 11 is a schematic diagram of an audio signal processing method provided by an embodiment of the present invention, when eight million samples of an input target audio signal input signal [0,1,2, \ 8230;, 7999998,7999999] are processed in a distributed transcoding manner. The process comprises the following steps:

step 1101: and determining the starting position and the ending position of slicing the target audio signal according to the initialization parameters of the target audio signal. Wherein, the length N of the fragments is 3000, the overlapping length M of the fragments is 10,frame _length =1024,delay =2624

The initialized parameters are as follows:

delay_remainder＝delay％frame_length＝2624％1024＝576

segment_samples＝N*frame_length＝3000*1024＝3072000

overlap_samples＝M*frame_length+(frame_length–delay_remainder)％frame_lengt＝10*1024+(1024-576)％1024＝8192+448＝10688

for a first slice of the target audio signal:

start_1＝0

end_1＝segment_samples–1-delay_remainder＝3072000-1-576＝3071423

3071423, 3071424 samples in total, plus delay 2624, (3071424 + 2624)/1024 =3002 frames.

For a second slice of the target audio signal:

start_2＝end_1+1-overlap_samples＝3071423+1-10688＝3060736

end_2＝end_1+segment_samples＝3071423+3072000＝6143423

3082688 samples in total are taken [3060736,6143423], and (3082688 + 2624)/1024= 3013 frame data are taken together with the time delay 2624.

For a third slice of the target audio signal:

start_3＝end_2+1-overlap_samples＝6143423+1-10688＝6132736

end_3＝7999999

1867263 samples in total [6132736,7999999], plus a delay 2624, zero padding 961 at the end,

in total, (3082688 +2624+ 961)/1024= 1827 frame data.

Step 1102: slicing the target audio signal according to the starting position and the ending position to form a slice of the target audio signal;

step 1103: respectively encoding each slice of the target audio signal;

step 1104: and deleting the overlapped frames of the slices and combining to form a code stream corresponding to the target audio signal.

Wherein for a first slice of the target audio signal:

tail-out overlap M/2=5 frame data: 3071423-1024 × 5=3066303

The signal data in the actual code stream is [0,3066303];

for a second slice of the target audio signal:

removing the header delay introduces 3 frames of data and overlaps 5 frames of data: 3060736-2624+1024 + 3+1024 + 5=3066304

Tail overlap 5 frame data: 6143423-1024 × 5=6138303

The signal data in the actual code stream is [3066304,6138303]

For a third slice of the target audio signal:

removing the header delay introducing 3 frames of data and overlapping 5 frames of data: 6132736-2624+1024 + 3+1024 + 5=6138304

The signal data in the actual code stream is [6138304,7999999]

Thus, the three slices are:

[0,3071423]，[3060736,6143423]，[6132736,7999999]

after the three sliced code streams are cut off, the three sliced code streams contain actual signals, namely, the three sliced code streams can be seen to contain continuous and complete input signals

[0,3066303]，[3066304,6138303]，[6138304,7999999]。

In some embodiments of the present invention, an Infinite Impulse Response filter (IIR filter) is commonly used in audio encoders, and theoretically the size of the current sample point value has an effect on the processing of all the subsequent sample points. Therefore, the correlation between frames is continuous all the time, and the slice coding and the whole coding cannot be completely consistent. The more adjacent sliced overlapping frames, the less the IIR filter will have impact.

Referring to fig. 12, fig. 12 is a schematic view illustrating an effect of the audio information processing method according to the embodiment of the present invention, as shown in fig. 12, a time duration of a sample of a target audio signal is 120 minutes, and code rates included in the sample are: 32 kilobits per second (kbps), 48kbps,64kbps,96kbps,128kbps,256kbps,320kbps, when the audio signal processing method described in the embodiment of the present invention is not adopted, because the AAC encoding mode does not support distributed transcoding, transcoding can be carried out by adopting a single-thread processing mode as far as possible, and the processing time corresponding to the sample of the target audio signal is: 175 seconds, 225 seconds, 250 seconds, 252 seconds, 260 seconds, and 280 seconds. After the audio signal processing method described in the embodiment of the present invention is adopted, the processing can be completed through distributed transcoding, and the processing times corresponding to the samples of the target audio signal are respectively: 40 seconds, 45 seconds, 50 seconds, 51 seconds and 52 seconds, the coding time is reduced by more than 80%, and the processing speed of the audio signal is effectively improved.

In summary, the audio signal processing method described in the present invention can achieve the following beneficial effects: 1) Slicing the target audio signal according to the determined starting position and the determined ending position to form a slice of the target audio signal; 2) Performing distributed encoding processing on different slices of the target audio signal to increase the speed of transcoding processing; 3) And accurately processing the mute part of the target audio signal needing distributed transcoding to form a lossless code stream.

The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of audio signal processing, the method comprising:

acquiring an initialization parameter of a target audio signal;

determining the number of sampling points, delay compensation and fragmentation overlapping sample points of the target audio signal according to the initialization parameters of the target audio signal, and determining the starting position and the ending position of slicing the target audio signal according to the determined number of the sampling points, delay compensation and fragmentation overlapping sample points;

respectively encoding each slice of the target audio signal;

2. The method of claim 1, wherein determining the number of sampling points, the delay compensation, and the number of sliced overlapping samples of the target audio signal according to the initialization parameters of the target audio signal comprises:

determining the number of sampling points of the target audio signal slice according to the product of the initialized slice length and the coding frame length;

performing complementation operation on the initialized coding time delay and the coding frame length, and determining a complementation result as the time delay compensation of the target audio signal slice;

determining a first parameter according to the product of the initialized fragment overlapping length and the coding frame length, and determining a second parameter according to the difference value of the coding frame length and the time delay compensation and the complementation result;

and determining the number of slice overlapping samples of the target audio signal slice according to the sum of the first parameter and the second parameter.

3. The method according to claim 1, wherein the encoding the respective slices of the target audio signal comprises:

extracting any slice from each slice of the target audio signal, adding a preset compensation amount of segmentation at the head of the slice to form a new audio signal, so that

The sum of the preset compensation quantity of the new audio signal segment and the first compensation carried by the target audio signal is integral multiple of the length of an audio coding standard frame;

and windowing each frame of audio signal in the new audio signal to form the coding results of different slices of the target audio signal.

4. The method of claim 3, further comprising:

when any one of the extracted slices is the last slice of the target audio signal, adding a second compensation after the last frame of the slice to form a new audio signal so that the new audio signal is obtained

The sum of the preset compensation amount of the new audio signal segment and the first compensation and the second compensation carried by the target audio signal is integral multiple of the audio coding standard frame length.

5. The method of claim 3, further comprising:

carrying out multi-thread coding processing on each slice of the target audio signal according to the number of the slices of the target audio signal; or,

different slices of the target audio signal are distributively encoded.

6. The method according to any one of claims 1 to 5, wherein the merging the encoding results of the slices of the target audio signal to obtain a code stream corresponding to the target audio signal comprises:

deleting the overlapped frames among the slices;

and merging the slices from which the overlapped frames are deleted according to the positions of different slices to form a code stream corresponding to the target audio signal.

7. The method of claim 6, wherein said removing overlapping frames between said slices comprises:

deleting overlapped frames of a first slice of the target audio signal, wherein the deleted frame number is a first frame number, and the first frame number is one half of the slice overlapping length;

deleting the overlapped frame of the last slice of the target audio signal, wherein the deleted frame number is the sum of a second frame number and the first frame number, and the second frame number is the minimum positive integer which is larger than or equal to the ratio of the coding delay to the coding frame length;

and deleting overlapped frames of a middle slice of the target audio signal, wherein the deleted frame number is the sum of a third frame number and the second frame number, the third frame number is the slice overlapping length, and the middle slice is a slice between the first slice and the last slice.

8. A method for processing multimedia information, the method comprising:

determining the number of sampling points, delay compensation and fragmentation overlapping sample points of a target audio signal according to an initialization parameter of the target audio signal, and determining the initial position and the end position for slicing the target audio signal according to the determined number of the sampling points, delay compensation and fragmentation overlapping sample points;

respectively encoding each slice of the target audio signal, and combining encoding results to obtain a code stream corresponding to the target audio signal;

9. An audio signal processing apparatus, characterized in that the apparatus comprises:

the signal slicing module is used for determining the number of sampling points, time delay compensation and the number of fragment overlapping sample points of the target audio signal according to the initialization parameters of the target audio signal, and determining the initial position and the end position of slicing the target audio signal according to the determined number of the sampling points, the time delay compensation and the number of the fragment overlapping sample points;

10. A multimedia information processing apparatus, characterized by comprising:

information separating means for separating a target audio signal and a target video signal from multimedia information;

the audio signal processing device is used for determining the number of sampling points, the time delay compensation and the number of fragment overlapping samples of the target audio signal according to the initialization parameter of the target audio signal, and determining the initial position and the end position of slicing the target audio signal according to the determined number of the sampling points, the time delay compensation and the number of the fragment overlapping samples; slicing the target audio signal according to the starting position and the ending position to form a slice of the target audio signal;

the audio signal processing device is used for respectively encoding each slice of the target audio signal and combining encoding results to obtain a code stream corresponding to the target audio signal;

11. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the audio signal processing method of any one of claims 1 to 7 when executing the executable instructions stored by the memory.

12. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the multimedia information processing method of claim 8 when executing the executable instructions stored in the memory.

13. A computer-readable storage medium storing executable instructions, wherein the executable instructions, when executed by a processor, implement the audio signal processing method of any one of claims 1 to 7 or implement the multimedia information processing method of claim 8.