WO2024114373A1 - 场景音频编码方法及电子设备 - Google Patents

场景音频编码方法及电子设备 Download PDF

Info

Publication number
WO2024114373A1
WO2024114373A1 PCT/CN2023/131640 CN2023131640W WO2024114373A1 WO 2024114373 A1 WO2024114373 A1 WO 2024114373A1 CN 2023131640 W CN2023131640 W CN 2023131640W WO 2024114373 A1 WO2024114373 A1 WO 2024114373A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
signal
virtual speaker
scene
reconstructed
Prior art date
Application number
PCT/CN2023/131640
Other languages
English (en)
French (fr)
Inventor
高原
刘帅
夏丙寅
王喆
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024114373A1 publication Critical patent/WO2024114373A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Definitions

  • the embodiments of the present application relate to the field of audio coding and decoding, and in particular to a scene audio coding method and electronic device.
  • Three-dimensional audio technology is an audio technology that uses computers and signal processing to acquire, process, transmit, render and play back sound events and three-dimensional sound field information in the real world.
  • Three-dimensional audio gives sound a strong sense of space, envelopment and immersion, giving people an extraordinary auditory experience of "being there".
  • HOA Higher Order Ambisonics
  • HOA Higher Order Ambisonics
  • It has higher flexibility when playing back three-dimensional audio, and has therefore received more extensive attention and research.
  • the corresponding number of channels is (N+1) 2 .
  • N the number of channels
  • the information used to record more detailed sound scenes in the HOA signal will also increase; however, the amount of data in the HOA signal will also increase, and a large amount of data will cause difficulties in transmission and storage, so the HOA signal needs to be encoded and decoded.
  • the existing technology has low encoding performance for HOA signals.
  • the present application provides a scene audio encoding method and an electronic device.
  • an embodiment of the present application provides a scene audio encoding method, the method comprising: first, obtaining a scene audio signal to be encoded, the scene audio signal comprising audio signals of C1 channels, where C1 is a positive integer; then, based on the scene audio signal, determining attribute information of a target virtual speaker; thereafter, encoding a first audio signal in the scene audio signal and the attribute information of the target virtual speaker to obtain a first bitstream; wherein the first audio signal is an audio signal of K channels in the scene audio signal, where K is a positive integer less than or equal to C1.
  • the position of the target virtual speaker matches the position of the sound source in the scene audio signal; based on the attribute information of the target virtual speaker and the first audio signal in the scene audio signal, a virtual speaker signal corresponding to the target virtual speaker can be generated; based on the virtual speaker signal, the scene audio signal can be reconstructed. Therefore, the encoder encodes the attribute information of the first audio signal in the scene audio signal and the target virtual speaker and sends them to the decoder. The decoder can reconstruct the scene audio signal based on the first reconstructed signal (i.e., the reconstructed signal of the first audio signal in the scene audio signal) and the attribute information of the target virtual speaker obtained by decoding.
  • the first reconstructed signal i.e., the reconstructed signal of the first audio signal in the scene audio signal
  • the audio quality of the scene audio signal reconstructed based on the virtual speaker signal is higher; therefore, when K is equal to C1, at the same bit rate, the audio quality of the scene audio signal reconstructed by the present application is higher.
  • the prior art converts the scene audio signal into a virtual speaker signal and a residual signal and then encodes it, while the encoding end of the present application directly encodes the first audio signal in the scene audio signal without calculating the virtual speaker signal and the residual signal, and the encoding complexity of the encoding end is lower.
  • the scene audio signal involved in the embodiments of the present application may refer to a signal used to describe a sound field; wherein the scene audio signal may include: an HOA signal (wherein the HOA signal may include a three-dimensional HOA signal and a two-dimensional HOA signal (also referred to as a planar HOA signal)) and a three-dimensional audio signal; the three-dimensional audio signal may refer to other audio signals in the scene audio signal except the HOA signal.
  • HOA signal may include a three-dimensional HOA signal and a two-dimensional HOA signal (also referred to as a planar HOA signal)
  • the three-dimensional audio signal may refer to other audio signals in the scene audio signal except the HOA signal.
  • K when N1 is equal to 1, K may be equal to C1; when N1 is greater than 1, K may be less than C1. It should be understood that when N1 is equal to 1, K may also be less than C1.
  • the process of encoding the attribute information of the first audio signal and the target virtual speaker in the scene audio signal may include operations such as downmixing, transformation, quantization, and entropy coding, which are not limited in the present application.
  • the first code stream may include encoded data of a first audio signal in a scene audio signal and encoded data of property information of a target virtual speaker.
  • a target virtual speaker may be selected from multiple candidate virtual speakers based on the scene audio signal, and then the attribute information of the target virtual speaker may be determined.
  • the virtual speakers are virtual speakers, not real speakers.
  • a plurality of candidate virtual speakers may be evenly distributed on a sphere, and the number of target virtual speakers may be one or more.
  • a preset target virtual speaker may be obtained, and then the attribute information of the target virtual speaker may be determined.
  • the scene audio signal is an N1-order high-order stereo reverberation HOA signal
  • the N1-order HOA signal includes a second audio signal and a third audio signal
  • the second audio signal is an HOA signal of the 0th order to the Mth order in the N1-order HOA signal
  • the third audio signal is an audio signal in the N1-order HOA signal except the second audio signal
  • M is an integer less than N1
  • C1 is equal to the square of (N1+1)
  • N1 is a positive integer
  • the first audio signal includes the second audio signal.
  • the first audio signal includes the second audio signal, which can be understood as the first audio signal only including the second audio signal.
  • the first audio signal includes the second audio signal, which can be understood as the first audio signal including the second audio signal and other audio signals.
  • the first audio signal further includes a fourth audio signal; wherein the fourth audio signal is an audio signal of some channels in the third audio signal.
  • the first audio signal may include audio signals of an even number of channels.
  • the number of channels of the second audio signal is an odd number
  • the number of channels of the fourth audio signal may also be an odd number. This facilitates encoding by an encoder that only supports encoding audio signals of an even number of channels.
  • the second audio signal may be referred to as a low-order portion of the scene audio signal
  • the third audio signal may be referred to as a high-order portion of the scene audio signal. That is, a portion of the low-order portion of the scene audio signal and the high-order portion of the scene audio signal may be encoded to ensure that the first audio signal includes audio signals of an even number of channels.
  • the first audio signal may also include an audio signal of an odd number of channels. Then, when the number of channels of the second audio signal is an even number, the number of channels of the fourth audio signal may be an odd number. This can facilitate encoding by an encoder that only supports encoding audio signals of an odd number of channels.
  • the number of channels of the encoded first audio signal is smaller and the corresponding bit rate is lower.
  • the attribute information of the target virtual speaker includes at least one of the following: position information of the target virtual speaker, a position index corresponding to the position information of the target virtual speaker, or a virtual speaker index of the target virtual speaker.
  • the position information of the target virtual speaker can be expressed as Among them, ⁇ s3 is the horizontal angle information of the target virtual speaker, It is the pitch angle information of the target virtual speaker.
  • the position index is used to uniquely identify the position of a virtual speaker.
  • the position index may include a horizontal angle index (used to uniquely identify a horizontal angle information) and a pitch angle index (used to uniquely identify a pitch angle information).
  • the position index of the virtual speaker corresponds to the position information of the virtual speaker one by one.
  • the virtual speaker index may be used to uniquely identify a virtual speaker; wherein the position information/position index of the virtual speaker corresponds one-to-one to the virtual speaker index.
  • attribute information of a target virtual speaker is determined based on a scene audio signal, including: obtaining multiple groups of virtual speaker coefficients corresponding to multiple candidate virtual speakers, the multiple groups of virtual speaker coefficients corresponding one-to-one to multiple candidate virtual speakers; selecting a target virtual speaker from multiple candidate virtual speakers based on the scene audio signal and the multiple groups of virtual speaker coefficients; and obtaining attribute information of the target virtual speaker.
  • the virtual speaker signal generated by the virtual sound source is a plane wave, which can be expanded in the spherical coordinate system.
  • the ideal plane wave of can be expanded using spherical harmonics as shown in the following formula (3).
  • Set as the position information of the candidate virtual speaker At this time, the That is, a set of virtual speaker coefficients (i.e., HOA coefficients).
  • the virtual speaker coefficients are also HOA coefficients. It should be noted that, according to formula (3), when the position of the candidate virtual speaker is different from the position of the sound source in the scene audio signal, the virtual speaker coefficients of the candidate virtual speaker and the scene audio signal are different HOA coefficients.
  • a target virtual speaker whose position matches the sound source position in the scene audio signal can be accurately found from multiple candidate virtual speakers.
  • a target virtual speaker is selected from multiple candidate virtual speakers, including: performing inner products on the scene audio signal and multiple groups of virtual speaker coefficients to obtain multiple inner product values; multiple inner product values correspond to multiple groups of virtual speaker coefficients one by one; based on the multiple inner product values, a target virtual speaker is selected from multiple candidate virtual speakers.
  • the matching degree of each candidate virtual speaker with the scene audio signal can be accurately determined; and then a target virtual speaker whose position is more matched with the sound source position in the scene audio signal can be selected.
  • the method further includes: obtaining feature information corresponding to the fifth audio signal in the scene audio signal; encoding the feature information to obtain a second bit stream; wherein the fifth audio signal is the third audio signal, or the fifth audio signal is an audio signal in the scene audio signal other than the second audio signal and the fourth audio signal, and the fourth audio signal is an audio signal of some channels in the third audio signal.
  • the feature information can be used to compensate for the audio signals of some channels in the reconstructed scene audio signal during the decoding process at the decoding end, so as to improve the audio quality of the audio signals of some channels in the reconstructed scene audio signal.
  • the data volume of the characteristic information is relatively small, so compared with the prior art, even if the characteristic information is encoded, the total bit rate of the present application is also smaller. Therefore, under the premise of the same bit rate, the audio quality of the reconstructed scene audio signal can be further improved.
  • the feature information corresponding to the fifth audio signal in the scene audio signal may be determined based on information such as energy and intensity of the scene audio signal.
  • the feature information includes gain information.
  • the characteristic information may also include diffusion information, etc., which is not limited in the present application.
  • an embodiment of the present application provides a scene audio decoding method, which includes: first, receiving a first bit stream; and decoding the first bit stream to obtain a first reconstructed signal and attribute information of a target virtual speaker, the first reconstructed signal is a reconstructed signal of a first audio signal in a scene audio signal, the scene audio signal includes audio signals of C1 channels, the first audio signal is audio signals of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer less than or equal to C1; then, based on the attribute information and the first reconstructed signal, a virtual speaker signal corresponding to the target virtual speaker is generated; thereafter, reconstruction is performed based on the attribute information and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal includes audio signals of C2 channels, and C2 is a positive integer.
  • the audio quality of the scene audio signal reconstructed based on the virtual speaker signal is higher; therefore, when K is equal to C1, at the same bit rate, the audio quality of the scene audio signal reconstructed by the present application is higher.
  • the number of channels of the audio signal encoded by the present application is less than the number of channels of the audio signal encoded by the prior art, and the data amount of the attribute information of the target virtual speaker is much smaller than the data amount of the audio signal of one channel; therefore, under the premise of the same bit rate, the audio quality of the reconstructed scene audio signal obtained by decoding by the present application is higher.
  • the virtual speaker signal and residual information encoded and transmitted by the prior art are converted from the original audio signal (i.e., the scene audio signal to be encoded), and are not the original audio signal, errors will be introduced; while the present application encodes part of the original audio signal (i.e., the audio signals of K channels in the scene audio signal to be encoded), thus avoiding the introduction of errors, thereby improving the audio quality of the reconstructed scene audio signal obtained by decoding; and it can also avoid fluctuations in the reconstruction quality of the reconstructed scene audio signal obtained by decoding, and has high stability.
  • the present application encodes and transmits the attribute information of the virtual speaker, and the data volume of the attribute information is much smaller than the data volume of the virtual speaker signal; therefore, the number of target virtual speakers selected by the present application is less subject to bandwidth restrictions.
  • the more target virtual speakers selected the higher the quality of the reconstructed scene audio signal based on the virtual speaker signal of the target virtual speaker. Therefore, compared with the prior art, under the same bit rate, the present application can select a larger number of target virtual speakers, so that the quality of the reconstructed scene audio signal decoded by the present application is higher.
  • the encoding end and decoding end of the present application do not need to perform residual and superposition operations. Therefore, the comprehensive complexity of the encoding end and decoding end of the present application is lower than the comprehensive complexity of the encoding end and decoding end of the prior art.
  • the encoding end when the encoding end performs lossy compression on the first audio signal in the scene audio signal, there is a difference between the first reconstructed signal decoded by the decoding end and the first audio signal encoded by the encoding end.
  • the first reconstructed signal decoded by the decoding end is the same as the first audio signal encoded by the encoding end.
  • the encoding end performs lossy compression on the attribute information of the target virtual speaker, there is a difference between the attribute information decoded by the decoding end and the attribute information encoded by the encoding end.
  • the attribute information decoded by the decoding end is the same as the attribute information encoded by the encoding end. (In this application, the attribute information encoded by the encoding end and the attribute information decoded by the decoding end are not distinguished in terms of name.)
  • the method further includes: generating a second reconstructed scene audio signal based on the first reconstructed signal and the first reconstructed scene audio signal, wherein the second reconstructed scene audio signal includes audio signals of C2 channels.
  • the decoded first reconstructed signal is closer to the encoded first audio signal; thus, a second reconstructed scene audio signal having higher audio quality than the first reconstructed scene audio signal can be obtained.
  • the scene audio signal is an N1-order high-order stereo reverberation HOA signal
  • the N1-order HOA signal includes a second audio signal and a third audio signal
  • the second audio signal is a signal of the 0th order to the Mth order in the N1-order HOA signal
  • the third audio signal is an audio signal other than the second audio signal in the N1-order HOA signal
  • M is an integer less than N
  • C1 is equal to the square of (N1+1)
  • N1 is a positive integer
  • the first reconstructed scene audio signal is an N2-order HOA signal
  • the N2-order HOA signal includes a sixth audio signal and a seventh audio signal
  • the sixth audio signal is a signal of the 0th order to the Mth order in the N2-order HOA signal
  • the seventh audio signal is an audio signal other than the sixth audio signal in the N2-order HOA signal
  • M is an integer less than N2
  • C2 is equal to the square of (N2+1)
  • N2 is a positive integer
  • Generating a second reconstructed scene audio signal based on the first reconstructed signal and the first reconstructed scene audio signal includes: when the first audio signal includes the second audio signal, generating the second reconstructed scene audio signal based on the second reconstructed signal and the seventh audio signal, the second reconstructed signal being a reconstructed signal of the second audio signal.
  • the first reconstructed signal obtained by decoding is closer to the first audio signal encoded by the encoding end; therefore, based on the first reconstructed signal and the seventh audio signal, the audio quality of the second reconstructed scene audio signal obtained is higher.
  • a second reconstructed scene audio signal is generated based on the first reconstruction signal and the first reconstructed scene audio signal, including: when the first audio signal includes a second audio signal and a fourth audio signal, a second reconstructed scene audio signal is generated based on the second reconstruction signal, the fourth reconstruction signal and the eighth audio signal; wherein the fourth audio signal is a partial audio signal in the third audio signal, the fourth reconstruction signal is a reconstruction signal of the fourth audio signal, the second reconstruction signal is a reconstruction signal of the second audio signal, and the eighth audio signal is a partial audio signal in the seventh audio signal.
  • the second reconstructed scene signal obtained in this way has more channels of the first reconstructed signal, so the obtained second reconstructed scene audio signal is closer to the encoded scene audio signal, and thus the audio quality of the second reconstructed scene audio signal is higher.
  • generating a virtual speaker signal corresponding to a target virtual speaker based on the attribute information and the first reconstruction signal includes: determining a first virtual speaker coefficient corresponding to the target virtual speaker based on the attribute information; and generating the virtual speaker signal based on the first reconstruction signal and the first virtual speaker coefficient. In this way, the virtual speaker signal can be generated.
  • reconstruction is performed based on the attribute information and the virtual speaker signal to obtain a first reconstructed scene audio signal, including: determining a second virtual speaker coefficient corresponding to the target virtual speaker based on the attribute information; and obtaining the first reconstructed scene audio signal based on the virtual speaker signal and the second virtual speaker coefficient. In this way, reconstruction of the scene audio signal can be achieved.
  • the method before generating the second reconstructed scene audio signal based on the second reconstructed signal and the seventh audio signal, the method further includes: receiving a second bit stream; decoding the second bit stream to obtain feature information corresponding to the fifth audio signal in the scene audio signal; wherein the fifth audio signal is the third audio signal; and compensating the seventh audio signal based on the feature information.
  • the method further includes: receiving a second bit stream; decoding the second bit stream to obtain feature information corresponding to the fifth audio signal in the scene audio signal; wherein the fifth audio signal is the third audio signal; and compensating the seventh audio signal based on the feature information.
  • the feature information decoded by the decoder when the encoder performs lossy compression on the feature information, the feature information decoded by the decoder is different from the feature information encoded by the encoder.
  • the feature information decoded by the decoder is the same as the feature information encoded by the encoder. (In this application, the feature information encoded by the encoder and the feature information decoded by the decoder are not distinguished in terms of name.)
  • the method before generating a second reconstructed scene audio signal based on the second reconstructed signal, the fourth reconstructed signal and the eighth audio signal, the method further includes: receiving a second bit stream; decoding the second bit stream to obtain feature information corresponding to a fifth audio signal in the scene audio signal; wherein the fifth audio signal is an audio signal other than the second audio signal and the fourth audio signal in the scene audio signal; and compensating the eighth audio signal based on the feature information.
  • the eighth audio signal in the reconstructed first reconstructed scene audio signal the audio quality of the eighth audio signal in the reconstructed first reconstructed scene audio signal can be improved.
  • the seventh audio signal/eighth audio signal in the first reconstructed scene audio signal can be compensated based on the feature information to improve the first reconstructed scene audio signal.
  • the feature information includes gain information.
  • the second reconstructed scene audio signal may be an N2-order HOA signal, where N2 is a positive integer.
  • the order N2 of the second reconstructed scene audio signal may be greater than or equal to the order N1 of the scene audio signal; correspondingly, the number of channels C2 of the audio signal included in the second reconstructed scene audio signal may be greater than or equal to the number of channels C1 of the audio signal included in the scene audio signal.
  • the decoding end can reconstruct a reconstructed scene audio signal having the same order as the scene audio signal encoded by the encoding end.
  • the decoding end may reconstruct a reconstructed scene audio signal having an order greater than the order of the scene audio signal encoded by the encoding end.
  • the second aspect and any implementation of the second aspect correspond to the first aspect and any implementation of the first aspect respectively.
  • the technical effects corresponding to the second aspect and any implementation of the second aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.
  • an embodiment of the present application provides a method for generating a bitstream, which can generate a bitstream according to the first aspect and any one of the implementation methods of the first aspect.
  • the third aspect and any implementation of the third aspect correspond to the first aspect and any implementation of the first aspect, respectively.
  • the technical effects corresponding to the third aspect and any implementation of the third aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.
  • an embodiment of the present application provides a scene audio encoding device, the device comprising:
  • a signal acquisition module used to acquire a scene audio signal to be encoded, where the scene audio signal includes audio signals of C1 channels, where C1 is a positive integer;
  • An attribute information acquisition module used to determine the attribute information of the target virtual speaker based on the scene audio signal
  • the encoding module is used to encode the attribute information of the first audio signal and the target virtual speaker in the scene audio signal to obtain a first code stream; wherein the first audio signal is the audio signal of K channels in the scene audio signal, and K is a positive integer less than or equal to C1.
  • the scene audio encoding device of the fourth aspect can execute the steps of the first aspect and any one of the implementation methods of the first aspect, which will not be repeated here.
  • the fourth aspect and any implementation of the fourth aspect correspond to the first aspect and any implementation of the first aspect, respectively.
  • the technical effects corresponding to the fourth aspect and any implementation of the fourth aspect can refer to the technical effects corresponding to the above-mentioned first aspect and any implementation of the first aspect, which will not be repeated here.
  • an embodiment of the present application provides a scene audio decoding device, the device comprising: a bit stream receiving module, configured to receive a first bit stream;
  • a decoding module used to decode the first bit stream to obtain a first reconstructed signal and attribute information of a target virtual speaker, wherein the first reconstructed signal is a reconstructed signal of a first audio signal in a scene audio signal, the scene audio signal includes audio signals of C1 channels, the first audio signal is audio signals of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer less than or equal to C1;
  • a virtual speaker signal generating module used for generating a virtual speaker signal corresponding to a target virtual speaker based on the attribute information and the first reconstruction signal;
  • the scene audio signal reconstruction module is used to reconstruct based on the attribute information and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal includes audio signals of C2 channels, where C2 is a positive integer.
  • the scene audio decoding device of the fifth aspect can execute the steps in the second aspect and any implementation method of the second aspect, which will not be repeated here.
  • the fifth aspect and any implementation of the fifth aspect correspond to the second aspect and any implementation of the second aspect, respectively.
  • the technical effects corresponding to the fifth aspect and any implementation of the fifth aspect can refer to the technical effects corresponding to the above-mentioned second aspect and any implementation of the second aspect, which will not be repeated here.
  • an embodiment of the present application provides an electronic device, comprising: a memory and a processor, wherein the memory is coupled to the processor; the memory stores program instructions, and when the program instructions are executed by the processor, the electronic device executes the scene audio encoding method in the first aspect or any possible implementation of the first aspect.
  • the sixth aspect and any implementation of the sixth aspect correspond to the first aspect and any implementation of the first aspect, respectively.
  • the technical effects corresponding to the sixth aspect and any implementation of the sixth aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.
  • an embodiment of the present application provides an electronic device, comprising: a memory and a processor, the memory being coupled to the processor; the memory storing program instructions, and when the program instructions are executed by the processor, the electronic device executes the scene audio decoding method in the second aspect or any possible implementation of the second aspect.
  • the seventh aspect and any implementation of the seventh aspect correspond to the second aspect and any implementation of the second aspect, respectively.
  • the technical effects corresponding to the seventh aspect and any implementation of the seventh aspect can refer to the technical effects corresponding to the above-mentioned second aspect and any implementation of the second aspect, which will not be repeated here.
  • an embodiment of the present application provides a chip comprising one or more interface circuits and one or more processors; the interface circuit is used to receive a signal from a memory of an electronic device and send a signal to the processor, the signal comprising a computer instruction stored in the memory; when the processor executes the computer instruction, the electronic device executes the scene audio encoding method in the first aspect or any possible implementation of the first aspect.
  • the eighth aspect and any implementation of the eighth aspect correspond to the first aspect and any implementation of the first aspect, respectively.
  • the technical effects corresponding to the eighth aspect and any implementation of the eighth aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.
  • an embodiment of the present application provides a chip comprising one or more interface circuits and one or more processors; the interface circuit is used to receive a signal from a memory of an electronic device and send a signal to the processor, the signal comprising a computer instruction stored in the memory; when the processor executes the computer instruction, the electronic device executes the scene audio decoding method in the second aspect or any possible implementation of the second aspect.
  • the ninth aspect and any implementation of the ninth aspect correspond to the second aspect and any implementation of the second aspect, respectively.
  • the technical effects corresponding to the ninth aspect and any implementation of the ninth aspect can refer to the technical effects corresponding to the above-mentioned second aspect and any implementation of the second aspect, which will not be repeated here.
  • an embodiment of the present application provides a computer-readable storage medium, which stores a computer program.
  • the computer program runs on a computer or a processor
  • the computer or the processor executes the scene audio encoding method in the first aspect or any possible implementation of the first aspect.
  • the tenth aspect and any implementation of the tenth aspect correspond to the first aspect and any implementation of the first aspect, respectively.
  • the technical effects corresponding to the tenth aspect and any implementation of the tenth aspect can refer to the technical effects corresponding to the above-mentioned first aspect and any implementation of the first aspect, which will not be repeated here.
  • an embodiment of the present application provides a computer-readable storage medium, which stores a computer program.
  • the computer program runs on a computer or a processor
  • the computer or the processor executes the scene audio decoding method in the second aspect or any possible implementation of the second aspect.
  • the eleventh aspect and any implementation of the eleventh aspect correspond to the second aspect and any implementation of the second aspect, respectively.
  • the technical effects corresponding to the eleventh aspect and any implementation of the eleventh aspect can refer to the technical effects corresponding to the above-mentioned second aspect and any implementation of the second aspect, which will not be repeated here.
  • an embodiment of the present application provides a computer program product, which includes a software program.
  • the software program is executed by a computer or a processor, the computer or the processor executes the scene audio encoding method in the first aspect or any possible implementation of the first aspect.
  • the twelfth aspect and any implementation of the twelfth aspect correspond to the first aspect and any implementation of the first aspect, respectively.
  • the technical effects corresponding to the twelfth aspect and any implementation of the twelfth aspect can refer to the technical effects corresponding to the above-mentioned first aspect and any implementation of the first aspect, which will not be repeated here.
  • an embodiment of the present application provides a computer program product, which includes a software program.
  • the software program is executed by a computer or a processor, the computer or the processor executes the scene audio decoding method in the second aspect or any possible implementation of the second aspect.
  • the thirteenth aspect and any implementation of the thirteenth aspect correspond to the second aspect and any implementation of the second aspect, respectively.
  • the technical effects corresponding to the thirteenth aspect and any implementation of the thirteenth aspect can refer to the technical effects corresponding to the above-mentioned second aspect and any implementation of the second aspect, which will not be repeated here.
  • an embodiment of the present application provides a device for storing a code stream, the device comprising: a receiver and at least one storage medium, the receiver is used to receive the code stream; the at least one storage medium is used to store the code stream; the code stream is generated according to the first aspect and any one implementation method of the first aspect.
  • the fourteenth aspect and any implementation of the fourteenth aspect correspond to the first aspect and any implementation of the first aspect, respectively.
  • the technical effects corresponding to the fourteenth aspect and any implementation of the fourteenth aspect can refer to the technical effects corresponding to the above-mentioned first aspect and any implementation of the first aspect, which will not be repeated here.
  • an embodiment of the present application provides a device for transmitting a code stream, the device comprising: a transmitter and at least one storage medium, the at least one storage medium is used to store the code stream, and the code stream is generated according to the first aspect and any one of the implementation methods of the first aspect; the transmitter is used to obtain the code stream from the storage medium and send the code stream to the terminal side device through the transmission medium.
  • the fifteenth aspect and any implementation of the fifteenth aspect correspond to the first aspect and any implementation of the first aspect, respectively.
  • the technical effects corresponding to the fifteenth aspect and any implementation of the fifteenth aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.
  • an embodiment of the present application provides a system for distributing code streams, the system comprising: at least one storage medium for storing at least one code stream, the at least one code stream being generated according to the first aspect and any one of the implementation methods of the first aspect, a streaming media device for obtaining a target code stream from the at least one storage medium and sending the target code stream to an end-side device, wherein the streaming media device comprises a content server or a content distribution server.
  • the sixteenth aspect and any implementation of the sixteenth aspect correspond to the first aspect and any implementation of the first aspect, respectively.
  • the technical effects corresponding to the sixteenth aspect and any implementation of the sixteenth aspect can refer to the technical effects corresponding to the above-mentioned first aspect and any implementation of the first aspect, which will not be repeated here.
  • FIG. 1a is a schematic diagram of an exemplary application scenario
  • FIG1b is a schematic diagram of an exemplary application scenario
  • FIG2a is a schematic diagram showing an exemplary encoding process
  • FIG2b is a schematic diagram showing an exemplary distribution of candidate virtual speakers
  • FIG3 is a schematic diagram showing an exemplary decoding process
  • FIG4 is a schematic diagram showing an exemplary encoding process
  • FIG5 is a schematic diagram showing an exemplary decoding process
  • FIG6a is a schematic diagram showing the structure of an encoding end
  • FIG6b is a schematic diagram showing the structure of a decoding end
  • FIG. 7 is a schematic diagram showing an exemplary encoding process
  • FIG8 is a schematic diagram showing an exemplary decoding process
  • FIG9a is a schematic diagram showing the structure of an encoding end
  • FIG9b is a schematic diagram showing the structure of a decoding end
  • FIG10 is a schematic diagram showing the structure of an exemplary scene audio encoding device
  • FIG11 is a schematic diagram showing the structure of an exemplary scene audio decoding device
  • FIG. 12 is a schematic diagram showing the structure of an exemplary device.
  • a and/or B in this article is merely a description of the association relationship of associated objects, indicating that three relationships may exist.
  • a and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone.
  • first and second in the description and claims of the embodiments of the present application are used to distinguish different objects rather than to describe a specific order of objects.
  • a first target object and a second target object are used to distinguish different target objects rather than to describe a specific order of target objects.
  • words such as “exemplary” or “for example” are used to indicate examples, illustrations or descriptions. Any embodiment or design described as “exemplary” or “for example” in the embodiments of the present application should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of words such as “exemplary” or “for example” is intended to present related concepts in a specific way.
  • multiple refers to two or more than two.
  • multiple processing units refer to two or more processing units; multiple systems refer to two or more systems.
  • Sound is a continuous wave generated by the vibration of an object.
  • the object that vibrates and emits sound waves is called a sound source.
  • sound waves propagate through a medium (such as air, solid or liquid), the auditory organs of humans or animals can perceive the sound.
  • the characteristics of sound waves include pitch, intensity and timbre.
  • Pitch refers to the high or low pitch of a sound.
  • Intensity refers to the size of a sound. Intensity can also be called loudness or volume. The unit of intensity is decibel (dB).
  • Timbre is also called timbre quality.
  • the frequency of sound waves determines the pitch of the sound. The higher the frequency, the higher the pitch.
  • the number of times an object vibrates in one second is called frequency, and the unit of frequency is hertz (Hz).
  • the frequency of sound that the human ear can recognize is between 20Hz and 20,000Hz.
  • the amplitude of the sound wave determines the intensity of the sound. The greater the amplitude, the greater the sound intensity. The closer to the sound source, the greater the sound intensity.
  • the waveform of the sound wave determines the timbre, and the waveforms of the sound wave include square wave, sawtooth wave, sine wave and pulse wave.
  • sounds can be divided into regular sounds and irregular sounds.
  • Irregular sounds refer to sounds produced by irregular vibrations of the sound source. Irregular sounds are, for example, noises that affect people's work, study, and rest.
  • Regular sounds refer to sounds produced by regular vibrations of the sound source. Regular sounds include speech and music.
  • regular sounds are analog signals that continuously change in the time and frequency domains. This analog signal can be called an audio signal.
  • An audio signal is an information carrier that carries speech, music, and sound effects.
  • the scene audio signal involved in the embodiment of the present application may refer to a signal used to describe a sound field; wherein the scene audio signal may include: an HOA signal (wherein the HOA signal may include a three-dimensional HOA signal and a two-dimensional HOA signal (also referred to as a planar HOA signal)) and a three-dimensional audio signal; the three-dimensional audio signal may refer to other audio signals in the scene audio signal except the HOA signal.
  • HOA signal may include a three-dimensional HOA signal and a two-dimensional HOA signal (also referred to as a planar HOA signal)
  • the three-dimensional audio signal may refer to other audio signals in the scene audio signal except the HOA signal.
  • the following description takes the HOA signal as an example.
  • the sound pressure p satisfies formula (1), is the Laplace operator.
  • the space system outside the human ear is a sphere
  • the listener is at the center of the sphere
  • the sound coming from outside the sphere has a projection on the sphere
  • the sound outside the sphere is filtered out.
  • the sound source is distributed on this sphere, and the sound field generated by the sound source on the sphere is used to fit the sound field generated by the original sound source, that is, three-dimensional audio technology is a method of fitting the sound field.
  • r represents the radius of the sphere
  • represents the horizontal angle information (or azimuth information)
  • the pitch angle information or called the elevation angle information
  • k represents the wave number
  • s represents the amplitude of the ideal plane wave
  • m represents the order number of the HOA signal (or called the order number of the HOA signal).
  • spherical Bessel function which is also called the radial basis function, where the first j represents the imaginary unit, Does not vary with angle.
  • Spherical harmonics of the direction
  • Spherical harmonics representing the direction of the sound source.
  • the HOA signal satisfies formula (3).
  • formula (3) can be transformed into formula (4).
  • the sound field refers to the area in the medium where sound waves exist.
  • N is an integer greater than or equal to 1.
  • the scene audio signal is an information carrier that carries the spatial position information of the sound source in the sound field, and describes the sound field of the listener in space.
  • Formula (4) shows that the sound field can be expanded on the sphere according to spherical harmonics, that is, the sound field can be decomposed into the superposition of multiple plane waves. Therefore, the sound field described by the HOA signal can be expressed by the superposition of multiple plane waves, and the sound field can be reconstructed by the HOA coefficients.
  • the HOA signal to be encoded involved in the embodiments of the present application may refer to an N1-order HOA signal, which may be represented by an HOA coefficient or an Ambisonic (stereo reverberation) coefficient, where N1 is an integer greater than or equal to 1 (wherein, when N1 is equal to, the 1st-order HOA signal may be referred to as a FOA (First Order Ambisonic, first-order stereo reverberation) signal).
  • the N1-order HOA signal includes an audio signal of (N1+1) 2 channels.
  • Fig. 1a is a schematic diagram of an exemplary application scenario.
  • Fig. 1a shows a coding and decoding scenario of a scene audio signal.
  • the first electronic device may include a first audio acquisition module, a first scene audio encoding module, a first channel encoding module, a first channel decoding module, a first scene audio decoding module and a first audio playback module. It should be understood that the first electronic device may include more or fewer modules than those shown in FIG. 1a, and the present application does not limit this.
  • the second electronic device may include a second audio acquisition module, a second scene audio encoding module, a second channel encoding module, a second channel decoding module, a second scene audio decoding module, and a second audio playback module. It should be understood that the second electronic device may include more or fewer modules than those shown in FIG. 1a, and the present application does not limit this.
  • the first electronic device encodes and transmits the scene audio signal to the second electronic device
  • the process of decoding and audio playback by the second electronic device can be as follows: the first audio acquisition module can perform audio acquisition and output the scene audio signal to the first scene audio encoding module. Then, the first scene audio encoding module can encode the scene audio signal and output the code stream to the first channel encoding module. After that, the first channel encoding module can channel encode the code stream and transmit the channel-encoded code stream to the second electronic device through a wireless or wired network communication device. Then, the second channel decoding module of the second electronic device can channel decode the received data to obtain the code stream and output the code stream to the second scene audio decoding module. Then, the second scene audio decoding module can decode the code stream to obtain a reconstructed scene audio signal; then the reconstructed scene audio signal is output to the second audio playback module, and the second audio playback module performs audio playback.
  • the second audio playback module can perform post-processing on the reconstructed scene audio signal (such as audio rendering (for example, the reconstructed scene audio signal containing (N1+1) 2 -channel audio signals can be converted into an audio signal with the same number of channels as the number of speakers in the second electronic device), loudness normalization, user interaction, audio format conversion or noise removal, etc.) to convert the reconstructed scene audio signal into an audio signal suitable for playback by the speakers in the second electronic device.
  • audio rendering for example, the reconstructed scene audio signal containing (N1+1) 2 -channel audio signals can be converted into an audio signal with the same number of channels as the number of speakers in the second electronic device
  • loudness normalization for example, the reconstructed scene audio signal containing (N1+1) 2 -channel audio signals can be converted into an audio signal with the same number of channels as the number of speakers in the second electronic device
  • loudness normalization for example, the reconstructed scene audio signal containing (N1+1) 2 -channel audio signals can be converted into an audio
  • the process of the second electronic device encoding and transmitting the scene audio signal to the first electronic device, decoding and audio playback by the first electronic device is similar to the above-mentioned process of the first electronic device transmitting the scene audio signal to the second electronic device, and audio playback by the second electronic device, and will not be repeated here.
  • the first electronic device and the second electronic device may include, but are not limited to: personal computers, computer workstations, smart phones, tablet computers, servers, smart cameras, smart cars or other types of cellular phones, media consumption devices, wearable devices, set-top boxes, game consoles, etc.
  • the present application can be specifically applied to VR (Virtual Reality)/AR (Augmented Reality) scenarios.
  • the first electronic device is a server
  • the second electronic device is a VR/AR device.
  • the second electronic device is a server
  • the first electronic device is a VR/AR device.
  • the first scene audio encoding module and the second scene audio encoding module may be scene audio encoders.
  • the first scene audio decoding module and the second scene audio decoding module may be scene audio decoders.
  • the first electronic device when the first electronic device encodes the scene audio signal and the second electronic device reconstructs the scene audio signal, the first electronic device can be called the encoding end and the second electronic device can be called the decoding end.
  • the second electronic device when the second electronic device encodes the scene audio signal and the first electronic device reconstructs the scene audio signal, the second electronic device can be called the encoding end and the first electronic device can be called the decoding end.
  • Fig. 1b is a schematic diagram of an exemplary application scenario.
  • Fig. 1b shows a transcoding scenario of a scene audio signal.
  • a wireless or core network device may include: a channel decoding module, other audio decoding modules, a scene audio encoding module and a channel encoding module.
  • the wireless or core network device may be used for audio transcoding.
  • the specific application scenario of Figure 1b (1) may be: when the first electronic device is not provided with a scene audio encoding module but only with other audio encoding modules; and the second electronic device is only provided with a scene audio decoding module but no other audio decoding modules, in order to enable the second electronic device to decode and play back the scene audio signal encoded by the first electronic device using other audio encoding modules, wireless or core network equipment may be used for transcoding.
  • the first electronic device uses other audio encoding modules to encode the scene audio signal to obtain a first code stream; and the first code stream is channel-encoded and sent to the wireless or core network device.
  • the channel decoding module of the wireless or core network device can perform channel decoding and output the first code stream obtained by channel decoding to other audio decoding modules.
  • other audio decoding modules decode the first code stream to obtain a scene audio signal and output the scene audio signal to the scene audio encoding module.
  • the scene audio encoding module can encode the scene audio signal to obtain a second code stream and output the second code stream to the channel encoding module, and the channel encoding module performs channel encoding on the second code stream and sends it to the second electronic device.
  • the second electronic device can call the scene audio decoding module, decode the second code stream obtained by channel decoding, and obtain a reconstructed scene audio signal; the reconstructed scene audio signal can be subsequently played back.
  • the wireless or core network device may include: a channel decoding module, a scene audio decoding module, other audio encoding modules and a channel encoding module.
  • the wireless or core network device may be used for audio transcoding.
  • the specific application scenario of Figure 1b (2) may be: when the first electronic device is only provided with a scene audio encoding module and no other audio encoding modules; and the second electronic device is not provided with a scene audio decoding module and only has other audio decoding modules, in order to enable the second electronic device to decode and play back the scene audio signal encoded by the first electronic device using the scene audio encoding module, wireless or core network equipment may be used for transcoding.
  • the first electronic device uses a scene audio encoding module to encode the scene audio signal to obtain a first code stream; and the first code stream is channel-encoded and sent to a wireless or core network device. Then, the channel decoding module of the wireless or core network device can perform channel decoding and output the first code stream obtained by channel decoding to the scene audio decoding module. After that, the scene audio decoding module decodes the first code stream to obtain a scene audio signal and outputs the scene audio signal to other audio encoding modules.
  • other audio encoding modules can encode the scene audio signal to obtain a second code stream and output the second code stream to the channel encoding module, and the channel encoding module performs channel encoding on the second code stream and sends it to the second electronic device.
  • the second electronic device can call other audio decoding modules to decode the second code stream obtained by channel decoding to obtain a reconstructed scene audio signal; the reconstructed scene audio signal can be subsequently played back.
  • FIG. 2a is a schematic diagram showing an exemplary encoding process.
  • the HOA signal may be an N1-order HOA signal, that is, when m is truncated to the N1-th item, the above formula (3)
  • S202 Determine attribute information of a target virtual speaker based on the scene audio signal.
  • S203 encode the first audio signal in the scene audio signal and the attribute information of the target virtual speaker to obtain a first bit stream; wherein the first audio signal is the audio signal of K channels in the scene audio signal, and K is a positive integer less than or equal to C1.
  • the virtual speaker is a virtual speaker, not a real speaker.
  • the scene audio signal can be expressed by the superposition of multiple plane waves, and then the target virtual speaker used to simulate the sound source in the scene audio signal can be determined; in this way, in the subsequent decoding process, the virtual speaker signal corresponding to the target virtual speaker is used to reconstruct the scene audio signal.
  • a plurality of candidate virtual speakers at different positions may be arranged on a spherical surface; then, a target virtual speaker whose position matches the position of a sound source in a scene audio signal may be selected from the plurality of candidate virtual speakers.
  • Fig. 2b is a schematic diagram showing an exemplary distribution of candidate virtual speakers.
  • multiple candidate virtual speakers may be evenly distributed on a spherical surface, and a point on the spherical surface represents a candidate virtual speaker.
  • a target virtual speaker whose position matches the sound source position in the scene audio signal can be selected from the multiple candidate virtual speakers; wherein the number of target virtual speakers can be one or more, and this application does not impose any restrictions on this.
  • the target virtual speaker may be preset.
  • the scene audio signal can be reconstructed based on the virtual speaker signal; however, directly transmitting the virtual speaker signal of the target virtual speaker will increase the bit rate; and the virtual speaker signal of the target virtual speaker can be generated based on the attribute information of the target virtual speaker and the scene audio signal of some or all channels; therefore, the attribute information of the target virtual speaker can be obtained, and the audio signals of K channels in the scene audio signal can be obtained as the first audio signal; then the first audio signal and the attribute information of the target virtual speaker are encoded to obtain a first bit stream.
  • the first audio signal and the property information of the target virtual speaker may be down-mixed, transformed, quantized, and entropy encoded to obtain a first bitstream.
  • the first bitstream may include the encoded data of the first audio signal in the scene audio signal and the encoded data of the property information of the target virtual speaker.
  • the scene audio signal reconstructed based on the virtual speaker signal has higher audio quality; therefore, when K is equal to C1, at the same bit rate, the scene audio signal reconstructed by the present application has higher audio quality.
  • the number of channels of the audio signal encoded by the present application is less than the number of channels of the audio signal encoded by the prior art, and the data amount of the attribute information of the target virtual speaker is also much smaller than the data amount of the audio signal of one channel; therefore, under the premise of achieving the same quality, the encoding bit rate of the present application is lower.
  • the prior art converts the scene audio signal into a virtual speaker signal and a residual signal and then encodes it
  • the encoding end of the present application directly encodes the audio signals of some channels in the scene audio signal without calculating the virtual speaker signal and the residual signal, and the encoding complexity of the encoding end is lower.
  • Fig. 3 is a schematic diagram of an exemplary decoding process.
  • Fig. 3 is a decoding process corresponding to the encoding process of Fig. 2.
  • S302 Decode the first bit stream to obtain a first reconstructed signal and property information of a target virtual speaker.
  • the encoded data of the first audio signal in the scene audio signal included in the first code stream can be decoded to obtain the first reconstructed signal; that is, the first reconstructed signal is a reconstructed signal of the first audio signal.
  • the encoded data of the attribute information of the target virtual speaker included in the first code stream can be decoded to obtain the attribute information of the target virtual speaker.
  • the encoding end when the encoding end performs lossy compression on the first audio signal in the scene audio signal, there is a difference between the first reconstructed signal decoded by the decoding end and the first audio signal encoded by the encoding end.
  • the first reconstructed signal decoded by the decoding end is the same as the first audio signal encoded by the encoding end.
  • the encoding end performs lossy compression on the attribute information of the target virtual speaker, there is a difference between the attribute information decoded by the decoding end and the attribute information encoded by the encoding end.
  • the attribute information decoded by the decoding end is the same as the attribute information encoded by the encoding end. (In this application, the attribute information encoded by the encoding end and the attribute information decoded by the decoding end are not distinguished in terms of name.)
  • S303 Generate a virtual speaker signal corresponding to the target virtual speaker based on the attribute information and the first reconstructed signal.
  • S304 Reconstruct based on the attribute information and the virtual speaker signal to obtain a first reconstructed scene audio signal.
  • the scene audio signal can be reconstructed based on the virtual speaker signal; then, the virtual speaker signal corresponding to the target virtual speaker can be generated based on the attribute information of the target virtual speaker and the first reconstruction signal.
  • one target virtual speaker corresponds to one virtual speaker signal, and the virtual speaker signal is a plane wave. Then, reconstruction is performed based on the attribute information of the target virtual speaker and the virtual speaker signal to generate a first reconstructed scene audio signal.
  • the reconstructed first reconstructed scene audio signal may also be an HOA signal
  • the HOA signal may be an N2-order HOA signal, where N2 is a positive integer.
  • the order N2 of the first reconstructed scene audio signal may be greater than or equal to the order N1 of the scene audio signal in the embodiment of FIG. 2a ; correspondingly, the number of channels C2 of the audio signal included in the first reconstructed scene audio signal may be greater than or equal to the number of channels C1 of the audio signal included in the scene audio signal in the embodiment of FIG. 2a .
  • the first reconstructed scene audio signal may be directly used as the final decoding result.
  • the audio quality of the scene audio signal reconstructed based on the virtual speaker signal is higher; therefore, when K is equal to C1, at the same bit rate, the audio quality of the scene audio signal reconstructed by the present application is higher.
  • the number of channels of the audio signal encoded by the present application is less than the number of channels of the audio signal encoded by the prior art, and the data amount of the attribute information of the target virtual speaker is much smaller than the data amount of the audio signal of one channel; therefore, under the premise of the same bit rate, the audio quality of the reconstructed scene audio signal obtained by decoding by the present application is higher.
  • the virtual speaker signal and residual information encoded and transmitted by the prior art are converted from the original audio signal (i.e., the scene audio signal to be encoded), and are not the original audio signal, errors will be introduced; while the present application encodes part of the original audio signal (i.e., the audio signals of K channels in the scene audio signal to be encoded), thus avoiding the introduction of errors, thereby improving the audio quality of the reconstructed scene audio signal obtained by decoding; and it can also avoid fluctuations in the reconstruction quality of the reconstructed scene audio signal obtained by decoding, and has high stability.
  • the present application encodes and transmits the attribute information of the virtual speaker, and the data volume of the attribute information is much smaller than the data volume of the virtual speaker signal; therefore, the number of target virtual speakers selected by the present application is less subject to bandwidth restrictions.
  • the more target virtual speakers selected the higher the quality of the reconstructed scene audio signal based on the virtual speaker signal of the target virtual speaker. Therefore, compared with the prior art, under the same bit rate, the present application can select a larger number of target virtual speakers, so that the quality of the reconstructed scene audio signal decoded by the present application is higher.
  • the encoding end and decoding end of the present application do not need to perform residual and superposition operations. Therefore, the comprehensive complexity of the encoding end and decoding end of the present application is lower than the comprehensive complexity of the encoding end and decoding end of the prior art.
  • the following scene audio signal is an N1-order HOA signal
  • the first reconstructed scene audio signal is an N2-order HOA signal.
  • N1 and N2 are both greater than 1, and K is less than C1.
  • a second reconstructed scene audio signal can be generated based on the first reconstructed scene audio signal and the first reconstructed signal; then, the second reconstructed scene audio signal is used as the final decoding result.
  • the audio signal corresponding to the channel of the first audio signal in the first reconstructed scene audio signal can be replaced by the first reconstructed signal.
  • the decoded first reconstructed signal is closer to the encoded first audio signal, so the obtained second reconstructed scene audio signal has higher audio quality than the first reconstructed scene audio signal.
  • the components of the scene audio signal ie, the N1-order HOA signal
  • the first reconstructed scene audio signal ie, the N2-order HOA signal
  • an N1-order HOA signal may include a second audio signal and a third audio signal, wherein the second audio signal is an HOA signal when the N1-order HOA signal is truncated to the M-order (or, the second audio signal is a signal of the 0th order to the Mth order in the N1-order HOA signal; wherein the second audio signal includes audio signals of (M+1) 2 channels, where M is an integer less than N1), and the third audio signal is an audio signal in the N1-order HOA signal other than the second audio signal.
  • the second audio signal may be referred to as a low-order part of an N1-order HOA signal
  • the third audio signal may be referred to as a high-order part of an N1-order HOA signal.
  • an N1-order HOA signal can include 16 channels of audio signals.
  • formula (3) when the value of n in formula (3) is 0, formula (3) can be expanded to obtain 1 monomial, as shown in the following formula (5); at this time, an audio signal of 1 channel can be obtained.
  • formula (3) when the value of n in formula (3) is 1, formula (3) can be expanded to obtain 3 monomials, as shown in the following formula (6); at this time, an audio signal of 3 channels can be obtained.
  • formula (3) When the value of n in formula (4) is 2, formula (3) can be expanded to obtain 5 monomials, as shown in the following formula (7); at this time, an audio signal of 5 channels can be obtained.
  • formula (3) when the value of n in formula (4) is 3, formula (3) can be expanded to obtain 7 monomials, as shown in the following formula (8); at this time, an audio signal of 7 channels can be obtained.
  • the second audio signal can include an audio signal of one channel, as shown in the above formula (5); the third audio signal can include another 15 channels of audio signals, as shown in the above formulas (6) to (8).
  • the second audio signal can include audio signals of 4 channels, as shown in the above formulas (5) and (6); the third audio signal can include audio signals of another 12 channels, as shown in the above formulas (7) and (8).
  • the second audio signal can include audio signals of 9 channels, as shown in the above formulas (5) to (7); the third audio signal can include audio signals of another 7 channels, as shown in the above formula (8).
  • the N2-order HOA signal may include a sixth audio signal and a seventh audio signal, the sixth audio signal being the HOA signal when the N2-order HOA signal is truncated to the M-order (or, the sixth audio signal is the 0th to Mth-order signals in the N2-order HOA signal; wherein the sixth audio signal includes (M+1) 2 channels of audio signals, where M is an integer less than N2), and the seventh audio signal is the audio signal in the N2-order HOA signal excluding the sixth audio signal.
  • the sixth audio signal may be referred to as a low-order part of an N2-order HOA signal
  • the seventh audio signal may be referred to as a high-order part of an N2-order HOA signal.
  • the N2-order HOA signal can include 16 channels of audio signals.
  • formula (3) when the value of n in formula (3) is 0, formula (3) can be expanded to obtain 1 monomial, as shown in the following formula (9); at this time, a 1-channel audio signal can be obtained.
  • formula (3) when the value of n in formula (3) is 1, formula (3) can be expanded to obtain 3 monomials, as shown in the following formula (10); at this time, a 3-channel audio signal can be obtained.
  • formula (3) When the value of n in formula (4) is 2, formula (3) can be expanded to obtain 5 monomials, as shown in the following formula (11); at this time, a 5-channel audio signal can be obtained.
  • formula (3) when the value of n in formula (4) is 3, formula (3) can be expanded to obtain 7 monomials, as shown in the following formula (12); at this time, a 7-channel audio signal can be obtained.
  • the sixth audio signal can include an audio signal of one channel, as shown in the above formula (9); the seventh audio signal can include another 15 channels of audio signals, as shown in the above formulas (10) to (12).
  • the sixth audio signal can include audio signals of four channels, as shown in the above formulas (9) and (10); the seventh audio signal can include audio signals of another 12 channels, as shown in the above formulas (11) and (12).
  • the sixth audio signal can include audio signals of 9 channels, as shown in the above formulas (9) to (11); the seventh audio signal can include audio signals of another 7 channels, as shown in the above formula (12).
  • the following describes the process of selecting a target virtual speaker during encoding and the process of reconstructing the second reconstructed scene audio signal during decoding.
  • FIG. 4 is a schematic diagram showing an exemplary encoding process.
  • S401 may refer to the description of S201 above, which will not be repeated here.
  • the first configuration information of an encoding module (eg, a scene audio encoding module) may be obtained; then, based on the first configuration information of the encoding module, the second configuration information of the candidate virtual speakers may be determined; then, based on the second configuration information of the candidate virtual speakers, multiple candidate virtual speakers may be generated.
  • an encoding module eg, a scene audio encoding module
  • the first configuration information includes but is not limited to: encoding bit rate, user-defined information (for example, the HOA order corresponding to the encoding module (referring to the order of the HOA signal that the encoding module can support encoding), the order of the reconstructed scene audio signal (the order of the reconstructed HOA signal obtained by the desired decoding end), the format of the reconstructed scene audio signal (the format of the reconstructed HOA signal obtained by the desired decoding end), etc.); this application is not limited to this.
  • user-defined information for example, the HOA order corresponding to the encoding module (referring to the order of the HOA signal that the encoding module can support encoding)
  • the order of the reconstructed scene audio signal the order of the reconstructed HOA signal obtained by the desired decoding end
  • the format of the reconstructed scene audio signal the format of the reconstructed HOA signal obtained by the desired decoding end
  • the second configuration information includes, but is not limited to: the total number of candidate virtual speakers, the HOA order of each candidate virtual speaker, the location information of each candidate virtual speaker, and other information; this application does not impose any restrictions on this.
  • the second configuration information of the candidate virtual speakers there may be multiple ways to determine the second configuration information of the candidate virtual speakers; for example, if the encoding bit rate is low, a smaller number of candidate virtual speakers may be configured; if the encoding bit rate is high, a plurality of candidate virtual speakers may be configured.
  • the HOA order of the virtual speaker may be configured as the HOA order of the encoding module.
  • the second configuration information of the candidate virtual speakers may also be determined according to user-defined information (for example, the total number of candidate virtual speakers, the HOA order of each candidate virtual speaker, the location information of each candidate virtual speaker, and other information that can be customized by the user).
  • a configuration table may be pre-set, and the configuration table includes the relationship between the number of candidate virtual speakers and the position information of the candidate virtual speakers.
  • the position information of each candidate virtual speaker may be determined by searching the configuration table.
  • multiple candidate virtual speakers can be generated based on the second configuration information of the candidate virtual speakers.
  • a corresponding number of candidate virtual speakers can be generated according to the total number of candidate virtual speakers, and the HOA order of each candidate virtual speaker can be set according to the HOA order of each candidate virtual speaker; and the position of each candidate virtual speaker can be set according to the position information of each candidate virtual speaker.
  • the virtual speaker signal generated by the virtual sound source is a plane wave, which can be expanded in a spherical coordinate system.
  • the ideal plane wave of can be expanded using spherical harmonics as shown in formula (3).
  • the HOA order of the candidate virtual speaker is the truncated value of m in formula (3).
  • the virtual speaker coefficients corresponding to each candidate virtual speaker can be determined according to the HOA order of each candidate virtual speaker (wherein each candidate virtual speaker corresponds to a set of virtual speaker coefficients).
  • the cutoff value of m in formula (3) can be set to the HOA order of the candidate virtual speaker, and the cutoff value of m in formula (3) can be set to the HOA order of the candidate virtual speaker.
  • the set of virtual speaker coefficients corresponding to the candidate virtual speakers determined in S402 may include C1 virtual speaker coefficients, and one virtual speaker coefficient corresponds to one channel of the scene audio signal.
  • the second configuration information of the candidate virtual speaker is determined (subsequently replaced by "step A”); according to the second configuration information of the candidate virtual speaker, multiple candidate virtual speakers are generated (subsequently replaced by "step B") and the virtual speaker coefficient corresponding to each candidate virtual speaker is determined (subsequently replaced by "step C”); these three steps can be performed in advance, that is, before obtaining the scene audio signal to be encoded.
  • step A and step B are performed in advance, and step C is performed after the scene audio signal to be encoded is acquired.
  • step A is performed in advance, and steps B and C are performed after the scene audio signal to be encoded is acquired.
  • step A, step B and step C are all performed after obtaining the scene audio signal to be encoded.
  • S403 Selecting a target virtual speaker from a plurality of candidate virtual speakers based on the scene audio signal and the plurality of groups of virtual speaker coefficients.
  • inner products are respectively performed on the scene audio signal and the multiple groups of virtual speaker coefficients to obtain multiple inner product values; the multiple inner product values correspond one to one with the multiple groups of virtual speaker coefficients.
  • an inner product of a group of virtual speaker coefficients corresponding to the candidate virtual speaker and the scene audio signal can be performed to obtain a corresponding inner product value.
  • a target virtual speaker can be selected from multiple candidate virtual speakers based on multiple inner product values.
  • the first G (G is a positive integer) candidate virtual speakers with the largest inner product values can be selected as target virtual speakers.
  • the candidate virtual speaker with the largest inner product can be first selected as a target virtual speaker; then, the scene audio signal is projected and superimposed on a linear combination of a set of virtual speaker coefficients corresponding to the candidate virtual speaker with the largest inner product to obtain a projection vector; then, the projection vector is subtracted from the scene audio signal to obtain a difference. Afterwards, the above process is repeated for the difference to implement iterative calculation, and a target virtual speaker is generated each iteration.
  • the inner product value between the scene audio signal of each frame of the scene audio signal and the virtual speaker coefficients corresponding to each candidate virtual speaker can be determined based on a frame of the scene audio signal; in this way, the target virtual speaker corresponding to each frame of the scene audio signal can be determined.
  • a frame of scene audio signal can be split into multiple subframes, and then the inner product value between each subframe and the virtual speaker coefficients corresponding to each candidate virtual speaker can be determined in units of one subframe; in this way, the target virtual speaker corresponding to each subframe can be determined.
  • the attribute information of the target virtual speaker is generated.
  • the position information of the target virtual speaker (including the pitch angle information and the horizontal angle information) can be used as the attribute information of the target virtual speaker.
  • the position index corresponding to the position information of the target virtual speaker (including the pitch angle index (which can be used to uniquely identify the pitch angle information) and the horizontal angle index (which can be used to uniquely identify the horizontal angle information)) is used as the attribute information of the target virtual speaker.
  • a virtual speaker index (eg, virtual speaker identifier) of the target virtual speaker may be used as the attribute information of the target virtual speaker, wherein the virtual speaker index corresponds to the position information one-to-one.
  • the virtual speaker coefficient of the target virtual speaker can be used as the attribute information of the target virtual speaker.
  • C2 virtual speaker coefficients of the target virtual speaker can be determined and used as the attribute information of the target virtual speaker; wherein the C2 virtual speaker coefficients of the target virtual speaker correspond one-to-one to the audio signal of the C2 number of channels included in the first reconstructed scene audio signal.
  • the data volume of the virtual speaker coefficient is much larger than the data volume of the position information, the index of the position information, and the virtual speaker index; according to the bandwidth, it can be decided which information among the position information, the index of the position information, the virtual speaker index, and the virtual speaker coefficient is used as the attribute information of the target virtual speaker.
  • the virtual speaker coefficient can be used as the attribute information of the target virtual speaker; in this way, the decoding end does not need to calculate the virtual speaker coefficient of the target virtual speaker, which can save the computing power of the decoding end.
  • any one of the position information, the index of the position information, and the virtual speaker index can be used as the attribute information of the target virtual speaker; in this way, the bit rate can be saved. It should be understood that it can also be pre-set which information among the position information, the index of the position information, the virtual speaker index, and the virtual speaker coefficient is used as the attribute information of the target virtual speaker; this is not limited in this application.
  • S405 Encode the first audio signal and the attribute information of the target virtual speaker in the scene audio signal to obtain a first bit stream.
  • the first audio signal is the second audio signal; that is, the first audio signal is the low-order part of the scene audio signal.
  • the first audio signal includes an audio signal of 1 channel; for example, the first audio signal is an audio signal of 1 channel represented by the above formula (5).
  • the first audio signal includes an audio signal of 4 channels; for example, the first audio signal includes an audio signal of 4 channels represented by the above formula (5) and formula (6).
  • the first audio signal includes an audio signal of 9 channels; for example, the first audio signal includes an audio signal of 9 channels represented by the above formula (5), formula (6) and formula (7).
  • the number of channels included in the second audio signal may be an odd number or an even number.
  • the first audio signal may include a second audio signal and a fourth audio signal, wherein the fourth audio signal is an audio signal of some channels in the third audio signal.
  • an audio signal of an odd number of channels may be selected from the third audio signal as the fourth audio signal; that is, the fourth audio signal may include an audio signal of an odd number of channels.
  • the first audio signal may include an audio signal of 1 channel represented by the above formula (5) and an audio signal of 1 channel represented by the first item of the above formula (6), in which case the first audio signal includes an audio signal of 2 channels.
  • the first audio signal may include 9-channel audio signals represented by the above formulas (5) to (7) and 1-channel audio signal represented by the first item of the above formula (8). In this case, the first audio signal includes 10-channel audio signals.
  • an audio signal of an even number of channels can be selected from the third audio signal as the fourth audio signal.
  • the first audio signal can include the above formula (5) and the above formula (6), and the first two terms of the above formula (7).
  • the first audio signal includes audio signals of 6 channels.
  • the second audio signal includes an even number of channels, it is also possible not to select audio signals of some channels from the third audio signal, but to directly use the second audio signal as the first audio signal.
  • the number of channels of the audio signal included in the first audio signal can be determined according to demand and bandwidth, and the present application does not impose any limitation on this.
  • Fig. 5 is a schematic diagram of an exemplary decoding process.
  • Fig. 5 is a decoding process corresponding to the encoding process in Fig. 4 .
  • S502 Decode the first bit stream to obtain a first reconstructed signal and property information of a target virtual speaker.
  • S501 to S502 may refer to the description of S301 to S302, which will not be repeated here.
  • S303 may refer to the description of S503 to S504:
  • S503 Determine a first virtual speaker coefficient corresponding to the target virtual speaker based on the attribute information.
  • the encoder can write M into the first bitstream; then M can be decoded from the first bitstream (of course, the encoder and the decoder can also pre-agree on M, and this application does not limit this).
  • the attribute information of the target virtual speaker is position information
  • the position information of the target virtual speaker can be substituted into the above formula (3), and m in formula (3) is set equal to M, so that the first virtual speaker coefficient corresponding to the target virtual speaker can be obtained.
  • the first virtual speaker coefficient includes (M+1) 2 virtual speaker coefficients, and these (M+1) 2 virtual speaker coefficients correspond to (M+1) 2 channels of the second reconstructed signal; wherein, the second reconstructed signal is a reconstructed signal of the second audio signal.
  • the position information of the target virtual speaker can be determined according to the relationship between the position information and the position index; and then the first virtual speaker coefficient is determined in the above manner, which will not be repeated here.
  • the position information of the target virtual speaker can be determined according to the relationship between the position information and the virtual speaker index; and then the first virtual speaker coefficient is determined in the above manner, which will not be repeated here.
  • a set of virtual speaker coefficients corresponding to the target virtual speaker includes C2 virtual speaker coefficients; at this time, (M+1) 2 virtual speaker coefficients corresponding to the (M+1) 2 channels included in the second reconstructed signal can be selected as the first virtual speaker coefficients.
  • S504 Generate a virtual speaker signal based on the first reconstructed signal and the first virtual speaker coefficient.
  • a virtual speaker signal may be generated based on the second reconstruction signal and the first virtual speaker coefficient in the first reconstruction signal.
  • a matrix A of size (Y1 ⁇ P) is used to represent the first virtual speaker coefficient of the target virtual speaker, where Y1 (Y1 is a positive integer) is the number of target virtual speakers, and P is the number of channels (M+1) 2 of the audio signal contained in the second reconstructed signal.
  • a matrix X of size (L ⁇ P) is used to represent the second reconstructed signal; where L is the number of sampling points of the second reconstructed signal.
  • matrix A -1 is the inverse matrix of matrix A.
  • S304 may refer to the following S505 to S506:
  • S505 Determine a second virtual speaker coefficient corresponding to the target virtual speaker based on the attribute information of the target virtual speaker.
  • m in the above formula (3) can be determined to be equal to N2 according to the order N2 of the desired reconstructed scene audio signal (that is, the order N2 of the first reconstructed scene audio signal or the second reconstructed scene audio signal). Then, when the attribute information of the target virtual speaker is position information, the position information of the target virtual speaker can be substituted into the above formula (3), and m in formula (3) is set to be equal to N2, so as to obtain the second virtual speaker coefficient.
  • the second virtual speaker coefficient includes C2 virtual speaker coefficients, and these C2 virtual speaker coefficients correspond to C2 channels of the first reconstructed scene audio signal.
  • the position information of the target virtual speaker can be determined according to the relationship between the position information and the position index; and then the first virtual speaker coefficient is determined in the above manner, which will not be repeated here.
  • the position information of the target virtual speaker can be determined according to the relationship between the position information and the virtual speaker index; and then the first virtual speaker coefficient is determined in the above manner, which will not be repeated here.
  • the property information of the target virtual speaker is a virtual speaker coefficient
  • the property information of the target virtual speaker may be directly used as the second virtual speaker coefficient
  • S506 Obtain a first reconstructed scene audio signal based on the virtual speaker signal and the second virtual speaker coefficient.
  • a matrix A of size (Y1 ⁇ C2) is used to represent the second virtual speaker coefficient, where Y1 is the number of target virtual speakers and C2 is the number of channels of the first reconstructed scene audio signal.
  • a matrix B of size (L ⁇ Y1) is used to represent the virtual speaker signal; where L is the number of sampling points of the first reconstructed scene audio signal.
  • H BA (14)
  • S507 Generate a second reconstructed scene audio signal based on the first reconstructed signal and the first reconstructed scene audio signal.
  • the first reconstructed signal obtained by decoding is closer to the first audio signal encoded by the encoding end; then, based on the first reconstructed scene audio signal and the first reconstructed signal, a second reconstructed scene audio signal is generated; then, the second reconstructed scene audio signal is used as the final decoding result; a reconstructed scene audio signal with higher audio quality can be obtained.
  • the first reconstructed signal is the second reconstructed signal; in this case, a second reconstructed scene audio signal can be generated based on the second reconstructed signal and the seventh audio signal.
  • the second reconstructed scene audio signal can be generated by splicing the second reconstructed signal and the seventh audio signal according to channels.
  • the obtained second reconstructed scene audio signal may include: a reconstructed signal of the audio signal of one channel represented by formula (5) and signals of 15 channels represented by formulas (10) to (12).
  • the obtained second reconstructed scene audio signal may include: a reconstructed signal of the audio signal of one channel represented by formula (5), and signals of 15 channels represented by formulas (10) to (12).
  • the first reconstructed signal may include the second reconstructed signal and the fourth reconstructed signal (the fourth reconstructed signal is a reconstructed signal of the fourth audio signal); in this case, a second reconstructed scene audio signal may be generated based on the second reconstructed signal, the fourth reconstructed signal and the eighth audio signal.
  • the eighth audio signal is an audio signal of some channels in the seventh audio signal
  • the eighth audio signal is an audio signal of other channels in the seventh audio signal except the channel corresponding to the fourth audio signal.
  • the second reconstructed scene audio signal may be generated by splicing the second reconstructed signal, the fourth reconstructed signal and the eighth audio signal according to the channels.
  • the second audio signal includes a signal of one channel represented by the above formula (5)
  • the fourth audio signal is a signal of one channel represented by the first term in the above formula (6)
  • the first audio signal includes the second audio signal and the fourth audio signal
  • the eighth audio signal is a signal of two channels represented by the last two terms in the above formula (10), and a signal of 12 channels represented by formulas (11) to (12).
  • the obtained second reconstructed scene audio signal may include: a reconstructed signal of an audio signal of one channel represented by formula (5) and a reconstructed signal of an audio signal of one channel represented by the first term in formula (6), a signal of two channels represented by the last two terms in formula (10), and a signal of 12 channels represented by formulas (11) to (12).
  • the second reconstructed scene audio signal may be an N2-order HOA signal, where N2 is a positive integer.
  • the order N2 of the second reconstructed scene audio signal may be greater than or equal to the order N1 of the scene audio signal; correspondingly, the number of channels C2 of the audio signal included in the second reconstructed scene audio signal may be greater than or equal to the number of channels C1 of the audio signal included in the scene audio signal.
  • the decoding end can reconstruct a reconstructed scene audio signal having the same order as the scene audio signal encoded by the encoding end.
  • the decoding end may reconstruct a reconstructed scene audio signal having an order greater than the order of the scene audio signal encoded by the encoding end.
  • FIG. 6 a is a schematic diagram showing the structure of an encoding end.
  • the encoding end may include a configuration unit, a virtual speaker generation unit, a target speaker generation unit, and a core encoder. It should be understood that Figure 6a is only an example of the present application, and the encoding end of the present application may include more or fewer modules than those shown in Figure 6a, which will not be repeated here.
  • the configuration unit may be configured to determine second configuration information of the candidate virtual speaker according to first configuration information of the encoding module.
  • the virtual speaker generation unit may be configured to generate a plurality of candidate virtual speakers according to the second configuration information of the candidate virtual speakers and determine a virtual speaker coefficient corresponding to each candidate virtual speaker.
  • the target speaker generation unit may be configured to select a target virtual speaker from a plurality of candidate virtual speakers and determine attribute information of the target virtual speaker based on a scene audio signal and a plurality of groups of virtual speaker coefficients.
  • the core encoder may be used to encode the property information of the first audio signal and the target virtual speaker in the scene audio signal.
  • the scene audio encoding module in FIG. 1a and FIG. 1b may include the configuration unit, the virtual speaker generation unit, the target speaker generation unit, and the core encoder of FIG. 6a; or, may only include the core encoder.
  • FIG. 6 b is a schematic diagram showing the structure of a decoding end.
  • the decoding end may include a core decoder, a virtual speaker coefficient generation unit, a virtual speaker signal generation unit, a first reconstruction unit, and a second reconstruction unit. It should be understood that Figure 6b is only an example of the present application, and the decoding end of the present application may include more or less modules than those shown in Figure 6b, which will not be repeated here.
  • the core decoder may be used to decode the first bitstream to obtain the first reconstructed signal and property information of the target virtual speaker.
  • the virtual speaker coefficient generating unit may be configured to determine a first virtual speaker coefficient and a second virtual speaker coefficient based on property information of a target virtual speaker.
  • the virtual speaker signal generating unit may be configured to generate a virtual speaker signal based on the first reconstructed signal and the first virtual speaker coefficient.
  • the first reconstruction unit may be configured to obtain a first reconstructed scene audio signal based on the virtual speaker signal and the second virtual speaker coefficient.
  • the second reconstruction unit may be configured to generate a second reconstructed scene audio signal based on the first reconstructed signal and the first reconstructed scene audio signal.
  • the scene audio decoding module in FIG. 1a and FIG. 1b may include the core decoder, the virtual speaker coefficient generation unit, the virtual speaker signal generation unit, the first reconstruction unit and the second reconstruction unit of FIG. 6b; or, may only include the core decoder.
  • feature information corresponding to the fifth audio signal in the scene audio signal can also be extracted, and encoded and sent to the decoder; after receiving the bit stream, the decoding end can compensate for the seventh audio signal/eighth audio signal in the first reconstructed scene audio signal based on the feature information, thereby improving the audio quality of the seventh audio signal/eighth audio signal in the first reconstructed scene audio signal/the second reconstructed scene audio signal.
  • FIG. 7 is a schematic diagram of an exemplary encoding process.
  • S703 Select a target virtual speaker from multiple candidate virtual speakers based on the scene audio signal and multiple groups of virtual speaker coefficients.
  • S705 Encode the first audio signal and the attribute information of the target virtual speaker in the scene audio signal to obtain a first bit stream.
  • S701 to S705 may refer to the description of S401 to S405 above, which will not be repeated here.
  • S706 Acquire feature information corresponding to the fifth audio signal in the scene audio signal.
  • the fifth audio signal is the third audio signal.
  • the fifth audio signal may be an audio signal of 15 channels represented by the above formulas (6) to (9). If the first audio signal includes the second audio signal and the fourth audio signal, and the second audio signal is an audio signal of one channel represented by the above formula (5), and the fourth audio signal is an audio signal of one channel represented by the first item in the above formula (6), then the fifth audio signal may be an audio signal of 15 channels represented by the above formulas (6) to (9).
  • the fifth audio signal may be an audio signal in the scene audio signal except the second audio signal and the fourth audio signal.
  • the first audio signal includes a second audio signal and a fourth audio signal
  • the second audio signal is an audio signal of one channel represented by the above formula (5)
  • the fourth audio signal is an audio signal of one channel represented by the first item in the above formula (6)
  • the fifth audio signal may include audio signals of two channels represented by the last two items in the above formula (6), and audio signals of 12 channels represented by formulas (7) to (9).
  • the scene audio signal may be analyzed to determine information such as the strength and energy of the scene audio signal; then, based on the information such as the strength and energy of the scene audio signal, feature information corresponding to the fifth audio signal in the scene audio signal may be extracted.
  • the characteristic information corresponding to the scene audio signal includes but is not limited to: gain information and diffusion information.
  • i is the channel number of the channel included in the fifth audio signal in the scene audio signal
  • E(i) is the energy of the i-th channel
  • E(1) is the energy of the audio signal of the C1 channel in the scene audio signal.
  • S707 Encode the characteristic information to obtain a second code stream.
  • feature information corresponding to the first audio signal in the scene audio signal may be encoded to obtain a second bitstream.
  • the second bitstream may be sent to a decoding end, so that the decoding end may compensate the seventh audio signal/eighth audio signal in the first reconstructed scene audio signal based on the feature information corresponding to the fifth audio signal in the scene audio signal, so as to improve the audio quality of the first reconstructed scene audio signal.
  • Fig. 8 is a schematic diagram of an exemplary decoding process.
  • Fig. 8 is a decoding process corresponding to the encoding process of Fig. 7 .
  • S802 Decode the first bit stream to obtain a first reconstructed signal and property information of a target virtual speaker.
  • S803 Decode the second bit stream to obtain feature information corresponding to the fifth audio signal in the decoded scene audio signal.
  • the encoding end performs lossy compression on the feature information
  • the feature information decoded by the decoding end is the same as the feature information encoded by the encoding end. (In this application, the feature information encoded by the encoding end and the feature information decoded by the decoding end are not distinguished in terms of name.)
  • S804 Determine a first virtual speaker coefficient based on the attribute information.
  • S805 Generate a virtual speaker signal based on the first reconstructed signal and the first virtual speaker coefficient.
  • S806 Determine a second virtual speaker coefficient based on the attribute information.
  • S807 Obtain a first reconstructed scene audio signal based on the virtual speaker signal and the second virtual speaker coefficient.
  • S801 to S807 may refer to the description of S501 to S506 above, which will not be repeated here.
  • S808 Compensate the seventh audio signal in the first reconstructed scene audio signal based on the feature information.
  • the seventh audio signal in the first reconstructed scene audio signal may be compensated based on the feature information corresponding to the fifth audio signal in the scene audio signal to improve the quality of the seventh audio signal in the first reconstructed scene audio signal.
  • i is the channel number of the channel contained in the seventh audio signal in the first reconstructed scene audio signal
  • E(i) is the energy of the i-th channel
  • E(1) is the energy of the C2 channel audio signal in the first reconstructed scene audio signal
  • Gain(i) is the gain information corresponding to the audio signal of the i-th channel in the fifth audio signal in the scene audio signal.
  • S809 Generate a second reconstructed scene audio signal based on the second reconstructed signal and the seventh audio signal.
  • the seventh audio signal in S809 is the seventh audio signal compensated based on the characteristic information; S809 may refer to the above description and will not be described in detail herein.
  • the eighth audio signal in the first reconstructed scene audio signal is compensated; and based on the second reconstructed signal, the fourth reconstructed signal and the eighth audio signal in the first reconstructed scene audio signal (the eighth audio signal after compensation based on the characteristic information), a second reconstructed scene audio signal is generated; reference may be made to the description of S808 to S809, which will not be repeated here.
  • S808 can be executed, that is, the first reconstructed scene audio signal can be compensated and the compensated first reconstructed scene audio signal can be used as the final reconstructed scene audio signal. In this way, the audio quality of the final reconstructed scene audio signal can also be improved.
  • Fig. 9a is a schematic diagram of the structure of an encoding end shown as an example, wherein Fig. 9a is the structure of the encoding end shown on the basis of Fig. 6a.
  • the encoding end may include a configuration unit, a virtual speaker generation unit, a target speaker generation unit, a core encoder and a feature extraction unit. It should be understood that Figure 9a is only an example of the present application, and the encoding end of the present application may include more or fewer modules than those shown in Figure 9a, which will not be repeated here.
  • the configuration unit, the virtual speaker generation unit, and the target speaker generation unit in FIG. 9 a may refer to the description of FIG. 6 a and are not described in detail here.
  • the feature extraction unit may be configured to obtain feature information corresponding to the fifth audio signal in the scene audio signal.
  • the core encoder can be used to encode the attribute information of the first audio signal and the target virtual speaker in the scene audio signal to obtain a first code stream; and encode the feature information corresponding to the fifth audio signal in the scene audio signal to obtain a second code stream.
  • the scene audio encoding module in FIG. 1a and FIG. 1b may include the configuration unit, the virtual speaker generation unit, the target speaker generation unit, the core encoder and the feature extraction unit of FIG. 9a; or, may only include the core encoder.
  • FIG. 9 b is a schematic diagram showing the structure of a decoding end.
  • the decoding end may include a core decoder, a virtual speaker coefficient generation unit, a virtual speaker signal generation unit, a first reconstruction unit, a compensation unit, and a second reconstruction unit. It should be understood that Figure 9b is only an example of the present application, and the decoding end of the present application may include more or less modules than those shown in Figure 9b, which will not be repeated here.
  • the virtual speaker coefficient generating unit, the virtual speaker signal generating unit and the first reconstruction unit in FIG. 9 b may refer to the description in FIG. 6 b , which will not be repeated here.
  • the core decoder can be used to decode the first bitstream to obtain the attribute information of the first reconstructed signal and the target virtual speaker; it can also be used to decode the second bitstream to obtain the feature information corresponding to the fifth audio signal in the scene audio signal.
  • the compensation module may be configured to compensate the seventh audio signal/the eighth audio signal based on the feature information corresponding to the fifth audio signal.
  • the second reconstruction module can be used to generate a second reconstructed scene audio signal based on the second reconstructed signal and the compensated seventh audio signal; or, to generate a second reconstructed scene audio signal based on the second reconstructed signal, the fourth reconstructed signal and the compensated eighth audio signal.
  • the scene audio decoding module in the above-mentioned Figures 1a and 1b may include the core decoder, virtual speaker coefficient generation unit, virtual speaker signal generation unit, first reconstruction unit, compensation unit and second reconstruction unit of Figure 9b; or, only include the core decoder.
  • the scene audio signal to be encoded is a third-order HOA signal, including 16 channels.
  • K 9; then the audio signals of 9 channels in the scene audio signal and the attribute information of 4 target virtual speakers can be encoded to obtain the first code stream; and the feature information corresponding to the audio signals of the other 7 channels of the scene audio signal can be encoded to obtain the second code stream.
  • the encoding end sends the first code stream and the second code stream to the decoding end.
  • the decoding end decodes the first code stream to obtain the attribute information of the 4 target virtual speakers and the audio signals of 9 channels in the scene audio signal; and decodes the second code stream to obtain the feature information corresponding to the audio signals of the other 7 channels in the scene audio signal.
  • 4 virtual speaker signals can be generated according to the attribute information of the 4 target virtual speakers and the audio signals of 9 channels in the scene audio signal.
  • the first reconstructed scene audio signal i.e., the third-order HOA signal, is generated using the attribute information of the 4 virtual speaker signals and the attribute information of the 4 target virtual speakers.
  • the second reconstructed scene audio signal is a 3rd order HOA signal, including 16 channels.
  • FIG10 is a schematic diagram of the structure of an exemplary scene audio encoding device.
  • the scene audio encoding device in FIG10 can be used to execute the encoding method of the aforementioned embodiment. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects of the corresponding method provided above, which will not be repeated here.
  • the scene audio encoding device may include:
  • the signal acquisition module 1001 is used to acquire a scene audio signal to be encoded, where the scene audio signal includes audio signals of C1 channels, where C1 is a positive integer;
  • the attribute information acquisition module 1002 is used to determine the attribute information of the target virtual speaker based on the scene audio signal;
  • the encoding module 1003 is used to encode the first audio signal and the attribute information of the target virtual speaker in the scene audio signal to obtain a first bit stream; wherein the first audio signal is the audio signal of K channels in the scene audio signal, and K is a positive integer less than or equal to C1.
  • the first audio signal includes the second audio signal.
  • the first audio signal further includes a fourth audio signal; wherein the fourth audio signal is an audio signal of some channels in the third audio signal.
  • the attribute information of the target virtual speaker includes at least one of the following: position information of the target virtual speaker, a position index corresponding to the position information of the target virtual speaker, or a virtual speaker index of the target virtual speaker.
  • the attribute information acquisition module 1002 is specifically used to obtain multiple groups of virtual speaker coefficients corresponding to multiple candidate virtual speakers, and the multiple groups of virtual speaker coefficients correspond one-to-one to the multiple candidate virtual speakers; based on the scene audio signal and the multiple groups of virtual speaker coefficients, select a target virtual speaker from the multiple candidate virtual speakers; and obtain the attribute information of the target virtual speaker.
  • the attribute information acquisition module 1002 is specifically used to perform inner products on the scene audio signal and multiple groups of virtual speaker coefficients respectively to obtain multiple inner product values; the multiple inner product values correspond one-to-one to the multiple groups of virtual speaker coefficients; based on the multiple inner product values, a target virtual speaker is selected from multiple candidate virtual speakers.
  • the scene audio encoding device also includes: a feature information acquisition module, used to obtain feature information corresponding to the fifth audio signal in the scene audio signal; wherein the fifth audio signal is the third audio signal, or the fifth audio signal is an audio signal in the scene audio signal except the second audio signal and the fourth audio signal; the encoding module 1003 is also used to encode the feature information to obtain a second bit stream.
  • a feature information acquisition module used to obtain feature information corresponding to the fifth audio signal in the scene audio signal
  • the fifth audio signal is the third audio signal, or the fifth audio signal is an audio signal in the scene audio signal except the second audio signal and the fourth audio signal
  • the encoding module 1003 is also used to encode the feature information to obtain a second bit stream.
  • the feature information includes gain information.
  • FIG11 is a schematic diagram of the structure of an exemplary scene audio decoding device.
  • the scene audio decoding device in FIG11 can be used to execute the decoding method of the aforementioned embodiment. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects of the corresponding method provided above, which will not be repeated here.
  • the scene audio decoding device may include:
  • the code stream receiving module 1101 is used to receive a first code stream
  • a decoding module 1102 is used to decode the first bit stream to obtain a first reconstructed signal and attribute information of a target virtual speaker, wherein the first reconstructed signal is a reconstructed signal of a first audio signal in a scene audio signal, the scene audio signal includes audio signals of C1 channels, the first audio signal is audio signals of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer less than or equal to C1;
  • the scene audio signal reconstruction module 1104 is used to reconstruct based on the attribute information and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal includes audio signals of C2 channels, where C2 is a positive integer.
  • the scene audio decoding device also includes: a signal generating module 1105, which is used to generate a second reconstructed scene audio signal based on the first reconstructed signal and the first reconstructed scene audio signal, wherein the second reconstructed scene audio signal includes audio signals of C2 channels, where C2 is a positive integer.
  • a signal generating module 1105 which is used to generate a second reconstructed scene audio signal based on the first reconstructed signal and the first reconstructed scene audio signal, wherein the second reconstructed scene audio signal includes audio signals of C2 channels, where C2 is a positive integer.
  • the signal generating module 1105 is specifically configured to generate a second reconstructed scene audio signal based on the second reconstructed signal and the seventh audio signal when the first audio signal includes the second audio signal; wherein the second reconstructed signal is a reconstructed signal of the second audio signal.
  • the signal generating module 1105 is specifically used to generate a second reconstructed scene audio signal based on the second reconstruction signal, the fourth reconstruction signal and the eighth audio signal when the first audio signal includes the second audio signal and the fourth audio signal; wherein the fourth audio signal is a partial audio signal in the third audio signal, the fourth reconstruction signal is a reconstruction signal of the fourth audio signal, the second reconstruction signal is a reconstruction signal of the second audio signal, and the eighth audio signal is a partial audio signal in the seventh audio signal.
  • the virtual speaker signal generating module 1103 is specifically configured to determine a first virtual speaker coefficient corresponding to the target virtual speaker based on the property information of the target virtual speaker; and generate a virtual speaker signal based on the first reconstructed signal and the first virtual speaker coefficient.
  • the scene audio signal reconstruction module 1104 is specifically used to determine a second virtual speaker coefficient corresponding to the target virtual speaker based on the attribute information of the target virtual speaker; and obtain a first reconstructed scene audio signal based on the virtual speaker signal and the second virtual speaker coefficient.
  • the code stream receiving module 1101 is also used to receive a second code stream; the decoding module 1102 is also used to decode the second code stream to obtain feature information corresponding to a fifth audio signal in the scene audio signal, wherein the fifth audio signal is the third audio signal; the scene audio decoding device also includes: a compensation module, used to compensate the seventh audio signal based on the feature information.
  • the code stream receiving module 1101 is also used to receive a second code stream; the decoding module 1102 is also used to decode the second code stream to obtain feature information corresponding to a fifth audio signal in the scene audio signal, wherein the fifth audio signal is an audio signal in the scene audio signal except the second audio signal and the fourth audio signal; the scene audio decoding device also includes: a compensation module, which is used to compensate the eighth audio signal based on the feature information.
  • the feature information includes gain information.
  • FIG12 shows a schematic block diagram of a device 1200 according to an embodiment of the present application.
  • the device 1200 may include: a processor 1201 and a transceiver/transceiver pin 1202 , and optionally, a memory 1203 .
  • bus 1204 includes a power bus, a control bus, and a status signal bus in addition to a data bus.
  • bus 1204 includes a power bus, a control bus, and a status signal bus in addition to a data bus.
  • various buses are referred to as bus 1204 in the figure.
  • the memory 1203 may be used to store instructions in the aforementioned method embodiment.
  • the processor 1201 may be used to execute instructions in the memory 1203, and control the receiving pin to receive a signal, and control the sending pin to send a signal.
  • the apparatus 1200 may be the electronic device or a chip of the electronic device in the above method embodiment.
  • This embodiment also provides a chip, which includes one or more interface circuits and one or more processors; the interface circuit is used to receive signals from the memory of the electronic device and send signals to the processor, the signals including computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device executes the method in the above embodiment.
  • the interface circuit may refer to the transceiver 1202 in FIG. 12 .
  • This embodiment also provides a computer-readable storage medium, in which computer instructions are stored.
  • the computer instructions When the computer instructions are executed on an electronic device, the electronic device executes the above-mentioned related method steps to implement the scene audio encoding and decoding method in the above-mentioned embodiment.
  • This embodiment further provides a computer program product.
  • the computer program product When the computer program product is run on a computer, the computer is enabled to execute the above-mentioned related steps to implement the scene audio encoding and decoding method in the above-mentioned embodiment.
  • This embodiment also provides a device for storing a code stream, which includes: a receiver and at least one storage medium, the receiver is used to receive the code stream; the at least one storage medium is used to store the code stream; the code stream is generated according to the scene audio encoding method in the above embodiment.
  • An embodiment of the present application provides a device for transmitting a code stream, the device comprising: a transmitter and at least one storage medium, the at least one storage medium is used to store the code stream, the code stream is generated according to the scene audio encoding method in the above embodiment; the transmitter is used to obtain the code stream from the storage medium and send the code stream to the terminal side device through the transmission medium.
  • An embodiment of the present application provides a system for distributing code streams, the system comprising: at least one storage medium for storing at least one code stream, the at least one code stream being generated according to the scene audio encoding method in the above embodiment, a streaming media device for obtaining a target code stream from the at least one storage medium and sending the target code stream to a terminal side device, wherein the streaming media device comprises a content server or a content distribution server.
  • an embodiment of the present application also provides a device, which can specifically be a chip, component or module, and the device may include a connected processor and memory; wherein the memory is used to store computer-executable instructions, and when the device is running, the processor can execute the computer-executable instructions stored in the memory so that the chip executes the scene audio encoding and decoding methods in the above-mentioned method embodiments.
  • the electronic device, computer-readable storage medium, computer program product or chip provided in this embodiment are all used to execute the corresponding methods provided above. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects in the corresponding methods provided above and will not be repeated here.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic, for example, the division of modules or units is only a logical function division, and there may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another device, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may be one physical unit or multiple physical units, that is, they may be located in one place or distributed in multiple different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the present embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable a device (which can be a single-chip microcomputer, chip, etc.) or a processor (processor) to execute all or part of the steps of the methods of each embodiment of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk and other media that can store program code.
  • the steps of the method or algorithm described in conjunction with the disclosed content of the embodiments of the present application can be implemented in hardware or by executing software instructions by a processor.
  • the software instructions can be composed of corresponding software modules, and the software modules can be stored in random access memory (Random Access Memory, RAM), flash memory, read-only memory (Read Only Memory, ROM), erasable programmable read-only memory (Erasable Programmable ROM, EPROM), electrically erasable programmable read-only memory (Electrically EPROM, EEPROM), registers, hard disks, mobile hard disks, read-only compact disks (CD-ROMs) or any other form of storage medium known in the art.
  • An exemplary storage medium is coupled to a processor so that the processor can read information from the storage medium and write information to the storage medium.
  • the storage medium can also be a component of the processor.
  • the processor and the storage medium can be located in an ASIC.
  • Computer-readable media include computer-readable storage media and communication media, wherein the communication media include any media that facilitates the transmission of a computer program from one place to another.
  • the storage medium can be any available medium that a general or special-purpose computer can access.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)

Abstract

本申请实施例提供了一种场景音频编码方法及电子设备。该编码方法包括:首先,获取待编码的场景音频信号,场景音频信号包括C1个通道的音频信号;接着,基于场景音频信号,确定目标虚拟扬声器的属性信息;之后,编码场景音频信号中第一音频信号和目标虚拟扬声器的属性信息,以得到第一码流;其中,第一音频信号为场景音频信号中K个通道的音频信号,K为小于或等于C1的正整数。相对于现有技术而言,本申请在达到同等质量的前提下编码码率更低。此外,相对于现有技术而言,本申请无需计算虚拟扬声器信号和残差信号,编码端的编码复杂度更低。

Description

场景音频编码方法及电子设备
本申请要求于2022年12月02日提交中国国家知识产权局、申请号为202211537851.0、申请名称为“场景音频编码方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及音频编解码领域,尤其涉及一种场景音频编码方法及电子设备。
背景技术
三维音频技术是通过计算机、信号处理等方式对真实世界中声音事件和三维声场信息进行获取、处理,传输和渲染回放的音频技术。三维音频使声音具有强烈的空间感、包围感及沉浸感,给人以“声临其境”的非凡听觉体验。其中,HOA(Higher Order Ambisonics,高阶立体混响)技术具有在录制、编码与回放阶段与扬声器布局无关的性质以及HOA格式数据的可旋转回放特性,在进行三维音频回放时具有更高的灵活性,因而也得到了更为广泛的关注和研究。
对于N阶HOA信号来说,其对应的通道数为(N+1)2。随着HOA阶数的增加,HOA信号中用于记录更详细的声音场景的信息也会随之增加;但HOA信号的数据量也会随之增多,大量的数据造成传输和存储的困难,因此需要对HOA信号进行编解码。然而,现有技术对HOA信号的编码性能低。
发明内容
本申请提供一种场景音频编码方法及电子设备。
第一方面,本申请实施例提供一种场景音频编码方法,该方法包括:首先,获取待编码的场景音频信号,场景音频信号包括C1个通道的音频信号,C1为正整数;接着,基于场景音频信号,确定目标虚拟扬声器的属性信息;之后,编码场景音频信号中第一音频信号和目标虚拟扬声器的属性信息,以得到第一码流;其中,第一音频信号为场景音频信号中K个通道的音频信号,K为小于或等于C1的正整数。
需要说明的是,目标虚拟扬声器的位置与场景音频信号中声源的位置相匹配;根据目标虚拟扬声器的属性信息和场景音频信号中第一音频信号,可以生成目标虚拟扬声器对应的虚拟扬声器信号;根据虚拟扬声器信号,可以重建出该场景音频信号。因此,编码端将场景音频信号中第一音频信号和目标虚拟扬声器的属性信息编码后发送给解码端,解码端可以基于解码得到第一重建信号(即场景音频信号中第一音频信号的重建信号)和目标虚拟扬声器的属性信息,重建出该场景音频信号。
相对于现有技术中其他重建场景音频信号的方法而言,基于虚拟扬声器信号重建出的场景音频信号的音频质量更高;因此当K等于C1时,在同等码率下,本申请的重建出的场景音频信号的音频质量更高。
当K小于C1时,相对于现有技术而言,本申请编码的音频信号的通道数更少,且目标虚拟扬声器的属性信息的数据量,远小于一个通道的音频信号的数据量;因此在达到同等质量的前提下,本申请编码码率更低。
此外,现有技术是将场景音频信号转换为虚拟扬声器信号和残差信号后再编码,而本申请编码端直接编码场景音频信号中第一音频信号,无需计算虚拟扬声器信号和残差信号,编码端的编码复杂度更低。
示例性的,本申请实施例涉及的场景音频信号,可以是指用于描述声场的信号;其中,场景音频信号可以包括:HOA信号(其中,HOA信号可以包括三维HOA信号和二维HOA信号(也可以称为平面HOA信号))和三维音频信号;三维音频信号可以是指场景音频信号中除HOA信号之外的其他音频信号。
一种可能的方式中,当N1等于1时,K可以等于C1;当N1大于1时,K可以小于C1。应该理解的是,当N1等于1时,K也可以小于C1。
示例性的,编码场景音频信号中第一音频信号和目标虚拟扬声器的属性信息的过程可以包括:下混、变换、量化以及熵编码等操作,本申请对此不作限制。
示例性的,第一码流可以包括场景音频信号中第一音频信号的编码数据,以及目标虚拟扬声器的属性信息的编码数据。
一种可能的方式中,可以基于场景音频信号,从多个候选虚拟扬声器中,选取目标虚拟扬声器,再确定目标虚拟扬声器的属性信息。示例性的,虚拟扬声器(包括候选虚拟扬声器和目标虚拟扬声器)是虚拟的扬声器,不是真实存在的扬声器。
示例性的,多个候选虚拟扬声器可以均匀的分布在球面上,目标虚拟扬声器的数量可以为一个或多个。
一种可能的方式中,可以获取预先设定的目标虚拟扬声器,再确定目标虚拟扬声器的属性信息。
应该理解的是,本申请不限制确定目标虚拟扬声器的方式。
根据第一方面,场景音频信号为N1阶高阶立体混响HOA信号,N1阶HOA信号包括第二音频信号和第三音频信号,第二音频信号为N1阶HOA信号中第0阶至第M阶的HOA信号,第三音频信号为N1阶HOA信号中除第二音频信号之外的音频信号,M为小于N1的整数,C1等于(N1+1)的平方,N1为正整数;第一音频信号包括第二音频信号。
示例性的,第一音频信号包括第二音频信号,可以理解为第一音频信号仅包括第二音频信号。
示例性的,第一音频信号包括第二音频信号,可以理解为第一音频信号包括第二音频信号和其他音频信号。
根据第一方面,或者以上第一方面的任意一种实现方式,第一音频信号还包括第四音频信号;其中,第四音频信号为第三音频信号中部分通道的音频信号。
其中,第一音频信号可以包括偶数个通道的音频信号,则当第二音频信号的通道数为奇数时,第四音频信号的通道数也可以为奇数;这样,能够便于仅支持编码偶数个通道的音频信号的编码器编码。
示例性的,第二音频信号可以称为场景音频信号的低阶部分,第三音频信号可以称为场景音频信号的高阶部分。也就是说,可以编码场景音频信号的低阶部分与场景音频信号的高阶部分中的一部分;以保证第一音频信号包括偶数个通道的音频信号。
应该理解的是,第一音频信号也可以包括奇数个通道的音频信号,则当第二音频信号的通道数为偶数时,第四音频信号的通道数可以为奇数;这样,能够便于仅支持编码奇数个通道的音频信号的编码器编码。
应该理解的是,相对于第一音频信号包括第二音频信号和第四音频信号而言,第一音频信号仅包括第二音频信号时,编码的第一音频信号的通道数更少,对应的码率更低。
根据第一方面,或者以上第一方面的任意一种实现方式,目标虚拟扬声器的属性信息包括以下至少一种:目标虚拟扬声器的位置信息,目标虚拟扬声器的位置信息对应的位置索引,或,目标虚拟扬声器的虚拟扬声器索引。
示例性的,在球坐标系下,目标虚拟扬声器的位置信息可以如其中,θs3为目标虚拟扬声器的水平角信息,为目标虚拟扬声器的俯仰角信息。
示例性的,位置索引用于唯一标识一个虚拟扬声器的位置。其中,位置索引可以包括水平角索引(用于唯一标识一个水平角信息)和俯仰角索引(用于唯一标识一个俯仰角信息)。其中,虚拟扬声器的位置索引与虚拟扬声器的位置信息一一对应。
示例性的,虚拟扬声器索引可以用于唯一标识一个虚拟扬声器;其中,虚拟扬声器的位置信息/位置索引,与虚拟扬声器索引一一对应。
根据第一方面,或者以上第一方面的任意一种实现方式,基于场景音频信号,确定目标虚拟扬声器的属性信息,包括:获取多个候选虚拟扬声器对应的多组虚拟扬声器系数,多组虚拟扬声器系数与多个候选虚拟扬声器一一对应;基于场景音频信号和多组虚拟扬声器系数,从多个候选虚拟扬声器中选取目标虚拟扬声器;获取目标虚拟扬声器的属性信息。
其中,每个候选虚拟扬声器作为一个虚拟声源时,该虚拟声源产生的虚拟扬声器信号是平面波,可以将其在球坐标系下展开。对于振幅为s,方向为的理想平面波,使用球谐函数展开后的形式可以如下述公式(3)所示。将公式(3)中设置为候选虚拟扬声器的位置信息此时公式(3)所示中的即为一组虚拟扬声器系数(即HOA系数)。也就是说,虚拟扬声器系数也是HOA系数。需要说明的是,根据公式(3)可知,候选虚拟扬声器的位置与场景音频信号中声源的位置不同时,候选虚拟扬声器的虚拟扬声器系数与场景音频信号是不同的HOA系数。
这样,基于场景音频信号和多组虚拟扬声器系数,能够从多个候选虚拟扬声器中,准确的查找出位置与场景音频信号中声源位置匹配的目标虚拟扬声器。
根据第一方面,或者以上第一方面的任意一种实现方式,基于场景音频信号和多组虚拟扬声器系数,从多个候选虚拟扬声器中选取目标虚拟扬声器,包括:将场景音频信号与多组虚拟扬声器系数分别进行内积,以得到多个内积值;多个内积值与多组虚拟扬声器系数一一对应;基于多个内积值,从多个候选虚拟扬声器中选取目标虚拟扬声器。这样,通过内积,能够准确的确定各候选虚拟扬声器与场景音频信号的匹配度;进而能够选取出位置与场景音频信号中声源位置更加匹配的目标虚拟扬声器。
根据第一方面,或者以上第一方面的任意一种实现方式,该方法还包括:获取场景音频信号中第五音频信号所对应的特征信息;编码特征信息,以得到第二码流;其中,第五音频信号为第三音频信号,或者,第五音频信号为场景音频信号中除第二音频信号和第四音频信号之外的音频信号,第四音频信号为第三音频信号中部分通道的音频信号。其中,特征信息可以用于解码端解码过程中,对重建得到的场景音频信号中部分通道的音频信号进行补偿,来提高重建得到的场景音频信号中部分通道的音频信号的音频质量。
其中,特征信息的数据量较小,因此相对于现有技术而言,即使编码特征信息,本申请的总码率也更小,因此在同等码率的前提下,能够进一步提高重建的场景音频信号的音频质量。
示例性的,可以基于场景音频信号的能量、强度等信息,来确定场景音频信号中第五音频信号所对应的特征信息。
根据第一方面,或者以上第一方面的任意一种实现方式,特征信息包括增益信息。
示例性的,特征信息还可以包括扩散信息等,本申请对此不作限制。
第二方面,本申请实施例提供一种场景音频解码方法,该场景音频解码方法包括:首先,接收第一码流;以及解码第一码流,以得到第一重建信号和目标虚拟扬声器的属性信息,第一重建信号是场景音频信号中第一音频信号的重建信号,场景音频信号包括C1个通道的音频信号,第一音频信号为场景音频信号中K个通道的音频信号,C1为正整数,K为小于或等于C1的正整数;接着,基于属性信息和第一重建信号,生成目标虚拟扬声器对应的虚拟扬声器信号;之后,基于属性信息和虚拟扬声器信号进行重建,以得到第一重建场景音频信号;第一重建场景音频信号包括C2个通道的音频信号,C2为正整数。
相对于现有技术中其他重建场景音频信号的方法而言,基于虚拟扬声器信号重建出的场景音频信号的音频质量更高;因此当K等于C1时,在同等码率下,本申请的重建出的场景音频信号的音频质量更高。
当K小于C1时,在对场景音频信号编码的过程中,本申请编码的音频信号的通道数,小于现有技术编码的音频信号的通道数,且目标虚拟扬声器的属性信息的数据量,远小于一个通道的音频信号的数据量;因此在同等码率的前提下,本申请解码得到重建场景音频信号的音频质量更高。
其次,由于现有技术编码传输的虚拟扬声器信号和残差信息是通过原始音频信号(即待编码的场景音频信号)转换而来的,并不是原始音频信号,会引入误差;而本申请编码了部分原始音频信号(即待编码的场景音频信号中的K个通道的音频信号),避免了误差的引入,进而能够提高解码得到重建场景音频信号的音频质量;且还能够避免解码得到重建场景音频信号的重建质量的波动,稳定性高。
此外,由于现有技术编码以及传输的是虚拟扬声器信号,而虚拟扬声器信号的数据量较大,因此现有技术选取的目标虚拟扬声器的数量受到带宽限制较大。本申请编码以及传输的是虚拟扬声器的属性信息,属性信息的数据量远小于虚拟扬声器信号的数据量;因此本申请选取的目标虚拟扬声器的数量受到带宽限制较小。而选取的目标虚拟扬声器的数量越多,基于目标虚拟扬声器的虚拟扬声器信号,重建出的场景音频信号的质量也就越高。因此,相对于现有技术而言,在同等码率的情况下,本申请可以选取数量更多的目标虚拟扬声器,这样,本申请解码得到重建场景音频信号的质量也就更高。
此外,综合编码端和解码端,相对于现有技术的编码端和解码端而言,本申请的编码端和解码端无需进行残差和叠加操作,因此本申请编码端和解码端的综合复杂度,低于现有技术编码端和解码端的综合复杂度。
应该理解的是,当编码端对场景音频信号中第一音频信号进行的是有损压缩时,解码端解码得到的第一重建信号和编码端编码的第一音频信号存在差异。当编码端对第一音频信号进行的是无损压缩时,解码端解码得到的第一重建信号和编码端编码的第一音频信号相同。
应该理解的是,当编码端对目标虚拟扬声器的属性信息进行的是有损压缩时,解码端解码得到的属性信息和编码端编码的属性信息存在差异。当编码端对虚拟扬声器的属性信息进行的是无损压缩时,解码端解码得到的属性信息和编码端编码的属性信息相同。(其中,本申请对编码端编码的属性信息和解码端解码得到的属性信息,未从名称上进行区分。)
根据第二方面,该方法还包括:基于第一重建信号和第一重建场景音频信号,生成第二重建场景音频信号,第二重建场景音频信号包括C2个通道的音频信号。相对于第一重建场景音频信号中通道与第一音频信号的通道对应的音频信号而言,解码出的第一重建信号,更接近编码的第一音频信号;这样,能够得到音频质量比第一重建场景音频信号更高的第二重建场景音频信号。
根据第二方面,或者以上第二方面的任意一种实现方式,
场景音频信号为N1阶高阶立体混响HOA信号,N1阶HOA信号包括第二音频信号和第三音频信号,第二音频信号为N1阶HOA信号中第0阶至第M阶的信号,第三音频信号为N1阶HOA信号中除第二音频信号之外的音频信号,M为小于N的整数,C1等于(N1+1)的平方,N1为正整数;
第一重建场景音频信号为N2阶HOA信号,N2阶HOA信号包括第六音频信号和第七音频信号,第六音频信号为N2阶HOA信号中第0阶至第M阶的信号,第七音频信号为N2阶HOA信号中除第六音频信号之外的音频信号,M为小于N2的整数,C2等于(N2+1)的平方,N2为正整数;
基于第一重建信号和第一重建场景音频信号,生成第二重建场景音频信号,包括:当第一音频信号包括第二音频信号时,基于第二重建信号和第七音频信号,生成第二重建场景音频信号,第二重建信号为第二音频信号的重建信号。
相对于第一重建场景音频信号中通道与第一音频信号通道对应的音频信号而言,解码得到第一重建信号,更接近编码端所编码的第一音频信号;因此基于第一重建信号和第七音频信号,所得到的第二重建场景音频信号的音频质量更高。
根据第二方面,或者以上第二方面的任意一种实现方式,基于第一重建信号和第一重建场景音频信号,生成第二重建场景音频信号,包括:当第一音频信号包括第二音频信号和第四音频信号时,基于第二重建信号、第四重建信号和第八音频信号,生成第二重建场景音频信号;其中,第四音频信号为第三音频信号中的部分音频信号,第四重建信号为第四音频信号的重建信号,第二重建信号为第二音频信号的重建信号,第八音频信号为第七音频信号中的部分音频信号。
这样,相对于上述基于第二重建信号和第七音频信号所生成的第二重建场景音频信号而言,该种方式所得到的第二重建场景信号中第一重建信号的通道数更多,因此得到的第二重建场景音频信号更接近编码的场景音频信号,进而得到第二重建场景音频信号的音频质量更高。
根据第二方面,或者以上第二方面的任意一种实现方式,基于属性信息和第一重建信号,生成目标虚拟扬声器对应的虚拟扬声器信号,包括:基于属性信息,确定目标虚拟扬声器对应的第一虚拟扬声器系数;基于第一重建信号和第一虚拟扬声器系数,生成虚拟扬声器信号。这样,能够实现生成虚拟扬声器信号。
根据第二方面,或者以上第二方面的任意一种实现方式,基于属性信息和虚拟扬声器信号进行重建,以得到第一重建场景音频信号,包括:基于属性信息,确定目标虚拟扬声器对应的第二虚拟扬声器系数;基于虚拟扬声器信号和第二虚拟扬声器系数,以得到第一重建场景音频信号。这样,能够实现场景音频信号的重建。
根据第二方面,或者以上第二方面的任意一种实现方式,在基于第二重建信号和第七音频信号,生成第二重建场景音频信号之前,该方法还包括:接收第二码流;解码第二码流,以得到场景音频信号中第五音频信号所对应的特征信息;其中,第五音频信号为第三音频信号;基于特征信息,对第七音频信号进行补偿。这样,通过对重建得到的第一重建场景音频信号中第七音频信号进行补偿,能够提高重建得到的第一重建场景音频信号中第七音频信号的音频质量。
应该理解的是,当编码端对特征信息进行的是有损压缩时,解码端解码得到的特征信息,和编码端编码的特征信息存在差异。当编码端对特征信息进行的是无损压缩时,解码端解码得到的特征信息,和编码端编码的特征信息相同。(其中,本申请对编码端编码的特征信息和解码端解码得到的特征信息,未从名称上进行区分。)
根据第二方面,或者以上第二方面的任意一种实现方式,在基于第二重建信号、第四重建信号和第八音频信号,生成第二重建场景音频信号之前,该方法还包括:接收第二码流;解码第二码流,以得到场景音频信号中第五音频信号所对应的特征信息;其中,第五音频信号为场景音频信号中除第二音频信号和第四音频信号之外的音频信号;基于特征信息,对第八音频信号进行补偿。这样,通过对重建得到的第一重建场景音频信号中第八音频信号进行补偿,能够提高重建得到的第一重建场景音频信号中第八音频信号的音频质量。
应该理解的是,无论是否执行基于第一重建信号和第一重建场景音频信号,生成第二重建场景音频信号操作,在得到第一重建场景音频信号之后,均可以基于特征信息,对第一重建场景音频信号中第七音频信号/第八音频信号进行补偿,来提高第一重建场景音频信号。
根据第二方面,或者以上第二方面的任意一种实现方式,特征信息包括增益信息。
示例性的,第二重建场景音频信号可以是N2阶HOA信号,N2为正整数。示例性的,N2阶HOA信号可以包括C2个通道的音频信号,C2=(N2+1)2
示例性的,第二重建场景音频信号的阶数N2,可以大于或等于场景音频信号的阶数N1;对应的,第二重建场景音频信号包括的音频信号的通道数C2,可以大于或等于场景音频信号包括的音频信号的通道数C1。
示例性的,当第二重建场景音频信号的阶数N2,等于场景音频信号的阶数N1时,解码端可以重建出阶数与编码端编码的场景音频信号的阶数相同的重建场景音频信号。
示例性的,当第二重建场景音频信号的阶数N2,大于场景音频信号的阶数N1时,解码端可以重建出阶数大于编码端编码的场景音频信号的阶数的重建场景音频信号。
第二方面以及第二方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第二方面以及第二方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第三方面,本申请实施例提供一种码流生成方法,该方法可以根据如第一方面及第一方面的任意一种实现方式生成码流。
第三方面以及第三方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第三方面以及第三方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第四方面,本申请实施例提供一种场景音频编码装置,该装置包括:
信号获取模块,用于获取待编码的场景音频信号,场景音频信号包括C1个通道的音频信号,C1为正整数;
属性信息获取模块,用于基于场景音频信号,确定目标虚拟扬声器的属性信息;
编码模块,用于编码场景音频信号中第一音频信号和目标虚拟扬声器的属性信息,以得到第一码流;其中,第一音频信号为场景音频信号中K个通道的音频信号,K为小于或等于C1的正整数。
第四方面的场景音频编码装置,可以执行第一方面以及第一方面的任意一种实现方式中的步骤,在此不再赘述。
第四方面以及第四方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第四方面以及第四方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第五方面,本申请实施例提供一种场景音频解码装置,该装置包括:码流接收模块,用于接收第一码流;
解码模块,用于解码第一码流,以得到第一重建信号和目标虚拟扬声器的属性信息,第一重建信号是场景音频信号中第一音频信号的重建信号,场景音频信号包括C1个通道的音频信号,第一音频信号为场景音频信号中K个通道的音频信号,C1为正整数,K为小于或等于C1的正整数;
虚拟扬声器信号生成模块,用于基于属性信息和第一重建信号,生成目标虚拟扬声器对应的虚拟扬声器信号;
场景音频信号重建模块,用于基于属性信息和虚拟扬声器信号进行重建,以得到第一重建场景音频信号;第一重建场景音频信号包括C2个通道的音频信号,C2为正整数。
第五方面的场景音频解码装置,可以执行第二方面以及第二方面的任意一种实现方式中的步骤,在此不再赘述。
第五方面以及第五方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第五方面以及第五方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第六方面,本申请实施例提供一种电子设备,包括:存储器和处理器,存储器与处理器耦合;存储器存储有程序指令,当程序指令由处理器执行时,使得电子设备执行第一方面或第一方面的任意可能的实现方式中的场景音频编码方法。
第六方面以及第六方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第六方面以及第六方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第七方面,本申请实施例提供一种电子设备,包括:存储器和处理器,存储器与处理器耦合;存储器存储有程序指令,当程序指令由处理器执行时,使得电子设备执行第二方面或第二方面的任意可能的实现方式中的场景音频解码方法。
第七方面以及第七方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第七方面以及第七方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第八方面,本申请实施例提供一种芯片,包括一个或多个接口电路和一个或多个处理器;接口电路用于从电子设备的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行计算机指令时,使得电子设备执行第一方面或第一方面的任意可能的实现方式中的场景音频编码方法。
第八方面以及第八方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第八方面以及第八方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第九方面,本申请实施例提供一种芯片,包括一个或多个接口电路和一个或多个处理器;接口电路用于从电子设备的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行计算机指令时,使得电子设备执行第二方面或第二方面的任意可能的实现方式中的场景音频解码方法。
第九方面以及第九方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第九方面以及第九方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,当计算机程序运行在计算机或处理器上时,使得计算机或处理器执行第一方面或第一方面的任意可能的实现方式中的场景音频编码方法。
第十方面以及第十方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第十方面以及第十方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十一方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,当计算机程序运行在计算机或处理器上时,使得计算机或处理器执行第二方面或第二方面的任意可能的实现方式中的场景音频解码方法。
第十一方面以及第十一方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第十一方面以及第十一方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十二方面,本申请实施例提供一种计算机程序产品,计算机程序产品包括软件程序,当软件程序被计算机或处理器执行时,使得计算机或处理器执行第一方面或第一方面的任意可能的实现方式中的场景音频编码方法。
第十二方面以及第十二方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第十二方面以及第十二方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十三方面,本申请实施例提供一种计算机程序产品,计算机程序产品包括软件程序,当软件程序被计算机或处理器执行时,使得计算机或处理器执行第二方面或第二方面的任意可能的实现方式中的场景音频解码方法。
第十三方面以及第十三方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第十三方面以及第十三方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十四方面,本申请实施例提供一种存储码流的装置,该装置包括:接收器和至少一个存储介质,接收器用于接收码流;至少一个存储介质用于存储码流;码流是根据第一方面以及第一方面的任意一种实现方式生成的。
第十四方面以及第十四方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第十四方面以及第十四方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十五方面,本申请实施例提供一种传输码流的装置,该装置包括:发送器和至少一个存储介质,至少一个存储介质用于存储码流,码流是根据第一方面以及第一方面的任意一种实现方式生成的;发送器用于从存储介质中获取码流并将码流通过传输介质发送给端侧设备。
第十五方面以及第十五方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第十五方面以及第十五方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十六方面,本申请实施例提供一种分发码流的***,该***包括:至少一个存储介质,用于存储至少一个码流,至少一个码流是根据第一方面以及第一方面的任意一种实现方式生成的,流媒体设备,用于从至少一个存储介质中获取目标码流,并将目标码流发送给端侧设备,其中,流媒体设备包括内容服务器或内容分发服务器。
第十六方面以及第十六方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第十六方面以及第十六方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
附图说明
图1a为示例性示出的应用场景示意图;
图1b为示例性示出的应用场景示意图;
图2a为示例性示出的编码过程示意图;
图2b为示例性示出的候选虚拟扬声器分布示意图;
图3为示例性示出的解码过程示意图;
图4为示例性示出的编码过程示意图;
图5为示例性示出的解码过程示意图;
图6a为示例性示出的编码端的结构示意图;
图6b为示例性示出的解码端的结构示意图;
图7为示例性示出的编码过程示意图;
图8为示例性示出的解码过程示意图;
图9a为示例性示出的编码端的结构示意图;
图9b为示例性示出的解码端的结构示意图;
图10为示例性示出的场景音频编码装置的结构示意图;
图11为示例性示出的场景音频解码装置的结构示意图;
图12为示例性示出的装置的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。
本申请实施例的说明书和权利要求书中的术语“第一”和“第二”等是用于区别不同的对象,而不是用于描述对象的特定顺序。例如,第一目标对象和第二目标对象等是用于区别不同的目标对象,而不是用于描述目标对象的特定顺序。
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请实施例的描述中,除非另有说明,“多个”的含义是指两个或两个以上。例如,多个处理单元是指两个或两个以上的处理单元;多个***是指两个或两个以上的***。
为了下述各实施例的描述清楚简洁,首先给出相关技术的简要介绍。
声音(sound)是由物体振动产生的一种连续的波。产生振动而发出声波的物体称为声源。声波通过介质(如:空气、固体或液体)传播的过程中,人或动物的听觉器官能感知到声音。
声波的特征包括音调、音强和音色。音调表示声音的高低。音强表示声音的大小。音强也可以称为响度或音量。音强的单位是分贝(decibel,dB)。音色又称为音品。
声波的频率决定了音调的高低。频率越高音调越高。物体在一秒钟之内振动的次数称为频率,频率单位是赫兹(hertz,Hz)。人耳能识别的声音的频率在20Hz~20000Hz之间。
声波的幅度决定了音强的强弱。幅度越大音强越大。距离声源越近,音强越大。
声波的波形决定了音色。声波的波形包括方波、锯齿波、正弦波和脉冲波等。
根据声波的特征,声音可以分为规则声音和无规则声音。无规则声音是指声源无规则地振动发出的声音。无规则声音例如是影响人们工作、学习和休息等的噪声。规则声音是指声源规则地振动发出的声音。 规则声音包括语音和乐音。声音用电表示时,规则声音是一种在时频域上连续变化的模拟信号。该模拟信号可以称为音频信号。音频信号是一种携带语音、音乐和音效的信息载体。
由于人的听觉具有辨别空间中声源的位置分布的能力,则听音者听到空间中的声音时,除了能感受到声音的音调、音强和音色外,还能感受到声音的方位。
随着人们对听觉***体验的关注和品质要求与日俱增,为了增强声音的纵深感、临场感和空间感,则三维音频技术应运而生。从而听音者不仅感受到来自前、后、左和右的声源发出的声音,而且感受到自己所处空间被这些声源产生的空间声场(简称“声场”(sound field))所包围的感觉,以及声音向四周扩散的感觉,营造出一种使听音者置身于影院或音乐厅等场所的“身临其境”的音响效果。
本申请实施例涉及的场景音频信号,可以是指用于描述声场的信号;其中,场景音频信号可以包括:HOA信号(其中,HOA信号可以包括三维HOA信号和二维HOA信号(也可以称为平面HOA信号))和三维音频信号;三维音频信号可以是指场景音频信号中除HOA信号之外的其他音频信号。以下以HOA信号为例进行说明。
众所周知,声波在理想介质中传播,波数为k=w/c,角频率为w=2πf,其中,f为声波频率,c为声速。声压p满足公式(1),为拉普拉斯算子。
假设人耳以外的空间***是一个球形,听音者处于球的中心,从球外传来的声音在球面上有一个投影,过滤掉球面以外的声音,假设声源分布在这个球面上,用球面上的声源产生的声场来拟合原始声源产生的声场,即三维音频技术就是一个拟合声场的方法。具体地,在球坐标系下求解公式(1)等式方程,在无源球形区域内,该公式(1)方程解为如下公式(2)。
其中,r表示球半径,θ表示水平角信息(或者称为方位角信息),表示俯仰角信息(或称为仰角信息),k表示波数,s表示理想平面波的幅度,m表示HOA信号的阶数序号(或称为HOA信号的阶数序号)。表示球贝塞尔函数,球贝塞尔函数又称为径向基函数,其中,第一个j表示虚数单位,不随角度变化。表示θ,方向的球谐函数,表示声源方向的球谐函数。HOA信号满足公式(3)。
将公式(3)代入公式(2),公式(2)可以变形为公式(4)。
其中,将m截断到第N项,即m=N,以作为对声场的近似描述;此时,可以称为HOA系数(可以用于表示N阶HOA信号)。声场是指介质中有声波存在的区域。N为大于或等于1的整数。
场景音频信号是一种携带声场中声源的空间位置信息的信息载体,描述了空间中听音者的声场。公式(4)表明声场可以在球面上按球谐函数展开,即声场可以分解为多个平面波的叠加。因此,可以将HOA信号描述的声场使用多个平面波的叠加来表达,并通过HOA系数重建声场。
本申请的实施例涉及的待编码的HOA信号可以是指N1阶HOA信号,可以采用HOA系数或Ambisonic(立体声混响)系数表示,N1为大于或等于1的整数(其中,当N1等于时,1阶HOA信号,可以称为FOA(First Order Ambisonic,一阶立体混响)信号)。其中,N1阶HOA信号包括(N1+1)2个通道的音频信号。
图1a为示例性示出的应用场景示意图。在图1a示出的是场景音频信号的编解码场景。
参照图1a,示例性的,第一电子设备可以包括第一音频采集模块、第一场景音频编码模块、第一信道编码模块、第一信道解码模块、第一场景音频解码模块和第一音频回放模块。应该理解的是,第一电子设备可以包括比图1a所示的更多或更少的模块,本申请对此不作限制。
参照图1a,示例性的,第二电子设备可以包括第二音频采集模块、第二场景音频编码模块、第二信道编码模块、第二信道解码模块、第二场景音频解码模块和第二音频回放模块。应该理解的是,第二电子设备可以包括比图1a所示的更多或更少的模块,本申请对此不作限制。
示例性的,第一电子设备编码并传输场景音频信号至第二电子设备,由第二电子设备解码以及音频回放的过程可以如下:第一音频采集模块可以进行音频采集,输出场景音频信号至第一场景音频编码模块。接着,第一场景音频编码模块可以对场景音频信号进行编码,输出码流至第一信道编码模块。之后,第一信道编码模块可以对码流进行信道编码,并将信道编码后的码流通过无线或有线网络通信设备传输到第二电子设备。然后,第二电子设备的第二信道解码模块可以对接收到的数据进行信道解码,以得到码流并将码流输出至第二场景音频解码模块。接着,第二场景音频解码模块可以对该码流进行解码,以得到重建场景音频信号;然后将该重建场景音频信号输出至第二音频回放模块,由第二音频回放模块进行音频回放。
需要说明的是,第二音频回放模块可以对重建场景音频信号进行后处理(如音频渲染(例如,可以将包含(N1+1)2个通道音频信号的重建场景音频信号,转换为与第二电子设备中扬声器数量相同通道数的音频信号)、响度归一化、用户交互、音频格式转换或去噪声等),以将重建场景音频信号转换为适应于第二电子设备中扬声器播放的音频信号。
应该理解的是,第二电子设备编码并传输场景音频信号至第一电子设备,由第一电子设备解码以及音频回放的过程,与上述第一电子设备传输场景音频信号至第二电子设备,由第二电子设备进行音频回放的过程类似,在此不再赘述。
示例性的,第一电子设备和第二电子设备均可以包括但不限于:个人计算机、计算机工作站、智能手机、平板电脑、服务器、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
示例性的,本申请具体可以应用于VR(Virtual Reality,虚拟现实)/AR(Augmented Reality,增强现实)场景。一种可能的方式中,第一电子设备为服务器,第二电子设备为VR/AR设备。一种可能的方式中,第二电子设备为服务器,第一电子设备为VR/AR设备。
示例性的,第一场景音频编码模块和第二场景音频编码模块,可以是场景音频编码器。第一场景音频解码模块和第二场景音频解码模块,可以是场景音频解码器。
示例性的,当由第一电子设备编码场景音频信号,第二电子设备重建场景音频信号时,第一电子设备可以称为编码端,第二电子设备可以称为解码端。当由第二电子设备编码场景音频信号,第一电子设备重建场景音频信号时,第二电子设备可以称为编码端,第一电子设备可以称为解码端。
图1b为示例性示出的应用场景示意图。在图1b示出的是场景音频信号的转码场景。
参照图1b(1),示例性的,无线或核心网设备可以包括:信道解码模块、其他音频解码模块、场景音频编码模块和信道编码模块。其中,无线或核心网设备可以用于音频转码。
示例性的,图1b(1)的具体应用场景可以是:在第一电子设备未设有场景音频编码模块,仅设有其他音频编码模块;而第二电子设备仅设有场景音频解码模块,未设有其他音频解码模块的情况下,为了实现第二电子设备能够解码并回放第一电子设备采用其他音频编码模块编码场景音频信号,可以使用无线或核心网设备进行转码。
具体的,第一电子设备采用其他音频编码模块对场景音频信号进行编码,得到第一码流;并将第一码流进行信道编码后发送给无线或核心网设备。接着,无线或核心网设备的信道解码模块可以进行信道解码,将信道解码出的第一码流输出至其他音频解码模块。之后,其他音频解码模块对第一码流进行解码,得到场景音频信号并将场景音频信号输出至场景音频编码模块。然后,场景音频编码模块可以对场景音频信号进行编码,以得到第二码流并将第二码流输出至信道编码模块,由信道编码模块对第二码流进行信道编码后,发送至第二电子设备。这样,第二电子设备可以调用场景音频解码模块,对信道解码得到第二码流进行解码,得到重建场景音频信号;后续即可对重建场景音频信号进行音频回放。
参照图1b(2),示例性的,无线或核心网设备可以包括:信道解码模块、场景音频解码模块、其他音频编码模块和信道编码模块。其中,无线或核心网设备可以用于音频转码。
示例性的,图1b(2)的具体应用场景可以是:在第一电子设备仅设有场景音频编码模块,未设有其他音频编码模块;而第二电子设备未设有场景音频解码模块,仅设有其他音频解码模块的情况下,为了实现第二电子设备能够解码并回放第一电子设备采用场景音频编码模块编码场景音频信号,可以使用无线或核心网设备进行转码。
具体的,第一电子设备采用场景音频编码模块对场景音频信号进行编码,得到第一码流;并将第一码流进行信道编码后发送给无线或核心网设备。接着,无线或核心网设备的信道解码模块可以进行信道解码,将信道解码出的第一码流输出至场景音频解码模块。之后,场景音频解码模块对第一码流进行解码,得到场景音频信号并将场景音频信号输出至其他音频编码模块。然后,其他音频编码模块可以对场景音频信号进行编码,以得到第二码流并将第二码流输出至信道编码模块,由信道编码模块对第二码流进行信道编码后,发送至第二电子设备。这样,第二电子设备可以调用其他音频解码模块,对信道解码得到第二码流进行解码,得到重建场景音频信号;后续即可对重建场景音频信号进行音频回放。
以下对场景音频信号的编解码过程进行说明。
图2a为示例性示出的编码过程示意图。
S201,获取待编码的场景音频信号,场景音频信号包括C1个通道的音频信号,C1为正整数。
示例性的,当场景音频信号为HOA信号时,该HOA信号可以为N1阶HOA信号,也就是当m截断到第N1项时,上述公式(3)中的
示例性的,N1阶HOA信号可以包括C1个通道的音频信号,C1=(N1+1)2。例如,N1=3时,N1阶HOA信号包括16个通道的音频信号;N1=4时,N1阶HOA信号包括25个通道的音频信号。
S202,基于场景音频信号,确定目标虚拟扬声器的属性信息。
S203,编码场景音频信号中第一音频信号和目标虚拟扬声器的属性信息,以得到第一码流;其中,第一音频信号为场景音频信号中K个通道的音频信号,K为小于或等于C1的正整数。
示例性的,虚拟扬声器是虚拟的扬声器,不是真实存在的扬声器。
示例性的,基于上述可知,场景音频信号可以使用多个平面波的叠加来表达,进而可以确定用于来模拟场景音频信号中声源的目标虚拟扬声器;这样,后续在解码过程中,采用目标虚拟扬声器对应的虚拟扬声器信号,来重建该场景音频信号。
一种可能的方式中,可以在球面上设置位置不同的多个候选虚拟扬声器;接着,可以从这多个候选虚拟扬声器中,选取位置与场景音频信号中声源位置相匹配的目标虚拟扬声器。
图2b为示例性示出的候选虚拟扬声器分布示意图。在图2b中,多个候选虚拟扬声器可以均匀的分布在球面上,球面上一个点,代表一个候选虚拟扬声器。
需要说明的是,本申请对候选虚拟扬声器的数量以及分布不作限制,可以按照需求设置,具体在后续进行说明。
示例性的,可以基于场景音频信号,从这多个候选虚拟扬声器中,选取位置与场景音频信号中声源位置匹配的目标虚拟扬声器;其中,目标虚拟扬声器的数量可以是一个,也可以是多个,本申请对此不作限制。
一种可能的方式中,可以预先设定目标虚拟扬声器。
应该理解的是,本申请不限制确定目标虚拟扬声器的方式。
示例性的,一种可能的方式中,在解码过程中,可以根据虚拟扬声器信号来重建场景音频信号;但是直接传输目标虚拟扬声器的虚拟扬声器信号,会增加码率;而目标虚拟扬声器的虚拟扬声器信号可以基于目标虚拟扬声器的属性信息和部分或全部通道的场景音频信号来生成;因此可以获取目标虚拟扬声器的属性信息,以及获取场景音频信号中的K个通道的音频信号,作为第一音频信号;然后对第一音频信号和目标虚拟扬声器的属性信息进行编码,以得到第一码流。
示例性的,可以对第一音频信号和目标虚拟扬声器的属性信息进行下混、变换、量化以及熵编码等操作,以得到第一码流。也就是说,该第一码流中可以包括场景音频信号中第一音频信号的编码数据,以及目标虚拟扬声器的属性信息的编码数据。
相对于现有技术中其他重建场景音频信号的方法而言,基于虚拟扬声器信号重建出的场景音频信号的音频质量更高;因此当K等于C1时,在同等码率下,本申请重建出的场景音频信号的音频质量更高。
当K小于C1时,在对场景音频信号编码的过程中,本申请编码的音频信号的通道数,小于现有技术编码的音频信号的通道数,且目标虚拟扬声器的属性信息的数据量,也远小一个通道的音频信号的数据量;因此在达到同等质量的前提下,本申请编码码率更低。
此外,现有技术是将场景音频信号转换为虚拟扬声器信号和残差信号后再编码,而本申请编码端直接编码场景音频信号中部分通道的音频信号,无需计算虚拟扬声器信号和残差信号,编码端的编码复杂度更低。
图3为示例性示出的解码过程示意图。图3为与图2的编码过程所对应的解码过程。
S301,接收第一码流。
S302,解码第一码流,以得到第一重建信号和目标虚拟扬声器的属性信息。
示例性的,可以对第一码流包含的场景音频信号中第一音频信号的编码数据进行解码,可以得到第一重建信号;也就是说,第一重建信号是第一音频信号的重建信号。以及可以对第一码流包含的目标虚拟扬声器的属性信息的编码数据进行解码,可以得到目标虚拟扬声器的属性信息。
应该理解的是,当编码端对场景音频信号中第一音频信号进行的是有损压缩时,解码端解码得到的第一重建信号和编码端编码的第一音频信号存在差异。当编码端对第一音频信号进行的是无损压缩时,解码端解码得到的第一重建信号和编码端编码的第一音频信号相同。
应该理解的是,当编码端对目标虚拟扬声器的属性信息进行的是有损压缩时,解码端解码得到的属性信息和编码端编码的属性信息存在差异。当编码端对虚拟扬声器的属性信息进行的是无损压缩时,解码端解码得到的属性信息和编码端编码的属性信息相同。(其中,本申请对编码端编码的属性信息和解码端解码得到的属性信息,未从名称上进行区分。)
S303,基于属性信息和第一重建信号,生成目标虚拟扬声器对应的虚拟扬声器信号。
S304,基于属性信息和虚拟扬声器信号进行重建,以得到第一重建场景音频信号。
示例性的,基于上述描述可知,可以基于虚拟扬声器信号,来重建场景音频信号;进而可以先基于目标虚拟扬声器的属性信息和第一重建信号,生成目标虚拟扬声器对应虚拟扬声器信号。其中,一个目标虚拟扬声器对应一路虚拟扬声器信号,虚拟扬声器信号是平面波。接着,再基于目标虚拟扬声器的属性信息和虚拟扬声器信号进行重建,生成第一重建场景音频信号。
示例性的,当场景音频信号为HOA信号时,重建得到的第一重建场景音频信号也可以为HOA信号,该HOA信号可以是N2阶HOA信号,N2为正整数。示例性的,N2阶HOA信号可以包括C2个通道的音频信号,C2=(N2+1)2
示例性的,第一重建场景音频信号的阶数N2,可以大于或等于图2a实施例中场景音频信号的阶数N1;对应的,第一重建场景音频信号包括的音频信号的通道数C2,可以大于或等于图2a实施例中场景音频信号包括的音频信号的通道数C1。
一种可能的方式中,可以直接将第一重建场景音频信号,作为最终的解码结果。
相对于现有技术中其他重建场景音频信号的方法而言,基于虚拟扬声器信号重建出的场景音频信号的音频质量更高;因此当K等于C1时,在同等码率下,本申请的重建出的场景音频信号的音频质量更高。
当K小于C1时,在对场景音频信号编码的过程中,本申请编码的音频信号的通道数,小于现有技术编码的音频信号的通道数,且目标虚拟扬声器的属性信息的数据量,远小于一个通道的音频信号的数据量;因此在同等码率的前提下,本申请解码得到重建场景音频信号的音频质量更高。
其次,由于现有技术编码传输的虚拟扬声器信号和残差信息是通过原始音频信号(即待编码的场景音频信号)转换而来的,并不是原始音频信号,会引入误差;而本申请编码了部分原始音频信号(即待编码的场景音频信号中K个通道的音频信号),避免了误差的引入,进而能够提高解码得到重建场景音频信号的音频质量;且还能够避免解码得到重建场景音频信号的重建质量的波动,稳定性高。
此外,由于现有技术编码以及传输的是虚拟扬声器信号,而虚拟扬声器信号的数据量较大,因此现有技术选取的目标虚拟扬声器的数量受到带宽限制较大。本申请编码以及传输的是虚拟扬声器的属性信息,属性信息的数据量远小于虚拟扬声器信号的数据量;因此本申请选取的目标虚拟扬声器的数量受到带宽限制较小。而选取的目标虚拟扬声器的数量越多,基于目标虚拟扬声器的虚拟扬声器信号,重建出的场景音频信号的质量也就越高。因此,相对于现有技术而言,在同等码率的情况下,本申请可以选取数量更多的目标虚拟扬声器,这样,本申请解码得到重建场景音频信号的质量也就更高。
此外,综合编码端和解码端,相对于现有技术的编码端和解码端而言,本申请的编码端和解码端无需进行残差和叠加操作,因此本申请编码端和解码端的综合复杂度,低于现有技术编码端和解码端的综合复杂度。
以下场景音频信号为N1阶HOA信号,第一重建场景音频信号为N2阶HOA信号,以N1和N2均大于1,K小于C1为例进行说明。
一种可能的方式中,可以基于第一重建场景音频信号和第一重建信号,生成第二重建场景音频信号;然后,将第二重建场景音频信号,作为最终的解码结果。其中,可以将第一重建场景音频信号中通道与第一音频信号的通道对应的音频信号,采用第一重建信号替换。相对于第一重建场景音频信号中通道与第一音频信号的通道对应的音频信号而言,解码得到的第一重建信号,更接近编码的第一音频信号,因此得到的第二重建场景音频信号比第一重建场景音频信号的音频质量更高。
为了便于后续描述生成第二重建场景音频信号的过程,先对场景音频信号(即N1阶HOA信号)和第一重建场景音频信号(即N2阶HOA信号)的组成成分进行说明。
示例性的,N1阶HOA信号可以包括第二音频信号和第三音频信号,第二音频信号为N1阶HOA信号截断到M阶时的HOA信号(或者说,第二音频信号为N1阶HOA信号中第0阶至第M阶的信号;其中,第二音频信号包括(M+1)2个通道的音频信号,M为小于N1的整数),第三音频信号为N1阶HOA信号中除第二音频信号之外的音频信号。
一种可能的方式中,第二音频信号可以称为N1阶HOA信号的低阶部分,第三音频信号可以称为N1阶HOA信号的高阶部分。
举个例子:假设,N1=3,则N1阶HOA信号可以包16个通道的音频信号。
示例性的,参照上述公式(3)可知,在N1等于3(也就是上述公式(3)中的m等于3)的情况下,将公式(3)展开可以得到16个单项式;其中,每一单项式可以用于表示N1阶HOA信号中一个通道的音频信号。
其中,当公式(3)中n取值为0时,将公式(3)展开可以得到1个单项式,如下公式(5)所示;此时可以得到1个通道的音频信号。当公式(3)中n取值为1时,将公式(3)展开可以得到3个单项式,如下公式(6)所示;此时可以得到3个通道的音频信号。当公式(4)中n取值为2时,将公式(3)展开可以得到5个单项式,如下公式(7)所示;此时可以得到5个通道的音频信号。当公式(4)中n取值为3时,将公式(3)展开可以得到7个单项式,如下公式(8)所示;此时可以得到7个通道的音频信号。



其中,为场景音频信号中声源的位置信息。
示例性的,若M=0,即公式(3)中m等于0,此时n的取值可以为0;对公式(3)展开可以1个单项式。这种情况下,第二音频信号可以包括1个通道的音频信号,如上述公式(5)所示;第三音频信号可以包括另外的15个通道的音频信号,如上述公式(6)~公式(8)所示。
示例性的,若M=1时,即公式(3)中m等于1,此时n的取值可以为0和1;对公式(3)展开可以得到4个单项式。这种情况下,第二音频信号可以包括4个通道的音频信号,如上述公式(5)和公式(6)所示;第三音频信号可以包括另外的12个通道的音频信号,如上述公式(7)和公式(8)所示。
示例性的,若M=2时,即公式(3)中m等于2,此时n的取值可以为0、1和2;对公式(3)展开可以得到9个单项式。这种情况下,第二音频信号可以包括9个通道的音频信号,上述公式(5)~公式(7)所示;第三音频信号可以包括另外的7个通道的音频信号,上述公式(8)。
示例性的,N2阶HOA信号可以包括第六音频信号和第七音频信号,第六音频信号为N2阶HOA信号截断到M阶时的HOA信号(或者说,第六音频信号为N2阶HOA信号中第0阶至第M阶的信号;其中,第六音频信号包括(M+1)2个通道的音频信号,M为小于N2的整数),第七音频信号为N2阶HOA信号中除第六音频信号之外的音频信号。
一种可能的方式中,第六音频信号可以称为N2阶HOA信号的低阶部分,第七音频信号可以称为N2阶HOA信号的高阶部分。
举个例子,假设,N2=3,则N2阶HOA信号可以包16个通道的音频信号。
示例性的,参照上述公式(3)可知,在N等于3(也就是上述公式(3)中的m等于3时)的情况下,将公式(3)展开可以得到16个单项式;其中,每一单项式可以用于表示N2阶HOA信号中一个通道的音频信号。
其中,当公式(3)中n取值为0时,将公式(3)展开可以得到1个单项式,如下公式(9)所示;此时可以得到1个通道的音频信号。当公式(3)中n取值为1时,将公式(3)展开可以得到3个单项式,如下公式(10)所示;此时可以得到3个通道的音频信号。当公式(4)中n取值为2时,将公式(3)展开可以得到5个单项式,如下公式(11)所示;此时可以得到5个通道的音频信号。当公式(4)中n取值为3时,将公式(3)展开可以得到7个单项式,如下公式(12)所示;此时可以得到7个通道的音频信号。



其中,为第一重建场景音频信号中声源的位置信息。
示例性的,若M=0,即公式(3)中m等于0,此时n的取值可以为0;对公式(3)展开可以1个单项式。这种情况下,第六音频信号可以包括1个通道的音频信号,如上述公式(9)所示;第七音频信号可以包括另外的15个通道的音频信号,如上述公式(10)~公式(12)所示。
示例性的,若M=1时,即公式(3)中m等于1,此时n的取值可以为0和1;对公式(3)展开可以得到4个单项式。这种情况下,第六音频信号可以包括4个通道的音频信号,如上述公式(9)和公式(10)所示;第七音频信号可以包括另外的12个通道的音频信号,如上述公式(11)和公式(12)所示。
示例性的,若M=2时,即公式(3)中m等于2,此时n的取值可以为0、1和2;对公式(3)展开可以得到9个单项式。这种情况下,第六音频信号可以包括9个通道的音频信号,上述公式(9)~公式(11)所示;第七音频信号可以包括另外的7个通道的音频信号,上述公式(12)。
以下对编码过程中选取目标虚拟扬声器的过程,以及解码过程中重建第二重建场景音频信号的过程进行说明。
图4为示例性示出的编码过程示意图。
S401,获取待编码的场景音频信号,场景音频信号包括C1个通道的音频信号,C1为正整数。
示例性的,S401可以参照上述S201的描述,在此不再赘述。
S402,获取多个候选虚拟扬声器对应的多组虚拟扬声器系数,多组虚拟扬声器系数与多个候选虚拟扬声器一一对应。
示例性的,可以获取编码模块(例如场景音频编码模块)的第一配置信息;然后根据编码模块的第一配置信息,确定候选虚拟扬声器的第二配置信息;接着,根据候选虚拟扬声器的第二配置信息,生成多个候选虚拟扬声器。
示例性的,第一配置信息包括且不限于:编码比特率,用户自定义信息(例如,编码模块对应的HOA阶数(是指编码模块可支持编码的HOA信号的阶数),重建场景音频信号的阶数(期望的解码端解码得到的重建HOA信号的阶数)、重建场景音频信号的格式(期望的解码端解码得到的重建HOA信号的格式)等等);本申请对此不作限制。
示例性的,第二配置信息包括但不限于:候选虚拟扬声器的总数量、各候选虚拟扬声器的HOA阶数、各候选虚拟扬声器的位置信息等信息;本申请对此不作限制。
示例性的,根据编码模块的第一配置信息,确定候选虚拟扬声器的第二配置信息的方式可以包括多种;例如,若编码比特率较低,则可以配置较少数量的候选虚拟扬声器;若编码比特率较高,则可以配置多个数量的候选虚拟扬声器。又如,可以将虚拟扬声器的HOA阶数,配置为编码模块的HOA阶数。不限定的是,本申请实施例中,除了可以根据编码模块的第一配置信息,确定候选虚拟扬声器的第二配置信息之外,还可以根据用户自定义信息(例如,用户可以自定义的候选虚拟扬声器的总数量、各候选虚拟扬声器的HOA阶数、各候选虚拟扬声器的位置信息等信息),确定候选虚拟扬声器的第二配置信息。
示例性的,可以预先设置一配置表,该配置表中包含候选虚拟扬声器的数量与候选虚拟扬声器的位置信息之间的关系。这样,在确定候选虚拟扬声器的总数量之后,可以通过查找给配置表,确定各候选虚拟扬声器的位置信息。
示例性的,在确定候选虚拟扬声器的第二配置信息后,可以基于候选虚拟扬声器的第二配置信息,生成多个候选虚拟扬声器。示例性的,可以根据候选虚拟扬声器的总数量,生成对应数量的候选虚拟扬声器,并且根据各候选虚拟扬声器的HOA阶数,设置各候选虚拟扬声器的HOA阶数;以及根据各候选虚拟扬声器的位置信息,设置各候选虚拟扬声器的位置。
示例性的,每个候选虚拟扬声器作为一个虚拟声源时,该虚拟声源产生的虚拟扬声器信号是平面波,可以将其在球坐标系下展开。对于振幅为s,方向为的理想平面波,使用球谐函数展开后的形式可以如公式(3)所示。其中,候选虚拟扬声器的HOA阶数,也就是公式(3)中m的截断值。
接着,可以根据各候选虚拟扬声器的HOA阶数,确定各候选虚拟扬声器对应的虚拟扬声器系数(其中,每个候选虚拟扬声器对应一组虚拟扬声器系数)。示例性的,针对一个候选虚拟扬声器,可以参照公式(3),将公式(3)中的m的截断值设置为候选虚拟扬声器的HOA阶数,以及将公式(3)中设置为候选虚拟扬声器的位置信息此时公式(3)所示中的即为一组虚拟扬声器系数(其中,虚拟扬声器系数也是HOA系数。需要说明的是,根据公式(3)可知,候选虚拟扬声器的位置与场景音频信号中声源的位置不同时,候选虚拟扬声器的虚拟扬声器系数与场景音频信号是不同的HOA系数)。这样,可以确定各个候选虚拟扬声器对应的一组虚拟扬声器系数。
其中,S402确定的候选虚拟扬声器对应的一组虚拟扬声器系数可以包括C1个虚拟扬声器系数,一个虚拟扬声器系数与场景音频信号的一个通道对应。
一种可能的方式中,根据编码模块的第一配置信息,确定候选虚拟扬声器的第二配置信息(后续用“步骤A”代替);根据候选虚拟扬声器的第二配置信息,生成多个候选虚拟扬声器(后续用“步骤B”代替)以及确定各候选虚拟扬声器对应的虚拟扬声器系数(后续用“步骤C”代替);这三个步骤可以是预先执行的,即在获取待编码的场景音频信号之前执行。
一种可能的方式中,步骤A和步骤B是预先执行的,步骤C是在获取待编码的场景音频信号之后执行的。
一种可能的方式中,步骤A是预先执行的,步骤B和步骤C是在获取待编码的场景音频信号之后执行的。
一种可能的方式中,步骤A、步骤B和步骤C均,是在获取待编码的场景音频信号之后执行的。
S403,基于场景音频信号和多组虚拟扬声器系数,从多个候选虚拟扬声器中选取目标虚拟扬声器。
示例性的,将场景音频信号与多组虚拟扬声器系数分别进行内积,以得到多个内积值;多个内积值与多组虚拟扬声器系数一一对应。示例性的,针对多个候选虚拟扬声器中的每一个候选虚拟扬声器,可以将该候选虚拟扬声器对应的一组虚拟扬声器系数与场景音频信号进行内积,可以得到对应的内积值。
接着,可以基于多个内积值,从多个候选虚拟扬声器中选取目标虚拟扬声器。一种可能的方式中,可以选取内积值最大的前G(G为正整数)个候选虚拟扬声器,作为目标虚拟扬声器。一种可能的方式中,可以先选取内积最大的候选虚拟扬声器,作为一个目标虚拟扬声器;接着,将场景音频信号投影叠加至内积最大的候选虚拟扬声器对应的一组虚拟扬声器系数的线性组合上,得到投影向量;然后,将投影向量从场景音频信号中减去,以得到差值。之后,对差值重复上述过程实现迭代计算,每迭代一次产生一个目标虚拟扬声器。
一种可能的方式中,可以一帧场景音频信号为单位,确定每帧场景音频信号的场景音频信号与各候选虚拟扬声器对应的虚拟扬声器系数之间的内积值;这样,可以确定每帧场景音频信号对应的目标虚拟扬声器。
一种可能的方式中,可以将一帧场景音频信号拆分为多个子帧,然后以一个子帧为单位,确定每个子帧分别与各候选虚拟扬声器对应的虚拟扬声器系数之间的内积值;这样,可以确定每个子帧对应的目标虚拟扬声器。
S404,获取目标虚拟扬声器的属性信息。
一种可能的方式中,基于目标虚拟扬声器的位置信息,生成目标虚拟扬声器的属性信息。其中,一种可能的方式中,可以将目标虚拟扬声器的位置信息(包括俯仰角信息和水平角信息),作为目标虚拟扬声器的属性信息。一种可能的方式中,将目标虚拟扬声器的位置信息对应的位置索引(包括俯仰角索引(可以用于唯一标识俯仰角信息)和水平角索引(可以用于唯一标识水平角信息)),作为目标虚拟扬声器的属性信息。
一种可能的方式中,可以将目标虚拟扬声器的虚拟扬声器索引(例如,虚拟扬声器标识),作为目标虚拟扬声器的属性信息。其中,虚拟扬声器索引与位置信息一一对应。
一种可能的方式中,可以将目标虚拟扬声器的虚拟扬声器系数,作为目标虚拟扬声器的属性信息。示例性的,可以确定目标虚拟扬声器的C2个虚拟扬声器系数,将目标虚拟扬声器的C2个虚拟扬声器系数,作为目标虚拟扬声器的属性信息;其中,目标虚拟扬声器的C2个虚拟扬声器系数与第一重建场景音频信号包括的C2个通道数的音频信号一一对应。
需要说明的是,虚拟扬声器系数的数据量,远大于位置信息、位置信息的索引和虚拟扬声器索引的数据量;可以根据带宽,决策采用位置信息、位置信息的索引、虚拟扬声器索引和虚拟扬声器系数中的哪种信息,作为目标虚拟扬声器的属性信息。例如,当带宽较大时,可以将虚拟扬声器系数,作为目标虚拟扬声器的属性信息;这样,无需解码端计算目标虚拟扬声器的虚拟扬声器系数,可以节省解码端的算力。当带宽较小时,可以将位置信息、位置信息的索引、虚拟扬声器索引中的任一种,作为目标虚拟扬声器的属性信息;这样,可以节省码率。应该理解的是,也可以预先设置采用位置信息、位置信息的索引、虚拟扬声器索引和虚拟扬声器系数中的哪种信息,作为目标虚拟扬声器的属性信息;本申请对此不作限制。
S405,编码场景音频信号中第一音频信号和目标虚拟扬声器的属性信息,以得到第一码流。
一种可能方式中,第一音频信号为第二音频信号;也就是说,第一音频信号是场景音频信号中的低阶部分。假设N1=3,当M=0时,第一音频信号包括1个通道的音频信号;例如,第一音频信号为上述公式(5)表示的1个通道的音频信号。当M=1时,第一音频信号包括4个通道的音频信号;例如,第一音频信号包括上述公式(5)和公式(6)表示的4个通道的音频信号。当M=2时,第一音频信号包括9个通道的音频信号;例如,第一音频信号包括上述公式(5)、公式(6)和公式(7)表示的9个通道的音频信号。
示例性的,第二音频信号包括的通道数可能为奇数,也可能为偶数。例如,基于上述示例,假设N1=3,当M=0和M=2时,第二音频信号包括的通道数为奇数;当M=1时,第二音频信号包括的通道数为偶数。由于部分编码器仅支持编码偶数个通道的音频信号,进而,一种可能方式中,第一音频信号可以包括第二音频信号和第四音频信号,其中,第四音频信号为第三音频信号中部分通道的音频信号。示例性的,当第二音频信号包括奇数个通道时,可以从第三音频信号中选取奇数个通道的音频信号,作为第四音频信号;即第四音频信号可以包括奇数个通道的音频信号。例如,当M=0时,第一音频信号可以包括上述公式(5)表示的1个通道的音频信号和上述公式(6)的第一项表示的1个通道的音频信号,此时,第一音频信号包括2个通道的音频信号。例如,当M=2时,第一音频信号可以包括上述公式(5)~公式(7)表示的9个通道的音频信号,以及上述公式(8)的第一项表示的1个通道的音频信号,此时,第一音频信号包括10个通道的音频信号。
当第二音频信号包括偶数个通道时,可以从第三音频信号中选取偶数个通道的音频信号,作为第四音频信号。例如,当M=1时,第一音频信号可以包括上述公式(5)和上述公式(6),以及上述公式(7)前两项,此时,第一音频信号包括6个通道的音频信号。
应该理解的是,当第二音频信号包括偶数个通道时,也可以不从第三音频信号中选取部分通道的音频信号,而是直接将第二音频信号作为第一音频信号。
应该理解的是,第一音频信号所包括的音频信号的通道数,可以按照需求以及带宽确定,本申请对此不作限制。
图5为示例性示出的解码过程示意图。图5为与图4编码过程中对应的解码过程。
S501,接收第一码流。
S502,解码第一码流,以得到第一重建信号和目标虚拟扬声器的属性信息。
示例性的,S501~S502,可以参照S301~S302的描述,在此不再赘述。
示例性的,上述S303可以参照S503~S504的描述:
S503,基于属性信息,确定目标虚拟扬声器对应的第一虚拟扬声器系数。
示例性的,编码端可以将M写入第一码流中;进而可以从第一码流中解码出M(当然,编码端和解码端也可以预先约定M,本申请对此不作限制)。示例性的,当目标虚拟扬声器的属性信息为位置信息时,可以将目标虚拟扬声器的位置信息代入上述公式(3),并令公式(3)中m等于M,即可得到目标虚拟扬声器对应的第一虚拟扬声器系数。其中,第一虚拟扬声器系数包括(M+1)2个虚拟扬声器系数,这(M+1)2个虚拟扬声器系数,对应第二重建信号的(M+1)2个通道;其中,第二重建信号为第二音频信号的重建信号。
示例性的,当目标虚拟扬声器的属性信息为位置信息的位置索引时,可以根据位置信息与位置索引之间的关系,确定目标虚拟扬声器的位置信息;然后按照上述方式,确定第一虚拟扬声器系数,在此不再赘述。
示例性的,当目标虚拟扬声器的属性信息为虚拟扬声器索引时,可以根据位置信息与虚拟扬声器索引之间的关系,确定目标虚拟扬声器的位置信息;然后按照上述方式,确定第一虚拟扬声器系数,在此不再赘述。
示例性的,当目标虚拟扬声器的属性信息为虚拟扬声器系数时,基于上述描述可知,目标虚拟扬声器对应的一组虚拟扬声器系数包括C2个虚拟扬声器系数;此时,可以选取与第二重建信号包括的(M+1)2个通道对应的(M+1)2个虚拟扬声器系数,作为第一虚拟扬声器系数。
S504,基于第一重建信号和第一虚拟扬声器系数,生成虚拟扬声器信号。
示例性的,可以基于第一重建信号中的第二重建信号和第一虚拟扬声器系数,生成虚拟扬声器信号。
示例性的,假设采用尺寸为(Y1×P)的矩阵A,表示目标虚拟扬声器的第一虚拟扬声器系数,其中,Y1(Y1为正整数)为目标虚拟扬声器的数量,P为第二重建信号包含的音频信号的通道数(M+1)2。以及采用尺寸为(L×P)的矩阵X,表示第二重建信号;其中,L为第二重建信号的采样点数。采用最小二乘方法求得理论的最优解w,w表示虚拟扬声器信号如公式(13)所示。
w=A-1X      (13)
其中,矩阵A-1为矩阵A的逆矩阵。
示例性的,上述S304可以参照如下S505~S506:
S505,基于目标虚拟扬声器的属性信息,确定目标虚拟扬声器对应的第二虚拟扬声器系数。
示例性的,可以根据期望的重建场景音频信号的阶数N2(也就是第一重建场景音频信号或第二重建场景音频信号的阶数N2),确定上述公式(3)中m等于N2。接着,当目标虚拟扬声器的属性信息为位置信息时,可以将目标虚拟扬声器的位置信息代入上述公式(3),并令公式(3)中m等于N2,即可得到第二虚拟扬声器系数。其中,第二虚拟扬声器系数包括C2个虚拟扬声器系数,这C2个虚拟扬声器系数,对应第一重建场景音频信号的C2个通道。
示例性的,当目标虚拟扬声器的属性信息为位置信息的位置索引时,可以根据位置信息与位置索引之间的关系,确定目标虚拟扬声器的位置信息;然后按照上述方式,确定第一虚拟扬声器系数,在此不再赘述。
示例性的,当目标虚拟扬声器的属性信息为虚拟扬声器索引时,可以根据位置信息与虚拟扬声器索引之间的关系,确定目标虚拟扬声器的位置信息;然后按照上述方式,确定第一虚拟扬声器系数,在此不再赘述。
示例性的,当目标虚拟扬声器的属性信息为虚拟扬声器系数时,可以直接将目标虚拟扬声器的属性信息,作为第二虚拟扬声器系数。
S506,基于虚拟扬声器信号和第二虚拟扬声器系数,以得到第一重建场景音频信号。
示例性的,假设采用尺寸为(Y1×C2)的矩阵A表示第二虚拟扬声器系数,其中,Y1为目标虚拟扬声器的数量,C2为第一重建场景音频信号的通道数。以及采用尺寸为(L×Y1)的矩阵B表示虚拟扬声器信号的;其中,L为第一重建场景音频信号的采样点数。则第一重建场景音频信号可以采用H表示,如公式(14)所示。
H=BA      (14)
S507,基于第一重建信号和第一重建场景音频信号,生成第二重建场景音频信号。
示例性的,相对于第一重建场景音频信号中通道与第一音频信号的通道对应的音频信号而言,解码得到的第一重建信号,更接近编码端所编码的第一音频信号;进而基于第一重建场景音频信号和第一重建信号,生成第二重建场景音频信号;然后,将第二重建场景音频信号,作为最终的解码结果;能够得到音频质量更高的重建场景音频信号。
一种可能的方式中,当第一音频信号包括第二音频信号时(也就是,第一音频信号为第二音频信号,或者,第一音频信号包括第二音频信号和第四音频信号时),第一重建信号为第二重建信号;此时,可以基于第二重建信号和第七音频信号,生成第二重建场景音频信号。示例性的,可以按照通道拼接第二重建信号和第七音频信号,来生成第二重建场景音频信号。
例如,假设第二音频信号为上述公式(5)表示的1个通道的信号,且第一音频信号为第二音频信号,第六音频信号为上述公式(10)~公式(12)表示的15个通道的信号;则得到的第二重建场景音频信号可以包括:公式(5)表示的1个通道的音频信号的重建信号和公式(10)~公式(12)表示的15个通道的信号。
例如,假设第二音频信号包括上述公式(5)表示的1个通道的信号,第四音频信号为上述公式(6)中第一项表示的1个通道的信号,第一音频信号包括第二音频信号和第四音频信号,第六音频信号为上述公式(10)~公式(12)表示的15个通道的信号;则得到的第二重建场景音频信号可以包括:公式(5)表示的1个通道的音频信号的重建信号,以及公式(10)~公式(12)表示的15个通道的信号。
一种可能的方式中,当第一音频信号包括第二音频信号和第四音频信号时,第一重建信号可以包括第二重建信号和第四重建信号(第四重建信号为第四音频信号的重建信号);此时,可以基于第二重建信号、第四重建信号和第八音频信号,生成第二重建场景音频信号。其中,第八音频信号为第七音频信号中部分通道的音频信号,且第八音频信号为第七音频信号中除与第四音频信号对应通道之外的其他通道的音频信号。示例性的,可以按照通道拼接第二重建信号、第四重建信号和第八音频信号,来生成第二重建场景音频信号。
例如,假设第二音频信号包括上述公式(5)表示的1个通道的信号,第四音频信号为上述公式(6)中第一项表示的1个通道的信号,第一音频信号包括第二音频信号和第四音频信号;则第八音频信号为上述公式(10)中的后两项表示的2个通道的信号,以及公式(11)~公式(12)表示的12个通道的信号。则得到的第二重建场景音频信号可以包括:公式(5)表示的1个通道的音频信号的重建信号和公式(6)中第一项表示的1个通道的音频信号的重建信号,公式(10)中的后两项表示的2个通道的信号,以及公式(11)~公式(12)表示的12个通道的信号。
示例性的,第二重建场景音频信号可以是N2阶HOA信号,N2为正整数。示例性的,第二重建场景音频信号可以包括C2个通道的音频信号,C2=(N2+1)2
示例性的,第二重建场景音频信号的阶数N2,可以大于或等于场景音频信号的阶数N1;对应的,第二重建场景音频信号包括的音频信号的通道数C2,可以大于或等于场景音频信号包括的音频信号的通道数C1。
示例性的,当第二重建场景音频信号的阶数N2,等于场景音频信号的阶数N1时,解码端可以重建出阶数与编码端编码的场景音频信号的阶数相同的重建场景音频信号。
示例性的,当第二重建场景音频信号的阶数N2,大于场景音频信号的阶数N1时,解码端可以重建出阶数大于编码端编码的场景音频信号的阶数的重建场景音频信号。
图6a为示例性示出的编码端的结构示意图。
参数图6a,示例性的,编码端可以包括配置单元、虚拟扬声器生成单元、目标扬声器生成单元、核心编码器。应该理解的是,图6a仅是本申请的一个示例,本申请的编码端可以包括比图6a示出的更多或更少的模块,在此不再赘述。
示例性的,配置单元,可以用于根据编码模块的第一配置信息,确定候选虚拟扬声器的第二配置信息。
示例性的,虚拟扬声器生成单元,可以用于根据候选虚拟扬声器的第二配置信息,生成多个候选虚拟扬声器以及确定各候选虚拟扬声器对应的虚拟扬声器系数。
示例性的,目标扬声器生成单元,可以用于根据基于场景音频信号和多组虚拟扬声器系数,从多个候选虚拟扬声器中选取目标虚拟扬声器以及确定目标虚拟扬声器的属性信息。
示例性的,核心编码器,可以用于对场景音频信号中第一音频信号和目标虚拟扬声器的属性信息进行编码。
示例性的,上述图1a和图1b中的场景音频编码模块可以包括图6a的配置单元、虚拟扬声器生成单元、目标扬声器生成单元、核心编码器;或者,仅包括核心编码器。
图6b为示例性示出的解码端的结构示意图。
参数图6b,示例性的,解码端可以包括核心解码器、虚拟扬声器系数生成单元、虚拟扬声器信号生成单元、第一重建单元和第二重建单元。应该理解的是,图6b仅是本申请的一个示例,本申请的解码端可以包括比图6b示出的更多或更少的模块,在此不再赘述。
示例性的,核心解码器,可以用于解码第一码流,以得到第一重建信号和目标虚拟扬声器的属性信息。
示例性的,虚拟扬声器系数生成单元,可以用于基于目标虚拟扬声器的属性信息,确定第一虚拟扬声器系数和第二虚拟扬声器系数。
示例性的,虚拟扬声器信号生成单元,可以用于基于第一重建信号和第一虚拟扬声器系数,生成虚拟扬声器信号。
示例性的,第一重建单元,可以用于基于虚拟扬声器信号和第二虚拟扬声器系数,以得到第一重建场景音频信号。
示例性的,第二重建单元,可以用于基于第一重建信号和第一重建场景音频信号,生成第二重建场景音频信号。
示例性的,上述图1a和图1b中的场景音频解码模块可以包括图6b的核心解码器、虚拟扬声器系数生成单元、虚拟扬声器信号生成单元、第一重建单元和第二重建单元;或者,仅包括核心解码器。
一种可能的方式中,在编码过程中,还可以提取场景音频信号中第五音频信号(第五音频信号为第三音频信号,或者,第五音频信号为场景音频信号中除第二音频信号和第四音频信号之外的音频信号)所对应的特征信息,并编码发送给解码;解码端在接收到码流后,可以基于该特征信息对第一重建场景音频信号中第七音频信号/第八音频信号进行补偿,可以提高第一重建场景音频信号/第二重建场景音频信号中第七音频信号/第八音频信号的音频质量。
图7为示例性的编码过程示意图。
S701,获取待编码的场景音频信号,场景音频信号包括C1个通道的音频信号,C1为正整数。
S702,获取多个候选虚拟扬声器对应的多组虚拟扬声器系数,多组虚拟扬声器系数与多个候选虚拟扬声器一一对应。
S703,基于场景音频信号和多组虚拟扬声器系数,从多个候选虚拟扬声器中选取目标虚拟扬声器。
S704,获取目标虚拟扬声器的属性信息。
S705,编码场景音频信号中第一音频信号和目标虚拟扬声器的属性信息,以得到第一码流。
示例性的,S701~S705,可以参照上述S401~S405的描述,在此不在赘述。
S706,获取场景音频信号中第五音频信号所对应的特征信息。
一种可能的方式中,当第一音频信号为第二音频信号,或者第一音频信号包括第二音频信号和第四音频信号时,第五音频信号为第三音频信号。
例如,假设N1=3,M=0。若第一音频信号为第二音频信号,第二音频信号为上述公式(5)表示的1个通道的音频信号,则第五音频信号可以为上述公式(6)~公式(9)表示的15个通道的音频信号。若第一音频信号包括第二音频信号和第四音频信号,第二音频信号为上述公式(5)表示的1个通道的音频信号,第四音频信号为上述公式(6)中第一项表示的1个通道的音频信号,则第五音频信号可以为上述公式(6)~公式(9)表示的15个通道的音频信号。
一种可能方式中,当第一音频信号包括第二音频信号和第四音频信号时,第五音频信号可以为场景音频信号中除第二音频信号和第四音频信号之外的音频信号。
例如,假设N1=3,M=0。若第一音频信号包括第二音频信号和第四音频信号,第二音频信号为上述公式(5)表示的1个通道的音频信号,第四音频信号为上述公式(6)中第一项表示的1个通道的音频信号,则第五音频信号可以包括上述公式(6)中后2项表示的2个通道的音频信号,以及公式(7)~公式(9)表示的12个通道的音频信号。
示例性的,可以对场景音频信号进行分析,确定场景音频信号的强度和能量等信息;然后基于场景音频信号的强度及能量等信息,提取出场景音频信号中第五音频信号所对应的特征信息。
其中,场景音频信号所对应的特征信息包括但不限于:增益信息和扩散信息。
示例性的,可以参照如下公式(15),计算场景音频信号中第五音频信号对应的增益信息Gain(i):
Gain(i)=E(i)/E(1)      (15)
其中,i为场景音频信号中第五音频信号包含的通道的通道号,E(i)为第i个通道的能量,E(1)为场景音频信号中C1个通道的音频信号的能量。
S707,编码特征信息,以得到第二码流。
示例性的,可以编码场景音频信号中第一音频信号所对应的特征信息,以得到第二码流。后续,可以将第二码流发送给解码端,这样,解码端可以基于场景音频信号中第五音频信号所对应的特征信息,对第一重建场景音频信号中第七音频信号/第八音频信号进行补偿后,以得到提高第一重建场景音频信号的音频质量。
图8为示例性示出的解码过程的示意图。图8为与图7的编码过程对应的解码过程。
S801,接收第一码流和第二码流。
S802,解码第一码流,以得到第一重建信号和目标虚拟扬声器的属性信息。
S803,解码第二码流,以得到解码出场景音频信号中第五音频信号所对应的特征信息。
应该理解的是,当编码端对特征信息进行的是有损压缩时,解码端解码得到的特征信息和编码端编码的特征信息存在差异。当编码端对特征信息进行的是无损压缩时,解码端解码得到的特征信息和编码端编码的特征信息相同。(其中,本申请对编码端编码的特征信息和解码端解码得到的特征信息,未从名称上进行区分。)
S804,基于属性信息,确定第一虚拟扬声器系数。
S805,基于第一重建信号和第一虚拟扬声器系数,生成虚拟扬声器信号。
S806,基于属性信息,确定第二虚拟扬声器系数。
S807,基于虚拟扬声器信号和第二虚拟扬声器系数,以得到第一重建场景音频信号。
示例性的,S801~S807,可以参照上述S501~S506的描述,在此不再赘述。
S808,基于特征信息,对第一重建场景音频信号中的第七音频信号进行补偿。
示例性的,可以基于场景音频信号中的第五音频信号所对应的特征信息,对第一重建场景音频信号中的第七音频信号进行补偿,以提升第一重建场景音频信号中第七音频信号的质量。
示例性的,当特征信息为增益信息时,可以参照如下公式(16)进行补偿:
E(i)=Gain(i)*E(1)     (16)
其中,i为第一重建场景音频信号中第七音频信号包含的通道的通道号,E(i)为第i个通道的能量,E(1)为第一重建场景音频信号中C2个通道音频信号的能量,Gain(i)为场景音频信号中第五音频信号中第i个通道的音频信号所对应的增益信息。
S809,基于第二重建信号和第七音频信号,生成第二重建场景音频信号。
示例性的,S809中的第七音频信号为基于特征信息补偿后的第七音频信号;S809可以参照上文的描述,在此不再赘述。
应该理解的是,基于特征信息,对第一重建场景音频信号中的第八音频信号进行补偿;以及基于第二重建信号、第四重建信号和第一重建场景音频信号中第八音频信号(基于特征信息补偿后的第八音频信号),生成第二重建场景音频信号;可以参照S808~S809的描述,在此不再赘述。
应该理解的是,即使不执行S809,也可以执行S808,也就是说,可以对第一重建场景音频信号进行补偿,将补偿后的第一重建场景音频信号作为最终的重建场景音频信号这样,也可以提高最终的重建场景音频信号的音频质量。
图9a为示例性示出的编码端的结构示意图。其中,图9a为在图6a的基础上示出的编码端的结构。
参数图9a,示例性的,编码端可以包括配置单元、虚拟扬声器生成单元、目标扬声器生成单元、核心编码器和特征提取单元。应该理解的是,图9a仅是本申请的一个示例,本申请的编码端可以包括比图9a示出的更多或更少的模块,在此不再赘述。
示例性的,图9a中的配置单元、虚拟扬声器生成单元、目标扬声器生成单元,可以参照图6a的描述在此不再赘述。
示例性的,特征提取单元,可以用于获取场景音频信号中第五音频信号所对应的特征信息。
示例性的,核心编码器,可以用于对场景音频信号中第一音频信号和目标虚拟扬声器的属性信息进行编码,以得到第一码流;以及对场景音频信号中第五音频信号所对应的特征信息进行编码,以得到第二码流。
示例性的,上述图1a和图1b中的场景音频编码模块可以包括图9a的配置单元、虚拟扬声器生成单元、目标扬声器生成单元、核心编码器和特征提取单元;或者,仅包括核心编码器。
图9b为示例性示出的解码端的结构示意图。
参数图9b,示例性的,解码端可以包括核心解码器、虚拟扬声器系数生成单元、虚拟扬声器信号生成单元、第一重建单元、补偿单元和第二重建单元。应该理解的是,图9b仅是本申请的一个示例,本申请的解码端可以包括比图9b示出的更多或更少的模块,在此不再赘述。
示例性的,图9b中的虚拟扬声器系数生成单元、虚拟扬声器信号生成单元和第一重建单元,可以参照图6b中的描述,在此不再赘述。
示例性的,核心解码器,可以用于解码第一码流,以得到第一重建信号和目标虚拟扬声器的属性信息;还可以用于解码第二码流,以得到场景音频信号中第五音频信号所对应的特征信息。
示例性的,补偿模块,可以用于基于第五音频信号所对应的特征信息,对第七音频信号/第八音频信号进行补偿。
示例性的,第二重建模块,可以用于基于第二重建信号和补偿后的第七音频信号,生成第二重建场景音频信号;或者,用于基于第二重建信号、第四重建信号和补偿后的第八音频信号,生成第二重建场景音频信号。
示例性的,上述图1a和图1b中的场景音频解码模块可以包括图9b的核心解码器、虚拟扬声器系数生成单元、虚拟扬声器信号生成单元、第一重建单元、补偿单元和第二重建单元;或者,仅包括核心解码器。
举个例子对上述编码解过程进行说明。例如,待编码的场景音频信号为3阶HOA信号,包括16个通道。假设,编码端选取的目标虚扬声器的数量为4个,K=9;则可以对场景音频信号中9个通道的音频信号和4个目标虚拟扬声器的属性信息进行编码,得到第一码流;以及对场景音频信号的另外7个通道的音频信号所对应的特征信息进行编码,得到第二码流。编码端将第一码流和第二码流发送至解码端。解码端解码第一码流,可以得到4个目标虚拟扬声器的属性信息和场景音频信号中9个通道的音频信号;以及解码第二码流,可以得到场景音频信号中另外7个通道的音频信号所对应的特征信息。接着,可以根据4个目标虚拟扬声器的属性信息和场景音频信号中9个通道的音频信号,生成4个虚拟扬声器信号。最后,再用4个虚拟扬声器信号和4个目标虚拟扬声器的属性信息,生成第一重建场景音频信号即3阶HOA信号。然后,将解码得到所对应的特征信息作用到第一重建场景音频信号中对应的7个通道的音频信号上;再按照通道拼接解码得到场景音频信号中的9个通道的音频信号和补偿后的第一重建场景音频信号中7个通道的音频信号,得到第二重建场景音频信号。第二重建场景音频信号为3阶HOA信号,包括16个通道。
通过测试,在768kbps速率下,本申请的编码效果优于现有技术的编码效果,能够达到透明音质和无方位偏差的效果。
图10为示例性示出的场景音频编码装置的结构示意图。图10中的场景音频编码装置可以用于执行前述实施例的编码方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。其中,场景音频编码装置可以包括:
信号获取模块1001,用于获取待编码的场景音频信号,场景音频信号包括C1个通道的音频信号,C1为正整数;
属性信息获取模块1002,用于基于场景音频信号,确定目标虚拟扬声器的属性信息;
编码模块1003,用于编码场景音频信号中第一音频信号和目标虚拟扬声器的属性信息,以得到第一码流;其中,第一音频信号为场景音频信号中K个通道的音频信号,K为小于或等于C1的正整数。
示例性的,第一音频信号包括第二音频信号。
示例性的,第一音频信号还包括第四音频信号;其中,第四音频信号为第三音频信号中部分通道的音频信号。
示例性的,目标虚拟扬声器的属性信息包括以下至少一种:目标虚拟扬声器的位置信息,目标虚拟扬声器的位置信息对应的位置索引,或,目标虚拟扬声器的虚拟扬声器索引。
示例性的,属性信息获取模块1002,具体用于获取多个候选虚拟扬声器对应的多组虚拟扬声器系数,多组虚拟扬声器系数与多个候选虚拟扬声器一一对应;基于场景音频信号和多组虚拟扬声器系数,从多个候选虚拟扬声器中选取目标虚拟扬声器;获取目标虚拟扬声器的属性信息。
示例性的,属性信息获取模块1002,具体用于将场景音频信号与多组虚拟扬声器系数分别进行内积,以得到多个内积值;多个内积值与多组虚拟扬声器系数一一对应;基于多个内积值,从多个候选虚拟扬声器中选取目标虚拟扬声器。
示例性的,场景音频编码装置还包括:特征信息获取模块,用于获取场景音频信号中第五音频信号所对应的特征信息;其中,第五音频信号为第三音频信号,或者,第五音频信号为场景音频信号中除第二音频信号和第四音频信号之外的音频信号;编码模块1003,还用于编码特征信息,以得到第二码流。
示例性的,特征信息包括增益信息。
图11为示例性示出的场景音频解码装置的结构示意图。图11中的场景音频解码装置可以用于执行前述实施例的解码方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。其中,场景音频解码装置可以包括:
码流接收模块1101,用于接收第一码流;
解码模块1102,用于解码第一码流,以得到第一重建信号和目标虚拟扬声器的属性信息,第一重建信号是场景音频信号中第一音频信号的重建信号,场景音频信号包括C1个通道的音频信号,第一音频信号为场景音频信号中K个通道的音频信号,C1为正整数,K为小于或等于C1的正整数;
虚拟扬声器信号生成模块1103,用于基于属性信息和第一重建信号,生成目标虚拟扬声器对应的虚拟扬声器信号;
场景音频信号重建模块1104,用于基于属性信息和虚拟扬声器信号进行重建,以得到第一重建场景音频信号;第一重建场景音频信号包括C2个通道的音频信号,C2为正整数。
示例性的,场景音频解码装置还包括:信号生成模块1105,用于基于第一重建信号和第一重建场景音频信号,生成第二重建场景音频信号,第二重建场景音频信号包括C2个通道的音频信号,C2为正整数。
示例性的,信号生成模块1105,具体用于当第一音频信号包括第二音频信号时,基于第二重建信号和第七音频信号,生成第二重建场景音频信号;其中,第二重建信号为第二音频信号的重建信号。
示例性的,信号生成模块1105,具体用于当第一音频信号包括第二音频信号和第四音频信号时,基于第二重建信号、第四重建信号和第八音频信号,生成第二重建场景音频信号;其中,第四音频信号为第三音频信号中的部分音频信号,第四重建信号为第四音频信号的重建信号,第二重建信号为第二音频信号的重建信号,第八音频信号为第七音频信号中的部分音频信号。
示例性的,虚拟扬声器信号生成模块1103,具体用于基于目标虚拟扬声器的属性信息,确定目标虚拟扬声器对应的第一虚拟扬声器系数;基于第一重建信号和第一虚拟扬声器系数,生成虚拟扬声器信号。
示例性的,场景音频信号重建模块1104,具体用于基于目标虚拟扬声器的属性信息,确定目标虚拟扬声器对应的第二虚拟扬声器系数;基于虚拟扬声器信号和第二虚拟扬声器系数,以得到第一重建场景音频信号。
示例性的,码流接收模块1101,还用于接收第二码流;解码模块1102,还用于解码第二码流,以得到场景音频信号中第五音频信号所对应的特征信息,其中,第五音频信号为第三音频信号;场景音频解码装置还包括:补偿模块,用于基于特征信息,对第七音频信号进行补偿。
示例性的,码流接收模块1101,还用于接收第二码流;解码模块1102,还用于解码第二码流,以得到场景音频信号中第五音频信号所对应的特征信息,其中,第五音频信号为场景音频信号中除第二音频信号和第四音频信号之外的音频信号;场景音频解码装置还包括:补偿模块,用于基于特征信息,对第八音频信号进行补偿。
示例性的,特征信息包括增益信息。
一个示例中,图12示出了本申请实施例的一种装置1200的示意性框图装置1200可包括:处理器1201和收发器/收发管脚1202,可选地,还包括存储器1203。
装置1200的各个组件通过总线1204耦合在一起,其中总线1204除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图中将各种总线都称为总线1204。
可选地,存储器1203可以用于存储前述方法实施例中的指令。该处理器1201可用于执行存储器1203中的指令,并控制接收管脚接收信号,以及控制发送管脚发送信号。
装置1200可以是上述方法实施例中的电子设备或电子设备的芯片。
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
本实施例还提供一种芯片,该芯片包括一个或多个接口电路和一个或多个处理器;接口电路用于从电子设备的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行计算机指令时,使得电子设备执行上述实施例中的方法。其中,接口电路可以是指图12中的收发器1202。
本实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,当该计算机指令在电子设备上运行时,使得电子设备执行上述相关方法步骤实现上述实施例中的场景音频编解码方法。
本实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的场景音频编解码方法。
本实施例还提供了一种存储码流的装置,该装置包括:接收器和至少一个存储介质,接收器用于接收码流;至少一个存储介质用于存储码流;码流是根据上述实施例中的场景音频编方法生成的。
本申请实施例提供一种传输码流的装置,该装置包括:发送器和至少一个存储介质,至少一个存储介质用于存储码流,码流是根据上述实施例中的场景音频编方法生成的;发送器用于从存储介质中获取码流并将码流通过传输介质发送给端侧设备。
本申请实施例提供一种分发码流的***,该***包括:至少一个存储介质,用于存储至少一个码流,至少一个码流是根据上述实施例中的场景音频编方法生成的,流媒体设备,用于从至少一个存储介质中获取目标码流,并将目标码流发送给端侧设备,其中,流媒体设备包括内容服务器或内容分发服务器。
另外,本申请的实施例还提供一种装置,这个装置具体可以是芯片,组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述各方法实施例中的场景音频编解码方法。
其中,本实施例提供的电子设备、计算机可读存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
通过以上实施方式的描述,所属领域的技术人员可以了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
本申请各个实施例的任意内容,以及同一实施例的任意内容,均可以自由组合。对上述内容的任意组合均在本申请的范围之内。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。
结合本申请实施例公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机可读存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。

Claims (24)

  1. 一种场景音频编码方法,其特征在于,所述方法包括:
    获取待编码的场景音频信号,所述场景音频信号包括C1个通道的音频信号,C1为正整数;
    基于所述场景音频信号,确定目标虚拟扬声器的属性信息;
    编码所述场景音频信号中第一音频信号和所述目标虚拟扬声器的属性信息,以得到第一码流;其中,所述第一音频信号为所述场景音频信号中K个通道的音频信号,K为小于或等于C1的正整数。
  2. 根据权利要求1所述的方法,其特征在于,
    所述场景音频信号为N1阶高阶立体混响HOA信号,所述N1阶HOA信号包括第二音频信号和第三音频信号,所述第二音频信号为所述N1阶HOA信号中第0阶至第M阶的信号,所述第三音频信号为所述N1阶HOA信号中除所述第二音频信号之外的音频信号,M为小于N1的整数,C1等于(N1+1)的平方,N1为正整数;
    所述第一音频信号包括所述第二音频信号。
  3. 根据权利要求2所述的方法,其特征在于,所述第一音频信号还包括第四音频信号,所述第四音频信号为所述第三音频信号中部分通道的音频信号。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,
    所述目标虚拟扬声器的属性信息包括以下至少一种:所述目标虚拟扬声器的位置信息,所述目标虚拟扬声器的位置信息对应的位置索引,或,所述目标虚拟扬声器的虚拟扬声器索引。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述基于所述场景音频信号,确定目标虚拟扬声器的属性信息,包括:
    获取多个候选虚拟扬声器对应的多组虚拟扬声器系数,所述多组虚拟扬声器系数与所述多个候选虚拟扬声器一一对应;
    基于所述场景音频信号和所述多组虚拟扬声器系数,从所述多个候选虚拟扬声器中选取所述目标虚拟扬声器;
    获取所述目标虚拟扬声器的属性信息。
  6. 根据权利要求5所述的方法,其特征在于,所述基于所述场景音频信号和所述多组虚拟扬声器系数,从所述多个候选虚拟扬声器中选取所述目标虚拟扬声器,包括:
    将所述场景音频信号与所述多组虚拟扬声器系数分别进行内积,以得到多个内积值;所述多个内积值与所述多组虚拟扬声器系数一一对应;
    基于所述多个内积值,从所述多个候选虚拟扬声器中选取所述目标虚拟扬声器。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,
    所述场景音频信号为N1阶HOA信号,所述N1阶HOA信号包括第二音频信号和第三音频信号,所述第二音频信号为所述N1阶HOA信号中第0阶至第M阶的信号,所述第三音频信号为所述N1阶HOA信号中除所述第二音频信号之外的音频信号,M为小于N1的整数,C1等于(N1+1)的平方,N1为正整数;
    所述方法还包括:
    获取所述场景音频信号中第五音频信号所对应的特征信息;
    编码所述特征信息,以得到第二码流;
    其中,所述第五音频信号为所述第三音频信号,或者,所述第五音频信号为所述场景音频信号中除所述第二音频信号和第四音频信号之外的音频信号,所述第四音频信号为所述第三音频信号中部分通道的音频信号。
  8. 根据权利要求7所述的方法,其特征在于,
    所述特征信息包括增益信息。
  9. 一种码流生成方法,其特征在于,根据如上述权利要求1至权利要求8任一项所述的编码方法生成码流。
  10. 一种场景音频编码装置,其特征在于,所述装置包括:
    信号获取模块,用于获取待编码的场景音频信号,所述场景音频信号包括C1个通道的音频信号,C1为正整数;
    属性信息获取模块,用于基于所述场景音频信号,确定目标虚拟扬声器的属性信息;
    编码模块,用于编码所述场景音频信号中第一音频信号和所述目标虚拟扬声器的属性信息,以得到第一码流;其中,所述第一音频信号为所述场景音频信号中K个通道的音频信号,K为小于或等于C1的正整数。
  11. 根据权利要求10所述的装置,其特征在于,
    所述场景音频信号为N1阶高阶立体混响HOA信号,所述N1阶HOA信号包括第二音频信号和第三音频信号,所述第二音频信号为所述N1阶HOA信号中第0阶至第M阶的信号,所述第三音频信号为所述N1阶HOA信号中除所述第二音频信号之外的音频信号,M为小于N1的整数,C1等于(N1+1)的平方,N1为正整数;
    所述第一音频信号包括所述第二音频信号。
  12. 根据权利要求11所述的装置,其特征在于,所述第一音频信号还包括第四音频信号,所述第四音频信号为所述第三音频信号中部分通道的音频信号。
  13. 根据权利要求10至12任一项所述的装置,其特征在于,
    所述目标虚拟扬声器的属性信息包括以下至少一种:所述目标虚拟扬声器的位置信息,所述目标虚拟扬声器的位置信息对应的位置索引,或,所述目标虚拟扬声器的虚拟扬声器索引。
  14. 根据权利要求10至13任一项所述的装置,其特征在于,
    所述属性信息获取模块,具体用于获取多个候选虚拟扬声器对应的多组虚拟扬声器系数,所述多组虚拟扬声器系数与所述多个候选虚拟扬声器一一对应;基于所述场景音频信号和所述多组虚拟扬声器系数,从所述多个候选虚拟扬声器中选取所述目标虚拟扬声器;获取所述目标虚拟扬声器的属性信息。
  15. 根据权利要求14所述的装置,其特征在于,
    所述属性信息获取模块,具体用于将所述场景音频信号与所述多组虚拟扬声器系数分别进行内积,以得到多个内积值;所述多个内积值与所述多组虚拟扬声器系数一一对应;基于所述多个内积值,从所述多个候选虚拟扬声器中选取所述目标虚拟扬声器。
  16. 根据权利要求10至15任一项所述的装置,其特征在于,所述场景音频信号为N1阶HOA信号,所述N1阶HOA信号包括第二音频信号和第三音频信号,所述第二音频信号为所述N1阶HOA信号中第0阶至第M阶的信号,所述第三音频信号为所述N1阶HOA信号中除所述第二音频信号之外的音频信号,M为小于N1的整数,C1等于(N1+1)的平方,N1为正整数;
    所述装置还包括:
    特征信息获取模块,用于获取所述场景音频信号中第五音频信号所对应的特征信息;其中,所述第五音频信号为所述第三音频信号,或者,所述第五音频信号为所述场景音频信号中除所述第二音频信号和第四音频信号之外的音频信号,所述第四音频信号为所述第三音频信号中部分通道的音频信号;
    所述编码模块,还用于编码所述特征信息,以得到第二码流。
  17. 根据权利要求16所述的装置,其特征在于,
    所述特征信息包括增益信息。
  18. 一种电子设备,其特征在于,包括:
    存储器和处理器,所述存储器与所述处理器耦合;
    所述存储器存储有程序指令,当所述程序指令由所述处理器执行时,使得所述电子设备执行权利要求1至权利要求8中任一项所述的场景音频编码方法。
  19. 一种芯片,其特征在于,包括一个或多个接口电路和一个或多个处理器;所述接口电路用于从电子设备的存储器接收信号,并向所述处理器发送所述信号,所述信号包括存储器中存储的计算机指令;当所述处理器执行所述计算机指令时,使得所述电子设备执行权利要求1至权利要求8中任一项所述的场景音频编码方法。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,当所述计算机程序运行在计算机或处理器上时,使得所述计算机或所述处理器执行如权利要求1至权利要求8任一项所述的场景音频编码方法。
  21. 一种计算机程序产品,其特征在于,所述计算机程序产品包含软件程序,当所述软件程序被计算机或处理器执行时,使得权利要求1至权利要求8任一项所述的方法的步骤被执行。
  22. 一种存储码流的装置,其特征在于,所述装置包括:接收器和至少一个存储介质,
    所述接收器用于接收码流;
    所述至少一个存储介质用于存储所述码流;
    所述码流是根据如权利要求1至权利要求8任一项所述的场景音频编码方法生成的。
  23. 一种传输码流的装置,其特征在于,所述装置包括:发送器和至少一个存储介质,
    所述至少一个存储介质用于存储码流,所述码流是根据如权利要求1至权利要求8任一项所述的场景音频编码方法生成的;
    所述发送器用于从所述存储介质中获取所述码流并将所述码流通过传输介质发送给端侧设备。
  24. 一种分发码流的***,其特征在于,所述***包括:
    至少一个存储介质,用于存储至少一个码流,所述至少一个码流是根据如权利要求1至权利要求8任一项场景音频编码方法生成的,
    流媒体设备,用于从所述至少一个存储介质中获取目标码流,并将所述目标码流发送给端侧设备,其中,所述流媒体设备包括内容服务器或内容分发服务器。
PCT/CN2023/131640 2022-12-02 2023-11-14 场景音频编码方法及电子设备 WO2024114373A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211537851.0A CN118136027A (zh) 2022-12-02 2022-12-02 场景音频编码方法及电子设备
CN202211537851.0 2022-12-02

Publications (1)

Publication Number Publication Date
WO2024114373A1 true WO2024114373A1 (zh) 2024-06-06

Family

ID=91238605

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/131640 WO2024114373A1 (zh) 2022-12-02 2023-11-14 场景音频编码方法及电子设备

Country Status (2)

Country Link
CN (1) CN118136027A (zh)
WO (1) WO2024114373A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120137253A (ko) * 2011-06-09 2012-12-20 삼성전자주식회사 3차원 오디오 신호를 부호화 및 복호화하는 방법 및 장치
CN113314129A (zh) * 2021-04-30 2021-08-27 北京大学 一种适应环境的声场重放空间解码方法
CN114582356A (zh) * 2020-11-30 2022-06-03 华为技术有限公司 一种音频编解码方法和装置
TW202247148A (zh) * 2021-05-17 2022-12-01 大陸商華為技術有限公司 三維音頻訊號編碼方法、裝置和編碼器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120137253A (ko) * 2011-06-09 2012-12-20 삼성전자주식회사 3차원 오디오 신호를 부호화 및 복호화하는 방법 및 장치
CN114582356A (zh) * 2020-11-30 2022-06-03 华为技术有限公司 一种音频编解码方法和装置
CN113314129A (zh) * 2021-04-30 2021-08-27 北京大学 一种适应环境的声场重放空间解码方法
TW202247148A (zh) * 2021-05-17 2022-12-01 大陸商華為技術有限公司 三維音頻訊號編碼方法、裝置和編碼器

Also Published As

Publication number Publication date
CN118136027A (zh) 2024-06-04

Similar Documents

Publication Publication Date Title
US20230298600A1 (en) Audio encoding and decoding method and apparatus
CN114067810A (zh) 音频信号渲染方法和装置
US11122386B2 (en) Audio rendering for low frequency effects
US20240119950A1 (en) Method and apparatus for encoding three-dimensional audio signal, encoder, and system
US20240087580A1 (en) Three-dimensional audio signal coding method and apparatus, and encoder
CN112005560B (zh) 使用元数据处理音频信号的方法和设备
WO2024114373A1 (zh) 场景音频编码方法及电子设备
WO2024114372A1 (zh) 场景音频解码方法及电子设备
WO2022110722A1 (zh) 一种音频编解码方法和装置
TW202425670A (zh) 場景音訊編碼方法及電子設備
TW202424960A (zh) 場景音訊解碼方法及電子設備
WO2024146408A1 (zh) 场景音频解码方法及电子设备
US20240079017A1 (en) Three-dimensional audio signal coding method and apparatus, and encoder
CN114128312B (zh) 用于低频效果的音频渲染
TWI844036B (zh) 三維音訊訊號編碼方法、裝置、編碼器、系統、電腦程式和電腦可讀儲存介質
KR20190060464A (ko) 오디오 신호 처리 방법 및 장치
WO2022237851A1 (zh) 一种音频编码、解码方法及装置
WO2022262758A1 (zh) 音频渲染***、方法和电子设备
US20240087579A1 (en) Three-dimensional audio signal coding method and apparatus, and encoder

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23896525

Country of ref document: EP

Kind code of ref document: A1