WO2023216119A1 - Audio signal encoding method and apparatus, electronic device and storage medium - Google Patents

Audio signal encoding method and apparatus, electronic device and storage medium Download PDF

Info

Publication number
WO2023216119A1
WO2023216119A1 PCT/CN2022/092082 CN2022092082W WO2023216119A1 WO 2023216119 A1 WO2023216119 A1 WO 2023216119A1 CN 2022092082 W CN2022092082 W CN 2022092082W WO 2023216119 A1 WO2023216119 A1 WO 2023216119A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
audio
target
scene
input format
Prior art date
Application number
PCT/CN2022/092082
Other languages
French (fr)
Chinese (zh)
Inventor
高硕�
Original Assignee
北京小米移动软件有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京小米移动软件有限公司 filed Critical 北京小米移动软件有限公司
Priority to CN202280001342.8A priority Critical patent/CN117813652A/en
Priority to PCT/CN2022/092082 priority patent/WO2023216119A1/en
Publication of WO2023216119A1 publication Critical patent/WO2023216119A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis

Definitions

  • the present disclosure relates to the field of communication technology, and in particular, to an audio signal encoding method, device, electronic equipment and storage medium.
  • users negotiate an audio format when establishing voice communication.
  • the negotiated audio format is always used to transmit the local user's audio signal in the audio format to the remote user.
  • the audio scene of the user's voice communication may change, and the audio signal in this audio format may not be able to provide the remote user with the real audio scene information of the local user in the changed audio scene, resulting in poor user experience. This is A problem that needs to be solved urgently.
  • Embodiments of the present disclosure provide an audio signal encoding method, device, electronic device and storage medium, which enables a remote user to obtain audio scene information of the audio scene where the local user is located, thereby improving user experience.
  • embodiments of the present disclosure provide an audio signal encoding method, which method includes: acquiring an audio signal; determining an audio scene type corresponding to the audio signal; and determining a target input according to the audio scene type and the audio signal. format audio signal; encode the target input format audio signal to generate a target encoding code stream.
  • the audio signal is obtained; the audio scene type corresponding to the audio signal is determined; the target input format audio signal is determined based on the audio scene type and audio signal; the target input format audio signal is encoded to generate the target encoding code stream.
  • determining the audio scene type corresponding to the audio signal includes: obtaining audio characteristic parameters of the audio signal; inputting the audio characteristic parameters into an audio scene type analysis model, and determining the audio signal The corresponding audio scene type.
  • determining the target input format audio signal according to the audio scene type and the audio signal includes: determining the target audio signal input format according to the audio scene type and/or the audio signal; The target input format audio signal is determined based on the target audio signal input format and the audio signal.
  • Determining the target audio signal input format according to the audio scene type and the audio signal includes:
  • determining the target audio signal input format to be an object-based signal input format
  • determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format
  • the target audio signal input format is determined to be a scene signal-based input format.
  • determining the target input format audio signal according to the target audio signal input format and the audio signal includes:
  • determining that the target audio signal input format is an object-based signal input format In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
  • the target audio signal input format is an object-based signal input format and a scene-based signal input format
  • determining that the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals
  • the target audio signal input format is a scene-based signal input format
  • encoding the target input format audio signal to generate a target encoding code stream includes:
  • Code stream multiplexing is performed according to the coding parameters to generate the target coding code stream.
  • determining the target signal encoding core according to the target input format audio signal includes:
  • the target encoding core of the object audio data in the object-based audio signal is an object audio data encoding core, and it is determined that the object-based audio signal is based on metadata in the object audio signal.
  • the target encoding core is an object metadata parameter encoding core;
  • the target encoding core is a scene audio data encoding core
  • the target encoding core is a channel audio data encoding core
  • the target encoding core based on the audio data in the MASA audio signal is a scene audio data encoding core, Determining that the target coding core based on the spatial auxiliary metadata in the MASA audio signal is a spatial auxiliary metadata parameter coding core;
  • the target encoding core is the encoding core selected for the format audio signal in the mixed format audio signal, wherein the mixed format signal includes an audio signal based on the object , at least two of a scene-based audio signal, a channel-based audio signal, and a MASA-based audio signal.
  • determining the target encoding core according to the audio scene type and the target input format audio signal includes:
  • the target encoding core of the object audio data is an object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that all the target encoding cores based on the scene audio signal are
  • the target encoding core is the scene audio data encoding core;
  • the target encoding core of the audio data is an object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target based on the scene audio signal
  • the encoding core is the scene audio data encoding core
  • the target input format audio signal is a scene-based audio signal
  • the target encoding core based on the scene audio signal is scene audio data. Coding core.
  • inventions of the present disclosure provide an audio signal encoding device.
  • the audio signal encoding device includes: a signal acquisition unit configured to acquire an audio signal; and a type determination unit configured to determine the audio corresponding to the audio signal. Scene type; a target signal determination unit configured to determine a target input format audio signal according to the audio scene type and the audio signal; an encoding processing unit configured to encode the target input format audio signal to generate a target Encoding stream.
  • the type determination unit includes:
  • a parameter acquisition module configured to acquire audio characteristic parameters of the audio signal
  • a model processing module configured to input the audio feature parameters into an audio scene type analysis model to determine the audio scene type corresponding to the audio signal.
  • the target signal determination unit includes: a target format determination module configured to determine the target audio signal input format according to the audio scene type and/or the audio signal; the target signal determination module, Configured to determine the target input format audio signal according to the target audio signal input format and the audio signal.
  • the target format determination module is specifically configured as:
  • determining the target audio signal input format to be an object-based signal input format
  • determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format
  • the target audio signal input format is determined to be a scene signal-based input format.
  • the target signal determination module is specifically configured as:
  • determining that the target audio signal input format is an object-based signal input format In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
  • the target audio signal input format is an object-based signal input format and a scene-based signal input format
  • determining that the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals
  • the target audio signal input format is a scene-based signal input format
  • the encoding processing unit includes: a coding core determination module configured to determine a target coding core for the target input format audio signal, or according to the audio scene type and the target input format.
  • the audio signal determines the target encoding core;
  • the parameter acquisition module is configured to encode the target input format audio signal according to the target encoding core and obtain encoding parameters;
  • the code stream generation module is configured to encode the audio signal according to the encoding parameters. Stream multiplexing to generate the target encoding code stream.
  • the encoding core determination module is specifically configured as:
  • the target encoding core of the object audio data in the object-based audio signal is an object audio data encoding core, and it is determined that the object-based audio signal is based on metadata in the object audio signal.
  • the target encoding core is an object metadata parameter encoding core;
  • the target encoding core is a scene audio data encoding core
  • the target encoding core is a channel audio data encoding core
  • the target encoding core based on the audio data in the MASA audio signal is a scene audio data encoding core, and it is determined that the MASA based audio
  • the target encoding core of the spatial auxiliary metadata in the signal is a spatial auxiliary metadata parameter encoding core
  • the target encoding core is the encoding core selected for the format audio signal in the mixed format audio signal, wherein the mixed format signal includes an audio signal based on the object , at least two of a scene-based audio signal, a channel-based audio signal, and a MASA-based audio signal.
  • the encoding core determination module is specifically configured as:
  • the target encoding core is an object audio data encoding core, it is determined that the target encoding core based on metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target encoding core based on the scene audio signal is a scene. Audio data encoding core;
  • the target encoding core of the audio data is an object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target based on the scene audio signal
  • the encoding core is the scene audio data encoding core
  • the target input format audio signal is a scene-based audio signal
  • the target encoding core based on the scene audio signal is scene audio data. Coding core.
  • embodiments of the present disclosure provide an electronic device, which includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be used by the at least one processor. Instructions executed by the processor, the instructions being executed by the at least one processor, so that the at least one processor can execute the method described in the first aspect.
  • embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in the first aspect.
  • embodiments of the present disclosure provide a computer program product, including computer instructions, characterized in that, when executed by a processor, the computer instructions implement the method described in the first aspect.
  • Figure 1 is a flow chart of an audio signal encoding method provided by an embodiment of the present disclosure
  • Figure 2 is a flow chart of another audio signal encoding method provided by an embodiment of the present disclosure.
  • Figure 3 is a flow chart of yet another audio signal encoding method provided by an embodiment of the present disclosure.
  • Figure 4 is a flow chart of yet another audio signal encoding method provided by an embodiment of the present disclosure.
  • Figure 5 is a flow chart of yet another audio signal encoding method provided by an embodiment of the present disclosure.
  • Figure 6 is a structural diagram of an audio signal encoding device provided by an embodiment of the present disclosure.
  • Figure 7 is a structural diagram of a type determination unit in the audio signal encoding device provided by an embodiment of the present disclosure
  • Figure 8 is a structural diagram of a target signal determination unit in the audio signal encoding device provided by an embodiment of the present disclosure
  • Figure 9 is a structural diagram of a coding processing unit in the audio signal coding device provided by an embodiment of the present disclosure.
  • FIG. 10 is a structural diagram of an electronic device according to an embodiment of the present disclosure.
  • At least one in the present disclosure can also be described as one or more, and the plurality can be two, three, four or more, and the present disclosure is not limited.
  • the technical feature is distinguished by “first”, “second”, “third”, “A”, “B”, “C” and “D” etc.
  • the technical features described in “first”, “second”, “third”, “A”, “B”, “C” and “D” are in no particular order or order.
  • each table in this disclosure can be configured or predefined.
  • the values of the information in each table are only examples and can be configured as other values, which is not limited by this disclosure.
  • it is not necessarily required to configure all the correspondences shown in each table.
  • the corresponding relationships shown in some rows may not be configured.
  • appropriate deformation adjustments can be made based on the above table, such as splitting, merging, etc.
  • the names of the parameters shown in the titles of the above tables may also be other names understandable by the communication device, and the values or expressions of the parameters may also be other values or expressions understandable by the communication device.
  • other data structures can also be used, such as arrays, queues, containers, stacks, linear lists, pointers, linked lists, trees, graphs, structures, classes, heaps, hash tables or hash tables. wait.
  • the first generation of mobile communication technology is the first generation of wireless cellular technology and is an analog mobile communication network.
  • the 3G mobile communication system was proposed by the ITU (International Telecommunication Union) for international mobile communications in 2000.
  • 4G is a better improvement on 3G technology.
  • both data and voice use an all-IP approach to provide real-time HD+Voice services for voice and audio.
  • the EVS codec used can take into account high-quality compression of voice and audio.
  • the voice and audio communication services provided above have expanded from narrowband signals to ultra-wideband and even full-band services, but they are still monophonic services. People's demand for high-quality audio continues to increase. Compared with monophonic audio, stereo audio Have a sense of orientation and distribution for each sound source and improve clarity.
  • Three signal formats including channel-based multi-channel audio signals, object-based audio signals, and scene-based audio signals, can provide three-dimensional audio services.
  • the IVAS codec for immersive voice and audio services being standardized by the Third Generation Partnership Project 3GPP SA4 can support the coding and decoding requirements of the above three signal formats.
  • Terminal devices that can support 3D audio services include mobile phones, computers, tablets, conference system equipment, augmented reality AR/virtual reality VR equipment, cars, etc.
  • the audio scene in which the local user is located may be constantly changing.
  • the local user communicates in real time with a remote user in a quiet and empty outdoor place.
  • the local user terminal device chooses to use mono.
  • the signal format as the input signal format can well transmit the audio scene of the local user to the remote user.
  • a bird will fly over during a certain period of time, and the bird's cry is an important part of the current audio scene. important audio element.
  • the bird's call cannot be well transmitted to the remote user.
  • the local user's terminal device can analyze the audio scene in which the local user is located in real time, and use the obtained audio scene type to guide the audio signal generator to output the optimal audio format signal, thereby ensuring the selected audio
  • the format signal can better represent the audio scene of the local user, so that the remote user can well obtain the audio scene information of the audio scene where the local user is located, and improve the user experience.
  • Figure 1 is a flow chart of an audio signal encoding method provided by an embodiment of the present disclosure.
  • the method may include but is not limited to the following steps:
  • the local user when the local user establishes voice communication with any remote user, the local user can establish voice communication with the terminal equipment of any remote user through the terminal device, wherein the terminal device of the local user can obtain the information in real time. Acquire the audio signal from the sound information of the environment where the local user is located.
  • the sound information of the environment where the local user is located includes the sound information emitted by the local user, the sound information of surrounding things, etc.
  • Sound information of surrounding things such as: sound information of driving vehicles, bird calls, wind sound information, sound information of other users around the local user, etc.
  • the terminal device is an entity on the user side that is used to receive or transmit signals, such as mobile phones, computers, tablets, watches, walkie-talkies, conference system equipment, augmented reality AR/virtual reality VR equipment, cars, etc.
  • Terminal equipment can also be called user equipment (user equipment, UE), mobile station (mobile station, MS), mobile terminal equipment (mobile terminal, MT), etc.
  • the terminal device can be a car with communication functions, a smart car, a mobile phone, a wearable device, a tablet computer (Pad), a computer with wireless transceiver functions, a virtual reality (VR) terminal device, an augmented reality (augmented reality (AR) terminal equipment, wireless terminal equipment in industrial control, wireless terminal equipment in self-driving, wireless terminal equipment in remote medical surgery, smart grid ( Wireless terminal equipment in smart grid, wireless terminal equipment in transportation safety, wireless terminal equipment in smart city, wireless terminal equipment in smart home, etc.
  • the embodiments of the present disclosure do not limit the specific technology and specific equipment form used by the terminal equipment.
  • the terminal device of the local user can obtain the audio signal through a recording device, such as a microphone, provided in or in conjunction with the terminal device, to obtain the sound information of the environment where the user is located, and further generate an audio signal. , obtain the audio signal.
  • a recording device such as a microphone
  • the obtained audio signal can be analyzed to obtain the audio scene type corresponding to the audio signal.
  • the audio signal may include sound information emitted by the local user and/or sound information of surrounding things.
  • the audio scene type corresponding to the audio signal can be determined according to the content included in the audio signal.
  • audio scene types include, for example: offices, theaters, cars, train stations, large shopping malls, etc.
  • an audio signal can be selected for each audio scene type.
  • the input format audio signal of the input format, or the input format audio signal of multiple audio signal input formats can be selected.
  • the audio scene types can also be divided in other ways, or combined with other ways, etc., for example: the audio scene types can only include at least one main audio signal in the audio signal, or The audio signal includes at least one main audio signal and a background audio signal, or the audio signal only includes at least one background audio signal.
  • the audio scene type can be set in advance as needed, and the embodiment of the present disclosure does not impose specific restrictions on this.
  • one or more audio signal input formats can be selected for each audio scene type.
  • the selected audio signal input format can also be determined according to the method of obtaining the audio signal, for example: The number of channels and spatial layout relationship of the audio signal are determined.
  • S3 determine the target input format audio signal according to the audio scene type and audio signal.
  • the target input format audio signal when the audio signal is acquired and the audio scene type corresponding to the audio signal is determined, can be further determined based on the audio scene type and the audio signal. Wherein, when the audio signal includes one or more input format audio signals, the target input format audio signal is determined to be one or more input format audio signals in the audio signal according to the audio scene type and the audio signal.
  • the audio signal may include one or more input format audio signals.
  • the corresponding relationship between the audio scene type and the input format audio signal can be set in advance. In the case of determining the audio scene type , determine the target input format audio signal based on the audio scene type and corresponding relationship.
  • the target input format audio signal in the audio signal can also be determined according to the audio scene type, the corresponding relationship, and the input format audio signal included in the audio signal.
  • the target input format audio signal when the target input format audio signal is determined, in order to communicate with the remote user, the target input format audio signal is sent to the terminal device of the remote user.
  • the audio signal in the target input format needs to be encoded to generate the target encoding stream and send it to the terminal device of the remote user.
  • the remote user's terminal device After receiving the target encoding code stream, the remote user's terminal device decodes the target encoding code stream to obtain the local user's voice information.
  • an audio signal (audio scene signal) is obtained, the audio scene of the audio signal is analyzed, the audio scene type is determined, and the audio signal format generator is input according to the audio scene type and the audio signal. , obtain the target input format audio signal (audio format signal).
  • the target encoding core used is determined according to the target input format audio signal and/or the audio scene type. For example, when the target input format audio signal is determined to be an object-based audio signal, the target encoding core used is determined to be the object. Audio data encoding core; when it is determined that the target input format audio signal is a scene-based audio signal, the target encoding core used is determined to be a scene audio data encoding core; when it is determined that the target input format audio signal is an object-based audio signal and a scene-based audio signal In the case of signals, it is determined that the target coding core used is the object audio data coding core, the scene audio data coding core, and so on.
  • the target input format audio signal (audio format signal) is encoded using a determined target encoding check, and the target encoding code stream is generated through code stream multiplexing and sent to the terminal device of the remote user.
  • the remote user's terminal device After receiving the target encoding code stream, the remote user's terminal device decodes the target encoding code stream to obtain the local user's voice information.
  • the audio signal is obtained; the audio scene type corresponding to the audio signal is determined; the target input format audio signal is determined based on the audio scene type and the audio signal; the target input format audio signal is encoded to generate the target encoding code stream. This ensures that the selected audio format signal can better represent the local user's audio scene, allowing the remote user to obtain audio scene information of the local user's audio scene, thereby improving the user experience.
  • Figure 3 is a flow chart of another audio signal encoding method provided by an embodiment of the present disclosure.
  • the method may include but is not limited to the following steps:
  • the obtained audio signal can be analyzed to obtain the audio scene type corresponding to the audio signal.
  • determining the audio scene type corresponding to the audio signal includes: obtaining audio characteristic parameters of the audio signal; inputting the audio characteristic parameters into the audio scene type analysis model to determine the audio scene type corresponding to the audio signal.
  • the audio characteristic parameters of the audio signal are obtained, such as: linear prediction cepstral coefficient, zero-crossing rate, mel frequency cepstrum coefficient, etc.
  • Audio scene type analysis models such as: HMM (Hidden Markov Model, hidden Markov model) model, Gaussian mixture model, or other mathematical models, etc.
  • the audio scene type analysis model can determine the audio scene type of the audio signal according to the audio characteristic parameters of the audio signal, wherein the audio scene type analysis model can be obtained through pre-training, or through pre-set To obtain, methods in related technologies can be used, and the embodiments of the present disclosure do not specifically limit this.
  • the target audio signal input format is determined according to the audio scene type, including:
  • determining the target audio signal input format to be an object-based signal input format In response to the audio scene type characterizing the audio scene including at least one main audio signal, determining the target audio signal input format to be an object-based signal input format;
  • determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format
  • the target audio signal input format is determined to be based on the scene signal input format.
  • the target audio signal input format may be determined to be an object-based signal input format.
  • the target audio signal input format may be determined to be an object-based signal input format and a scene-based signal input format.
  • determining the audio scene type indicates that the audio scene includes at least one main audio signal and a background audio signal, and the audio signal can be determined to be a mixed format audio signal, including input format audio signals of at least two audio scene types.
  • the audio scene type represents that the audio scene includes at least one main audio signal and a background audio signal
  • the audio signal is determined to be a mixed format audio signal
  • the target audio signal input format is an object-based signal.
  • the input format is at least two of a mono format signal based, a stereo format signal based, a MASA format signal based, a channel format signal based, a FOA format signal based and a HOA format signal based.
  • the target audio signal input format may be determined to be a scene signal-based input format.
  • determining the target input format audio signal according to the target audio signal input format and the audio signal includes:
  • determining that the target audio signal input format is an object-based signal input format In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
  • determining the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals
  • the target input format audio signal is determined to be a scene-based audio signal among the audio signals.
  • the target audio signal input format is an object-based signal input format
  • it is determined that the target input format audio signal is an object-based audio signal among the audio signals.
  • the target audio signal input format when it is determined that the target audio signal input format is an object-based signal input format and a scene-based signal input format, the target input format audio signal may be determined to be an object-based audio signal and a scene-based audio signal among the audio signals. .
  • the target input format audio signal when it is determined that the target audio signal input format is a scene-based signal input format in response to the determination, the target input format audio signal may be determined to be a scene-based audio signal among the audio signals.
  • S30 Determine the target encoding core according to the target input format audio signal, or determine the target encoding core according to the audio scene type and the target input format audio signal.
  • the target signal encoding core is determined according to the target input format audio signal, including:
  • the target encoding core based on the object audio data in the object audio signal is determined to be the object audio data encoding core, and the target encoding core based on the metadata in the object audio signal is determined to be the object metadata parameter.
  • determining the target encoding core to be the scene audio data encoding core
  • determining the target encoding core to be the channel audio data encoding core
  • the target encoding core based on the audio data in the MASA audio signal is the scene audio data encoding core, and the target based on the spatial auxiliary metadata in the MASA audio signal is determined.
  • the encoding core is the spatial auxiliary metadata parameter encoding core;
  • the target encoding core is determined to be the encoding core selected for the format audio signal in the mixed format audio signal, where the mixed format signal includes an object-based audio signal, a scene-based audio signal, and a sound-based audio signal. at least two of the channel audio signal and the MASA-based audio signal.
  • the target encoding core based on the object audio data in the object audio signal can be determined.
  • the object audio data encoding core it is determined that the target encoding core based on the metadata in the object audio signal is the object metadata parameter encoding core.
  • the target encoding core when it is determined that the target input format audio signal is based on a scene audio signal, the target encoding core may be determined to be a scene audio data encoding core.
  • the target encoding core when it is determined that the target input format audio signal is based on the channel audio signal, the target encoding core may be determined to be the channel audio data encoding core.
  • the target input format audio signal is a spatial audio MASA audio signal based on auxiliary metadata
  • the MASA-based audio signal includes audio data and spatial auxiliary metadata
  • it can be determined that the MASA-based audio signal is The target coding core of the audio data is the scene audio data coding core, and it can be determined that the target coding core based on the spatial auxiliary metadata in the MASA audio signal is the spatial auxiliary metadata parameter coding core.
  • the target encoding core when it is determined that the target input format audio signal is a mixed format audio signal, can be determined to be the encoding core selected for the format audio signal in the mixed format audio signal, where the mixed format signal includes an object-based At least two of an audio signal, a scene-based audio signal, a channel-based audio signal, and a MASA-based audio signal.
  • the target encoding core is determined according to the audio scene type and the target input format audio signal, including:
  • the target encoding core of the object audio data in the object-based audio signal is the object audio data Coding core, determine the target coding core based on the metadata in the object audio signal as the object metadata parameter coding core, and determine the target coding core based on the scene audio signal as the scene audio data coding core;
  • a target encoding core of the object audio data in the object-based audio signal is determined.
  • the object audio data encoding core determine the target encoding core based on the metadata in the object audio signal as the object metadata parameter encoding core, and determine the target encoding core based on the scene audio signal as the scene audio data encoding core;
  • the target input format audio signal is based on the scene audio signal.
  • the target input format audio signal is an object-based audio signal and a scene-based audio signal
  • the object-based audio signal it can be determined that the object-based audio signal
  • the target coding core of the audio data is the object audio data coding core
  • the target coding core based on the metadata in the object audio signal is determined to be the object metadata parameter coding core
  • the target coding core based on the scene audio signal is determined to be the scene audio data coding core.
  • the target input format audio signal is an object-based audio signal and a scene-based audio signal
  • the object-based audio signal is the object audio data coding core.
  • the target coding core based on the metadata in the object audio signal is determined to be the object metadata parameter coding core.
  • the target coding core based on the scene audio signal is determined to be the scene audio data coding. nuclear.
  • the target input format audio signal is a scene-based audio signal
  • the target encoding core based on the scene audio signal is scene Audio data encoding core
  • S40 Encode the audio signal in the target input format according to the target encoding check and obtain encoding parameters.
  • S50 Multiplex the code stream according to the coding parameters to generate the target code stream.
  • the audio signal is obtained, the audio scene type corresponding to the audio signal is determined, and the target input format audio signal is determined according to the audio scene type and/or the audio signal.
  • the target input format audio signal is an object-based audio signal
  • the target encoding core based on the object audio data in the object audio signal is determined to be the object audio data encoding core
  • the target encoding core based on the metadata in the object audio signal is determined.
  • the target encoding core is the object metadata parameter encoding core.
  • the code stream is multiplexed to obtain the target encoding code stream. This ensures that the selected audio format signal can better represent the audio scene of the local user, so that the remote end Users can well obtain the audio scene information of the audio scene where the local user is located, improving the user experience.
  • the audio signal is obtained, the audio scene type corresponding to the audio signal is determined, and the target input format audio is determined according to the audio scene type and/or the audio signal.
  • the signal is based on MASA audio signal
  • the target input format audio signal is based on MASA audio signal
  • it is determined that the target encoding core based on the audio data in the MASA audio signal is the scene audio data encoding core, and it can be determined based on the spatial auxiliary in the MASA audio signal
  • the target encoding core of the metadata is the spatial auxiliary metadata parameter encoding core, in which the spatial auxiliary metadata in the MASA audio signal is optional.
  • the code stream is multiplexed to obtain the target encoding code stream, thus ensuring that all The selected audio format signal can better characterize the audio scene of the local user, allowing the remote user to obtain the audio scene information of the audio scene where the local user is located, and improving the user experience.
  • Figure 6 is a structural diagram of an audio signal encoding device provided by an embodiment of the present disclosure.
  • the audio signal encoding device 1 includes: a signal acquisition unit 11 , a type determination unit 12 , a target signal determination unit 13 and an encoding processing unit 14 .
  • the signal acquisition unit 11 is configured to acquire audio signals.
  • the type determining unit 12 is configured to determine the audio scene type corresponding to the audio signal.
  • the target signal determining unit 13 is configured to determine the target input format audio signal according to the audio scene type and the audio signal.
  • the encoding processing unit 14 is configured to encode the target input format audio signal and generate a target encoded code stream.
  • the type determination unit 12 includes: a parameter acquisition module 121 and a model processing module 122.
  • the parameter acquisition module 121 is configured to acquire audio characteristic parameters of the audio signal.
  • the model processing module 122 is configured to input the audio feature parameters into the audio scene type analysis model to determine the audio scene type corresponding to the audio signal.
  • the target signal determination unit 13 includes: a target format determination module 131 and a target signal determination module 132 .
  • the target format determining module 131 is configured to determine the target audio signal input format according to the audio scene type and/or the audio signal.
  • the target signal determining module 132 is configured to determine the target input format audio signal according to the target audio signal input format and the audio signal.
  • the target format determination module 131 is specifically configured as:
  • determining the target audio signal input format to be an object-based signal input format In response to the audio scene type characterizing the audio scene including at least one main audio signal, determining the target audio signal input format to be an object-based signal input format;
  • determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format
  • the target audio signal input format is determined to be based on the scene signal input format.
  • the target signal determination module 132 is specifically configured as:
  • determining that the target audio signal input format is an object-based signal input format In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
  • determining the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals
  • the target input format audio signal is determined to be a scene-based audio signal among the audio signals.
  • the encoding processing unit 14 includes: an encoding core determination module 141, a parameter acquisition module 142, and a code stream generation module 143.
  • the encoding core determination module 141 is configured to determine the target encoding core according to the target input format audio signal, or to determine the target encoding core according to the audio scene type and the target input format audio signal.
  • the parameter acquisition module 142 is configured to encode the target input format audio signal according to the target encoding check and obtain encoding parameters.
  • the code stream generation module 143 is configured to perform code stream multiplexing according to encoding parameters and generate a target code stream.
  • the encoding core determination module 141 is specifically configured as:
  • the target encoding core based on the object audio data in the object audio signal is determined to be the object audio data encoding core, and the target encoding core based on the metadata in the object audio signal is determined to be the object metadata parameter.
  • determining the target encoding core to be the scene audio data encoding core
  • determining the target encoding core to be the channel audio data encoding core
  • the target encoding core based on the audio data in the MASA audio signal is the scene audio data encoding core, and the target based on the spatial auxiliary metadata in the MASA audio signal is determined.
  • the encoding core is the spatial auxiliary metadata parameter encoding core;
  • the target encoding core is determined to be the encoding core selected for the format audio signal in the mixed format audio signal, where the mixed format signal includes an object-based audio signal, a scene-based audio signal, and a sound-based audio signal. at least two of the channel audio signal and the MASA-based audio signal.
  • the encoding core determination module 141 is specifically configured as:
  • the target encoding core of the object audio data in the object-based audio signal is the object audio data Coding core, determine the target coding core based on the metadata in the object audio signal as the object metadata parameter coding core, and determine the target coding core based on the scene audio signal as the scene audio data coding core;
  • a target encoding core of the object audio data in the object-based audio signal is determined.
  • the object audio data encoding core determine the target encoding core based on the metadata in the object audio signal as the object metadata parameter encoding core, and determine the target encoding core based on the scene audio signal as the scene audio data encoding core;
  • the target input format audio signal is based on the scene audio signal.
  • the audio signal encoding device provided by the embodiments of the present disclosure can perform the audio signal encoding method as described in some of the above embodiments, and its beneficial effects are the same as those of the audio signal encoding method described above, which will not be described again here.
  • FIG. 10 is a structural diagram of an electronic device 100 for performing an audio signal encoding method according to an exemplary embodiment.
  • the electronic device 100 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like.
  • the electronic device 100 may include one or more of the following components: a processing component 101 , a memory 102 , a power supply component 103 , a multimedia component 104 , an audio component 105 , an input/output (I/O) interface 106 , and a sensor. component 107, and communications component 108.
  • the processing component 101 generally controls the overall operations of the electronic device 100, such as operations associated with display, phone calls, data communications, camera operations, and recording operations.
  • the processing component 101 may include one or more processors 1011 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 101 may include one or more modules that facilitate interaction between processing component 101 and other components. For example, processing component 101 may include a multimedia module to facilitate interaction between multimedia component 104 and processing component 101 .
  • Memory 102 is configured to store various types of data to support operations at electronic device 100 . Examples of such data include instructions for any application or method operating on the electronic device 100, contact data, phonebook data, messages, pictures, videos, etc.
  • the memory 102 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as SRAM (Static Random-Access Memory), EEPROM (Electrically Erasable Programmable read only memory), which can be Erasable programmable read-only memory), EPROM (Erasable Programmable Read-Only Memory, erasable programmable read-only memory), PROM (Programmable read-only memory, programmable read-only memory), ROM (Read-Only Memory, only read memory), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM Static Random-Access Memory
  • EEPROM Electrical Erasable Programmable read only memory
  • EPROM Erasable Programmable Read-Only Memory, erasable programmable read-only memory
  • PROM Pro
  • Power supply component 103 provides power to various components of electronic device 100 .
  • Power supply components 103 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 100 .
  • Multimedia component 104 includes a touch-sensitive display screen that provides an output interface between the electronic device 100 and the user.
  • the touch display screen may include LCD (Liquid Crystal Display) and TP (Touch Panel).
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action.
  • multimedia component 104 includes a front-facing camera and/or a rear-facing camera. When the electronic device 100 is in an operating mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data.
  • Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.
  • Audio component 105 is configured to output and/or input audio signals.
  • the audio component 105 includes a MIC (Microphone), and when the electronic device 100 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
  • the received audio signals may be further stored in memory 102 or sent via communications component 108 .
  • audio component 105 also includes a speaker for outputting audio signals.
  • the I/O interface 2112 provides an interface between the processing component 101 and a peripheral interface module.
  • the peripheral interface module may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.
  • Sensor component 107 includes one or more sensors for providing various aspects of status assessment for electronic device 100 .
  • the sensor component 107 can detect the open/closed state of the electronic device 100, the relative positioning of components, such as the display and the keypad of the electronic device 100, the sensor component 107 can also detect the electronic device 100 or an electronic device 100.
  • the position of components changes, the presence or absence of user contact with the electronic device 100 , the orientation or acceleration/deceleration of the electronic device 100 and the temperature of the electronic device 100 change.
  • Sensor assembly 107 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • the sensor component 107 may also include a light sensor, such as a CMOS (Complementary Metal Oxide Semiconductor) or a CCD (Charge-coupled Device) image sensor for use in imaging applications.
  • CMOS Complementary Metal Oxide Semiconductor
  • CCD Charge-coupled Device
  • the sensor component 107 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 108 is configured to facilitate wired or wireless communication between electronic device 100 and other devices.
  • the electronic device 100 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 108 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 108 also includes an NFC (Near Field Communication) module to facilitate short-range communication.
  • the NFC module can be based on RFID (Radio Frequency Identification) technology, IrDA (Infrared Data Association) technology, UWB (Ultra Wide Band) technology, BT (Bluetooth, Bluetooth) technology and other Technology to achieve.
  • the electronic device 100 may be configured by one or more ASIC (Application Specific Integrated Circuit), DSP (Digital Signal Processor, digital signal processor), digital signal processing device (DSPD), PLD ( Programmable Logic Device, Programmable Logic Device), FPGA (Field Programmable Gate Array, Field Programmable Logic Gate Array), controller, microcontroller, microprocessor or other electronic components are used to perform the above audio signal encoding method.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processor, digital signal processor
  • DSPD digital signal processing device
  • PLD Programmable Logic Device, Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • controller microcontroller, microprocessor or other electronic components
  • the electronic device 100 provided by the embodiments of the present disclosure can perform the audio signal encoding method as described in some of the above embodiments, and its beneficial effects are the same as those of the audio signal encoding method described above, which will not be described again here.
  • the present disclosure also proposes a storage medium.
  • the electronic device can perform the audio signal encoding method as described above.
  • the storage medium can be ROM (Read Only Memory Image, read-only memory), RAM (Random Access Memory, random access memory), CD-ROM (Compact Disc Read-Only Memory, compact disc read-only memory) , tapes, floppy disks and optical data storage devices, etc.
  • the present disclosure also provides a computer program product.
  • the computer program When the computer program is executed by a processor of an electronic device, the electronic device can perform the audio signal encoding method as described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)

Abstract

An audio signal encoding method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring an audio signal (S1); determining an audio scene type corresponding to the audio signal (S2); determining an audio signal of a target input format according to the audio scene type and the audio signal (S3); and encoding the audio signal of the target input format to generate a target encoded code stream (S4). Therefore, it can be guaranteed that the selected audio format signal can better characterize the audio scene of a local user, such that a remote user can well obtain the audio scene information of the audio scene where the local user is located, and the user experience is improved.

Description

音频信号编码方法、装置、电子设备和存储介质Audio signal encoding method, device, electronic equipment and storage medium 技术领域Technical field
本公开涉及通信技术领域,尤其涉及一种音频信号编码方法、装置、电子设备和存储介质。The present disclosure relates to the field of communication technology, and in particular, to an audio signal encoding method, device, electronic equipment and storage medium.
背景技术Background technique
相关技术中,用户在建立语音通信时,会协商一种音频格式,在整个语音通信过程中,会始终采用协商的音频格式,传递本端用户该音频格式下的音频信号给远端用户。但是,用户语音通信的音频场景可能会发生变化,该音频格式下的音频信号可能无法给远端用户提供本端用户在发生变化的音频场景下真实的音频场景信息,导致用户体验差,这是亟需解决的问题。In related technologies, users negotiate an audio format when establishing voice communication. During the entire voice communication process, the negotiated audio format is always used to transmit the local user's audio signal in the audio format to the remote user. However, the audio scene of the user's voice communication may change, and the audio signal in this audio format may not be able to provide the remote user with the real audio scene information of the local user in the changed audio scene, resulting in poor user experience. This is A problem that needs to be solved urgently.
发明内容Contents of the invention
本公开实施例提供一种音频信号编码方法、装置、电子设备和存储介质,能够使远端用户能够很好的获取本端用户所处音频场景的音频场景信息,提升用户体验。Embodiments of the present disclosure provide an audio signal encoding method, device, electronic device and storage medium, which enables a remote user to obtain audio scene information of the audio scene where the local user is located, thereby improving user experience.
第一方面,本公开实施例提供一种音频信号编码方法,该方法包括:获取音频信号;确定所述音频信号对应的音频场景类型;根据所述音频场景类型和所述音频信号,确定目标输入格式音频信号;对所述目标输入格式音频信号进行编码,生成目标编码码流。In a first aspect, embodiments of the present disclosure provide an audio signal encoding method, which method includes: acquiring an audio signal; determining an audio scene type corresponding to the audio signal; and determining a target input according to the audio scene type and the audio signal. format audio signal; encode the target input format audio signal to generate a target encoding code stream.
在该技术方案中,获取音频信号;确定音频信号对应的音频场景类型;根据音频场景类型和音频信号,确定目标输入格式音频信号;对目标输入格式音频信号进行编码,生成目标编码码流。由此,能够保证所选择的音频格式信号能够更好表征本端用户的音频场景,使得远端用户能够很好的获取本端用户所处音频场景的音频场景信息,提升用户体验。In this technical solution, the audio signal is obtained; the audio scene type corresponding to the audio signal is determined; the target input format audio signal is determined based on the audio scene type and audio signal; the target input format audio signal is encoded to generate the target encoding code stream. This ensures that the selected audio format signal can better represent the local user's audio scene, allowing the remote user to obtain audio scene information of the local user's audio scene, thereby improving the user experience.
在一些实施例中,所述确定所述音频信号对应的音频场景类型,包括:获取所述音频信号的音频特征参数;将所述音频特征参数输入至音频场景类型分析模型,确定所述音频信号对应的所述音频场景类型。In some embodiments, determining the audio scene type corresponding to the audio signal includes: obtaining audio characteristic parameters of the audio signal; inputting the audio characteristic parameters into an audio scene type analysis model, and determining the audio signal The corresponding audio scene type.
在一些实施例中,所述根据所述音频场景类型和所述音频信号,确定目标输入格式音频信号,包括:根据所述音频场景类型和/或所述音频信号,确定目标音频信号输入格式;根据所述目标音频信号输入格式和所述音频信号,确定所述目标输入格式音频信号。In some embodiments, determining the target input format audio signal according to the audio scene type and the audio signal includes: determining the target audio signal input format according to the audio scene type and/or the audio signal; The target input format audio signal is determined based on the target audio signal input format and the audio signal.
所述根据所述音频场景类型和所述音频信号,确定目标音频信号输入格式,包括:Determining the target audio signal input format according to the audio scene type and the audio signal includes:
响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号,确定所述目标音频信号输入格式为基于对象信号输入格式;In response to the audio scene type representing that the audio scene includes at least one primary audio signal, determining the target audio signal input format to be an object-based signal input format;
响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,确定所述目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式;In response to the audio scene type representing that the audio scene includes at least one main audio signal and a background audio signal, determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format;
响应于所述音频场景类型表征音频场景中仅包括至少一个背景音频信号,确定所述目标音频信号输入格式为基于场景信号输入格式。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, the target audio signal input format is determined to be a scene signal-based input format.
在一些实施例中,所述根据所述目标音频信号输入格式和所述音频信号,确定所述目标输入格式音频信号,包括:In some embodiments, determining the target input format audio signal according to the target audio signal input format and the audio signal includes:
响应于确定所述目标音频信号输入格式为基于对象信号输入格式,确定所述目标输入格式音频信号 为所述音频信号中的基于对象音频信号;In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
响应于确定所述目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式,确定所述目标输入格式音频信号为所述音频信号中的基于对象音频信号和基于场景音频信号;In response to determining that the target audio signal input format is an object-based signal input format and a scene-based signal input format, determining that the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals;
响应于确定所述目标音频信号输入格式为基于场景信号输入格式,确定所述目标输入格式音频信号为所述音频信号中的基于场景音频信号。In response to determining that the target audio signal input format is a scene-based signal input format, it is determined that the target input format audio signal is a scene-based audio signal among the audio signals.
在一些实施例中,所述对所述目标输入格式音频信号进行编码,生成目标编码码流,包括:In some embodiments, encoding the target input format audio signal to generate a target encoding code stream includes:
根据所述目标输入格式音频信号,确定目标编码核,或者,根据所述音频场景类型和所述目标输入格式音频信号,确定目标编码核;Determine a target encoding core according to the target input format audio signal, or determine a target encoding core according to the audio scene type and the target input format audio signal;
根据所述目标编码核对所述目标输入格式音频信号进行编码,获取编码参数;Encode the target input format audio signal according to the target encoding check and obtain encoding parameters;
根据所述编码参数进行码流复用,生成所述目标编码码流。Code stream multiplexing is performed according to the coding parameters to generate the target coding code stream.
在一些实施例中,所述根据所述目标输入格式音频信号,确定目标信号编码核,包括:In some embodiments, determining the target signal encoding core according to the target input format audio signal includes:
响应于所述目标输入格式音频信号为基于对象音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核;In response to the target input format audio signal being an object-based audio signal, it is determined that the target encoding core of the object audio data in the object-based audio signal is an object audio data encoding core, and it is determined that the object-based audio signal is based on metadata in the object audio signal. The target encoding core is an object metadata parameter encoding core;
响应于所述目标输入格式音频信号为基于场景音频信号,确定所述目标编码核为场景音频数据编码核;In response to the target input format audio signal being a scene-based audio signal, determining that the target encoding core is a scene audio data encoding core;
响应于所述目标输入格式音频信号为基于声道音频信号,确定所述目标编码核为声道音频数据编码核;In response to the target input format audio signal being a channel-based audio signal, determining that the target encoding core is a channel audio data encoding core;
响应于所述目标输入格式音频信号为基于辅助元数据的空间音频MASA(Metadata-assistedspatial audio)音频信号,确定所述基于MASA音频信号中音频数据的所述目标编码核为场景音频数据编码核,确定所述基于MASA音频信号中空间辅助元数据的所述目标编码核为空间辅助元数据参数编码核;In response to the target input format audio signal being a spatial audio MASA (Metadata-assistedspatial audio) audio signal based on auxiliary metadata, it is determined that the target encoding core based on the audio data in the MASA audio signal is a scene audio data encoding core, Determining that the target coding core based on the spatial auxiliary metadata in the MASA audio signal is a spatial auxiliary metadata parameter coding core;
响应于所述目标输入格式音频信号为混合格式音频信号,确定所述目标编码核为所述混合格式音频信号中的格式音频信号选用的编码核,其中,所述混合格式信号包括基于对象音频信号、基于场景音频信号、基于声道音频信号和基于MASA音频信号中的至少两个。In response to the target input format audio signal being a mixed format audio signal, it is determined that the target encoding core is the encoding core selected for the format audio signal in the mixed format audio signal, wherein the mixed format signal includes an audio signal based on the object , at least two of a scene-based audio signal, a channel-based audio signal, and a MASA-based audio signal.
在一些实施例中,所述根据所述音频场景类型和所述目标输入格式音频信号,确定目标编码核,包括:In some embodiments, determining the target encoding core according to the audio scene type and the target input format audio signal includes:
响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号和/或背景音频信号,以及所述目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal and/or background audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, determining the object-based audio signal The target encoding core of the object audio data is an object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that all the target encoding cores based on the scene audio signal are The target encoding core is the scene audio data encoding core;
响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,以及所述目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal and a background audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, determining the object in the object-based audio signal The target encoding core of the audio data is an object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target based on the scene audio signal The encoding core is the scene audio data encoding core;
响应于所述音频场景类型表征音频场景中仅包括至少一个背景音频信号,以及所述目标输入格式音频信号为基于场景音频信号,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, and the target input format audio signal is a scene-based audio signal, it is determined that the target encoding core based on the scene audio signal is scene audio data. Coding core.
第二方面,本公开实施例提供一种音频信号编码装置,所述音频信号编码装置包括:信号获取单元, 被配置为获取音频信号;类型确定单元,被配置为确定所述音频信号对应的音频场景类型;目标信号确定单元,被配置为根据所述音频场景类型和所述音频信号,确定目标输入格式音频信号;编码处理单元,被配置为对所述目标输入格式音频信号进行编码,生成目标编码码流。In a second aspect, embodiments of the present disclosure provide an audio signal encoding device. The audio signal encoding device includes: a signal acquisition unit configured to acquire an audio signal; and a type determination unit configured to determine the audio corresponding to the audio signal. Scene type; a target signal determination unit configured to determine a target input format audio signal according to the audio scene type and the audio signal; an encoding processing unit configured to encode the target input format audio signal to generate a target Encoding stream.
所述类型确定单元,包括:The type determination unit includes:
参数获取模块,被配置为获取所述音频信号的音频特征参数;A parameter acquisition module configured to acquire audio characteristic parameters of the audio signal;
模型处理模块,被配置为将所述音频特征参数输入至音频场景类型分析模型,确定所述音频信号对应的所述音频场景类型。A model processing module configured to input the audio feature parameters into an audio scene type analysis model to determine the audio scene type corresponding to the audio signal.
在一些实施例中,所述目标信号确定单元,包括:目标格式确定模块,被配置为根据所述音频场景类型和/或所述音频信号,确定目标音频信号输入格式;目标信号确定模块,被配置为根据所述目标音频信号输入格式和所述音频信号,确定所述目标输入格式音频信号。In some embodiments, the target signal determination unit includes: a target format determination module configured to determine the target audio signal input format according to the audio scene type and/or the audio signal; the target signal determination module, Configured to determine the target input format audio signal according to the target audio signal input format and the audio signal.
在一些实施例中,所述目标格式确定模块,具体被配置为:In some embodiments, the target format determination module is specifically configured as:
响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号,确定所述目标音频信号输入格式为基于对象信号输入格式;In response to the audio scene type representing that the audio scene includes at least one primary audio signal, determining the target audio signal input format to be an object-based signal input format;
响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,确定所述目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式;In response to the audio scene type representing that the audio scene includes at least one main audio signal and a background audio signal, determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format;
响应于所述音频场景类型表征音频场景中仅包括至少一个背景音频信号,确定所述目标音频信号输入格式为基于场景信号输入格式。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, the target audio signal input format is determined to be a scene signal-based input format.
在一些实施例中,所述目标信号确定模块,具体被配置为:In some embodiments, the target signal determination module is specifically configured as:
响应于确定所述目标音频信号输入格式为基于对象信号输入格式,确定所述目标输入格式音频信号为所述音频信号中的基于对象音频信号;In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
响应于确定所述目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式,确定所述目标输入格式音频信号为所述音频信号中的基于对象音频信号和基于场景音频信号;In response to determining that the target audio signal input format is an object-based signal input format and a scene-based signal input format, determining that the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals;
响应于确定所述目标音频信号输入格式为基于场景信号输入格式,确定所述目标输入格式音频信号为所述音频信号中的基于场景音频信号。In response to determining that the target audio signal input format is a scene-based signal input format, it is determined that the target input format audio signal is a scene-based audio signal among the audio signals.
在一些实施例中,所述编码处理单元,包括:编码核确定模块,被配置为所述目标输入格式音频信号,确定目标编码核,或者,根据根据所述音频场景类型和所述目标输入格式音频信号,确定目标编码核;参数获取模块,被配置为根据所述目标编码核对所述目标输入格式音频信号进行编码,获取编码参数;码流生成模块,被配置为根据所述编码参数进行码流复用,生成所述目标编码码流。In some embodiments, the encoding processing unit includes: a coding core determination module configured to determine a target coding core for the target input format audio signal, or according to the audio scene type and the target input format. The audio signal determines the target encoding core; the parameter acquisition module is configured to encode the target input format audio signal according to the target encoding core and obtain encoding parameters; the code stream generation module is configured to encode the audio signal according to the encoding parameters. Stream multiplexing to generate the target encoding code stream.
在一些实施例中,所述编码核确定模块,具体被配置为:In some embodiments, the encoding core determination module is specifically configured as:
响应于所述目标输入格式音频信号为基于对象音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核;In response to the target input format audio signal being an object-based audio signal, it is determined that the target encoding core of the object audio data in the object-based audio signal is an object audio data encoding core, and it is determined that the object-based audio signal is based on metadata in the object audio signal. The target encoding core is an object metadata parameter encoding core;
响应于所述目标输入格式音频信号为基于场景音频信号,确定所述目标编码核为场景音频数据编码核;In response to the target input format audio signal being a scene-based audio signal, determining that the target encoding core is a scene audio data encoding core;
响应于所述目标输入格式音频信号为基于声道音频信号,确定所述目标编码核为声道音频数据编码核;In response to the target input format audio signal being a channel-based audio signal, determining that the target encoding core is a channel audio data encoding core;
响应于所述目标输入格式音频信号为基于辅助元数据的空间音频MASA音频信号,确定所述基于 MASA音频信号中音频数据的所述目标编码核为场景音频数据编码核,确定所述基于MASA音频信号中空间辅助元数据的所述目标编码核为空间辅助元数据参数编码核;In response to the target input format audio signal being a spatial audio MASA audio signal based on auxiliary metadata, it is determined that the target encoding core based on the audio data in the MASA audio signal is a scene audio data encoding core, and it is determined that the MASA based audio The target encoding core of the spatial auxiliary metadata in the signal is a spatial auxiliary metadata parameter encoding core;
响应于所述目标输入格式音频信号为混合格式音频信号,确定所述目标编码核为所述混合格式音频信号中的格式音频信号选用的编码核,其中,所述混合格式信号包括基于对象音频信号、基于场景音频信号、基于声道音频信号和基于MASA音频信号中的至少两个。In response to the target input format audio signal being a mixed format audio signal, it is determined that the target encoding core is the encoding core selected for the format audio signal in the mixed format audio signal, wherein the mixed format signal includes an audio signal based on the object , at least two of a scene-based audio signal, a channel-based audio signal, and a MASA-based audio signal.
在一些实施例中,所述编码核确定模块,具体被配置为:In some embodiments, the encoding core determination module is specifically configured as:
响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号,以及所述目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, determining all of the object audio data in the object-based audio signal. The target encoding core is an object audio data encoding core, it is determined that the target encoding core based on metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target encoding core based on the scene audio signal is a scene. Audio data encoding core;
响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,以及所述目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal and a background audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, determining the object in the object-based audio signal The target encoding core of the audio data is an object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target based on the scene audio signal The encoding core is the scene audio data encoding core;
响应于所述音频场景类型表征音频场景中仅包括至少一个背景音频信号,以及所述目标输入格式音频信号为基于场景音频信号,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, and the target input format audio signal is a scene-based audio signal, it is determined that the target encoding core based on the scene audio signal is scene audio data. Coding core.
第三方面,本公开实施例提供一种电子设备,该电子设备包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述第一方面所述的方法。In a third aspect, embodiments of the present disclosure provide an electronic device, which includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be used by the at least one processor. Instructions executed by the processor, the instructions being executed by the at least one processor, so that the at least one processor can execute the method described in the first aspect.
第四方面,本公开实施例提供一种存储有计算机指令的非瞬时计算机可读存储介质,其特征在于,所述计算机指令用于使所述计算机执行上述第一方面所述的方法。In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in the first aspect.
第五方面,本公开实施例提供一种计算机程序产品,包括计算机指令,其特征在于,所述计算机指令在被处理器执行时实现上述第一方面所述的方法。In a fifth aspect, embodiments of the present disclosure provide a computer program product, including computer instructions, characterized in that, when executed by a processor, the computer instructions implement the method described in the first aspect.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and do not limit the present disclosure.
附图说明Description of the drawings
为了更清楚地说明本公开实施例或背景技术中的技术方案,下面将对本公开实施例或背景技术中所需要使用的附图进行说明。In order to more clearly illustrate the technical solutions in the embodiments of the disclosure or the background technology, the drawings required to be used in the embodiments or the background technology of the disclosure will be described below.
图1是本公开实施例提供的一种音频信号编码方法的流程图;Figure 1 is a flow chart of an audio signal encoding method provided by an embodiment of the present disclosure;
图2是本公开实施例提供的另一种音频信号编码方法的流程图;Figure 2 is a flow chart of another audio signal encoding method provided by an embodiment of the present disclosure;
图3是本公开实施例提供的又一种音频信号编码方法的流程图;Figure 3 is a flow chart of yet another audio signal encoding method provided by an embodiment of the present disclosure;
图4是本公开实施例提供的又一种音频信号编码方法的流程图;Figure 4 is a flow chart of yet another audio signal encoding method provided by an embodiment of the present disclosure;
图5是本公开实施例提供的又一种音频信号编码方法的流程图;Figure 5 is a flow chart of yet another audio signal encoding method provided by an embodiment of the present disclosure;
图6是本公开实施例提供的一种音频信号编码装置的结构图;Figure 6 is a structural diagram of an audio signal encoding device provided by an embodiment of the present disclosure;
图7是本公开实施例提供的音频信号编码装置中一种类型确定单元的结构图;Figure 7 is a structural diagram of a type determination unit in the audio signal encoding device provided by an embodiment of the present disclosure;
图8是本公开实施例提供的音频信号编码装置中一种目标信号确定单元的结构图;Figure 8 is a structural diagram of a target signal determination unit in the audio signal encoding device provided by an embodiment of the present disclosure;
图9是本公开实施例提供的音频信号编码装置中一种编码处理单元的结构图;Figure 9 is a structural diagram of a coding processing unit in the audio signal coding device provided by an embodiment of the present disclosure;
图10为本公开一实施例示出的电子设备的结构图。FIG. 10 is a structural diagram of an electronic device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
为了使本领域普通人员更好地理解本公开的技术方案,下面将结合附图,对本公开实施例中的技术方案进行清楚、完整地描述。In order to allow ordinary people in the art to better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings.
除非上下文另有要求,否则,在整个说明书和权利要求书中,术语“包括”被解释为开放、包含的意思,即为“包含,但不限于”。在说明书的描述中,术语“一些实施例”等旨在表明与该实施例或示例相关的特定特征、结构、材料或特性包括在本公开的至少一个实施例或示例中。上述术语的示意性表示不一定是指同一实施例或示例。此外,所述的特定特征、结构、材料或特点可以以任何适当方式包括在任何一个或多个实施例或示例中。Unless the context requires otherwise, throughout the specification and claims, the term "including" is to be interpreted in an open, inclusive sense, that is, "including, but not limited to." In the description of this specification, the terms "some embodiments" and the like are intended to indicate that a particular feature, structure, material, or characteristic associated with the embodiment or example is included in at least one embodiment or example of the present disclosure. The schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be included in any suitable manner in any one or more embodiments or examples.
需要说明的是,本公开的说明书和权利要求书及附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It should be noted that the terms "first", "second", etc. in the description, claims and drawings of the present disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. The terms “first” and “second” are used for descriptive purposes only and shall not be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as "first" and "second" may explicitly or implicitly include one or more of these features. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of the disclosure as detailed in the appended claims.
本公开中的至少一个还可以描述为一个或多个,多个可以是两个、三个、四个或者更多个,本公开不做限制。在本公开实施例中,对于一种技术特征,通过“第一”、“第二”、“第三”、“A”、“B”、“C”和“D”等区分该种技术特征中的技术特征,该“第一”、“第二”、“第三”、“A”、“B”、“C”和“D”描述的技术特征间无先后顺序或者大小顺序。At least one in the present disclosure can also be described as one or more, and the plurality can be two, three, four or more, and the present disclosure is not limited. In the embodiment of the present disclosure, for a technical feature, the technical feature is distinguished by “first”, “second”, “third”, “A”, “B”, “C” and “D” etc. The technical features described in "first", "second", "third", "A", "B", "C" and "D" are in no particular order or order.
本公开中各表所示的对应关系可以被配置,也可以是预定义的。各表中的信息的取值仅仅是举例,可以配置为其他值,本公开并不限定。在配置信息与各参数的对应关系时,并不一定要求必须配置各表中示意出的所有对应关系。例如,本公开中的表格中,某些行示出的对应关系也可以不配置。又例如,可以基于上述表格做适当的变形调整,例如,拆分,合并等等。上述各表中标题示出参数的名称也可以采用通信装置可理解的其他名称,其参数的取值或表示方式也可以通信装置可理解的其他取值或表示方式。上述各表在实现时,也可以采用其他的数据结构,例如可以采用数组、队列、容器、栈、线性表、指针、链表、树、图、结构体、类、堆、散列表或哈希表等。The corresponding relationships shown in each table in this disclosure can be configured or predefined. The values of the information in each table are only examples and can be configured as other values, which is not limited by this disclosure. When configuring the correspondence between information and each parameter, it is not necessarily required to configure all the correspondences shown in each table. For example, in the table in this disclosure, the corresponding relationships shown in some rows may not be configured. For another example, appropriate deformation adjustments can be made based on the above table, such as splitting, merging, etc. The names of the parameters shown in the titles of the above tables may also be other names understandable by the communication device, and the values or expressions of the parameters may also be other values or expressions understandable by the communication device. When implementing the above tables, other data structures can also be used, such as arrays, queues, containers, stacks, linear lists, pointers, linked lists, trees, graphs, structures, classes, heaps, hash tables or hash tables. wait.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本公开的范围。Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered to be beyond the scope of this disclosure.
第一代移动通信技术(1G)是第一代无线蜂窝技术,属于模拟移动通信网。1G升级到2G时将手机从模拟通信转移到数字通信,3G移动通信***是ITU(International Telecommunication Union,国际电信联盟)为2000年国际移动通信而提出的,4G是在3G技术上的一次更好的改良,数据和话音都采用全IP的方式,提供语音音频的实时HD+Voice服务,采用的EVS编解码器能够兼顾语音和音频的高质量压缩。The first generation of mobile communication technology (1G) is the first generation of wireless cellular technology and is an analog mobile communication network. When upgrading from 1G to 2G, the mobile phone will be transferred from analog communication to digital communication. The 3G mobile communication system was proposed by the ITU (International Telecommunication Union) for international mobile communications in 2000. 4G is a better improvement on 3G technology. With improvements, both data and voice use an all-IP approach to provide real-time HD+Voice services for voice and audio. The EVS codec used can take into account high-quality compression of voice and audio.
以上提供的语音和音频通信服务从窄带信号扩展到超宽带甚至是全带服务,但还都是单声道服务,人们对高质量音频的需求不断增加,与单声道音频相比,立体声音频对于每个声源具有取向感和分布感,并且可以提高清晰度。The voice and audio communication services provided above have expanded from narrowband signals to ultra-wideband and even full-band services, but they are still monophonic services. People's demand for high-quality audio continues to increase. Compared with monophonic audio, stereo audio Have a sense of orientation and distribution for each sound source and improve clarity.
随着传输带宽的增加、终端设备信号采集设备的升级、信号处理器性能的提升、以及终端回放设备的升级。基于声道的多通道音频信号,基于对象的音频信号,基于场景的音频信号等三种信号格式可以提供三维音频服务。第三代合作伙伴计划3GPP SA4正在标准化的沉浸式语音和音频服务IVAS编解码器即能支持上述三种信号格式的编解码需求。能够支持三维音频服务的终端设备有手机,电脑,平板,会议***设备,增强现实AR/虚拟现实VR设备,汽车等。With the increase of transmission bandwidth, the upgrade of terminal equipment signal acquisition equipment, the improvement of signal processor performance, and the upgrade of terminal playback equipment. Three signal formats, including channel-based multi-channel audio signals, object-based audio signals, and scene-based audio signals, can provide three-dimensional audio services. The IVAS codec for immersive voice and audio services being standardized by the Third Generation Partnership Project 3GPP SA4 can support the coding and decoding requirements of the above three signal formats. Terminal devices that can support 3D audio services include mobile phones, computers, tablets, conference system equipment, augmented reality AR/virtual reality VR equipment, cars, etc.
本公开实施例中,本端用户所处的音频场景可能在不断变化,例如:本端用户在室外寂静空旷的场所与远端用户进行实时通信,此时本端用户终端设备选择使用单声道信号格式作为输入信号格式就能够将本端用户的音频场景很好的传送给远端用户,某一时间段会有一只小鸟飞过,小鸟的叫声是当前音频场景的一个不可忽略的重要音频元素,此时如果仍然使用之前的单声道音频信号格式就无法将小鸟的叫声很好的传递给远端用户,此时,对当前音频场景使用单声道音频信号格式(本端用户声音)和对象格式信号叠加混合信号格式(小鸟的叫声,包括小鸟的叫声音频信号和小鸟的飞行轨迹)是一种最优的信号格式,之后当本端用户走到一处喧闹的露天广场处,此时应当选择一种基于场景信号格式:如FOA/HOA格式信号,或者空间音频信号格式(spatial audio):如基于MASA(Metadata-assistedspatial audio)格式信号,当本端用户走到一个交响乐播放场所时,同时此处有专业的声道录音设备,则可以选择使用声道信号格式如:(5.1/5.1.4/7.1/7.1.4)格式信号。本公开实施例中,本端用户的终端设备可以实时分析本端用户所处的音频场景,通过获取的音频场景类型来指导音频信号产生器输出最优的音频格式信号,从而保证所选择的音频格式信号能够更好表征本端用户的音频场景,从而使得远端用户能够很好的获取本端用户所处音频场景的音频场景信息,提升用户体验。In this disclosed embodiment, the audio scene in which the local user is located may be constantly changing. For example, the local user communicates in real time with a remote user in a quiet and empty outdoor place. At this time, the local user terminal device chooses to use mono. The signal format as the input signal format can well transmit the audio scene of the local user to the remote user. A bird will fly over during a certain period of time, and the bird's cry is an important part of the current audio scene. important audio element. At this time, if you still use the previous mono audio signal format, the bird's call cannot be well transmitted to the remote user. At this time, use the mono audio signal format for the current audio scene (this The end user's voice) and the object format signal superimposed on the mixed signal format (the bird's call, including the bird's call audio signal and the bird's flight path) are an optimal signal format. Then when the local user comes to In a noisy open-air square, you should choose a scene-based signal format: such as FOA/HOA format signal, or spatial audio signal format (spatial audio): such as MASA (Metadata-assistedspatial audio) format signal, when this When the end user goes to a symphony playing place and there is professional channel recording equipment, he can choose to use the channel signal format such as: (5.1/5.1.4/7.1/7.1.4) format signal. In the embodiment of the present disclosure, the local user's terminal device can analyze the audio scene in which the local user is located in real time, and use the obtained audio scene type to guide the audio signal generator to output the optimal audio format signal, thereby ensuring the selected audio The format signal can better represent the audio scene of the local user, so that the remote user can well obtain the audio scene information of the audio scene where the local user is located, and improve the user experience.
请参见图1,图1是本公开实施例提供的一种音频信号编码方法的流程图。Please refer to Figure 1. Figure 1 is a flow chart of an audio signal encoding method provided by an embodiment of the present disclosure.
如图1所示,该方法可以包括但不限于如下步骤:As shown in Figure 1, the method may include but is not limited to the following steps:
S1,获取音频信号。S1, obtain the audio signal.
可以理解的是,本端用户与任一远端用户建立语音通信时,本端用户可以通过终端设备与任一远端用户的终端设备建立语音通信,其中,本端用户的终端设备可以实时获取本端用户所在环境的声音信息,获取音频信号。It can be understood that when the local user establishes voice communication with any remote user, the local user can establish voice communication with the terminal equipment of any remote user through the terminal device, wherein the terminal device of the local user can obtain the information in real time. Acquire the audio signal from the sound information of the environment where the local user is located.
其中,本端用户所在环境的声音信息,包括本端用户发出的声音信息、以及周边事物的声音信息等。周围事物的声音信息,例如:车辆行驶的声音信息、小鸟的叫声信息、风声信息、本端用户身边的其他用户发出的声音信息等等。Among them, the sound information of the environment where the local user is located includes the sound information emitted by the local user, the sound information of surrounding things, etc. Sound information of surrounding things, such as: sound information of driving vehicles, bird calls, wind sound information, sound information of other users around the local user, etc.
需要说明的是,终端设备是用户侧的一种用于接收或发射信号的实体,如手机、电脑、平板、手表、对讲机、会议***设备、增强现实AR/虚拟现实VR设备、汽车等。终端设备也可以称为用户设备(user equipment,UE)、移动台(mobile station,MS)、移动终端设备(mobile terminal,MT)等。终端设备可以是具备通信功能的汽车、智能汽车、手机(mobile phone)、穿戴式设备、平板电脑(Pad)、带无线收发功能的电脑、虚拟现实(virtual reality,VR)终端设备、增强现实(augmented reality,AR)终端设备、工业控制(industrial control)中的无线终端设备、无人驾驶(self-driving)中的无线终端设备、远程手术(remote medical surgery)中的无线终端设备、智能电网(smart grid)中的无线终端设备、 运输安全(transportation safety)中的无线终端设备、智慧城市(smart city)中的无线终端设备、智慧家庭(smart home)中的无线终端设备等等。本公开的实施例对终端设备所采用的具体技术和具体设备形态不做限定。It should be noted that the terminal device is an entity on the user side that is used to receive or transmit signals, such as mobile phones, computers, tablets, watches, walkie-talkies, conference system equipment, augmented reality AR/virtual reality VR equipment, cars, etc. Terminal equipment can also be called user equipment (user equipment, UE), mobile station (mobile station, MS), mobile terminal equipment (mobile terminal, MT), etc. The terminal device can be a car with communication functions, a smart car, a mobile phone, a wearable device, a tablet computer (Pad), a computer with wireless transceiver functions, a virtual reality (VR) terminal device, an augmented reality ( augmented reality (AR) terminal equipment, wireless terminal equipment in industrial control, wireless terminal equipment in self-driving, wireless terminal equipment in remote medical surgery, smart grid ( Wireless terminal equipment in smart grid, wireless terminal equipment in transportation safety, wireless terminal equipment in smart city, wireless terminal equipment in smart home, etc. The embodiments of the present disclosure do not limit the specific technology and specific equipment form used by the terminal equipment.
本公开实施例中,本端用户的终端设备获取音频信号,可以通过设置于终端设备中或与终端设备配合的录音装置,例如:麦克风,获取本端用户所在环境的声音信息,进一步生成音频信号,获取音频信号。In the embodiment of the present disclosure, the terminal device of the local user can obtain the audio signal through a recording device, such as a microphone, provided in or in conjunction with the terminal device, to obtain the sound information of the environment where the user is located, and further generate an audio signal. , obtain the audio signal.
S2,确定音频信号对应的音频场景类型。S2, determine the audio scene type corresponding to the audio signal.
本公开实施例中,在获取音频信号的基础上,可以对获取的音频信号进行分析,获取音频信号对应的音频场景类型。In the embodiment of the present disclosure, on the basis of obtaining the audio signal, the obtained audio signal can be analyzed to obtain the audio scene type corresponding to the audio signal.
可以理解的是,音频信号中可能包括本端用户发出的声音信息和/或周边事物的声音信息。本公开实施例中,可以根据音频信号中包括的内容,确定音频信号对应的音频场景类型。It can be understood that the audio signal may include sound information emitted by the local user and/or sound information of surrounding things. In the embodiment of the present disclosure, the audio scene type corresponding to the audio signal can be determined according to the content included in the audio signal.
在一种可能的实施方式中,音频场景类型,例如包括:办公场所、剧场、汽车内、火车站、大型商场等,本公开实施例中,针对每一种音频场景类型可以选择一种音频信号输入格式的输入格式音频信号,或者可以选择多种音频信号输入格式的输入格式音频信号。In a possible implementation, audio scene types include, for example: offices, theaters, cars, train stations, large shopping malls, etc. In the embodiment of the present disclosure, an audio signal can be selected for each audio scene type. The input format audio signal of the input format, or the input format audio signal of multiple audio signal input formats can be selected.
示例性地,如下表1所示:For example, as shown in Table 1 below:
Figure PCTCN2022092082-appb-000001
Figure PCTCN2022092082-appb-000001
表1Table 1
可以理解的是,上述示例仅作为示意,音频场景类型还可以采用其他方式进行划分,或者与其他方式结合进行划分等,例如:音频场景类型可以采用音频信号中仅包括至少一个主要音频信号、或者音频信号中包括至少一个主要音频信号和背景音频信号、或者音频信号中仅包括至少一个背景音频信号的方 式进行划分等。音频场景类型可以预先根据需要进行设置,本公开实施例对此不作具体限制。It can be understood that the above examples are only for illustration, and the audio scene types can also be divided in other ways, or combined with other ways, etc., for example: the audio scene types can only include at least one main audio signal in the audio signal, or The audio signal includes at least one main audio signal and a background audio signal, or the audio signal only includes at least one background audio signal. The audio scene type can be set in advance as needed, and the embodiment of the present disclosure does not impose specific restrictions on this.
本公开实施例中,针对每一种音频场景类型可以选择一种或多种音频信号输入格式的输入格式音频信号,选择的音频信号输入格式还可以根据获取音频信号的方式进行确定,例如:根据音频信号的声道数以及空间布局关系等进行确定。In the embodiment of the present disclosure, one or more audio signal input formats can be selected for each audio scene type. The selected audio signal input format can also be determined according to the method of obtaining the audio signal, for example: The number of channels and spatial layout relationship of the audio signal are determined.
S3,根据音频场景类型和音频信号,确定目标输入格式音频信号。S3, determine the target input format audio signal according to the audio scene type and audio signal.
本公开实施例中,在获取音频信号,确定音频信号对应的音频场景类型的情况下,可以进一步的,根据音频场景类型和音频信号,确定目标输入格式音频信号。其中,在音频信号中包括一种或多种输入格式音频信号的情况下,根据音频场景类型和音频信号,确定目标输入格式音频信号为音频信号中的一种或多种输入格式音频信号。In the embodiment of the present disclosure, when the audio signal is acquired and the audio scene type corresponding to the audio signal is determined, the target input format audio signal can be further determined based on the audio scene type and the audio signal. Wherein, when the audio signal includes one or more input format audio signals, the target input format audio signal is determined to be one or more input format audio signals in the audio signal according to the audio scene type and the audio signal.
可以理解的是,音频信号中可能包括一种或多种输入格式音频信号,本公开实施例中,可以通过预先设置音频场景类型与输入格式音频信号的对应关系,在确定音频场景类型的情况下,根据音频场景类型和对应关系,确定目标输入格式音频信号。It can be understood that the audio signal may include one or more input format audio signals. In the embodiment of the present disclosure, the corresponding relationship between the audio scene type and the input format audio signal can be set in advance. In the case of determining the audio scene type , determine the target input format audio signal based on the audio scene type and corresponding relationship.
其中,本公开实施例中,还可以根据音频场景类型、对应关系以及音频信号中包括的输入格式音频信号,确定音频信号中的目标输入格式音频信号。Among them, in the embodiment of the present disclosure, the target input format audio signal in the audio signal can also be determined according to the audio scene type, the corresponding relationship, and the input format audio signal included in the audio signal.
S4,对目标输入格式音频信号进行编码,生成目标编码码流。S4: Encode the audio signal in the target input format and generate the target encoding stream.
本公开实施例中,在确定目标输入格式音频信号的情况下,为与远端用户进行通信,将目标输入格式音频信号发送至远端用户的终端设备。首先,需要对目标输入格式音频信号进行编码,生成目标编码码流,发送至远端用户的终端设备。远端用户的终端设备接收到目标编码码流后,对目标编码码流进行解编码,以获取本端用户的声音信息。In the embodiment of the present disclosure, when the target input format audio signal is determined, in order to communicate with the remote user, the target input format audio signal is sent to the terminal device of the remote user. First, the audio signal in the target input format needs to be encoded to generate the target encoding stream and send it to the terminal device of the remote user. After receiving the target encoding code stream, the remote user's terminal device decodes the target encoding code stream to obtain the local user's voice information.
为了方便理解,本公开实施例提供一示例性实施例:To facilitate understanding, the embodiment of the present disclosure provides an exemplary embodiment:
如图2所示,本公开实施例中,获取音频信号(音频场景信号),对音频信号的音频场景进行分析,确定音频场景类型,根据音频场景类型和音频信号,输入至音频信号格式生成器,获取目标输入格式音频信号(音频格式信号)。As shown in Figure 2, in the embodiment of the present disclosure, an audio signal (audio scene signal) is obtained, the audio scene of the audio signal is analyzed, the audio scene type is determined, and the audio signal format generator is input according to the audio scene type and the audio signal. , obtain the target input format audio signal (audio format signal).
其中,根据目标输入格式音频信号和/或音频场景类型,确定使用的目标编码核,示例性地,在确定目标输入格式音频信号为基于对象音频信号的情况下,确定使用的目标编码核为对象音频数据编码核;在确定目标输入格式音频信号为基于场景音频信号的情况下,确定使用的目标编码核为场景音频数据编码核;在确定目标输入格式音频信号为基于对象音频信号和基于场景音频信号的情况下,确定使用的目标编码核为对象音频数据编码核和场景音频数据编码核,等等。Wherein, the target encoding core used is determined according to the target input format audio signal and/or the audio scene type. For example, when the target input format audio signal is determined to be an object-based audio signal, the target encoding core used is determined to be the object. Audio data encoding core; when it is determined that the target input format audio signal is a scene-based audio signal, the target encoding core used is determined to be a scene audio data encoding core; when it is determined that the target input format audio signal is an object-based audio signal and a scene-based audio signal In the case of signals, it is determined that the target coding core used is the object audio data coding core, the scene audio data coding core, and so on.
本公开实施例中,采用确定的目标编码核对目标输入格式音频信号(音频格式信号)进行编码,并经过码流复用,生成目标编码码流,并发送至远端用户的终端设备。远端用户的终端设备接收到目标编码码流后,对目标编码码流进行解编码,以获取本端用户的声音信息。In the embodiment of the present disclosure, the target input format audio signal (audio format signal) is encoded using a determined target encoding check, and the target encoding code stream is generated through code stream multiplexing and sent to the terminal device of the remote user. After receiving the target encoding code stream, the remote user's terminal device decodes the target encoding code stream to obtain the local user's voice information.
通过实施本公开实施例,获取音频信号;确定音频信号对应的音频场景类型;根据音频场景类型和音频信号,确定目标输入格式音频信号;对目标输入格式音频信号进行编码,生成目标编码码流。由此,能够保证所选择的音频格式信号能够更好表征本端用户的音频场景,使得远端用户能够很好的获取本端用户所处音频场景的音频场景信息,提升用户体验。By implementing the embodiments of the present disclosure, the audio signal is obtained; the audio scene type corresponding to the audio signal is determined; the target input format audio signal is determined based on the audio scene type and the audio signal; the target input format audio signal is encoded to generate the target encoding code stream. This ensures that the selected audio format signal can better represent the local user's audio scene, allowing the remote user to obtain audio scene information of the local user's audio scene, thereby improving the user experience.
请参见图3,图3是本公开实施例提供的另一种音频信号编码方法的流程图。Please refer to Figure 3, which is a flow chart of another audio signal encoding method provided by an embodiment of the present disclosure.
如图3所示,该方法可以包括但不限于如下步骤:As shown in Figure 3, the method may include but is not limited to the following steps:
S10,获取音频信号。S10, obtain the audio signal.
其中,S10的相关描述可以参见上述实施例中的相关描述,相同的内容此处不再赘述。For the relevant description of S10, please refer to the relevant description in the above embodiment, and the same content will not be described again here.
S20,确定音频信号对应的音频场景类型。S20: Determine the audio scene type corresponding to the audio signal.
本公开实施例中,在获取音频信号的基础上,可以对获取的音频信号进行分析,获取音频信号对应的音频场景类型。In the embodiment of the present disclosure, on the basis of obtaining the audio signal, the obtained audio signal can be analyzed to obtain the audio scene type corresponding to the audio signal.
在一些实施例中,确定音频信号对应的音频场景类型,包括:获取音频信号的音频特征参数;将音频特征参数输入至音频场景类型分析模型,确定音频信号对应的音频场景类型。In some embodiments, determining the audio scene type corresponding to the audio signal includes: obtaining audio characteristic parameters of the audio signal; inputting the audio characteristic parameters into the audio scene type analysis model to determine the audio scene type corresponding to the audio signal.
其中,本公开实施例中,获取音频信号的音频特征参数,例如:线性预测倒谱系数,过零率,梅尔频率倒谱系数等。音频场景类型分析模型,例如:HMM(Hidden Markov Model,隐马尔可夫模型)模型、高斯混合模型、或者其它数学模型等。在获取音频信号的音频特征参数的情况下,将音频信号的音频特征参数输入至音频场景类型分析模型,可以确定音频信号对应的音频场景类型。Among them, in the embodiment of the present disclosure, the audio characteristic parameters of the audio signal are obtained, such as: linear prediction cepstral coefficient, zero-crossing rate, mel frequency cepstrum coefficient, etc. Audio scene type analysis models, such as: HMM (Hidden Markov Model, hidden Markov model) model, Gaussian mixture model, or other mathematical models, etc. When the audio characteristic parameters of the audio signal are obtained, the audio characteristic parameters of the audio signal are input into the audio scene type analysis model, and the audio scene type corresponding to the audio signal can be determined.
可以理解的是,本公开实施例中,音频场景类型分析模型能够根据音频信号的音频特征参数,确定音频信号的音频场景类型,其中,音频场景类型分析模型可以通过预先训练获得,或者通过预先设置获得,可以采用相关技术中的方法,本公开实施例对此不作具体限制。It can be understood that in the embodiment of the present disclosure, the audio scene type analysis model can determine the audio scene type of the audio signal according to the audio characteristic parameters of the audio signal, wherein the audio scene type analysis model can be obtained through pre-training, or through pre-set To obtain, methods in related technologies can be used, and the embodiments of the present disclosure do not specifically limit this.
其中,音频场景类型的相关描述可以参见上述实施例中的相关描述,此处不再赘述。For relevant descriptions of audio scene types, please refer to the relevant descriptions in the above embodiments, and will not be described again here.
在一些实施例中,根据音频场景类型,确定目标音频信号输入格式,包括:In some embodiments, the target audio signal input format is determined according to the audio scene type, including:
响应于音频场景类型表征音频场景中包括至少一个主要音频信号,确定目标音频信号输入格式为基于对象信号输入格式;In response to the audio scene type characterizing the audio scene including at least one main audio signal, determining the target audio signal input format to be an object-based signal input format;
响应于音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,确定目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式;In response to the audio scene type characterizing the audio scene including at least one main audio signal and a background audio signal, determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format;
响应于音频场景类型表征音频场景中仅包括至少一个背景音频信号,确定目标音频信号输入格式为基于场景信号输入格式。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, the target audio signal input format is determined to be based on the scene signal input format.
本公开实施例中,在确定音频场景类型表征音频场景中包括至少一个主要音频信号的情况下,可以确定目标音频信号输入格式为基于对象信号输入格式。In the embodiment of the present disclosure, when it is determined that the audio scene type indicates that the audio scene includes at least one main audio signal, the target audio signal input format may be determined to be an object-based signal input format.
本公开实施例中,在确定音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号的情况下,可以确定目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式。In the embodiment of the present disclosure, when it is determined that the audio scene type indicates that the audio scene includes at least one main audio signal and a background audio signal, the target audio signal input format may be determined to be an object-based signal input format and a scene-based signal input format.
其中,本公开实施例中,确定音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,可以确定音频信号为混合格式音频信号,包括至少两种音频场景类型的输入格式音频信号。In this embodiment of the present disclosure, determining the audio scene type indicates that the audio scene includes at least one main audio signal and a background audio signal, and the audio signal can be determined to be a mixed format audio signal, including input format audio signals of at least two audio scene types.
本公开实施例中,在确定音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,确定音频信号为混合格式音频信号的情况下,还可以确定目标音频信号输入格式为基于对象信号输入格式、基于单声道格式信号、基于立体声格式信号、基于MASA格式信号、基于声道格式信号、基于FOA格式信号和基于HOA格式信号中的至少两个。In the embodiment of the present disclosure, when it is determined that the audio scene type represents that the audio scene includes at least one main audio signal and a background audio signal, and the audio signal is determined to be a mixed format audio signal, it can also be determined that the target audio signal input format is an object-based signal. The input format is at least two of a mono format signal based, a stereo format signal based, a MASA format signal based, a channel format signal based, a FOA format signal based and a HOA format signal based.
本公开实施例中,在确定音频场景类型表征音频场景中仅包括至少一个背景音频信号的情况下,可以确定目标音频信号输入格式为基于场景信号输入格式。In the embodiment of the present disclosure, when it is determined that the audio scene type indicates that the audio scene only includes at least one background audio signal, the target audio signal input format may be determined to be a scene signal-based input format.
在一些实施例中,根据目标音频信号输入格式和音频信号,确定目标输入格式音频信号,包括:In some embodiments, determining the target input format audio signal according to the target audio signal input format and the audio signal includes:
响应于确定目标音频信号输入格式为基于对象信号输入格式,确定目标输入格式音频信号为音频信号中的基于对象音频信号;In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
响应于确定目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式,确定目标输入格式音频信号为音频信号中的基于对象音频信号和基于场景音频信号;In response to determining that the target audio signal input format is an object-based signal input format and a scene-based signal input format, determining the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals;
响应于确定目标音频信号输入格式为基于场景信号输入格式,确定目标输入格式音频信号为音频信号中的基于场景音频信号。In response to determining that the target audio signal input format is a scene-based signal input format, the target input format audio signal is determined to be a scene-based audio signal among the audio signals.
本公开实施例中,在确定目标音频信号输入格式为基于对象信号输入格式的情况下,确定目标输入格式音频信号为音频信号中的基于对象音频信号。In the embodiment of the present disclosure, when it is determined that the target audio signal input format is an object-based signal input format, it is determined that the target input format audio signal is an object-based audio signal among the audio signals.
本公开实施例中,在确定目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式的情况下,可以确定目标输入格式音频信号为音频信号中的基于对象音频信号和基于场景音频信号。In the embodiment of the present disclosure, when it is determined that the target audio signal input format is an object-based signal input format and a scene-based signal input format, the target input format audio signal may be determined to be an object-based audio signal and a scene-based audio signal among the audio signals. .
本公开实施例中,在确定响应于确定目标音频信号输入格式为基于场景信号输入格式的情况下,可以确定目标输入格式音频信号为音频信号中的基于场景音频信号。In the embodiment of the present disclosure, when it is determined that the target audio signal input format is a scene-based signal input format in response to the determination, the target input format audio signal may be determined to be a scene-based audio signal among the audio signals.
S30,根据目标输入格式音频信号,确定目标编码核,或者,根据音频场景类型和目标输入格式音频信号,确定目标编码核。S30: Determine the target encoding core according to the target input format audio signal, or determine the target encoding core according to the audio scene type and the target input format audio signal.
在一些实施例中,根据目标输入格式音频信号,确定目标信号编码核,包括:In some embodiments, the target signal encoding core is determined according to the target input format audio signal, including:
响应于目标输入格式音频信号为基于对象音频信号,确定基于对象音频信号中对象音频数据的目标编码核为对象音频数据编码核,确定基于对象音频信号中元数据的目标编码核为对象元数据参数编码核;In response to the target input format audio signal being an object-based audio signal, the target encoding core based on the object audio data in the object audio signal is determined to be the object audio data encoding core, and the target encoding core based on the metadata in the object audio signal is determined to be the object metadata parameter. coding core;
响应于目标输入格式音频信号为基于场景音频信号,确定目标编码核为场景音频数据编码核;In response to the target input format audio signal being a scene-based audio signal, determining the target encoding core to be the scene audio data encoding core;
响应于目标输入格式音频信号为基于声道音频信号,确定目标编码核为声道音频数据编码核;In response to the target input format audio signal being a channel-based audio signal, determining the target encoding core to be the channel audio data encoding core;
响应于目标输入格式音频信号为基于辅助元数据的空间音频MASA音频信号,确定基于MASA音频信号中音频数据的目标编码核为场景音频数据编码核,确定基于MASA音频信号中空间辅助元数据的目标编码核为空间辅助元数据参数编码核;In response to the target input format audio signal being a spatial audio MASA audio signal based on auxiliary metadata, it is determined that the target encoding core based on the audio data in the MASA audio signal is the scene audio data encoding core, and the target based on the spatial auxiliary metadata in the MASA audio signal is determined. The encoding core is the spatial auxiliary metadata parameter encoding core;
响应于目标输入格式音频信号为混合格式音频信号,确定目标编码核为混合格式音频信号中的格式音频信号选用的编码核,其中,混合格式信号包括基于对象音频信号、基于场景音频信号、基于声道音频信号和基于MASA音频信号中的至少两个。In response to the target input format audio signal being a mixed format audio signal, the target encoding core is determined to be the encoding core selected for the format audio signal in the mixed format audio signal, where the mixed format signal includes an object-based audio signal, a scene-based audio signal, and a sound-based audio signal. at least two of the channel audio signal and the MASA-based audio signal.
本公开实施例中,在确定目标输入格式音频信号为基于对象音频信号的情况下,其中,基于对象音频信号包括对象音频数据和元数据,可以确定基于对象音频信号中对象音频数据的目标编码核为对象音频数据编码核,确定基于对象音频信号中元数据的目标编码核为对象元数据参数编码核。In the embodiment of the present disclosure, when it is determined that the target input format audio signal is an object-based audio signal, where the object-based audio signal includes object audio data and metadata, the target encoding core based on the object audio data in the object audio signal can be determined. As the object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is the object metadata parameter encoding core.
本公开实施例中,在确定目标输入格式音频信号为基于场景音频信号的情况下,可以确定目标编码核为场景音频数据编码核。In the embodiment of the present disclosure, when it is determined that the target input format audio signal is based on a scene audio signal, the target encoding core may be determined to be a scene audio data encoding core.
本公开实施例中,在确定目标输入格式音频信号为基于声道音频信号的情况下,可以确定目标编码核为声道音频数据编码核。In the embodiment of the present disclosure, when it is determined that the target input format audio signal is based on the channel audio signal, the target encoding core may be determined to be the channel audio data encoding core.
本公开实施例中,在确定目标输入格式音频信号为基于辅助元数据的空间音频MASA音频信号的情况下,其中,基于MASA音频信号包括音频数据和空间辅助元数据,可以确定基于MASA音频信号中音频数据的目标编码核为场景音频数据编码核,可以确定基于MASA音频信号中空间辅助元数据的目标编码核为空间辅助元数据参数编码核。In the embodiment of the present disclosure, when it is determined that the target input format audio signal is a spatial audio MASA audio signal based on auxiliary metadata, where the MASA-based audio signal includes audio data and spatial auxiliary metadata, it can be determined that the MASA-based audio signal is The target coding core of the audio data is the scene audio data coding core, and it can be determined that the target coding core based on the spatial auxiliary metadata in the MASA audio signal is the spatial auxiliary metadata parameter coding core.
本公开实施例中,在确定目标输入格式音频信号为混合格式音频信号的情况下,可以确定目标编码核为混合格式音频信号中的格式音频信号选用的编码核,其中,混合格式信号包括基于对象音频信号、基于场景音频信号、基于声道音频信号和基于MASA音频信号中的至少两个。In the embodiment of the present disclosure, when it is determined that the target input format audio signal is a mixed format audio signal, the target encoding core can be determined to be the encoding core selected for the format audio signal in the mixed format audio signal, where the mixed format signal includes an object-based At least two of an audio signal, a scene-based audio signal, a channel-based audio signal, and a MASA-based audio signal.
在一些实施例中,根据音频场景类型和目标输入格式音频信号,确定目标编码核,包括:In some embodiments, the target encoding core is determined according to the audio scene type and the target input format audio signal, including:
响应于音频场景类型表征音频场景中包括至少一个主要音频信号,以及目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定基于对象音频信号中对象音频数据的目标编码核为对象音频数据编码核,确定基于对象音频信号中元数据的目标编码核为对象元数据参数编码核,确定基于场景音频信号的目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, it is determined that the target encoding core of the object audio data in the object-based audio signal is the object audio data Coding core, determine the target coding core based on the metadata in the object audio signal as the object metadata parameter coding core, and determine the target coding core based on the scene audio signal as the scene audio data coding core;
响应于音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,以及目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定基于对象音频信号中对象音频数据的目标编码核为对象音频数据编码核,确定基于对象音频信号中元数据的目标编码核为对象元数据参数编码核,确定基于场景音频信号的目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal and the background audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, a target encoding core of the object audio data in the object-based audio signal is determined. As the object audio data encoding core, determine the target encoding core based on the metadata in the object audio signal as the object metadata parameter encoding core, and determine the target encoding core based on the scene audio signal as the scene audio data encoding core;
响应于音频场景类型表征音频场景中仅包括至少一个背景音频信号,以及目标输入格式音频信号为基于场景音频信号,确定基于场景音频信号的目标编码核为场景音频数据编码核。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, and the target input format audio signal is based on the scene audio signal, it is determined that the target encoding core based on the scene audio signal is the scene audio data encoding core.
本公开实施例中,在确定音频场景类型表征音频场景中包括至少一个主要音频信号,以及目标输入格式音频信号为基于对象音频信号和基于场景音频信号的情况下,可以确定基于对象音频信号中对象音频数据的目标编码核为对象音频数据编码核,确定基于对象音频信号中元数据的目标编码核为对象元数据参数编码核,确定基于场景音频信号的目标编码核为场景音频数据编码核。In the embodiment of the present disclosure, when it is determined that the audio scene type represents that the audio scene includes at least one main audio signal, and the target input format audio signal is an object-based audio signal and a scene-based audio signal, it can be determined that the object-based audio signal The target coding core of the audio data is the object audio data coding core, the target coding core based on the metadata in the object audio signal is determined to be the object metadata parameter coding core, and the target coding core based on the scene audio signal is determined to be the scene audio data coding core.
本公开实施例中,在确定音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,以及目标输入格式音频信号为基于对象音频信号和基于场景音频信号的情况下,确定基于对象音频信号中对象音频数据的目标编码核为对象音频数据编码核,确定基于对象音频信号中元数据的目标编码核为对象元数据参数编码核,确定基于场景音频信号的目标编码核为场景音频数据编码核。In the embodiment of the present disclosure, when it is determined that the audio scene type representation audio scene includes at least one main audio signal and a background audio signal, and the target input format audio signal is an object-based audio signal and a scene-based audio signal, it is determined that the object-based audio signal The target coding core of the object audio data in the signal is the object audio data coding core. The target coding core based on the metadata in the object audio signal is determined to be the object metadata parameter coding core. The target coding core based on the scene audio signal is determined to be the scene audio data coding. nuclear.
本公开实施例中,在确定音频场景类型表征音频场景中仅包括至少一个背景音频信号,以及目标输入格式音频信号为基于场景音频信号的情况下,可以确定基于场景音频信号的目标编码核为场景音频数据编码核。In the embodiment of the present disclosure, when it is determined that the audio scene type representation audio scene only includes at least one background audio signal, and the target input format audio signal is a scene-based audio signal, it can be determined that the target encoding core based on the scene audio signal is scene Audio data encoding core.
S40,根据目标编码核对目标输入格式音频信号进行编码,获取编码参数。S40: Encode the audio signal in the target input format according to the target encoding check and obtain encoding parameters.
S50,根据编码参数进行码流复用,生成目标编码码流。S50: Multiplex the code stream according to the coding parameters to generate the target code stream.
其中,本公开实施例中,S40和S50的相关描述可以参见上述实施例中的相关描述,此处不再赘述。Among them, in the embodiment of the present disclosure, the relevant descriptions of S40 and S50 can be referred to the relevant descriptions in the above embodiments, and will not be described again here.
在一种可能的实现方式中,如图4所示,本公开实施例中,获取音频信号,确定音频信号对应的音频场景类型,根据音频场景类型和/或音频信号,确定目标输入格式音频信号为基于对象音频信号的情况下,根据目标输入格式音频信号为基于对象音频信号,确定基于对象音频信号中对象音频数据的目标编码核为对象音频数据编码核,确定基于对象音频信号中元数据的目标编码核为对象元数据参数编码核,之后,进行码流复用,获取目标编码码流,由此,能够保证所选择的音频格式信号能够更好表征本端用户的音频场景,使得远端用户能够很好的获取本端用户所处音频场景的音频场景信息,提升用户体验。In a possible implementation, as shown in Figure 4, in the embodiment of the present disclosure, the audio signal is obtained, the audio scene type corresponding to the audio signal is determined, and the target input format audio signal is determined according to the audio scene type and/or the audio signal. In the case of an object-based audio signal, according to the target input format audio signal is an object-based audio signal, the target encoding core based on the object audio data in the object audio signal is determined to be the object audio data encoding core, and the target encoding core based on the metadata in the object audio signal is determined. The target encoding core is the object metadata parameter encoding core. Afterwards, the code stream is multiplexed to obtain the target encoding code stream. This ensures that the selected audio format signal can better represent the audio scene of the local user, so that the remote end Users can well obtain the audio scene information of the audio scene where the local user is located, improving the user experience.
在另一种可能的实现方式中,如图5所示,本公开实施例中,获取音频信号,确定音频信号对应的音频场景类型,根据音频场景类型和/或音频信号,确定目标输入格式音频信号为基于MASA音频信号的情况下,根据目标输入格式音频信号为基于MASA音频信号,确定基于MASA音频信号中音频数据的目标编码核为场景音频数据编码核,可以确定基于MASA音频信号中空间辅助元数据的目标编码核为空间辅助元数据参数编码核,其中,基于MASA音频信号中空间辅助元数据是可选项,之后,进行码流复用,获取目标编码码流,由此,能够保证所选择的音频格式信号能够更好表征本端用户的音频场 景,使得远端用户能够很好的获取本端用户所处音频场景的音频场景信息,提升用户体验。In another possible implementation, as shown in Figure 5, in the embodiment of the present disclosure, the audio signal is obtained, the audio scene type corresponding to the audio signal is determined, and the target input format audio is determined according to the audio scene type and/or the audio signal. When the signal is based on MASA audio signal, according to the target input format audio signal is based on MASA audio signal, it is determined that the target encoding core based on the audio data in the MASA audio signal is the scene audio data encoding core, and it can be determined based on the spatial auxiliary in the MASA audio signal The target encoding core of the metadata is the spatial auxiliary metadata parameter encoding core, in which the spatial auxiliary metadata in the MASA audio signal is optional. Afterwards, the code stream is multiplexed to obtain the target encoding code stream, thus ensuring that all The selected audio format signal can better characterize the audio scene of the local user, allowing the remote user to obtain the audio scene information of the audio scene where the local user is located, and improving the user experience.
图6是本公开实施例提供的一种音频信号编码装置的结构图。Figure 6 is a structural diagram of an audio signal encoding device provided by an embodiment of the present disclosure.
如图6所示,音频信号编码装置1,包括:信号获取单元11、类型确定单元12、目标信号确定单元13和编码处理单元14。As shown in FIG. 6 , the audio signal encoding device 1 includes: a signal acquisition unit 11 , a type determination unit 12 , a target signal determination unit 13 and an encoding processing unit 14 .
信号获取单元11,被配置为获取音频信号。The signal acquisition unit 11 is configured to acquire audio signals.
类型确定单元12,被配置为确定音频信号对应的音频场景类型。The type determining unit 12 is configured to determine the audio scene type corresponding to the audio signal.
目标信号确定单元13,被配置为根据音频场景类型和音频信号,确定目标输入格式音频信号。The target signal determining unit 13 is configured to determine the target input format audio signal according to the audio scene type and the audio signal.
编码处理单元14,被配置为对目标输入格式音频信号进行编码,生成目标编码码流。The encoding processing unit 14 is configured to encode the target input format audio signal and generate a target encoded code stream.
如图7所示,在一些实施例中,类型确定单元12,包括:参数获取模块121和模型处理模块122。As shown in Figure 7, in some embodiments, the type determination unit 12 includes: a parameter acquisition module 121 and a model processing module 122.
参数获取模块121,被配置为获取音频信号的音频特征参数。The parameter acquisition module 121 is configured to acquire audio characteristic parameters of the audio signal.
模型处理模块122,被配置为将音频特征参数输入至音频场景类型分析模型,确定音频信号对应的音频场景类型。The model processing module 122 is configured to input the audio feature parameters into the audio scene type analysis model to determine the audio scene type corresponding to the audio signal.
如图8所示,在一些实施例中,目标信号确定单元13,包括:目标格式确定模块131和目标信号确定模块132。As shown in FIG. 8 , in some embodiments, the target signal determination unit 13 includes: a target format determination module 131 and a target signal determination module 132 .
目标格式确定模块131,被配置为根据音频场景类型和/或音频信号,确定目标音频信号输入格式。The target format determining module 131 is configured to determine the target audio signal input format according to the audio scene type and/or the audio signal.
目标信号确定模块132,被配置为根据目标音频信号输入格式和音频信号,确定目标输入格式音频信号。The target signal determining module 132 is configured to determine the target input format audio signal according to the target audio signal input format and the audio signal.
在一些实施例中,目标格式确定模块131,具体被配置为:In some embodiments, the target format determination module 131 is specifically configured as:
响应于音频场景类型表征音频场景中包括至少一个主要音频信号,确定目标音频信号输入格式为基于对象信号输入格式;In response to the audio scene type characterizing the audio scene including at least one main audio signal, determining the target audio signal input format to be an object-based signal input format;
响应于音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,确定目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式;In response to the audio scene type characterizing the audio scene including at least one main audio signal and a background audio signal, determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format;
响应于音频场景类型表征音频场景中仅包括至少一个背景音频信号,确定目标音频信号输入格式为基于场景信号输入格式。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, the target audio signal input format is determined to be based on the scene signal input format.
在一些实施例中,目标信号确定模块132,具体被配置为:In some embodiments, the target signal determination module 132 is specifically configured as:
响应于确定目标音频信号输入格式为基于对象信号输入格式,确定目标输入格式音频信号为音频信号中的基于对象音频信号;In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
响应于确定目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式,确定目标输入格式音频信号为音频信号中的基于对象音频信号和基于场景音频信号;In response to determining that the target audio signal input format is an object-based signal input format and a scene-based signal input format, determining the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals;
响应于确定目标音频信号输入格式为基于场景信号输入格式,确定目标输入格式音频信号为音频信号中的基于场景音频信号。In response to determining that the target audio signal input format is a scene-based signal input format, the target input format audio signal is determined to be a scene-based audio signal among the audio signals.
如图9所示,在一些实施例中,编码处理单元14,包括:编码核确定模块141、参数获取模块142和码流生成模块143。As shown in Figure 9, in some embodiments, the encoding processing unit 14 includes: an encoding core determination module 141, a parameter acquisition module 142, and a code stream generation module 143.
编码核确定模块141,被配置为根据目标输入格式音频信号,确定目标编码核,或者,根据音频场景类型和目标输入格式音频信号,确定目标编码核。The encoding core determination module 141 is configured to determine the target encoding core according to the target input format audio signal, or to determine the target encoding core according to the audio scene type and the target input format audio signal.
参数获取模块142,被配置为根据目标编码核对目标输入格式音频信号进行编码,获取编码参数。The parameter acquisition module 142 is configured to encode the target input format audio signal according to the target encoding check and obtain encoding parameters.
码流生成模块143,被配置为根据编码参数进行码流复用,生成目标编码码流。The code stream generation module 143 is configured to perform code stream multiplexing according to encoding parameters and generate a target code stream.
在一些实施例中,编码核确定模块141,具体被配置为:In some embodiments, the encoding core determination module 141 is specifically configured as:
响应于目标输入格式音频信号为基于对象音频信号,确定基于对象音频信号中对象音频数据的目标编码核为对象音频数据编码核,确定基于对象音频信号中元数据的目标编码核为对象元数据参数编码核;In response to the target input format audio signal being an object-based audio signal, the target encoding core based on the object audio data in the object audio signal is determined to be the object audio data encoding core, and the target encoding core based on the metadata in the object audio signal is determined to be the object metadata parameter. coding core;
响应于目标输入格式音频信号为基于场景音频信号,确定目标编码核为场景音频数据编码核;In response to the target input format audio signal being a scene-based audio signal, determining the target encoding core to be the scene audio data encoding core;
响应于目标输入格式音频信号为基于声道音频信号,确定目标编码核为声道音频数据编码核;In response to the target input format audio signal being a channel-based audio signal, determining the target encoding core to be the channel audio data encoding core;
响应于目标输入格式音频信号为基于辅助元数据的空间音频MASA音频信号,确定基于MASA音频信号中音频数据的目标编码核为场景音频数据编码核,确定基于MASA音频信号中空间辅助元数据的目标编码核为空间辅助元数据参数编码核;In response to the target input format audio signal being a spatial audio MASA audio signal based on auxiliary metadata, it is determined that the target encoding core based on the audio data in the MASA audio signal is the scene audio data encoding core, and the target based on the spatial auxiliary metadata in the MASA audio signal is determined. The encoding core is the spatial auxiliary metadata parameter encoding core;
响应于目标输入格式音频信号为混合格式音频信号,确定目标编码核为混合格式音频信号中的格式音频信号选用的编码核,其中,混合格式信号包括基于对象音频信号、基于场景音频信号、基于声道音频信号和基于MASA音频信号中的至少两个。In response to the target input format audio signal being a mixed format audio signal, the target encoding core is determined to be the encoding core selected for the format audio signal in the mixed format audio signal, where the mixed format signal includes an object-based audio signal, a scene-based audio signal, and a sound-based audio signal. at least two of the channel audio signal and the MASA-based audio signal.
在一些实施例中,编码核确定模块141,具体被配置为:In some embodiments, the encoding core determination module 141 is specifically configured as:
响应于音频场景类型表征音频场景中包括至少一个主要音频信号,以及目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定基于对象音频信号中对象音频数据的目标编码核为对象音频数据编码核,确定基于对象音频信号中元数据的目标编码核为对象元数据参数编码核,确定基于场景音频信号的目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, it is determined that the target encoding core of the object audio data in the object-based audio signal is the object audio data Coding core, determine the target coding core based on the metadata in the object audio signal as the object metadata parameter coding core, and determine the target coding core based on the scene audio signal as the scene audio data coding core;
响应于音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,以及目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定基于对象音频信号中对象音频数据的目标编码核为对象音频数据编码核,确定基于对象音频信号中元数据的目标编码核为对象元数据参数编码核,确定基于场景音频信号的目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal and the background audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, a target encoding core of the object audio data in the object-based audio signal is determined. As the object audio data encoding core, determine the target encoding core based on the metadata in the object audio signal as the object metadata parameter encoding core, and determine the target encoding core based on the scene audio signal as the scene audio data encoding core;
响应于音频场景类型表征音频场景中仅包括至少一个背景音频信号,以及目标输入格式音频信号为基于场景音频信号,确定基于场景音频信号的目标编码核为场景音频数据编码核。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, and the target input format audio signal is based on the scene audio signal, it is determined that the target encoding core based on the scene audio signal is the scene audio data encoding core.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the devices in the above embodiments, the specific manner in which each module performs operations has been described in detail in the embodiments related to the method, and will not be described in detail here.
本公开实施例提供的音频信号编码装置,可以执行如上面一些实施例所述的音频信号编码方法,其有益效果与上述的音频信号编码方法的有益效果相同,此处不再赘述。The audio signal encoding device provided by the embodiments of the present disclosure can perform the audio signal encoding method as described in some of the above embodiments, and its beneficial effects are the same as those of the audio signal encoding method described above, which will not be described again here.
图10是根据一示例性实施例示出的一种用于执行音频信号编码方法的电子设备100的结构图。FIG. 10 is a structural diagram of an electronic device 100 for performing an audio signal encoding method according to an exemplary embodiment.
示例性地,电子设备100可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。Illustratively, the electronic device 100 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like.
如图10所示,电子设备100可以包括以下一个或多个组件:处理组件101,存储器102,电源组件103,多媒体组件104,音频组件105,输入/输出(I/O)的接口106,传感器组件107,以及通信组件108。As shown in FIG. 10 , the electronic device 100 may include one or more of the following components: a processing component 101 , a memory 102 , a power supply component 103 , a multimedia component 104 , an audio component 105 , an input/output (I/O) interface 106 , and a sensor. component 107, and communications component 108.
处理组件101通常控制电子设备100的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件101可以包括一个或多个处理器1011来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件101可以包括一个或多个模块,便于处理组件101和其他组件之间的交互。例如,处理组件101可以包括多媒体模块,以方便多媒体组件104和处理组件101之间的交互。The processing component 101 generally controls the overall operations of the electronic device 100, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 101 may include one or more processors 1011 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 101 may include one or more modules that facilitate interaction between processing component 101 and other components. For example, processing component 101 may include a multimedia module to facilitate interaction between multimedia component 104 and processing component 101 .
存储器102被配置为存储各种类型的数据以支持在电子设备100的操作。这些数据的示例包括用于在电子设备100上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器102可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如SRAM(Static Random-Access Memory,静态随机存取存储器),EEPROM(Electrically Erasable Programmable read only memory,带电可擦可编程只读存储器),EPROM(Erasable Programmable Read-Only Memory,可擦除可编程只读存储器),PROM(Programmable read-only memory,可编程只读存储器),ROM(Read-Only Memory,只读存储器),磁存储器,快闪存储器,磁盘或光盘。 Memory 102 is configured to store various types of data to support operations at electronic device 100 . Examples of such data include instructions for any application or method operating on the electronic device 100, contact data, phonebook data, messages, pictures, videos, etc. The memory 102 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as SRAM (Static Random-Access Memory), EEPROM (Electrically Erasable Programmable read only memory), which can be Erasable programmable read-only memory), EPROM (Erasable Programmable Read-Only Memory, erasable programmable read-only memory), PROM (Programmable read-only memory, programmable read-only memory), ROM (Read-Only Memory, only read memory), magnetic memory, flash memory, magnetic disk or optical disk.
电源组件103为电子设备100的各种组件提供电力。电源组件103可以包括电源管理***,一个或多个电源,及其他与为电子设备100生成、管理和分配电力相关联的组件。 Power supply component 103 provides power to various components of electronic device 100 . Power supply components 103 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 100 .
多媒体组件104包括在所述电子设备100和用户之间的提供一个输出接口的触控显示屏。在一些实施例中,触控显示屏可以包括LCD(Liquid Crystal Display,液晶显示器)和TP(Touch Panel,触摸面板)。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件104包括一个前置摄像头和/或后置摄像头。当电子设备100处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜***或具有焦距和光学变焦能力。 Multimedia component 104 includes a touch-sensitive display screen that provides an output interface between the electronic device 100 and the user. In some embodiments, the touch display screen may include LCD (Liquid Crystal Display) and TP (Touch Panel). The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action. In some embodiments, multimedia component 104 includes a front-facing camera and/or a rear-facing camera. When the electronic device 100 is in an operating mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.
音频组件105被配置为输出和/或输入音频信号。例如,音频组件105包括一个MIC(Microphone,麦克风),当电子设备100处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器102或经由通信组件108发送。在一些实施例中,音频组件105还包括一个扬声器,用于输出音频信号。 Audio component 105 is configured to output and/or input audio signals. For example, the audio component 105 includes a MIC (Microphone), and when the electronic device 100 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signals may be further stored in memory 102 or sent via communications component 108 . In some embodiments, audio component 105 also includes a speaker for outputting audio signals.
I/O接口2112为处理组件101和***接口模块之间提供接口,上述***接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 2112 provides an interface between the processing component 101 and a peripheral interface module. The peripheral interface module may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.
传感器组件107包括一个或多个传感器,用于为电子设备100提供各个方面的状态评估。例如,传感器组件107可以检测到电子设备100的打开/关闭状态,组件的相对定位,例如所述组件为电子设备100的显示器和小键盘,传感器组件107还可以检测电子设备100或电子设备100一个组件的位置改变,用户与电子设备100接触的存在或不存在,电子设备100方位或加速/减速和电子设备100的温度变化。传感器组件107可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件107还可以包括光传感器,如CMOS(Complementary Metal Oxide Semiconductor,互补金属氧化物半导体)或CCD(Charge-coupled Device,电荷耦合元件)图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件107还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。 Sensor component 107 includes one or more sensors for providing various aspects of status assessment for electronic device 100 . For example, the sensor component 107 can detect the open/closed state of the electronic device 100, the relative positioning of components, such as the display and the keypad of the electronic device 100, the sensor component 107 can also detect the electronic device 100 or an electronic device 100. The position of components changes, the presence or absence of user contact with the electronic device 100 , the orientation or acceleration/deceleration of the electronic device 100 and the temperature of the electronic device 100 change. Sensor assembly 107 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor component 107 may also include a light sensor, such as a CMOS (Complementary Metal Oxide Semiconductor) or a CCD (Charge-coupled Device) image sensor for use in imaging applications. In some embodiments, the sensor component 107 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
通信组件108被配置为便于电子设备100和其他设备之间有线或无线方式的通信。电子设备100可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件108经由广播信道接收来自外部广播管理***的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件108还包括NFC(Near Field Communication,近场通信)模块,以促进短程通信。例如,在NFC模块可基于RFID(Radio Frequency Identification,射频识别)技术,IrDA(Infrared Data Association,红外数据协会)技术,UWB(Ultra Wide Band,超宽带)技术,BT(Bluetooth,蓝牙) 技术和其他技术来实现。 Communication component 108 is configured to facilitate wired or wireless communication between electronic device 100 and other devices. The electronic device 100 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 108 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 108 also includes an NFC (Near Field Communication) module to facilitate short-range communication. For example, the NFC module can be based on RFID (Radio Frequency Identification) technology, IrDA (Infrared Data Association) technology, UWB (Ultra Wide Band) technology, BT (Bluetooth, Bluetooth) technology and other Technology to achieve.
在示例性实施例中,电子设备100可以被一个或多个ASIC(Application Specific Integrated Circuit,专用集成电路)、DSP(Digital Signal Processor,数字信号处理器)、数字信号处理设备(DSPD)、PLD(Programmable Logic Device,可编程逻辑器件)、FPGA(Field Programmable Gate Array,现场可编程逻辑门阵列)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述音频信号编码方法。需要说明的是,本实施例的电子设备的实施过程和技术原理参见前述对本公开实施例的音频信号编码方法的解释说明,此处不再赘述。In an exemplary embodiment, the electronic device 100 may be configured by one or more ASIC (Application Specific Integrated Circuit), DSP (Digital Signal Processor, digital signal processor), digital signal processing device (DSPD), PLD ( Programmable Logic Device, Programmable Logic Device), FPGA (Field Programmable Gate Array, Field Programmable Logic Gate Array), controller, microcontroller, microprocessor or other electronic components are used to perform the above audio signal encoding method. It should be noted that, for the implementation process and technical principles of the electronic device of this embodiment, please refer to the aforementioned explanation of the audio signal encoding method of the embodiment of the present disclosure, and will not be described again here.
本公开实施例提供的电子设备100,可以执行如上面一些实施例所述的音频信号编码方法,其有益效果与上述的音频信号编码方法的有益效果相同,此处不再赘述。The electronic device 100 provided by the embodiments of the present disclosure can perform the audio signal encoding method as described in some of the above embodiments, and its beneficial effects are the same as those of the audio signal encoding method described above, which will not be described again here.
为了实现上述实施例,本公开还提出一种存储介质。In order to implement the above embodiments, the present disclosure also proposes a storage medium.
其中,该存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如前所述的音频信号编码方法。例如,所述存储介质可以是ROM(Read Only Memory Image,只读存储器)、RAM(Random Access Memory,随机存取存储器)、CD-ROM(Compact Disc Read-Only Memory,紧凑型光盘只读存储器)、磁带、软盘和光数据存储设备等。When the instructions in the storage medium are executed by the processor of the electronic device, the electronic device can perform the audio signal encoding method as described above. For example, the storage medium can be ROM (Read Only Memory Image, read-only memory), RAM (Random Access Memory, random access memory), CD-ROM (Compact Disc Read-Only Memory, compact disc read-only memory) , tapes, floppy disks and optical data storage devices, etc.
为了实现上述实施例,本公开还提供一种计算机程序产品,该计算机程序由电子设备的处理器执行时,使得电子设备能够执行如前所述的音频信号编码方法。In order to implement the above embodiments, the present disclosure also provides a computer program product. When the computer program is executed by a processor of an electronic device, the electronic device can perform the audio signal encoding method as described above.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure that follow the general principles of the disclosure and include common knowledge or customary technical means in the technical field that are not disclosed in the disclosure. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the systems, devices and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any person familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present disclosure. should be covered by the protection scope of this disclosure. Therefore, the protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims (19)

  1. 一种音频信号编码方法,其特征在于,包括:An audio signal encoding method, characterized by including:
    获取音频信号;Get the audio signal;
    确定所述音频信号对应的音频场景类型;Determine the audio scene type corresponding to the audio signal;
    根据所述音频场景类型和所述音频信号,确定目标输入格式音频信号;Determine a target input format audio signal according to the audio scene type and the audio signal;
    对所述目标输入格式音频信号进行编码,生成目标编码码流。Encode the audio signal in the target input format to generate a target code stream.
  2. 如权利要求1所述的方法,其特征在于,所述确定所述音频信号对应的音频场景类型,包括:The method of claim 1, wherein determining the audio scene type corresponding to the audio signal includes:
    获取所述音频信号的音频特征参数;Obtain the audio characteristic parameters of the audio signal;
    将所述音频特征参数输入至音频场景类型分析模型,确定所述音频信号对应的所述音频场景类型。The audio feature parameters are input into the audio scene type analysis model to determine the audio scene type corresponding to the audio signal.
  3. 如权利要求1或2所述的方法,其特征在于,所述根据所述音频场景类型和所述音频信号,确定目标输入格式音频信号,包括:The method of claim 1 or 2, wherein determining the target input format audio signal according to the audio scene type and the audio signal includes:
    根据所述音频场景类型和/或所述音频信号,确定目标音频信号输入格式;Determine a target audio signal input format according to the audio scene type and/or the audio signal;
    根据所述目标音频信号输入格式和所述音频信号,确定所述目标输入格式音频信号。The target input format audio signal is determined based on the target audio signal input format and the audio signal.
  4. 如权利要求3所述的方法,其特征在于,所述根据所述音频场景类型,确定目标音频信号输入格式,包括:The method of claim 3, wherein determining the target audio signal input format according to the audio scene type includes:
    响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号,确定所述目标音频信号输入格式为基于对象信号输入格式;In response to the audio scene type representing that the audio scene includes at least one primary audio signal, determining the target audio signal input format to be an object-based signal input format;
    响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,确定所述目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式;In response to the audio scene type representing that the audio scene includes at least one main audio signal and a background audio signal, determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format;
    响应于所述音频场景类型表征音频场景中仅包括至少一个背景音频信号,确定所述目标音频信号输入格式为基于场景信号输入格式。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, the target audio signal input format is determined to be a scene signal-based input format.
  5. 如权利要求4所述的方法,其特征在于,所述根据所述目标音频信号输入格式和所述音频信号,确定所述目标输入格式音频信号,包括:The method of claim 4, wherein determining the target input format audio signal according to the target audio signal input format and the audio signal includes:
    响应于确定所述目标音频信号输入格式为基于对象信号输入格式,确定所述目标输入格式音频信号为所述音频信号中的基于对象音频信号;In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
    响应于确定所述目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式,确定所述目标输入格式音频信号为所述音频信号中的基于对象音频信号和基于场景音频信号;In response to determining that the target audio signal input format is an object-based signal input format and a scene-based signal input format, determining that the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals;
    响应于确定所述目标音频信号输入格式为基于场景信号输入格式,确定所述目标输入格式音频信号为所述音频信号中的基于场景音频信号。In response to determining that the target audio signal input format is a scene-based signal input format, it is determined that the target input format audio signal is a scene-based audio signal among the audio signals.
  6. 如权利要求1至5中任一项所述的方法,其特征在于,所述对所述目标输入格式音频信号进行编码,生成目标编码码流,包括:The method according to any one of claims 1 to 5, characterized in that encoding the target input format audio signal to generate a target encoding code stream includes:
    根据所述目标输入格式音频信号,确定目标编码核,或者,根据所述音频场景类型和所述目标输入格式音频信号,确定目标编码核;Determine a target encoding core according to the target input format audio signal, or determine a target encoding core according to the audio scene type and the target input format audio signal;
    根据所述目标编码核对所述目标输入格式音频信号进行编码,获取编码参数;Encode the target input format audio signal according to the target encoding check and obtain encoding parameters;
    根据所述编码参数进行码流复用,生成所述目标编码码流。Code stream multiplexing is performed according to the coding parameters to generate the target coding code stream.
  7. 如权利要求6所述的方法,其特征在于,所述根据所述目标输入格式音频信号,确定目标信号编码核,包括:The method of claim 6, wherein determining the target signal encoding core according to the target input format audio signal includes:
    响应于所述目标输入格式音频信号为基于对象音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核;In response to the target input format audio signal being an object-based audio signal, it is determined that the target encoding core of the object audio data in the object-based audio signal is an object audio data encoding core, and it is determined that the object-based audio signal is based on metadata in the object audio signal. The target encoding core is an object metadata parameter encoding core;
    响应于所述目标输入格式音频信号为基于场景音频信号,确定所述目标编码核为场景音频数据编码核;In response to the target input format audio signal being a scene-based audio signal, determining that the target encoding core is a scene audio data encoding core;
    响应于所述目标输入格式音频信号为基于声道音频信号,确定所述目标编码核为声道音频数据编码核;In response to the target input format audio signal being a channel-based audio signal, determining that the target encoding core is a channel audio data encoding core;
    响应于所述目标输入格式音频信号为基于辅助元数据的空间音频MASA音频信号,确定所述基于MASA音频信号中音频数据的所述目标编码核为场景音频数据编码核,确定所述基于MASA音频信号中空间辅助元数据的所述目标编码核为空间辅助元数据参数编码核;In response to the target input format audio signal being a spatial audio MASA audio signal based on auxiliary metadata, it is determined that the target encoding core based on the audio data in the MASA audio signal is a scene audio data encoding core, and it is determined that the MASA based audio The target encoding core of the spatial auxiliary metadata in the signal is a spatial auxiliary metadata parameter encoding core;
    响应于所述目标输入格式音频信号为混合格式音频信号,确定所述目标编码核为所述混合格式音频信号中的格式音频信号选用的编码核,其中,所述混合格式信号包括基于对象音频信号、基于场景音频信号、基于声道音频信号和基于MASA音频信号中的至少两个。In response to the target input format audio signal being a mixed format audio signal, it is determined that the target encoding core is the encoding core selected for the format audio signal in the mixed format audio signal, wherein the mixed format signal includes an audio signal based on the object , at least two of a scene-based audio signal, a channel-based audio signal, and a MASA-based audio signal.
  8. 如权利要求6所述的方法,其特征在于,所述根据所述音频场景类型和所述目标输入格式音频信号,确定目标编码核,包括:The method of claim 6, wherein determining the target coding core according to the audio scene type and the target input format audio signal includes:
    响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号,以及所述目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, determining all of the object audio data in the object-based audio signal. The target encoding core is an object audio data encoding core, it is determined that the target encoding core based on metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target encoding core based on the scene audio signal is a scene. Audio data encoding core;
    响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,以及所述目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal and a background audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, determining the object in the object-based audio signal The target encoding core of the audio data is an object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target based on the scene audio signal The encoding core is the scene audio data encoding core;
    响应于所述音频场景类型表征音频场景中仅包括至少一个背景音频信号,以及所述目标输入格式音频信号为基于场景音频信号,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, and the target input format audio signal is a scene-based audio signal, it is determined that the target encoding core based on the scene audio signal is scene audio data. Coding core.
  9. 一种音频信号编码装置,其特征在于,包括:An audio signal encoding device, characterized by including:
    信号获取单元,被配置为获取音频信号;a signal acquisition unit configured to acquire the audio signal;
    类型确定单元,被配置为确定所述音频信号对应的音频场景类型;A type determination unit configured to determine the audio scene type corresponding to the audio signal;
    目标信号确定单元,被配置为根据所述音频场景类型和所述音频信号,确定目标输入格式音频信号;a target signal determining unit configured to determine a target input format audio signal according to the audio scene type and the audio signal;
    编码处理单元,被配置为对所述目标输入格式音频信号进行编码,生成目标编码码流。The encoding processing unit is configured to encode the audio signal in the target input format and generate a target encoded code stream.
  10. 如权利要求9所述的装置,其特征在于,所述类型确定单元,包括:The device according to claim 9, characterized in that the type determining unit includes:
    参数获取模块,被配置为获取所述音频信号的音频特征参数;A parameter acquisition module configured to acquire audio characteristic parameters of the audio signal;
    模型处理模块,被配置为将所述音频特征参数输入至音频场景类型分析模型,确定所述音频信号对应的所述音频场景类型。A model processing module configured to input the audio feature parameters into an audio scene type analysis model to determine the audio scene type corresponding to the audio signal.
  11. 如权利要求9或10所述的装置,其特征在于,所述目标信号确定单元,包括:The device according to claim 9 or 10, characterized in that the target signal determining unit includes:
    目标格式确定模块,被配置为根据所述音频场景类型和/或所述音频信号,确定目标音频信号输入格式;A target format determination module configured to determine a target audio signal input format according to the audio scene type and/or the audio signal;
    目标信号确定模块,被配置为根据所述目标音频信号输入格式和所述音频信号,确定所述目标输入格式音频信号。A target signal determining module is configured to determine the target input format audio signal according to the target audio signal input format and the audio signal.
  12. 如权利要求11所述的装置,其特征在于,所述目标格式确定模块,具体被配置为:The device according to claim 11, wherein the target format determination module is specifically configured as:
    响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号,确定所述目标音频信号输入格式为基于对象信号输入格式;In response to the audio scene type representing that the audio scene includes at least one primary audio signal, determining the target audio signal input format to be an object-based signal input format;
    响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,确定所述目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式;In response to the audio scene type representing that the audio scene includes at least one main audio signal and a background audio signal, determining the target audio signal input format to be an object-based signal input format and a scene-based signal input format;
    响应于所述音频场景类型表征音频场景中仅包括至少一个背景音频信号,确定所述目标音频信号输入格式为基于场景信号输入格式。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, the target audio signal input format is determined to be a scene signal-based input format.
  13. 如权利要求12所述的装置,其特征在于,所述目标信号确定模块,具体被配置为:The device according to claim 12, wherein the target signal determination module is specifically configured as:
    响应于确定所述目标音频信号输入格式为基于对象信号输入格式,确定所述目标输入格式音频信号为所述音频信号中的基于对象音频信号;In response to determining that the target audio signal input format is an object-based signal input format, determining that the target input format audio signal is an object-based audio signal among the audio signals;
    响应于确定所述目标音频信号输入格式为基于对象信号输入格式和基于场景信号输入格式,确定所述目标输入格式音频信号为所述音频信号中的基于对象音频信号和基于场景音频信号;In response to determining that the target audio signal input format is an object-based signal input format and a scene-based signal input format, determining that the target input format audio signal is an object-based audio signal and a scene-based audio signal among the audio signals;
    响应于确定所述目标音频信号输入格式为基于场景信号输入格式,确定所述目标输入格式音频信号为所述音频信号中的基于场景音频信号。In response to determining that the target audio signal input format is a scene-based signal input format, it is determined that the target input format audio signal is a scene-based audio signal among the audio signals.
  14. 如权利要求9至13中任一项所述的装置,其特征在于,所述编码处理单元,包括:The device according to any one of claims 9 to 13, characterized in that the encoding processing unit includes:
    编码核确定模块,被配置为根据所述目标输入格式音频信号,确定目标编码核,或者,根据所述音频场景类型和所述目标输入格式音频信号,确定目标编码核;a coding core determination module configured to determine a target coding core according to the target input format audio signal, or to determine a target coding core according to the audio scene type and the target input format audio signal;
    参数获取模块,被配置为根据所述目标编码核对所述目标输入格式音频信号进行编码,获取编码参数;A parameter acquisition module configured to encode the target input format audio signal according to the target encoding check and obtain encoding parameters;
    码流生成模块,被配置为根据所述编码参数进行码流复用,生成所述目标编码码流。A code stream generation module is configured to perform code stream multiplexing according to the coding parameters and generate the target code stream.
  15. 如权利要求14所述的装置,其特征在于,所述编码核确定模块,具体被配置为:The device according to claim 14, wherein the encoding core determination module is specifically configured as:
    响应于所述目标输入格式音频信号为基于对象音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核;In response to the target input format audio signal being an object-based audio signal, it is determined that the target encoding core of the object audio data in the object-based audio signal is an object audio data encoding core, and it is determined that the object-based audio signal is based on metadata in the object audio signal. The target encoding core is an object metadata parameter encoding core;
    响应于所述目标输入格式音频信号为基于场景音频信号,确定所述目标编码核为场景音频数据编码核;In response to the target input format audio signal being a scene-based audio signal, determining that the target encoding core is a scene audio data encoding core;
    响应于所述目标输入格式音频信号为基于声道音频信号,确定所述目标编码核为声道音频数据编码核;In response to the target input format audio signal being a channel-based audio signal, determining that the target encoding core is a channel audio data encoding core;
    响应于所述目标输入格式音频信号为基于辅助元数据的空间音频MASA音频信号,确定所述基于MASA音频信号中音频数据的所述目标编码核为场景音频数据编码核,确定所述基于MASA音频信号中空间辅助元数据的所述目标编码核为空间辅助元数据参数编码核;In response to the target input format audio signal being a spatial audio MASA audio signal based on auxiliary metadata, it is determined that the target encoding core based on the audio data in the MASA audio signal is a scene audio data encoding core, and it is determined that the MASA based audio The target encoding core of the spatial auxiliary metadata in the signal is a spatial auxiliary metadata parameter encoding core;
    响应于所述目标输入格式音频信号为混合格式音频信号,确定所述目标编码核为所述混合格式音频信号中的格式音频信号选用的编码核,其中,所述混合格式信号包括基于对象音频信号、基于场景音频 信号、基于声道音频信号和基于MASA音频信号中的至少两个。In response to the target input format audio signal being a mixed format audio signal, it is determined that the target encoding core is the encoding core selected for the format audio signal in the mixed format audio signal, wherein the mixed format signal includes an audio signal based on the object , at least two of a scene-based audio signal, a channel-based audio signal, and a MASA-based audio signal.
  16. 如权利要求14所述的装置,其特征在于,所述编码核确定模块,具体被配置为:The device according to claim 14, wherein the encoding core determination module is specifically configured as:
    响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号,以及所述目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, determining all of the object audio data in the object-based audio signal. The target encoding core is an object audio data encoding core, it is determined that the target encoding core based on metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target encoding core based on the scene audio signal is a scene. audio data encoding core;
    响应于所述音频场景类型表征音频场景中包括至少一个主要音频信号和背景音频信号,以及所述目标输入格式音频信号为基于对象音频信号和基于场景音频信号,确定所述基于对象音频信号中对象音频数据的所述目标编码核为对象音频数据编码核,确定所述基于对象音频信号中元数据的所述目标编码核为对象元数据参数编码核,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核;In response to the audio scene type characterizing the audio scene including at least one main audio signal and a background audio signal, and the target input format audio signal being an object-based audio signal and a scene-based audio signal, determining the object in the object-based audio signal The target encoding core of the audio data is an object audio data encoding core, it is determined that the target encoding core based on the metadata in the object audio signal is an object metadata parameter encoding core, and it is determined that the target based on the scene audio signal The encoding core is the scene audio data encoding core;
    响应于所述音频场景类型表征音频场景中仅包括至少一个背景音频信号,以及所述目标输入格式音频信号为基于场景音频信号,确定所述基于场景音频信号的所述目标编码核为场景音频数据编码核。In response to the audio scene type representing that the audio scene only includes at least one background audio signal, and the target input format audio signal is a scene-based audio signal, it is determined that the target encoding core based on the scene audio signal is scene audio data. Coding core.
  17. 一种电子设备,其特征在于,包括:An electronic device, characterized by including:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1至8中任一项所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1 to 8 Methods.
  18. 一种存储有计算机指令的非瞬时计算机可读存储介质,其特征在于,所述计算机指令用于使所述计算机执行权利要求1至8中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, characterized in that the computer instructions are used to cause the computer to execute the method described in any one of claims 1 to 8.
  19. 一种计算机程序产品,包括计算机指令,其特征在于,所述计算机指令在被处理器执行时实现权利要求1至8中任一项所述的方法。A computer program product comprising computer instructions, characterized in that, when executed by a processor, the computer instructions implement the method of any one of claims 1 to 8.
PCT/CN2022/092082 2022-05-10 2022-05-10 Audio signal encoding method and apparatus, electronic device and storage medium WO2023216119A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202280001342.8A CN117813652A (en) 2022-05-10 2022-05-10 Audio signal encoding method, device, electronic equipment and storage medium
PCT/CN2022/092082 WO2023216119A1 (en) 2022-05-10 2022-05-10 Audio signal encoding method and apparatus, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/092082 WO2023216119A1 (en) 2022-05-10 2022-05-10 Audio signal encoding method and apparatus, electronic device and storage medium

Publications (1)

Publication Number Publication Date
WO2023216119A1 true WO2023216119A1 (en) 2023-11-16

Family

ID=88729306

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/092082 WO2023216119A1 (en) 2022-05-10 2022-05-10 Audio signal encoding method and apparatus, electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN117813652A (en)
WO (1) WO2023216119A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373465A (en) * 2023-12-08 2024-01-09 富迪科技(南京)有限公司 Voice frequency signal switching system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101393741A (en) * 2007-09-19 2009-03-25 中兴通讯股份有限公司 Audio signal classification apparatus and method used in wideband audio encoder and decoder
US20110246207A1 (en) * 2010-04-02 2011-10-06 Korea Electronics Technology Institute Apparatus for playing and producing realistic object audio
US20160225377A1 (en) * 2013-10-17 2016-08-04 Socionext Inc. Audio encoding device and audio decoding device
CN112767956A (en) * 2021-04-09 2021-05-07 腾讯科技(深圳)有限公司 Audio encoding method, apparatus, computer device and medium
CN113948099A (en) * 2021-10-18 2022-01-18 北京金山云网络技术有限公司 Audio encoding method, audio decoding method, audio encoding device, audio decoding device and electronic equipment
CN114125639A (en) * 2021-12-06 2022-03-01 维沃移动通信有限公司 Audio signal processing method and device and electronic equipment
CN114299967A (en) * 2020-09-22 2022-04-08 华为技术有限公司 Audio coding and decoding method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101393741A (en) * 2007-09-19 2009-03-25 中兴通讯股份有限公司 Audio signal classification apparatus and method used in wideband audio encoder and decoder
US20110246207A1 (en) * 2010-04-02 2011-10-06 Korea Electronics Technology Institute Apparatus for playing and producing realistic object audio
US20160225377A1 (en) * 2013-10-17 2016-08-04 Socionext Inc. Audio encoding device and audio decoding device
CN114299967A (en) * 2020-09-22 2022-04-08 华为技术有限公司 Audio coding and decoding method and device
CN112767956A (en) * 2021-04-09 2021-05-07 腾讯科技(深圳)有限公司 Audio encoding method, apparatus, computer device and medium
CN113948099A (en) * 2021-10-18 2022-01-18 北京金山云网络技术有限公司 Audio encoding method, audio decoding method, audio encoding device, audio decoding device and electronic equipment
CN114125639A (en) * 2021-12-06 2022-03-01 维沃移动通信有限公司 Audio signal processing method and device and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373465A (en) * 2023-12-08 2024-01-09 富迪科技(南京)有限公司 Voice frequency signal switching system
CN117373465B (en) * 2023-12-08 2024-04-09 富迪科技(南京)有限公司 Voice frequency signal switching system

Also Published As

Publication number Publication date
CN117813652A (en) 2024-04-02

Similar Documents

Publication Publication Date Title
US9966084B2 (en) Method and device for achieving object audio recording and electronic apparatus
EP3046309B1 (en) Method, device and system for projection on screen
US20170304735A1 (en) Method and Apparatus for Performing Live Broadcast on Game
US11567729B2 (en) System and method for playing audio data on multiple devices
WO2017181551A1 (en) Video processing method and device
CN106454644B (en) Audio playing method and device
JP7361890B2 (en) Call methods, call devices, call systems, servers and computer programs
EP4044578A1 (en) Audio processing method and electronic device
WO2021244159A1 (en) Translation method and apparatus, earphone, and earphone storage apparatus
CN115273831A (en) Voice conversion model training method, voice conversion method and device
CN112286481A (en) Audio output method and electronic equipment
WO2023216119A1 (en) Audio signal encoding method and apparatus, electronic device and storage medium
WO2021244135A1 (en) Translation method and apparatus, and headset
CN110767203B (en) Audio processing method and device, mobile terminal and storage medium
CN104682908A (en) Method and device for controlling volume
CN111739538B (en) Translation method and device, earphone and server
CN115550559A (en) Video picture display method, device, equipment and storage medium
CN108111920A (en) Video information processing method and device
CN110213531B (en) Monitoring video processing method and device
WO2024000534A1 (en) Audio signal encoding method and apparatus, and electronic device and storage medium
WO2023212879A1 (en) Object audio data generation method and apparatus, electronic device, and storage medium
CN116830193A (en) Audio code stream signal processing method, device, electronic equipment and storage medium
CN114007101B (en) Processing method, device and storage medium of fusion display device
EP4030294A1 (en) Function control method, function control apparatus, and storage medium
EP4167580A1 (en) Audio control method, system, and electronic device

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202280001342.8

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22941082

Country of ref document: EP

Kind code of ref document: A1