WO2023142363A1 - 显示设备及音频处理方法 - Google Patents

显示设备及音频处理方法 Download PDF

Info

Publication number
WO2023142363A1
WO2023142363A1 PCT/CN2022/101859 CN2022101859W WO2023142363A1 WO 2023142363 A1 WO2023142363 A1 WO 2023142363A1 CN 2022101859 W CN2022101859 W CN 2022101859W WO 2023142363 A1 WO2023142363 A1 WO 2023142363A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
gain
channel
sound
target
Prior art date
Application number
PCT/CN2022/101859
Other languages
English (en)
French (fr)
Inventor
王海盈
李奎宝
徐志强
邢文峰
孙永瑞
Original Assignee
海信视像科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202210102847.5A external-priority patent/CN114615534A/zh
Priority claimed from CN202210102840.3A external-priority patent/CN114466241A/zh
Priority claimed from CN202210102852.6A external-priority patent/CN114598917B/zh
Priority claimed from CN202210102896.9A external-priority patent/CN114466242A/zh
Application filed by 海信视像科技股份有限公司 filed Critical 海信视像科技股份有限公司
Publication of WO2023142363A1 publication Critical patent/WO2023142363A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams

Definitions

  • the present application relates to display device technology, in particular, to a display device and an audio processing method.
  • Some embodiments of the present application provide a display device, including: a controller configured to: perform sound separation on the acquired first audio data to obtain first target audio data and first background audio data; performing gain processing on the first target audio data to obtain second target audio data; performing gain processing on the first background audio data according to a second gain to obtain second background audio data; wherein the first gain and The second gain is determined according to the sound control mode corresponding to the display device; combining the second target audio data and the second background audio data, and performing sound effect enhancement processing to obtain second audio data; audio output An interface configured to: output the second audio data.
  • Some embodiments of the present application provide a display device, including: a controller configured to: respectively perform sound separation and sound effect enhancement processing on the acquired first audio data, to obtain first target audio data and second audio data; Perform gain processing on the first target audio data according to the first gain to obtain second target audio data; perform gain processing on the second audio data according to the second gain to obtain third audio data, wherein the first The gain and the second gain are determined according to the sound control mode corresponding to the display device; performing delay processing on the second target audio data or the third audio data, so that the second target audio data and the The third audio data is synchronized; the second target audio data is combined with the third audio data to obtain fourth audio data; the audio output interface is configured to: output the fourth audio data.
  • Some embodiments of the present application provide a display device, including: a controller and a plurality of audio output interfaces; the controller is configured to: respectively Perform vocal separation to obtain the first vocal audio data of the first channel and the first background audio data of the first channel, and the first vocal audio data of the second channel and the first background audio data of the second channel; The first human voice audio data of the first channel and the first human voice audio data of the second channel are merged to obtain target human voice audio data; the first channel audio data and the second channel audio are obtained The image data at the time of the data, lip movement detection is performed on the image data, if the lip movement coordinates in the screen of the display device are detected, the lip movement coordinates and the coordinates of the plurality of audio output interfaces are used to determine the lip movement coordinates.
  • Some embodiments of the present application provide a display device, including: a controller configured to: acquire song audio data, perform vocal separation on the song audio data, and obtain original singer vocal audio data and accompaniment audio data; The controller is further configured to: determine the original singing gain according to the energy of the original vocal audio data in each time period and the energy of the singing vocal audio data collected in the time period; according to the Gain for the original singer, performing gain processing on the original vocal audio data within the time period to obtain target vocal audio data; accompaniment audio data, target vocal audio data and singing vocal audio data within the time period performing merging and performing sound effect enhancement processing to obtain target audio data; the audio output interface is configured to: output the target audio data.
  • FIG. 1 is a schematic diagram of an operation scene between a display device and a control device according to some embodiments of the present application;
  • FIG. 2 is a block diagram of a hardware configuration of a display device 200 according to some embodiments of the present application
  • FIG. 3 is a block diagram of a hardware configuration of a control device 100 according to some embodiments of the present application.
  • FIG. 4 is a schematic diagram of software configuration in the display device 200 according to some embodiments of the present application.
  • FIG. 5 is a schematic diagram of displaying an icon control interface of an application program in the display device 200 according to some embodiments of the present application;
  • FIG. 6A is a schematic diagram of a system architecture of an audio processing method in some embodiments of the present application.
  • FIG. 6B is a schematic diagram of an audio processing method in some embodiments of the present application.
  • Fig. 7 is a kind of schematic diagram of sound separation
  • FIG. 8 is a schematic diagram of an audio processing method in some embodiments of the present application.
  • Fig. 9a is a schematic diagram of distribution angles of standard recording studios or home audio speakers
  • Figure 9b is a schematic diagram of the angle of the TV speaker
  • Fig. 9c is a schematic diagram of changing the energy distribution relationship of TV speakers
  • Figure 10 is a schematic diagram of the function f(x) in some embodiments of the present application.
  • FIG. 11A is a schematic diagram of a system architecture of an audio processing method in some embodiments of the present application.
  • FIG. 11B is a schematic diagram of an audio processing method in some embodiments of the present application.
  • Fig. 12 is a schematic diagram of an audio processing method in some embodiments of the present application.
  • FIG. 13A is a schematic diagram of a system architecture of an audio processing method in some embodiments of the present application.
  • Fig. 13B is a schematic diagram of an audio processing method in some embodiments of the present application.
  • Fig. 14 is a kind of schematic diagram of loudspeaker distribution
  • FIG. 15A is a schematic diagram of a system architecture of an audio processing method in some embodiments of the present application.
  • Fig. 15B is a schematic diagram of an audio processing method in some embodiments of the present application.
  • FIG. 16 is a schematic diagram of performing time-domain transformation on original vocal audio data in some embodiments of the present application.
  • Fig. 17 is a schematic diagram of performing frequency-domain transformation on original vocal audio data in some embodiments of the present application.
  • FIG. 18 is a flowchart of an audio processing method in some embodiments of the present application.
  • FIG. 19 is a flowchart of an audio processing method in some embodiments of the present application.
  • FIG. 20 is a flowchart of an audio processing method in some embodiments of the present application.
  • Fig. 21 is a flowchart of an audio processing method in some embodiments of the present application.
  • FIG. 1 is a schematic diagram of an operation scene between a display device and a control device according to some embodiments of the present application.
  • a user can operate a display device 200 through a mobile terminal 300 and a control device 100 .
  • the control device 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication, bluetooth protocol communication, wireless or other wired methods to control the display device 200 .
  • the mobile terminal 300 can install software applications with the display device 200, realize connection and communication through a network communication protocol, and realize the purpose of one-to-one control operation and data communication.
  • FIG. 2 is a configuration block diagram of the control device 100 according to some embodiments.
  • the control device 100 includes a controller 110 , a communication interface 130 , a user input/output interface 140 , a memory, and a power supply.
  • FIG. 3 is a block diagram of a hardware configuration of a display device 200 according to some embodiments.
  • the display device 200 includes a tuner and demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, an external memory, a power supply, and a user interface 280. at least one of .
  • the controller includes a central processing unit, a video processor, an audio processor, a graphics processor, RAM, ROM, the first interface to the nth interface for input/output.
  • the display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen.
  • FIG. 4 is a schematic diagram of software configuration in a display device 200 according to some embodiments of the present application.
  • the system is divided into four layers, which are the Applications (Applications) layer (abbreviated as "application layer”) from top to bottom.
  • Application Framework Application Framework
  • Android runtime Android runtime
  • system library layer referred to as “system runtime layer”
  • the kernel layer contains at least one of the following drivers: audio driver, display driver, bluetooth driver, camera driver, WIFI driver, USB driver, HDMI driver, sensor driver (such as fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive etc.
  • FIG. 5 is a schematic diagram showing an icon control interface of an application in the display device 200 according to some embodiments of the present application.
  • the application layer includes at least one application that can display a corresponding icon control on the display, such as: Live TV App Icon Control, Video On Demand App Icon Control, Media Center App Icon Control, App Center Icon Control, Gaming App Icon Control, and more.
  • the Android system mainly includes the application layer, middleware and core layer, and the implementation logic can be in the middleware.
  • the middleware includes: audio decoder, sound separation module, Gain control module, sound effect enhancement module and audio output interface.
  • the audio decoder is used to perform audio decoding processing on a signal source input through a broadcast signal, network, USB or HDMI, etc., to obtain audio data.
  • the sound separation module is used for sound separation of the decoded audio data, for example, human voice audio and background audio can be separated by a human voice separation method.
  • the gain control module can obtain the user's sound control mode for the display device, and perform different gain processing on the human voice audio and the background audio, so as to enhance the human voice audio or the background audio.
  • the merging module is used to combine the gain-processed human voice audio and background audio to obtain combined audio data
  • the sound effect enhancement module is used to perform sound effect enhancement processing on the combined audio data to obtain target audio data.
  • the audio output interface is used to output target audio data.
  • the above implementation logic can be implemented not only in the middleware, but also in the core layer. Alternatively, it can also be implemented at the middleware and the core layer, for example, the audio decoder and the sound separation module can be implemented at the middleware, and the modules after the sound separation module can be implemented at the core layer.
  • FIG. 6B is a schematic diagram of an audio processing method in some embodiments of the present application.
  • the first audio data can be obtained.
  • the sound separation module can realize the sound separation of the first audio data through AI (artificial intelligence) technology and a pre-trained neural network model, and obtain the first target audio data and the first background audio data.
  • AI artificial intelligence
  • the human voice can be separated by the human voice separation model, and the human voice is the first target audio data
  • the car sound can be separated by the pre-trained car sound separation model, and the car sound is the first target audio data, and the first background audio
  • the data is audio data other than the first target audio data.
  • the gain control module can obtain the first gain and the second gain according to the sound control mode, and the values of the first gain and the second gain are not equal. Perform gain processing on the first target audio data according to the first gain to obtain second target audio data, and perform gain processing on the first background audio data according to the second gain to obtain second background audio data. After combining the second target audio data and the second background audio data and performing sound effect enhancement processing, the second audio data is obtained and output.
  • the first target audio data or the first background audio data is enhanced by performing non-proportional gain processing on the first target audio data and the first background audio data, thereby improving the effect of sound effect enhancement.
  • the display device 200 includes: a controller 250 configured to: perform sound separation on the acquired first audio data to obtain first target audio data and first background audio data.
  • the first audio data refers to audio data containing at least two mixed sounds.
  • the first audio data may include human voice and background music, and the human voice is separated through the pre-trained human voice separation model, except for the human voice.
  • the other sounds are background sounds.
  • the first target audio data is human voice
  • the first background audio data is background sound.
  • the audio output interface 270 is configured to: output second audio data.
  • Fig. 7 is a schematic diagram of sound separation.
  • sound signal 1 is the sound of a musical instrument
  • sound signal 2 is the sound of a person singing.
  • the mixed sound signal is a sound signal that mixes the sound of musical instruments and the sound of people singing during recording and audio and video production.
  • Traditional sound effect algorithms based on fixed logic operations cannot separate two sounds from a mixed sound signal, but with the help of AI technology, sound separation can be achieved, and audio 1 similar to musical instruments and audio 2 similar to human voices can be obtained.
  • the first audio data includes various mixed sounds such as human voice, car sound, gun sound and background music, and the human voice can be separated by the human voice separation model, and the car sound can be separated by the pre-trained car sound separation model. , separate the guns and guns through the pre-trained guns and guns separation model.
  • the first audio data other sounds except the separated human voice, car sound and gun sound are used as background sound.
  • the first target audio data may include human voices, car sounds, and gun sounds
  • the first background audio data is the background sound.
  • the user can select a voice control mode according to his/her preference, and according to the voice control mode, the first gain and the second gain can be determined.
  • the controller 250 is configured to: perform gain processing on the first target audio data according to the first gain to obtain second target audio data; perform gain processing on the first background audio data according to the second gain to obtain second background audio data. That is, gain processing of different magnitudes is performed on the first target audio data and the first background audio data, so as to enhance the first target audio data or the first background audio data.
  • the second target audio data and the second background audio data are combined, and sound effect enhancement processing is performed to obtain second audio data.
  • the signal after combining the second target audio data and the second background audio data is highly similar to the signal before sound separation.
  • the sound enhancement algorithm includes but not limited to AGC (automatic gain control), DRC (Dynamic range compression, dynamic range planning), EQ (equalizer), virtual surround, etc.
  • the controller 250 is configured to: determine the type of sound effect enhancement mode corresponding to the first audio data according to the sound control mode corresponding to the display device; the type of sound effect enhancement mode refers to the type of sound that the user wants to enhance , according to the sound control mode corresponding to the display device, determine the first gain and the second gain corresponding to the type of the sound effect enhancement mode. Depending on the type of sound effect enhancement mode, the corresponding first gain and second gain will also be different.
  • the type of the sound effect enhancement mode corresponding to the first audio data can be determined first.
  • the type of sound effect enhancement mode indicates the type of sound that the user wants to enhance.
  • the types of sound effect enhancement modes are different.
  • the methods for determining the first gain and the second gain may also be different. Therefore, after the type of the sound effect enhancement mode, the first gain and the second gain corresponding to the type of the sound effect enhancement mode may be determined according to the sound control mode.
  • the types of the sound enhancement mode may include a sound enhancement mode and a background enhancement mode, the sound enhancement mode indicates that the user wants to enhance the first target audio data, and the background enhancement mode indicates that the user wants to enhance the first background audio data.
  • the controller 250 is configured to: if the type of the sound enhancement mode corresponding to the first audio data is a sound enhancement mode, that is, to enhance the first target audio data, the first gain is greater than the second gain. If the sound effect enhancement mode corresponding to the first audio data is a background enhancement mode, that is, the first background audio data is enhanced, and the first gain is smaller than the second gain.
  • the first target audio data can be enhanced without changing the first background audio data, that is, G1 can be a value greater than 0 dB, and G2 equal to 0dB. If the user wants to enhance the first background audio data, the first target audio data may not be changed, and the first background audio data may be enhanced, that is, G1 is equal to 0 dB, and G2 is a value greater than 0 dB.
  • the range of G1 and G2 may be [-911BB, 0dB]. If the type of the sound enhancement mode corresponding to the first audio data is a sound enhancement mode, the first gain is set to 0 dB, and the second gain is determined according to the sound control mode, wherein the second gain is less than 0 dB. In this way, the purpose of enhancing the first target audio data is achieved by weakening the first background audio data without changing the first target audio data.
  • the first gain is determined according to the sound control mode, and the second gain is set to 0 dB, wherein the first gain is less than 0 dB. In this way, the purpose of enhancing the first background audio data is achieved by weakening the first target audio data without changing the first background audio data.
  • the display device corresponds to multiple preset sound clarity control modes and/or multiple preset sound effect modes.
  • the user can adjust the clarity of the human voice according to his needs and preferences, and select a target voice clarity control mode from multiple preset voice clarity control modes, and each preset voice clarity control mode has a corresponding value.
  • the multiple preset sound clarity control modes are divided into multiple different levels, and each level corresponds to a different numerical value.
  • the user can also select a target sound effect mode from various preset sound effect modes (such as standard mode, music mode, movie mode, etc.), and each preset sound effect mode has a corresponding value.
  • the preset sound clarity control mode indicates the sound clarity of the display device, and may include multiple different levels. If the value corresponding to the preset sound clarity control mode is M1, the user can adjust the sound clarity through the menu.
  • the menu adjustment value can be normalized to a value within [0,1], that is, M1 is greater than A value equal to 0 and less than or equal to 1. Assume that 0.5 indicates the default value of the display device when it leaves the factory, greater than 0.5 indicates higher clarity of the sound, and less than 0.5 indicates lower clarity of the sound.
  • the preset sound effect mode indicates the sound effect mode of the display device, and may include standard sound effects, music sound effects, movie sound effects, news sound effects, and the like. If the value corresponding to the preset sound effect mode is M2, M2 can also be a normalized value, assuming that the value of M2 in standard mode is 0.5, the value of M2 in music mode is 0.6, the value of M2 in movie mode is 0.7, and the value of M2 in news mode The lower M2 has a value of 0.8.
  • the sound control mode corresponding to the display device includes: target sound clarity control mode and/or target sound effect mode; wherein, the target sound clarity control mode is one of multiple preset sound clarity control modes, and the target sound effect mode is multiple One of the preset sound modes.
  • the controller 250 is configured to: determine the type of the sound effect enhancement mode corresponding to the first audio data according to the first value corresponding to the target sound clarity control mode and/or the second value corresponding to the target sound effect mode. That is, a numerical value can be obtained according to the first numerical value and/or the second numerical value, and the type of the sound effect enhancement mode can be determined according to the numerical value. Further, according to the first value and/or the second value, the first gain and the second gain corresponding to the type of the sound effect enhancement mode are determined.
  • a third value can be obtained according to the first value and the second value, and the type of the sound enhancement mode is determined based on the third value. Assuming a normalization scenario, when the third value may be 1, it means that the first target audio data and the first background audio data are not enhanced. When the third value is greater than 1, it means that the first target audio data is enhanced; when the third value is less than 1, it means that the first background audio data is enhanced.
  • the third value T can be expressed as the following formula:
  • the first numerical value corresponding to the target sound clarity control mode is 0.5
  • the second numerical value corresponding to the target sound effect mode is also 0.5.
  • T is equal to 1.
  • Both the first gain G1 and the second gain G2 can be 0 dB, that is, no gain processing is performed on the first target audio data and the first background audio data.
  • the first numerical value corresponding to the target sound clarity control mode is 0.7
  • the second numerical value corresponding to the target sound effect mode is 0.8.
  • the value of T is greater than 1, that is, the first target audio data is enhanced.
  • both G1 and G2 are numbers not greater than 0 dB, so G1 can be set to 0, and G2 can be set to a value less than 0.
  • G2 can be expressed as the following formula:
  • represents a multiplication operation
  • log represents a logarithmic operation
  • - represents a division operation
  • the determination method of G2 is not limited to this, for example, simple deformation of the formula (2) can be performed.
  • G2 can be set to 0, and G1 can be set to a value smaller than 0.
  • G1 can be expressed as the following formula:
  • FIG. 8 is a schematic diagram of an audio processing method in some embodiments of the present application.
  • the audio data of the left and right channels are independently processed by human voice separation, gain processing, and sound effect enhancement processing, and then sent to the corresponding speakers.
  • Fig. 9a is a schematic diagram of distribution angles of standard recording studios or home audio speakers. It can be seen that the angle between the left and right sound channels is 60°.
  • the sound source When the sound source is created, generally the sound does not only exist in one channel, but two channels have sound at the same time.
  • the creator wants to express the sound on the left, the sound on the left will be louder than the sound on the right.
  • you want to show that the sound is on the right side the sound on the right side will be louder than the left side.
  • FIG. 9b is a schematic diagram of the angle of a TV speaker.
  • the virtual sound image of all sound elements is reduced, which is different from the creator's intention based on the creation of 60° speakers.
  • the angle of the two speakers is reduced to 8° ⁇ 14°, if the ratio of the left and right channels is still the same as the original ratio, the audience will get a blurred sound image, and it is difficult to hear the sense of direction of the sound.
  • FIG. 9c is a schematic diagram of changing the energy distribution relationship of TV speakers. It can be seen that after changing the energy distribution relationship, the car is closer to the left speaker in the audience's subjective sense of hearing.
  • the energy of the left and right channels of the background music used to set off the atmosphere in a film and television drama is basically the same or the signal is completely the same, but the typical sound used to express the sense of orientation will be allocated to different channels , used to express the sense of orientation, typical sounds include but not limited to human voices, guns, cars, airplanes, etc. If you still calculate the energy of the left and right channels according to the above method, and then simply change the energy ratio of the two channels, the center of the background music with the sound image centered will also be changed, so this method is not advisable.
  • the first audio data includes at least one third target audio data belonging to a preset sound type (such as a sound type that expresses a sense of orientation), and the third target audio data includes but is not limited to human voices, guns, etc. sound, car sound, airplane sound, etc.
  • a preset sound type such as a sound type that expresses a sense of orientation
  • controller 250 is further configured to: separate at least one kind of third target audio data and third background audio data from the first audio data.
  • the first audio data refers to audio data containing at least two kinds of mixed sounds, and different neural network models can be trained to separate human voices, gun sounds, car sounds, etc. from the first audio data
  • the third target audio data is a type of audio data, and one or more third target audio data may be included in the first audio data, and the audio data other than the third target audio data in the first audio data is the first audio data.
  • Three background audio data For example, when the first audio data includes human voice and car sound, the first audio data includes two kinds of third target audio data, which are respectively human voice and car sound, and the sound other than human voice and car sound is the background sound .
  • the following process can be performed.
  • the third target audio data includes audio data of at least two different channels (for example, a first channel and a second channel).
  • the first and second audio channels may be left and right channels, respectively.
  • the third target audio data includes audio data of two channels, that is, initial target audio data of the first channel and initial target audio data of the second channel.
  • the initial target audio data of the first channel and the initial target audio data of the second channel may be left channel audio data and right channel audio data respectively.
  • the initial background audio data of the first channel and the initial background audio data of the second channel described below may be the initial background audio data of the left channel and the initial background audio data of the right channel respectively.
  • the energy of the initial target audio data of the first channel and the initial target audio data of the second channel in the third target audio data are different, therefore, the first channel of the single third target audio data can be obtained
  • the first energy value of the initial target audio data and the second energy value of the initial target audio data of the second channel determining the third gain corresponding to the initial target audio data of the first channel according to the first energy value and the second energy value, and The fourth gain corresponding to the initial target audio data of the second channel.
  • the target audio data is subjected to gain processing to obtain the first gain audio data of the second channel, that is, the second channel audio data after gain processing; wherein, the third gain and the fourth gain are determined according to the first energy value and the second energy value .
  • performing gain processing on the initial target audio data of the first channel according to the third gain and performing gain processing on the initial target audio data of the second channel according to the fourth gain can further improve the sense of orientation of the third target audio data. Meanwhile, the center of the third background audio data may not be changed.
  • the third gain can be greater than the fourth gain, for example, the third gain can be set to be greater than 0 The value in dB, the fourth gain is set to 0 dB, that is, no gain processing is performed on the initial target audio data of the second channel. If the first energy value is equal to the second energy value, it means that the two energy values are equal, and the third gain is equal to the fourth gain, or it may not be processed.
  • the third gain can be smaller than the fourth gain, for example, the third gain is set to 0 dB, that is, no gain processing is performed on the initial target audio data of the first channel, and the fourth gain is set to be greater than 0 dB value.
  • the third gain can be set to 0 dB, according to the first energy value and the second energy value The value determines a fourth gain, wherein the fourth gain is less than 0 dB. Perform gain processing on the initial target audio data of the first channel according to the third gain to obtain the first gain audio data of the first channel; perform gain processing on the initial target audio data of the second channel according to the fourth gain to obtain the second channel First gain audio data.
  • the third gain can be determined according to the first energy value and the second energy value, the third gain is less than 0 dB, and the fourth gain is set to 0 dB. Perform gain processing on the initial target audio data of the first channel according to the third gain to obtain the first gain audio data of the first channel; perform gain processing on the initial target audio data of the second channel according to the fourth gain to obtain the second channel First gain audio data.
  • first channel first gain audio data of the first channel and the first channel initial background audio data of the third background audio data are merged, and sound effect enhancement processing is performed to obtain the first enhanced audio data of the first channel;
  • the first gain audio data of the channel is combined with the initial background audio data of the second channel of the third background audio data, and the sound effect enhancement processing is performed to obtain the first enhanced audio data of the second channel.
  • the first channel initial target audio data and the second channel initial target audio data can be analyzed.
  • different gain processing is performed on the initial target audio data of the first channel and the initial target audio data of the second channel, so that the audio data of the channel with high energy Stronger, to better enhance the sense of direction of the sound and enhance the effect of sound enhancement.
  • the processing process is similar to this, and will not be repeated here.
  • the audio output interface 270 includes: a first output interface and a second output interface; the first output interface is configured to: output the first enhanced audio data of the first channel; the second output interface is configured to: output the first audio data of the second channel; Enhance audio data.
  • gain processing may be performed on the third target audio data and the third background audio data by considering the sound control mode, the first energy value and the second energy value at the same time.
  • the controller 250 is further configured to: determine the fifth gain and the sixth gain corresponding to the single third target audio data according to the sound control mode, the first energy value and the second energy value corresponding to the display device.
  • the fifth gain and the sixth gain are gains corresponding to the initial target audio data of the first channel and the initial target audio data of the second channel of the third target audio data, respectively.
  • the fifth gain and the sixth gain may be different.
  • the seventh gain is determined; wherein, the seventh gain refers to the gain corresponding to the third background audio data, and since the center of the third background audio data is not changed, the seventh gain is used to adjust the first Gain processing is performed on the initial background audio data of the channel and the initial background audio data of the second channel, that is, the same gain processing is performed on the initial background audio data of the first channel and the initial background audio data of the second channel.
  • gain processing is performed on the initial target audio data of the first channel according to the fifth gain to obtain audio data of the second gain of the first channel, that is, audio data of the first channel after gain processing.
  • the initial background audio data and the second channel initial background audio data are subjected to gain processing to obtain the first channel gain background audio data (i.e. the background audio data of the first channel after gain processing) and the second channel gain background audio data ( That is, the background audio data of the second channel after gain processing).
  • the second gain audio data of the first channel and the aforementioned first gain audio data of the first channel are the first channel audio data after gain processing is performed on the initial target audio data of the first channel.
  • the corresponding gain values are different during gain processing.
  • the audio data of the second channel with the second gain and the aforementioned audio data with the first gain of the second channel are the audio data of the second channel after gain processing has been performed on the initial target audio data of the second channel. The corresponding gain value is different during gain processing.
  • the audio output interface 270 includes: a first output interface and a second output interface; the first output interface is configured to: output the second enhanced audio data of the first channel; the second output interface is configured to: output the second audio data of the second channel Enhance audio data.
  • the controller 250 is configured to: determine the type of the sound effect enhancement mode corresponding to the first audio data according to the sound control mode corresponding to the display device; and the second energy value of the initial target audio data of the second channel to determine the relationship between the energy of the left and right channels.
  • the first energy value and the second energy value corresponding to the display device determine the fifth gain and the sixth gain corresponding to the type of sound effect enhancement mode and the energy magnitude relationship between the left and right channels; according to the corresponding sound control of the display device mode, to determine the seventh gain corresponding to the type of the sound effect enhancement mode and the energy magnitude relationship between the left and right channels.
  • the types of sound effect enhancement modes are different, and the gain processing methods for the third target audio data and the third background audio data are different.
  • the relationship between the energy of the left and right channels is different, and the gain processing methods for the initial target audio data of the first channel and the initial target audio data of the second channel are also different.
  • the type of sound effect enhancement mode is used to determine whether to enhance the third target audio data or the third background audio data, and the energy magnitude relationship between the left and right channels is used to determine whether to enhance the original target audio data of the first channel or the original target audio data of the second channel. Therefore, different types of sound effect enhancement modes and the energy magnitude relationship between the left and right channels correspond to different fifth gains, sixth gains and seventh gains.
  • both the fifth gain and the sixth gain are greater than the seventh gain, and if the first energy is greater than the second energy, the fifth gain is greater than the sixth gain.
  • the fifth gain may be equal to the sixth gain if the first energy is equal to the second energy. If the first energy is less than the second energy, the fifth gain is less than the sixth gain.
  • both the fifth gain and the sixth gain are smaller than the seventh gain, and if the first energy is greater than the second energy, the fifth gain is greater than the sixth gain.
  • the fifth gain may be equal to the sixth gain if the first energy is equal to the second energy. If the first energy is less than the second energy, the fifth gain is less than the sixth gain.
  • the third value T may be greater than 1, assuming that the first energy value is PL and the second energy value is PR , if PL is greater than PR , at this time, the fifth gain may be equal to 0 dB, and both the sixth gain and the seventh gain are less than 0 dB.
  • the fifth gain G 1L 0 dB
  • the sixth gain can be expressed as the following formula:
  • the seventh gain can be expressed as the following formula:
  • the sixth gain is equal to 0 dB, and both the fifth gain and the seventh gain are less than 0 dB.
  • the fifth gain can be expressed as the following formula:
  • the sixth gain G 1R 0 dB
  • the seventh gain can be expressed as the following formula:
  • the fifth gain can be expressed as the following formula:
  • the sixth gain can be expressed as the following formula:
  • the seventh gain G 2 0 dB.
  • the fifth gain G 1L can be expressed as the following formula:
  • the sixth gain G 1R can be expressed as the following formula:
  • the seventh gain G 2 0 dB.
  • FIG. 10 is a schematic diagram of the function f(x) in some embodiments of the present application. It can be seen that the variation trend of f(x) with x satisfies the above relationship. It should be noted that the variation trend of f(x) with x is not limited thereto, for example, it can be exponential, parabolic or a combination of multiple forms, as long as the above relationship is satisfied.
  • the manner of determining the fifth gain, the sixth gain and the seventh gain is not limited thereto, for example, a simple modification of the above formula may be used.
  • the fifth gain, the sixth gain and the seventh gain may also be greater than or equal to 0 dB.
  • the controller 250 is configured to: combine the second gain audio data of the first channel and the background audio data of the first channel gain, and perform sound effect enhancement processing, obtain and output the second enhanced audio data of the first channel;
  • the two-channel second-gain audio data and the second-channel gain background audio data are combined, and sound effect enhancement processing is performed to obtain and output second-channel second enhanced audio data.
  • the present application can also consider the control mode and the energy magnitude relationship between the initial target audio data of the first channel and the initial target audio data of the second channel to determine the initial target audio data of the first channel and the initial target audio data of the second channel corresponding gain values, so that the effect of sound effect enhancement can be further improved.
  • the sound separation algorithm usually uses artificial intelligence technology. After the sound is processed by artificial intelligence, it is then processed by sound effect enhancement. There is a problem of out-of-sync audio and video. In order to solve this problem, the present application provides a display device.
  • the display device can run the Android system, and the implementation in the Android system can be shown in Figure 11A.
  • the Android system mainly includes an application layer, a middleware, and a core layer, and the implementation logic can be in the middleware.
  • the middleware Including: audio decoder, sound separation module, sound enhancement module, gain control module, delay module and audio output interface.
  • the middleware may also include a human voice separation module, a sound distribution module and an image decoder, wherein the sound distribution module is used to perform lip movement detection on the image decoded and output by the image decoder to determine the audio output interface
  • the weight of the vocal audio and the weight of the background audio are output, as shown in FIG. 13A .
  • the middleware may also include an original vocal volume control module, and the original vocal volume control module determines the size of the original vocal audio that is merged into the accompaniment audio according to the singing audio data and the separated original vocal audio, that is Target vocal audio, as shown in Figure 15A.
  • the audio decoder is used to perform audio decoding processing on a signal source input through a broadcast signal, network, USB or HDMI, etc., to obtain audio data.
  • the sound separation module is used for sound separation of the decoded audio data, for example, human voice audio can be separated by a human voice separation method.
  • the sound effect enhancement module is used to perform sound effect enhancement processing on the decoded audio data
  • the gain control module can obtain the user's sound control mode for the display device, and perform different gain processing on the separated audio and the sound effect enhanced audio respectively. Since the durations consumed by sound separation and sound effect enhancement are usually different, the delay module can perform delay processing on the two audio data after gain processing.
  • the merging module is used for merging the two audios after gain processing to obtain the combined audio data.
  • the audio output interface is used to output the combined audio data.
  • FIG. 11B is a schematic diagram of an audio processing method in some embodiments of the present application.
  • the first audio data can be obtained.
  • the sound separation module can realize the sound separation of the first audio data through the AI technology and the pre-trained neural network model to obtain the first target audio data.
  • the first target audio data may be a human voice, a car sound, or the like.
  • the second audio data may be obtained after sound effect enhancement processing is performed on the first audio data.
  • the gain control module can obtain the first gain and the second gain according to the sound control mode, and the values of the first gain and the second gain are not equal.
  • some embodiments of the present application also provide a display device 200 including:
  • the controller 250 may also be configured to: respectively perform sound separation and sound effect enhancement processing on the acquired first audio data to obtain first target audio data and second audio data.
  • the controller can control the related data processing of the above-mentioned application layer, middleware and core layer.
  • the first audio data refers to audio data including at least two kinds of mixed sounds.
  • the first audio data may include human voice and background music.
  • the first target audio data usually refers to the audio data that the user wants to enhance, which may be human voice or other sounds, etc., for example, it is applicable to scenes such as watching movies and TV dramas and listening to music.
  • the human voice can be separated through the pre-trained human voice separation model.
  • the first target audio data is the human voice.
  • the first audio data includes multiple mixed sounds such as human voice, car sound, gun sound and background music, and the car sound can be separated by a pre-trained car sound separation model.
  • the first target audio data is for car sound. In the above sound separation process, only one kind of sound (the first target audio data) may be separated. This reduces the amount of time the separation process takes compared to isolating multiple sounds.
  • the present application can also perform sound effect enhancement processing on the first audio data.
  • the sound effect enhancement processing and sound separation processing can be processed in parallel instead of serial processing, which can further shorten the entire audio processing flow.
  • the time consumed can improve the effect of audio and video synchronization.
  • the sound effect enhancement algorithm includes but not limited to automatic gain control, dynamic range planning, equalizer, virtual surround and so on.
  • the display device corresponds to multiple preset sound clarity control modes and/or multiple preset sound effect modes; each preset sound clarity control mode has a corresponding numerical value, and each preset sound effect mode has a corresponding value.
  • Users can adjust the sound control mode of the display device according to their needs and preferences.
  • the corresponding sound control modes of the display device include: target sound clarity control mode and/or target sound effect mode; wherein, the target sound clarity control mode is a variety of preset sound clarity One of the control modes, the target sound effect mode is one of various preset sound effect modes. Therefore, the first gain and the second gain are determined according to the first value corresponding to the target sound clarity control mode and/or the second value corresponding to the target sound effect mode, wherein the first gain may be greater than the second gain.
  • the first target audio data usually refers to the audio data that the user wants to enhance. Therefore, in the case where the types of the sound effect enhancement mode include the sound enhancement mode and the background enhancement mode, it is applicable to the scene of the sound enhancement mode.
  • a third value is obtained according to the first value and the second value, and when the third value is greater than 1, the first target audio data is enhanced.
  • the third value T can be expressed as: (2 ⁇ M1) ⁇ (2 ⁇ M2). It can be understood that, in the standard mode, the values of M1 and M2 are different, and the expression of the third value T can also be different.
  • the first gain and the second gain may be less than or equal to 0 dB in order to ensure that the audio signal does not appear broken due to a positive gain.
  • the first gain can be set to 0 dB; according to the first value corresponding to the target sound clarity control mode and/or the second value corresponding to the target sound effect mode, the second gain is determined so that the second gain is less than 0 dB.
  • the methods for determining the first gain and the second gain may refer to the descriptions in the foregoing embodiments, and details are not repeated here.
  • the sound separation process and the sound effect enhancement process of the first audio data can be processed in parallel, and the time spent on the sound separation of the first audio data and the sound effect enhancement process are usually different, therefore, if directly When the second target audio data and the third audio data are combined, there will be a problem that the sound signals cannot be overlapped, resulting in an echo.
  • the second target audio data or the third audio data can be delayed, so that the second target audio data and the third audio data are synchronized; the second target audio data and the third audio data are combined to obtain Fourth audio data. In this way, it is possible to avoid problems such as inability to overlap sound signals and causing echoes.
  • the audio output interface 270 is configured to: output fourth audio data.
  • the controller 250 is configured to: obtain the first duration consumed during sound separation and the second duration consumed during sound effect enhancement processing; according to the first duration and the second duration, the second target audio data or the third audio data for delay processing. That is, the time spent on sound separation and sound effect enhancement can be directly counted. If the time spent on sound separation is short, the second target audio data can be delayed; if the time spent on sound effect enhancement is short, the The third audio data is delayed, and finally the second target audio data and the third audio data are synchronized.
  • both the first duration and the second duration can calculate one or several sets of fixed values based on measurements.
  • the sound separation algorithm is usually not dedicated on the chip of the display device, but uses an APU (Accelerated Processing Unit, accelerated processor) or GPU (graphics processing unit, graphics processing unit) together with the image AI algorithm. ), so that the calculation time of sound separation is often not a fixed value, but there is a certain volatility, and the volatility is between ⁇ 20ms through actual measurement.
  • APU Accelerated Processing Unit, accelerated processor
  • GPU graphics processing unit, graphics processing unit
  • the controller 250 is configured to: determine the time difference between the first target audio data and the second audio data according to the correlation between the first target audio data and the second audio data; according to the time difference, Delay processing is performed on the second target audio data or the third audio data.
  • the correlation between the first target audio data and the second audio data may also be analyzed. According to the correlation, the time difference between the first target audio data and the second audio data is determined, and then delay processing is performed.
  • the correlation between the first target audio data and the second audio data may be compared through a time domain window function.
  • the controller 250 is configured to: acquire a first audio segment of the first target audio data within a time period t, where the first audio segment may be any audio segment with a duration of t; acquire the second audio data within the time period The second audio segment within t (i.e. the same time as the first audio segment), and a plurality of third audio segments before the second audio segment, a plurality of fourth audio segments after the second audio segment; wherein, The durations corresponding to the third audio segment and the fourth audio segment are both equal to the duration of the time segment t.
  • intercept a segment from the first target audio data denoted as w
  • use the same window to intercept multiple segments of the second audio data in the same time period, denoted as w(x)
  • the convolution value of all the data in to obtain the correlation data between w and w(x).
  • the time difference between w(x) with the highest correlation and w is determined as the time difference between the first target audio data and the second audio data.
  • the time difference between two audio data is also possible.
  • the window width is closely related to the delay calculation precision, and the window width is t, and the calculation precision is also t. However, the smaller t is, the larger the corresponding calculation amount will be.
  • the calculation amount is relatively large, and the calculation amount can be reduced by half by using point-by-point calculation.
  • the corresponding precision can be selected according to the computing power of the processor.
  • the sound of the left and right channels is independently separated, and the two kinds of audio data obtained after the separation are respectively processed by the first gain and the second gain through the method shown in the system architecture of Figure 8 After the gain processing, they are combined, and after the sound effect enhancement processing, they are sent to the corresponding speakers.
  • the architecture is simple, the audio data of the left and right channels need to be calculated by the sound separation algorithm, and the sound separation algorithm usually uses the same physical computing processor, which is superimposed in time, so the AI processing capability of the chip is highly required. . It can be seen that how to reduce the amount of sound separation determines whether it can be applied to more display devices.
  • FIG. 12 is a schematic diagram of an audio processing method in some embodiments of the present application.
  • the audio data of the left channel and the audio data of the right channel output by the audio decoder are not only subjected to sound enhancement processing and gain processing respectively, but also combined into one signal for sound separation, and the separation Gain processing is performed on the output first target audio data. Then delay processing is performed on the sound signals of the two links, and the sound signals in the sound separation link are finally superimposed on the left and right sound channels in the sound effect enhancement link respectively. In this way, the amount of computation for sound separation can be reduced by half, making the implementation more practical.
  • the first audio data includes first channel original audio data and second channel original audio data. That is, the first audio data may include audio data of two channels, for example, the original audio data of the first channel and the original audio data of the second channel may be the audio data of the left channel and the audio data of the right channel contained in the first audio data. audio data.
  • the controller 250 is configured to: perform sound effect enhancement processing on the initial audio data of the first channel and the initial audio data of the second channel, respectively, to obtain the enhanced audio data of the first channel (that is, the audio data of the first channel after sound effect enhancement). data) and second-channel sound-enhanced audio data (that is, second-channel audio data after sound effect enhancement).
  • sound separation can be directly performed on the first audio data (that is, the audio data after combining the original audio data of the first channel and the original audio data of the second channel) to obtain the first target audio data, so that the amount of computation for sound separation is cut in half.
  • Gain processing can be performed on the first target audio data according to the first gain to obtain the second target audio data; according to the second gain, the first channel sound effect enhanced audio data and the second channel sound effect enhanced audio data are respectively subjected to gain processing to obtain First channel target audio data and second channel target audio data.
  • Delay processing is carried out to the second target audio data or the first channel target audio data, so that the second target audio data and the first channel target audio data are synchronized; and the second target audio data or the second channel target audio The data is delayed, so that the second target audio data and the second channel target audio data are synchronized.
  • the duration consumed by sound separation and the duration consumed by sound effect enhancement processing are usually different, therefore, delay processing may be performed first and then merged.
  • it is also possible to count the first duration consumed by sound separation, the second duration consumed by performing sound enhancement processing on the original audio data of the first channel, and performing sound enhancement on the original audio data of the second channel The third amount of time spent processing. According to the first duration and the second duration, the second target audio data or the first channel target audio data is delayed; according to the first duration and the third duration, the second target audio data or the second channel target audio Data is processed with a delay.
  • the correlation between the first target audio data and the sound effect enhanced audio data of the second channel, and delay processing is performed on the second target audio data or the second channel target audio data according to the correlation.
  • the second duration consumed by performing sound effect enhancement processing on the original audio data of the first channel and the third duration consumed by performing sound effect enhancement processing on the original audio data of the second channel are usually equal , or the difference is small and can be ignored. Therefore, in order to reduce the amount of computation, it is also possible to count only the time consumed by one of the sound effect enhancement processes. Alternatively, it is enough to determine the correlation between the first target audio data and the audio data with enhanced sound effect of the first channel (audio data with enhanced sound effect of the second channel).
  • the second target audio data is respectively combined with the first channel target audio data and the second channel target audio data to obtain the combined audio data of the first channel and the combined audio data of the second channel;
  • the audio output interface 270 includes: a first output interface and a second output interface; the first output interface is configured to output the combined audio data of the first channel; the second output interface is configured to output the combined audio data of the second channel.
  • sound separation can be realized by artificial intelligence technology.
  • the first audio data includes the original audio data of the first channel and the original audio data of the second channel
  • the original audio data of the two channels are respectively subjected to sound separation and sound effect enhancement processing, and sound separation will consume a large amount of calculation, therefore, the processing capability of the chip in the display device is required to be relatively high.
  • the original audio data of the first channel and the original audio data of the second channel can be combined, that is, the sound separation of the first audio data is directly performed, and after the gain processing is performed on the separated first target audio data, Obtain the second target audio data.
  • the present application also provides a display device for enhancing the stereo effect of sound.
  • the implementation in the Android system can be shown in Figure 13A.
  • the Android system mainly includes the application layer, middleware and core layer, and the implementation logic can be in the middleware.
  • the middleware can include: audio decoder, human Modules such as sound separation module, gain control module, image decoder, sound distribution module, merging module, sound effect enhancement module and audio output interface, wherein, about the introduction of audio decoder, sound effect enhancement module, audio output interface, and in Fig.
  • the human voice separation module is used to separate the human voice from the decoded left channel audio data and right channel audio data, respectively, to obtain the left channel human voice audio data and the left channel background audio data, and Human voice audio data of the right channel and background audio data of the right channel.
  • the sound allocation module is used to detect the lip movement on the image decoded and output by the image decoder, so as to determine the weight of human voice audio and the weight of background audio output by each audio output interface.
  • FIG. 13B is a schematic diagram of an audio processing method in some embodiments of the present application.
  • the audio decoder can decode and output the left channel audio data and the right channel audio data, and can separate the left channel audio data and the right channel audio data respectively to obtain the left channel human voice audio data and the left channel background Audio data, as well as right channel human voice audio data and right channel background audio data.
  • the human voice separation of the left channel audio data and the human voice separation of the right channel audio data can be realized through AI technology and a pre-trained neural network model.
  • the left-channel human voice audio data and the right-channel human voice audio data are combined to obtain target human voice audio data.
  • the image decoder can decode the image at the time of the left channel audio data and the right channel audio data, and perform lip movement detection on the image, and determine the target human voice audio data in each audio output interface according to the lip movement detection result.
  • the weights for outputting the left-channel background audio data and the right-channel background audio data by the audio output interface may be determined according to the coordinates of the audio output interface.
  • the audio output interface outputs the weights of the left channel background audio data and the right channel background audio data, and merges the human voice audio and the background audio. Finally, perform sound effect enhancement processing on the merged audio and output it.
  • the separated left channel human voice audio data and the right channel human voice audio can be separated first.
  • adjust the weight of the voice corresponding to each audio output interface that is, the weight corresponding to the output human voice audio
  • adjust the weight of the background audio output by each audio output interface according to the position of the audio output interface so that The three-dimensional effect of the sound is enhanced, and the viewing experience of the user is improved.
  • the display device 200 includes: a controller 250 and multiple audio output interfaces 270;
  • the controller 250 is configured to: separately perform vocal separation on the acquired audio data of the first channel and the audio data of the second channel to obtain the audio data of the first human voice of the first channel and the first background of the first channel audio data, as well as the first vocal audio data of the second channel and the first background audio data of the second channel.
  • the first channel audio data and the second channel audio data are audio data of two different channels acquired at the same time, the first channel audio data and the second channel audio data can make the sound more stereoscopic .
  • the first channel audio data and the second channel audio data may be left channel audio data and right channel audio data respectively.
  • the first human voice audio data of the first channel and the first background audio data of the first channel can be obtained through human voice separation (for example, artificial intelligence technology).
  • the first-channel first-voice audio data refers to the human voice in the first-channel audio data, and the number of the first-channel first-voice audio data can be multiple, that is, the voices of multiple people can be extracted .
  • the audio data except the first human voice audio data of the first channel is the first background audio data of the first channel.
  • vocal separation can be performed on the audio data of the second channel to obtain the first human voice audio data of the second channel and the first background audio data of the second channel.
  • the first human voice audio data of the first channel and the first human voice audio data of the second channel separated are not directly allocated to the first channel, the second channel and the background Instead of merging audio, the first human voice audio data of the first channel and the first human voice audio data of the second channel are directly combined to obtain the target human voice audio data. Furthermore, according to the speaking position of the person in the image, the output status of the target human voice audio data on each audio output interface is allocated.
  • the vocal audios of multiple characters are included, for each character, the first vocal audio data of the first channel and the first vocal audio data of the second channel corresponding to the character are combined to obtain The target vocal audio data for this character.
  • the allocation method of the target human voice audio data of each character is similar, and the target human voice audio data of a character is taken as an example for description here.
  • the controller 250 is configured to: acquire the image data at the time of the first channel audio data and the second channel audio data, and perform lip movement detection on the image data; if the lip movement coordinates on the screen of the display device are detected, the The moving coordinates and the coordinates of a single audio output interface determine the vocal weight corresponding to the audio output interface.
  • the image decoder in addition to decoding the audio data by the audio decoder, can also decode the corresponding image data.
  • the image data corresponding to the audio can be acquired at the same time.
  • the image data at the moment of the first channel audio data and the second channel audio data may be acquired.
  • the image data usually has a corresponding person image. Therefore, the lip movement detection can be performed on the image data to obtain the lip movement coordinates, that is, the position coordinates of the lips of the person.
  • the lip movement coordinates can be detected if there is a moving lip.
  • the lip movement coordinates indicate the position where the person in the image speaks on the screen, and the coordinates of the multiple audio output interfaces indicate the positions where the audio is output. It can be understood that when the lip movement coordinates are closer to the audio output interface, the vocal weight corresponding to the audio output interface is also greater. The greater the weight of the human voice, the greater the energy of the human voice audio output from the audio output interface.
  • the controller 250 is configured to: for each audio output interface, according to the coordinates of the audio output interface, determine the area corresponding to the audio output interface in the screen; if the lip movement coordinates are located in the area corresponding to the audio output interface Within, determine the vocal weight corresponding to the audio output interface as the first value; if the lip movement coordinates are outside the area corresponding to the audio output interface, determine the vocal weight corresponding to the audio output interface as the second value, and the second value is smaller than the first value .
  • corresponding regions may be divided for each audio output interface on the screen in advance according to the coordinates of each audio output interface. It can be understood that when the lip movement coordinates are closer to the area corresponding to the audio output interface, the weight of the human voice corresponding to the audio output interface is also greater.
  • the lip movement coordinates can be the position coordinates (x, y) of the actual pixel. If the row resolution of the video is L and the column resolution is C. Then, the lip movement coordinates can be normalized to the following formula:
  • x' is less than 0.5, it means that the lip movement coordinates are in the left area, and if x' is greater than 0.5, it means that the lip movement coordinates are in the right area.
  • the vocal weight corresponding to the speaker at the lower left of the screen and the vocal weight corresponding to the speaker at the lower right of the screen can be set to 1 and 0 respectively, that is, through the lower left of the screen
  • the speaker outputs target vocal audio data, and the speaker at the bottom right of the screen does not output target vocal audio data.
  • the voice weight corresponding to the speaker at the bottom left of the screen and the voice weight corresponding to the speaker at the bottom right of the screen can also be set to 0.8 and 0.2 respectively, which can be determined by referring to the specific position of the lip movement coordinates in the left area.
  • FIG. 14 is a schematic diagram of speaker distribution. It can be seen that the display device includes four speakers, which are respectively located at the lower left, lower right, upper left and upper right of the screen.
  • the areas corresponding to the four speakers on the screen are shown in FIG. 14 , which are the lower left area, the lower right area, the upper left area, and the upper right area of the screen.
  • the lip movement coordinates are located in the upper left area, and the vocal weights corresponding to the four speakers in the lower left, lower right, upper left, and upper right can be 0, 0, 1, and 0, respectively.
  • the vocal weights corresponding to the lower left, lower right, upper left, and upper right speakers can also be 0.2, 0, 0.8, and 0, so that the final effect is located at the upper left of the screen in terms of subjective hearing.
  • the screen includes: a middle area and a non-middle area.
  • the controller 250 is configured to: if the lip movement coordinates are located in the non-middle area, determine the vocal weights corresponding to the plurality of audio output interfaces according to the lip movement coordinates and the coordinates of the plurality of audio output interfaces. That is, the vocal weights corresponding to the multiple audio output interfaces may be determined according to the above method.
  • the lip movement coordinates are located in the middle area, according to the coordinates of the multiple audio output interfaces and the attribute information of the multiple audio output interfaces, determine the vocal weights corresponding to the multiple audio output interfaces, wherein the attribute information includes volume and/or orientation . That is, when the lip movement coordinates are located in the middle area of the screen, the vocal weights corresponding to each audio output interface can be flexibly configured according to the volume, orientation, and positional relationship of the audio output interface, so that the final effect is based on the subjective listening experience. Ideally located in the center of the screen.
  • the speakers below the screen are oriented downward, and the speakers above the screen are oriented upward.
  • the louder the volume of the speaker the lower the corresponding vocal gain of the speaker, and the lower the volume of the speaker, the larger the corresponding vocal gain of the speaker.
  • the subjective sense of hearing can be centered on the screen.
  • the volume of the four speakers is the same, the corresponding vocal gain of the four speakers can be the same.
  • the controller 250 is configured to: according to the coordinates of the audio output interface, determine that the audio output interface corresponds to the first background audio data of the first channel and/or the first background audio data of the second channel.
  • the background audio data since it has nothing to do with the human voice, it can be determined directly according to the coordinates of the audio output interface whether the audio output interface outputs the first background audio data of the first channel, or the first background audio data of the second channel, or the first background audio data of the first channel.
  • the first background audio data of the channel and the first background audio data of the second channel since it has nothing to do with the human voice, it can be determined directly according to the coordinates of the audio output interface whether the audio output interface outputs the first background audio data of the first channel, or the first background audio data of the second channel, or the first background audio data of the first channel.
  • the screen includes: a left area and a right area, if the coordinates of the audio output interface correspond to the left area, determine that the audio output interface corresponds to the first channel initial background audio data; if the coordinates of the audio output interface correspond to the right area , to determine that the audio output interface corresponds to the initial background audio data of the second channel. If both the lower left and lower right of the screen contain a speaker, corresponding to the left area and the right area respectively, the speaker at the lower left of the screen can output the initial background audio data of the first channel, and the speaker at the lower right of the screen can output the initial background of the second channel audio data.
  • the screen includes: a left area, a middle area, and a right area; the controller 250 is configured to: If the coordinates of the audio output interface correspond to the left area, determine that the audio output interface corresponds to the first channel first background audio Data; if the coordinates of the audio output interface correspond to the right area, determine that the audio output interface corresponds to the first background audio data of the second channel; if the coordinates of the audio output interface correspond to the middle area, determine that the audio output interface corresponds to the first channel first background audio data and second channel first background audio data.
  • the lower left, lower middle, and lower right of the screen all contain a speaker, corresponding to the left area, middle area, and right area respectively.
  • the first background audio data of the first channel and the first background audio data of the second channel can be output simultaneously, and the speaker at the bottom right of the screen can output the first background audio data of the second channel.
  • the controller 250 is configured to: multiply the product of the target vocal audio data and the vocal weight corresponding to the audio output interface, and the first channel first background audio data and/or the second channel first background audio data corresponding to the audio output interface.
  • the background audio data is merged, and the sound effect enhancement processing is performed to obtain the audio data corresponding to the audio output interface.
  • the human voice audio and the background audio can be combined, and the sound effect enhancement processing can be performed to obtain the audio data corresponding to the audio output interface.
  • the single audio output interface 270 is configured to: output audio data corresponding to the audio output interface.
  • different gain processing can be performed on the human voice audio and the background audio, so as to highlight and enhance the human voice audio or the background audio .
  • the controller 250 is further configured to: respectively perform gain processing on the first human voice audio data of the first channel and the first human voice audio data of the second channel according to the first gain to obtain the second human voice audio data of the first channel and the second vocal audio data of the second channel; according to the second gain, the first background audio data of the first channel and the first background audio data of the second channel are respectively subjected to gain processing to obtain the second background audio of the first channel data and the second background audio data of the second channel; wherein, the first gain and the second gain are determined according to the sound control mode corresponding to the display device.
  • both the first human voice audio data of the first channel and the first human voice audio data of the second channel belong to human voice audio, and may correspond to the same first gain, the first background audio data of the first channel and the first background audio data of the second channel.
  • the first background audio data of the second sound channel all belong to the background audio and may correspond to the same second gain.
  • the display device corresponds to multiple preset sound clarity control modes and/or multiple preset sound effect modes; each preset sound clarity control mode has a corresponding numerical value, and each preset sound effect mode has a corresponding
  • the sound control mode includes: the target sound clarity control mode and/or the target sound effect mode; wherein, the target sound clarity control mode is one of a variety of preset sound clarity control modes, and the target sound effect mode is a variety of One of the preset sound effect modes; the controller 250 is configured to: determine the first gain and the second gain according to the first value corresponding to the target sound clarity control mode and/or the second value corresponding to the target sound effect mode.
  • the user can control the sound control mode of the display device according to his/her own preference, and further, the controller 250 can determine how to perform the audio data of the first human voice in the first channel and the first human voice in the second channel according to the sound control mode. Gain processing is performed on the audio data, and how to perform gain processing on the first background audio data of the first channel and the first background audio data of the second channel.
  • the method for determining the first gain and the second gain is the same as the method for determining the first gain and the second gain in the foregoing embodiments, and for details, please refer to the description in the foregoing embodiments, and details are not repeated here. .
  • the controller 250 is configured to: combine the second human voice audio data of the first channel and the second human voice audio data of the second channel to obtain the target human voice audio data; and for each audio output interface, according to the audio
  • the coordinates of the output interface determine that the audio output interface corresponds to the second background audio data of the first channel and/or the second background audio data of the second channel; the product of the target vocal audio data and the vocal weight corresponding to the audio output interface, and the second background audio data of the first channel corresponding to the audio output interface and/or the second background audio data of the second channel are combined, and sound effect enhancement processing is performed to obtain the audio data corresponding to the audio output interface.
  • the image data does not contain a person, or even if the image data contains a person, the lips of the person are not displayed, for example, only the profile of the person, the back of the person, etc. are displayed. Or, even if the character's lips are displayed, but the character's lips are not moving, the lip movement coordinates cannot be detected at this time.
  • the controller 250 is also configured to: if no lip movement coordinates are detected, for each audio output interface, the energy of the first human voice audio data of the first channel and the first human voice audio data of the second channel can be directly The energy ratio of , and the coordinates of the audio output interfaces determine the vocal weights corresponding to the audio output interfaces.
  • the vocal weight corresponding to the speaker at the lower left of the screen can be greater than the vocal weight corresponding to the speaker located at the bottom right of the screen. If the ratio of the energy of the left channel human voice audio data to the energy of the right channel human voice audio data is 0.6:0.4, then the human voice weight corresponding to the speaker at the bottom left of the screen can be 0.6, and the speaker at the bottom right of the screen corresponds to the human voice weight The sound weight can be 0.4. Alternatively, in order to further enhance the sense of orientation of the sound, the voice weight corresponding to the speaker at the bottom left of the screen may be 0.7, and the weight of the voice corresponding to the speaker at the bottom right of the screen may be 0.3.
  • the karaoke function of the TV is usually completed in the singing APP.
  • the singing APP has rich functions and better user experience, but the media resources of the singing APP are relatively limited.
  • the original singer A of a song is a male singer and the cover singer B is a female singer.
  • the cover singer B is a female singer.
  • a female user C wants to sing this song, only the accompaniment video of the original singer A may be recorded in the singing app, but there is no accompaniment video of singer B, making it impossible to find a suitable accompaniment.
  • the display device in some embodiments of the present application removes the human voice in the song being played through the human voice separation technology, so that the user can find his favorite song without relying on the singing APP, such as playing it through an online music player You can play songs you are familiar with, or play the audio and video content you paid for on TV. Then turn on the vocal elimination function, which can remove the original vocals in the audio, and then realize singing without being restricted by media resources. At the same time, according to the energy of the singing vocals collected by the microphone, all or part of the original vocals can be added to the accompaniment to avoid affecting the singing experience due to the singer's low singing level.
  • the realization of the display device in the Android system can be shown in Figure 15A.
  • the Android system mainly includes the application layer, middleware and core layer, and the implementation logic can be in the middleware.
  • the middleware includes: audio decoder, human voice separation module, Audio input interface, volume control module for original singing, merging module, sound effect enhancement module, gain control module, delay module and audio output interface.
  • the audio decoder, the human voice separation module, the merging module, the sound effect enhancement module, and the audio output interface are the same as those shown in Fig. Based on the singing audio and the separated original vocal audio, determine the magnitude of the original vocal audio that is merged into the accompaniment audio, that is, the target vocal audio.
  • FIG. 15B is a schematic diagram of an audio processing method in some embodiments of the present application.
  • the audio decoder decodes and obtains the audio data of the song, it separates the human voice to obtain the audio data of the original singer's voice and the audio data of the accompaniment.
  • the microphone can collect the singing vocal audio data input by the user, and the target vocal audio data can be determined according to the original vocal audio data and the singing vocal audio data, that is, the size of the original vocal audio data merged into the accompaniment audio data .
  • the singing vocal audio data, the target vocal audio data and the accompaniment audio data are combined, and the audio effect is enhanced before being output.
  • Some embodiments of the present application also provide a display device 200, including:
  • the controller 250 is configured to: acquire song audio data, perform vocal separation on the song audio data, and obtain original singer vocal audio data and accompaniment audio data.
  • the song audio data can be any song, including songs included in the singing APP, and songs not included in the singing APP.
  • vocal separation on the song audio data, for example, the original vocal audio data and the accompaniment audio data can be separated by artificial intelligence technology. It can be seen that for any song, the corresponding accompaniment audio data can be separated.
  • the controller 250 is further configured to: determine the original singing gain according to the energy of the original vocal audio data in each time period and the energy of the singing vocal audio data collected in the time period; according to the original singing gain, Gain processing is performed on the original vocal audio data within the time period to obtain target vocal audio data.
  • the user can sing a song through an audio input interface (such as a microphone).
  • an audio input interface such as a microphone
  • the audio data of the singing voice can be collected, but the user may have problems such as out-of-tune and poor pitch when singing.
  • the vocal separation is calculated in real time on the main chip of the display device, and there may be problems that the vocal separation is not clean or individual noises are introduced during the separation.
  • the original vocal audio from the vocal separation can be fully or partially merged into the accompaniment to enhance the atmosphere of the singing scene, and when it is detected that the user is singing , you can reduce or mute the original singer's vocal audio through the volume control of the original singer's vocal audio, and mainly play the user's singing voice.
  • the audio data can be processed according to a preset time period. That is, the audio data of each time period are sequentially processed in chronological order.
  • the time period may be 0.8 second, 1 second and so on.
  • the original vocal gain can be obtained according to the energy of the original vocal audio data and the energy of the singing vocal audio data, and the original vocal audio data can be gain-processed through the original vocal gain to obtain the target vocal audio data, that is, audio data merged into the accompaniment audio data.
  • the original singing gain is less than or equal to a preset gain threshold.
  • the preset gain threshold may be 0.1dB, 0dB, -0.1dB and so on.
  • the preset gain threshold is equal to 0dB, the original singing gain is less than or equal to 0 dB.
  • the gain of the original singer is equal to 0 dB, it means that the audio data of the original singer’s voice is all merged into the accompaniment audio data; when the gain of the original singer is less than 0 dB, it means that the audio data of the original singer’s voice is partially merged into the accompaniment audio data middle.
  • the preset gain threshold is less than 0 dB
  • the gain of the original singer is also less than 0 dB, which means that the audio data of the original singer's voice is partially merged into the accompaniment audio data.
  • the preset gain threshold is greater than 0 dB, it means that the original vocal audio data can be merged into the accompaniment audio data after enhanced processing.
  • the controller 250 is configured to: if the energy of the singing vocal audio data is less than a preset energy threshold, the preset energy threshold is a smaller energy value, at this time it can be considered that the user is not singing, and Setting the gain of the original singer to a preset gain threshold, for example, setting the gain of the original singer to 0dB, means directly using the audio data of the original singer's voice as the target voice audio data. If the energy of the singing vocal audio data is greater than or equal to the preset energy threshold, it can be considered that the user has started to sing at this time, and the original singing voice is determined according to the energy ratio between the energy of the singing vocal audio data and the energy of the original vocal audio data. Singing gain, make the original singing gain less than the preset gain threshold, that is, reduce the energy of the original vocal audio data, and use it as the target vocal audio data.
  • the energy and source of the singing vocal audio data can be established in advance.
  • the energy ratio between the energy of vocal audio data and the corresponding relationship between the original singing gain for example, when the energy ratio is within a certain energy ratio range, the original singing gain may correspond to the same value.
  • the controller 250 is configured to: combine the accompaniment audio data, the target vocal audio data and the singing vocal audio data within the time period and perform sound effect enhancement processing to obtain the target audio data.
  • the target vocal audio data is also combined.
  • the target vocal audio data refers to all or part of the original vocal audio data. Therefore, the final output target audio data is richer and more effective.
  • the audio output interface 270 is configured to: output target audio data.
  • the accompaniment audio data can be obtained by separating the human voice, so that the user is not limited by media resources when singing.
  • the controller 250 is further configured to: obtain the original singing gain corresponding to the previous time period, if the original singing gain corresponding to the current time period is the same as the original singing gain corresponding to the previous time period, it means that the previous
  • the energy ratio between the energy of the singing vocal audio data corresponding to the time period and the energy of the original vocal audio data has a small gap with the energy ratio corresponding to the current time period, for example, it is in the same energy ratio range, indicating that the user's singing is relatively stable , the user is very familiar with the song being sung, and the time period can be extended to reduce the processing frequency of the above process until the extended time period is less than the first time threshold (for example, it can be 2 seconds, etc.).
  • the processing frequency of the above process is reduced, instead of frequently merging the target vocal audio data obtained based on the original vocal audio data into the accompaniment audio data during singing intervals.
  • the time period cannot be extended indefinitely, so as to avoid affecting the final singing effect due to too long time period.
  • the gain of the original singer corresponding to the current time period is different from the gain of the original singer corresponding to the previous time period, it means that the volume of the user's singing has changed, which is out of tune with the original singer, and the user may not be able to sing or sing inaccurately.
  • Shorten the time period now that is, quickly call out the target audio data, and merge the target audio data into the accompaniment audio data until the shortened time period is greater than the second time threshold (for example, can be 0.25 seconds, etc.), wherein, the first A time threshold is greater than a second time threshold.
  • the above audio processing process can improve the effect of the accompaniment when singing.
  • a professional singing app there are many professional accompaniment music libraries in addition to the left and right channel audio data subtraction music library.
  • the accompaniment library is not obtained by subtracting the audio data of the left and right channels to eliminate the audio data of the original singer, but to record the accompaniment audio data in a separate audio track when recording music.
  • many songs in addition to the accompaniment, there are also some professional accompaniment vocals.
  • all human voices can be identified and eliminated. Although the effect of a single music accompaniment track can be approximated, because the harmony of the accompaniment personnel is also eliminated, the left accompaniment lacks a sense of atmosphere .
  • human voice separation is to separate the signal that belongs to the human voice from the original audio signal.
  • the sound of the human voice and the instrument will overlap in the frequency domain, and when the human voice is separated, it will overlap with the human voice.
  • the instrumental sounds are also stripped out together.
  • the separated original vocal audio data can be transformed to obtain the accompaniment audio data, and then the accompaniment audio data is merged into the accompaniment at a certain ratio to make up for the hollowness of the accompaniment.
  • the ratio is related to the energy of the singing vocal audio data. Specifically, when the energy of the singing vocal audio data increases, the ratio also increases, and when the singing voice decreases, the ratio also decreases.
  • the controller 250 is further configured to: generate the first Accompanying audio data.
  • the energy of the singing vocal audio data is less than the preset energy threshold, it means that the user is not singing, or the singing voice is extremely low, and all the original vocal audio data can be merged into the accompaniment audio data. At this time, the first accompaniment audio data may not be generated. Therefore, in some embodiments, when the energy of the singing vocal audio data is greater than or equal to a preset energy threshold, the first accompaniment audio data is generated according to the original vocal audio data in each time period.
  • time-domain transformation may be performed on the original vocal audio data to generate the first accompaniment audio data.
  • the controller 250 is configured to: obtain a plurality of different delays and a gain corresponding to each delay; for each delay, perform delay processing on the original vocal audio data in each time period according to the delay , to obtain the first delayed audio data; perform gain processing on the delayed audio data according to the gain corresponding to the delay to obtain the second delayed audio data; merge a plurality of second delayed audio data to obtain the first accompaniment audio data .
  • FIG. 16 is a schematic diagram of performing time-domain transformation on original vocal audio data in some embodiments of the present application.
  • T1 is 10ms
  • T2 is 20ms
  • T3 is 30ms...
  • gain 1 is 0dB
  • gain 2 is -6dB
  • gain 3 is -10dB...
  • the original vocal audio data in each time period may be delayed according to the delay to obtain the first delayed audio data.
  • gain processing is performed on the delayed audio data according to the gain corresponding to the delay to obtain second delayed audio data.
  • the original vocal audio data can be delayed according to 10ms to obtain the first delayed audio data, and the first delayed audio data can be gain-processed according to 0dB to obtain the second delayed audio data.
  • T2, T3... are processed in the same manner, and the corresponding second delayed audio data can be obtained.
  • the multiple second delayed audio data are combined to obtain the first accompaniment audio data.
  • frequency domain transformation may also be performed on the original vocal audio data to generate the first accompaniment audio data.
  • the controller 250 is configured to: determine the sound zone to which the audio data of the original vocalist belongs; perform pitch-up processing or pitch-down processing on the audio data of the original vocalist according to the sound zone, to obtain the first accompaniment audio data.
  • the accompaniment can be formed, and the accompaniment and the original singing are not in the same tone.
  • there is a professional accompaniment team and their voices are not in the same part as the original singer, for example, they may be 3 degrees higher or lower than the original singer.
  • FIG. 17 is a schematic diagram of performing frequency-domain transformation on original vocal audio data in some embodiments of the present application.
  • the fundamental frequency analysis is to perform FFT (Fast Fourier Transform) on the human voice to find the first peak, and the peak frequency is the fundamental frequency.
  • the singer's tone can be known, for example, the frequency of middle C or "do" is 261.6 Hz.
  • the calculated pitch of the current voice the frequency corresponding to the rising pitch or the falling pitch can be calculated.
  • the algorithm principle of raising 3 degrees or lowering 3 degrees can be described in detail according to the piano keyboard. If the current audio data of the original vocalist belongs to the middle pitch C, that is, C4, the white keyboard E4 is raised by 3 degrees, and there are a total of 4 semitones in the middle, that is, the pitch of the current voice is changed and the frequency is increased. multiple. And if the pitch of the original singer's vocal audio data is B3, it is raised by 3 degrees to D4, a total of 3 semitones, that is, the frequency is increased multiple.
  • pitch-up processing or down-pitch processing on the audio data of the original singer's voice according to the singing habits of the general singer.
  • the bass is not low enough and the treble is not high enough. Therefore, in some embodiments, in order to solve the problem that the bass is not low enough and the treble is not high enough when non-professional singers sing.
  • the controller 250 is configured to: if the sound range is a low-pitched range, perform down-tuning processing on the audio data of the original vocalist to obtain the first accompaniment audio data; Tone processing, obtain the first accompaniment audio data; if the sound zone is a mid-range area, the original singer's vocal audio data is raised and lowered to obtain the first vocal audio data and the second vocal audio data respectively; The first vocal audio data and the second vocal audio data are used as the first accompaniment audio data.
  • the pitch-down operation is started, and when the audio data of the original vocalist's voice is higher than a certain high pitch, the pitch-up operation is started.
  • the up-tune operation is started, that is, the gain of the control down-tune operation is the minimum, that is, mute, and the gain of the control up-tune operation is 0dB, that is, the generated first accompaniment audio number includes the up-tune operation. audio data.
  • the gains of the up-pitch calculation and the down-pitch calculation can both be -6dB, that is, the generated first accompaniment audio data includes both the audio data after the pitch calculation and the audio data after the down-pitch calculation.
  • the original accompaniment style and timbre may be affected.
  • the purpose of accompaniment is to enrich and beautify the singing voice when the singing voice exists. Therefore, the energy of the accompaniment audio data finally incorporated into the accompaniment audio data can be smaller than the energy of the vocal audio data. For example, it is 12dB smaller than the singing vocal audio data, etc.
  • the controller 250 is configured to: determine the accompaniment gain according to the energy of the vocal audio data collected within the time period; The energy of the singing vocal audio data is positively correlated; the first accompaniment audio data is gain-processed by the accompaniment gain to obtain the second accompaniment audio data; wherein, the energy of the second accompaniment audio data is smaller than the energy of the singing vocal audio data.
  • the greater the energy of the vocal audio data the greater the energy of the accompaniment audio data that is finally combined into the accompaniment audio data.
  • Energy is positively correlated.
  • the calculation method of the accompaniment gain is not limited to this, and the above formula can be simply modified to calculate the accompaniment gain.
  • the controller 250 is configured to: combine the accompaniment audio data, the second accompaniment audio data, the target vocal audio data and the singing vocal audio data within a time period and perform sound effect enhancement processing to obtain the target audio data.
  • the present application further provides an audio processing method. It can be understood that the steps involved in FIG. 18 to FIG. 21 may include more steps or fewer steps in actual implementation, and the order of these steps may also be different, so as to realize the embodiment of the present invention
  • the audio processing method provided in the document shall prevail.
  • FIG. 18 is a flowchart of an audio processing method in some embodiments of the present application, which may include the following steps:
  • Step S1810 performing sound separation on the acquired first audio data to obtain first target audio data and first background audio data.
  • Step S1820 performing gain processing on the first target audio data according to the first gain to obtain second target audio data, and performing gain processing on the first background audio data according to the second gain to obtain second background audio data.
  • the first gain and the second gain are determined according to the sound control mode corresponding to the display device.
  • Step S1830 combining the second target audio data and the second background audio data, and performing sound effect enhancement processing, to obtain and output the second audio data.
  • the first target audio data after separating the first target audio data and the first background audio data from the first audio data, the first target audio data can be subjected to gain processing according to the first gain to obtain the second target audio data; Gain processing is performed on the first background audio data according to the second gain to obtain second background audio data. Combining the second target audio data and the second background audio data and performing sound effect enhancement processing to obtain and output the second audio data. Since the first gain and the second gain are determined according to the corresponding sound control mode of the display device, the first target audio data and the first background audio data can be combined after performing non-proportional gain processing based on the user's current viewing needs. , the first target audio data or the first background audio data may be enhanced according to the viewing requirements of the user, thereby improving the effect of sound effect enhancement.
  • the above audio processing method further includes: according to the sound control mode, determining the type of the sound effect enhancement mode corresponding to the first audio data; according to the sound control mode, determining the first gain and the second gain corresponding to the type of the sound effect enhancement mode Two gains.
  • the display device corresponds to multiple preset sound clarity control modes and/or multiple preset sound effect modes; each preset sound clarity control mode has a corresponding numerical value, and each preset sound effect mode has a corresponding
  • the sound control mode includes: the target sound clarity control mode and/or the target sound effect mode; wherein, the target sound clarity control mode is one of a variety of preset sound clarity control modes, and the target sound effect mode is a variety of One of the preset sound effect modes; according to the sound control mode, determine the type of sound effect enhancement mode corresponding to the first audio data, including: according to the first value corresponding to the target sound clarity control mode and/or the first value corresponding to the target sound effect mode Two numerical values, determine the type of the sound effect enhancement mode corresponding to the first audio data; according to the sound control mode, determine the first gain and the second gain corresponding to the type of the sound effect enhancement mode, including: according to the first value and/or the second value , to determine the first gain and the second gain corresponding to the type of the sound effect enhancement mode
  • determining the first gain and the second gain corresponding to the type of the sound effect enhancement mode includes: if the type of the sound effect enhancement mode corresponding to the first audio data is a sound enhancement mode, the first gain greater than the second gain; if the type of the sound enhancement mode corresponding to the first audio data is the background enhancement mode, the first gain is smaller than the second gain.
  • the first audio data includes at least one third target audio data belonging to a preset sound type
  • the above audio processing method also includes: separating at least one third target audio data and the third background audio data from the first audio data; obtaining the first energy of the first channel initial target audio data of the single third target audio data value and the second energy value of the initial target audio data of the second channel; the initial target audio data of the first channel is subjected to gain processing according to the third gain to obtain the first gain audio data of the first channel; the first gain audio data of the first channel is obtained according to the fourth gain; The two-channel initial target audio data is subjected to gain processing to obtain the first gain audio data of the second channel; wherein, the third gain and the fourth gain are determined according to the first energy value and the second energy value; The initial background audio data of the first channel of the gain audio data and the third background audio data are merged, and sound effect enhancement processing is performed to obtain and output the first enhanced audio data of the first channel; the first gain audio data of the second channel Merge the original background audio data of the second channel with the third background audio data, and perform sound effect enhancement processing to obtain and output the first enhanced audio data
  • the above audio processing method further includes: according to the sound control mode, the first energy value and the second energy value, determining the fifth gain and the sixth gain corresponding to the single third target audio data; according to the sound control mode, Determine the seventh gain; perform gain processing on the initial target audio data of the first channel according to the fifth gain to obtain the second gain audio data of the first channel; perform gain processing on the initial target audio data of the second channel according to the sixth gain, Obtain the second gain audio data of the second channel; perform gain processing on the initial background audio data of the first channel and the initial background audio data of the second channel respectively according to the seventh gain, and obtain the background audio data of the first channel gain and the second gain channel gain background audio data; the first channel second gain audio data and the first channel gain background audio data are merged, and sound effect enhancement processing is performed to obtain and output the second enhanced audio data of the first channel; the second The two-channel second-gain audio data and the second-channel gain background audio data are combined, and sound effect enhancement processing is performed to obtain and output second-channel second enhanced audio data.
  • the first energy value and the second energy value determines the fifth gain and the sixth gain corresponding to the single third target audio data includes: determining the first audio data according to the sound control mode The type of the corresponding sound effect enhancement mode; according to the first energy value of the initial target audio data of the first channel and the second energy value of the initial target audio data of the second channel, determine the relationship between the energy of the left and right channels; according to the sound control mode, The first energy value and the second energy value determine the fifth gain and the sixth gain corresponding to the type of sound effect enhancement mode and the relationship between the energy of the left and right channels; determine the seventh gain according to the sound control mode, including: according to the sound control mode , to determine the seventh gain corresponding to the type of the sound effect enhancement mode and the energy magnitude relationship between the left and right channels.
  • FIG. 19 is a flowchart of an audio processing method in some embodiments of the present application, which may include the following steps:
  • Step S1910 respectively performing sound separation and sound effect enhancement processing on the acquired first audio data to obtain first target audio data and second audio data.
  • Step S1920 performing gain processing on the first target audio data according to the first gain to obtain second target audio data, and performing gain processing on the second audio data according to the second gain to obtain third audio data, wherein the first gain and the second The second gain is determined according to the sound control mode corresponding to the display device.
  • Step S1930 performing delay processing on the second target audio data or the third audio data, so as to synchronize the second target audio data and the third audio data.
  • Step S1940 combining the second target audio data and the third audio data to obtain and output fourth audio data.
  • the time consumed by the sound separation algorithm can be reduced by half.
  • sound separation and sound effect enhancement can be processed in parallel instead of serial processing, which can further shorten the time consumed by the entire audio processing process, thereby improving the effect of audio and video synchronization.
  • delay processing is performed on the second target audio data or the third audio data. For example, delay processing can be performed in a link with less computing time among the sound effect enhancement link and the sound separation link, so that the second target audio data Synchronized with the third audio data and then merged to avoid the echo problem, so as to improve the audio and video synchronization effect while not reducing the effect of sound effect enhancement.
  • performing delay processing on the second target audio data or the third audio data includes: obtaining the first duration consumed during sound separation and the second duration consumed during sound effect enhancement processing; according to the first duration and the second duration, delay processing is performed on the second target audio data or the third audio data.
  • performing delay processing on the second target audio data or the third audio data includes: determining the first target audio data and the second target audio data according to the correlation between the first target audio data and the second audio data. The time difference between the audio data; according to the time difference, delay processing is performed on the second target audio data or the third audio data.
  • determining the time difference between the first target audio data and the second audio data includes: acquiring the first target audio data within a time period The first audio segment within t; obtain the second audio segment of the second audio data within the time period t, and a plurality of third audio segments before the second audio segment, and a plurality of fourth audio segments after the second audio segment ; Wherein, the corresponding duration of the third audio segment and the fourth audio segment is equal to the duration of the time segment t; determine the correlation between the first audio segment and the second audio segment, the third audio segment and the fourth audio segment, determine The audio segment with the highest correlation; determining the time difference between the audio segment with the highest correlation and the first audio segment as the time difference between the first target audio data and the second audio data.
  • the first audio data includes initial audio data of the first channel and initial audio data of the second channel; sound effect enhancement processing is performed on the first audio data to obtain the second audio data, including: first channel The initial audio data and the second channel initial audio data are respectively subjected to sound effect enhancement processing to obtain the first channel sound effect enhanced audio data and the second channel sound effect enhanced audio data; the second audio data is subjected to gain processing according to the second gain to obtain
  • the third audio data includes: respectively performing gain processing on the first channel sound effect enhanced audio data and the second channel sound effect enhanced audio data according to the second gain, to obtain the first channel target audio data and the second channel target audio data ; Carry out delay processing to the second target audio data or the third audio data, so that the second target audio data and the third audio data are synchronized, including: delaying the second target audio data or the first channel target audio data processing, so that the second target audio data and the first channel target audio data are synchronized; and performing delay processing on the second target audio data or the second channel target audio data, so that the second target audio data and
  • the display device corresponds to multiple preset sound clarity control modes and/or multiple preset sound effect modes; each preset sound clarity control mode has a corresponding numerical value, and each preset sound effect mode has a corresponding
  • the sound control mode includes: the target sound clarity control mode and/or the target sound effect mode; wherein, the target sound clarity control mode is one of a variety of preset sound clarity control modes, and the target sound effect mode is a variety of One of the preset sound effect modes;
  • the method further includes: determining the first gain and the second gain according to the first value corresponding to the target sound clarity control mode and/or the second value corresponding to the target sound effect mode, wherein the first gain greater than the second gain.
  • determining the first gain and the second gain according to the first value corresponding to the target sound clarity control mode and/or the second value corresponding to the target sound effect mode includes: setting the first gain to 0dB; The first numerical value corresponding to the target sound clarity control mode and/or the second numerical value corresponding to the target sound effect mode determine the second gain so that the second gain is less than 0 dB.
  • FIG. 20 is another flowchart of an audio processing method in some embodiments of the present application, which is applied to a display device and may include the following steps:
  • Step S2010 performing vocal separation on the acquired first-channel audio data and second-channel audio data respectively, to obtain the first human voice audio data of the first channel, the first background audio data of the first channel, and the second Two-channel first human voice audio data and second-channel first background audio data.
  • Step S2020 combining the first human voice audio data of the first channel and the first human voice audio data of the second channel to obtain target human voice audio data.
  • Step S2030 acquire the image data at the time of the first channel audio data and the second channel audio data, and perform lip movement detection on the image data, if the lip movement coordinates on the display device screen are detected, according to the lip movement coordinates and the display device
  • the coordinates of the multiple audio output interfaces are used to determine the vocal weights corresponding to the multiple audio output interfaces.
  • Step S2040 for each audio output interface, according to the coordinates of the audio output interface, determine that the audio output interface corresponds to the first background audio data of the first channel and/or the first background audio data of the second channel.
  • Step S2050 combining the product of the target vocal audio data and the vocal weight corresponding to the audio output interface, and the first background audio data of the first channel and/or the first background audio data of the second channel corresponding to the audio output interface, And perform sound effect enhancement processing to obtain audio data corresponding to the audio output interface, and output the audio data through the audio output interface.
  • the separated first channel in a stereo scene, after the human voice is separated from the audio data of the first channel and the audio data of the second channel, the separated first channel can be first
  • the human voice audio data is combined with the first human voice audio data of the second channel to obtain the target human voice audio data, and the target human voice audio data is used as the human voice audio to be output.
  • adjust the weight of the voice corresponding to each audio output interface that is, the weight corresponding to the output human voice audio
  • adjust the weight of the background audio output by each audio output interface according to the position of the audio output interface so that The three-dimensional effect of the sound is enhanced, and the viewing experience of the user is improved.
  • the above audio processing method further includes: respectively performing gain processing on the first human voice audio data of the first channel and the first human voice audio data of the second channel according to the first gain, to obtain the first human voice audio data of the first channel Two person voice audio data and the second voice audio data of the second channel; the first background audio data of the first channel and the first background audio data of the second channel are respectively subjected to gain processing according to the second gain to obtain the first voice The second background audio data of the channel and the second background audio data of the second channel; wherein, the first gain and the second gain are determined according to the sound control mode corresponding to the display device; the first human voice audio data of the first channel and the second Merging the first human voice audio data of the channel to obtain the target human voice audio data, including: merging the second human voice audio data of the first channel and the second human voice audio data of the second channel to obtain the target human voice audio Data; for each audio output interface, according to the coordinates of the audio output interface, determine that the audio output interface corresponds to the first channel first background audio data and
  • the above sound effect processing method further includes: if no lip movement coordinates are detected, for each audio output interface, according to the energy of the first vocal audio data of the first channel and the first vocal of the second channel The energy ratio of the audio data and the coordinates of the audio output interfaces determine the vocal weights corresponding to the audio output interfaces.
  • the screen includes: a left area, a middle area and a right area; according to the coordinates of the audio output interface, it is determined that the audio output interface corresponds to the first background audio data of the first channel and/or the first background audio of the second channel Data, including: if the coordinates of the audio output interface correspond to the left area, determine that the audio output interface corresponds to the first channel first background audio data; if the coordinates of the audio output interface correspond to the right area, determine that the audio output interface corresponds to the second channel First background audio data; if the coordinates of the audio output interface correspond to the middle area, determine that the audio output interface corresponds to the first background audio data of the first channel and the first background audio data of the second channel.
  • the screen includes: a middle area and a non-middle area; according to the coordinates of the lip movement and the coordinates of the multiple audio output interfaces of the display device, the vocal weights corresponding to the multiple audio output interfaces are determined, including: if the lip movement If the coordinates are not in the middle area, the vocal weights corresponding to the multiple audio output interfaces are determined according to the lip movement coordinates and the coordinates of the multiple audio output interfaces; if the lip movement coordinates are located in the middle area, according to the coordinates of the multiple audio output interfaces and the multiple attribute information of each audio output interface, and determine the vocal weights corresponding to the multiple audio output interfaces, wherein the attribute information includes volume and/or orientation.
  • the area corresponding to the audio output interface in the screen determines the area corresponding to the audio output interface in the screen; if the lip movement coordinates are located in the area corresponding to the audio output interface, determine the area corresponding to the audio output interface The vocal weight is the first value; if the lip movement coordinates are outside the area corresponding to the audio output interface, determine the vocal weight corresponding to the audio output interface as the second value, and the second value is smaller than the first value.
  • the display device corresponds to multiple preset sound clarity control modes and/or multiple preset sound effect modes; each preset sound clarity control mode has a corresponding numerical value, and each preset sound effect mode has a corresponding
  • the sound control mode includes: the target sound clarity control mode and/or the target sound effect mode; wherein, the target sound clarity control mode is one of a variety of preset sound clarity control modes, and the target sound effect mode is a variety of One of the preset sound effect modes; the above audio processing method further includes: determining the first gain and the second gain according to the first value corresponding to the target sound clarity control mode and/or the second value corresponding to the target sound effect mode.
  • Some embodiments of the present application also provide an audio processing method, which can realize singing without being limited by media resources through vocal separation. At the same time, according to the energy of the singing vocals collected by the microphone, all or part of the original vocals can be added to the accompaniment to avoid affecting the singing experience due to the singer's low singing level.
  • FIG. 21 is another flowchart of an audio processing method in some embodiments of the present application, which is applied to a display device and may include the following steps:
  • Step S2110 acquiring the audio data of the song, performing vocal separation on the audio data of the song, and obtaining the audio data of the original vocalist and the audio data of the accompaniment.
  • Step S2120 according to the energy of the vocal audio data of the original singer in each time period and the energy of the audio data of the vocal vocal collected in the time period, determine the gain of the original singer, according to the gain of the original singer, the Gain processing is performed on the audio data of the original singer's voice to obtain the audio data of the target voice.
  • step S2130 the accompaniment audio data, the target vocal audio data and the singing vocal audio data in each time period are combined and sound effect enhanced to obtain and output the target audio data.
  • the original vocal audio data and the accompaniment audio data can be obtained by separating the human voice.
  • the original vocal gain is determined according to the energy of the real-time collected singing vocal audio data and the energy of the original vocal audio data, and gain processing is performed on the original vocal audio data according to the original vocal gain to obtain target vocal audio data.
  • the target vocal audio data is merged into the accompaniment audio data, that is, according to the user's singing situation, the original singer Merge the audio data into the accompaniment audio data, for example, merge all the original vocal audio data into the accompaniment audio data, or merge part of the original vocal audio data into the accompaniment audio data, so as to improve the accompaniment effect when the user sings , to improve user experience.
  • the original singing gain is less than or equal to a preset gain threshold.
  • the original vocal gain is determined according to the energy of the original vocal audio data in each time period and the energy of the vocal audio data collected in the time period, including: if the vocal audio data If the energy of the singing vocal audio data is less than the preset energy threshold, set the gain of the original singer to the preset gain threshold; if the energy of the singing vocal audio data is greater than or equal to the preset energy threshold, according to the energy of the singing vocal audio data and the The energy ratio between the energies determines the original singing gain, so that the original singing gain is less than the preset gain threshold.
  • the above sound effect processing method further includes: obtaining the original singing gain corresponding to the previous time period, and if the original singing gain corresponding to the current time period is the same as the original singing gain corresponding to the previous time period, extending the time period until The extended time period is less than the first time threshold; if the original singing gain corresponding to the current time period is different from the original singing gain corresponding to the previous time period, the time period is shortened until the shortened time period is greater than the second time threshold, wherein, The first time threshold is greater than the second time threshold.
  • the above sound effect processing method further includes: generating the first accompaniment audio data according to the original vocal audio data in each time period; according to the energy of the singing vocal audio data collected in the time period, Determining the accompaniment gain; wherein, the energy of the singing vocal audio data collected in the accompaniment gain is positively correlated with the time period; the first accompaniment audio data is gain-processed by the accompaniment gain to obtain the second accompaniment audio data; wherein, the second accompaniment The energy of the audio data is less than the energy of the singing vocal audio data; the accompaniment audio data, the target vocal audio data and the singing vocal audio data are merged within the time period, and sound effect enhancement processing is performed to obtain the target audio data, specifically including: Merge the accompaniment audio data, the second accompaniment audio data, the target vocal audio data and the singing vocal audio data within the time period, and perform sound effect enhancement processing to obtain the target audio data.
  • generating the first accompaniment audio data according to the original vocal audio data in each time period includes: obtaining a plurality of different delays and a gain corresponding to each delay; for each delay , according to the delay, the original vocal audio data in each time period is delayed to obtain the first delayed audio data; according to the gain corresponding to the delay, the delayed audio data is gain-processed to obtain the second delayed Audio data; combining a plurality of second delayed audio data to obtain the first accompaniment audio data.
  • generating the first accompaniment audio data according to the original vocal audio data in each time period includes: determining the sound zone to which the original vocal audio data belongs; The data is subjected to pitch-up processing or pitch-down processing to obtain the first accompaniment audio data.
  • performing up-tune processing or down-pitch processing on the original vocal audio data according to the range including:
  • the audio data of the original vocalist is lowered to obtain the first accompaniment audio data; if the sound area is in the high-pitched area, the audio data of the original vocalist is raised to obtain the first accompaniment audio data; if the sound zone is a middle-tone zone, the original vocal audio data is raised and lowered to obtain the first human voice audio data and the second human voice audio data respectively; the first human voice audio data and the first human voice audio data are The second vocal audio data serves as the first accompaniment audio data.
  • Some embodiments of the present application also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the above audio processing method is realized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

本申请一些实施方式公开一种显示设备及音频处理方法,应用于音频处理技术领域,显示设备包括:控制器,被配置为:对获取到的第一音频数据进行声音分离,得到第一目标音频数据和第一背景音频数据(S1810);按照第一增益对第一目标音频数据进行增益处理,得到第二目标音频数据,按照第二增益对第一背景音频数据进行增益处理,得到第二背景音频数据(S1820);其中,第一增益和第二增益根据显示设备对应的声音控制模式确定;将第二目标音频数据和第二背景音频数据进行合并,并进行音效增强处理,得到第二音频数据(S1830);音频输出接口,被配置为:输出第二音频数据。

Description

显示设备及音频处理方法
本申请要求于2022年1月27日提交的、申请号为202210102896.9;于2022年1月27日提交的、申请号为202210102852.6;于2022年1月27日提交的、申请号为202210102847.5;于2022年1月27日提交的、申请号为202210102840.3的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及显示设备技术,具体而言,涉及一种显示设备及音频处理方法。
背景技术
语言信息在视频中占有绝大多数的信息量,用户观看视频的最基本音质需求是清晰度,尤其是人声清晰度。当视频中多种声音混在一起时,如汽车、飞机、音乐等,声音清晰度就会存在问题。
发明内容
本申请一些实施例提供了一种显示设备,包括:控制器,被配置为:对获取到的第一音频数据进行声音分离,得到第一目标音频数据和第一背景音频数据;按照第一增益对所述第一目标音频数据进行增益处理,得到第二目标音频数据;按照第二增益对所述第一背景音频数据进行增益处理,得到第二背景音频数据;其中,所述第一增益和所述第二增益根据所述显示设备对应的声音控制模式确定;将所述第二目标音频数据和所述第二背景音频数据进行合并,并进行音效增强处理,得到第二音频数据;音频输出接口,被配置为:输出所述第二音频数据。
本申请一些实施例提供了一种显示设备,包括:控制器,被配置为:对获取到的第一音频数据分别进行声音分离和音效增强处理,得到第一目标音频数据和第二音频数据;按照第一增益对所述第一目标音频数据进行增益处理,得到第二目标音频数据;按照第二增益对所述第二音频数据进行增益处理,得到第三音频数据,其中,所述第一增益和所述第二增益根据所述显示设备对应的声音控制模式确定;对所述第二目标音频数据或所述第三音频数据进行延时处理,以使所述第二目标音频数据和所述第三音频数据同步;将所述第二目标音频数据和所述第三音频数据合并,得到第四音频数据;音频输出接口,被配置为:输出所述第四音频数据。
本申请一些实施例提供了一种显示设备,包括:控制器和多个音频输出接口;所述控制器,被配置为:对获取到的第一声道音频数据和第二声道音频数据分别进行人声分离,得到第一声道第一人声音频数据和第一声道第一背景音频数据,以及第二声道第一人声音频数据和第二声道第一背景音频数据;将所述第一声道第一人声音频数据和所述第二声道第一人声音频数据进行合并,得到目标人声音频数据;获取所述第一声道音频数据和第二声道音频数据所在时刻的图像数据,对所述图像数据进行唇动检测,如果检测到所述显示设备屏幕中的唇动坐标,根据所述唇动坐标和所述多个音 频输出接口的坐标,确定所述多个音频输出接口分别对应的人声权重;针对每个音频输出接口,根据所述音频输出接口的坐标,确定所述音频输出接口对应第一声道第一背景音频数据和/或第二声道第一背景音频数据;将所述目标人声音频数据和所述音频输出接口对应的人声权重的乘积,以及所述音频输出接口对应的第一声道第一背景音频数据和/或第二声道第一背景音频数据合并,并进行音效增强处理,得到所述音频输出接口对应的音频数据;针对每个音频输出接口,被配置为:输出所述音频输出接口对应的音频数据。
本申请一些实施例提供了一种显示设备,包括:控制器,被配置为:获取歌曲音频数据,对所述歌曲音频数据进行人声分离,得到原唱人声音频数据和伴奏音频数据;所述控制器,还被配置为:根据每个时间周期内的原唱人声音频数据的能量和在所述时间周期内采集到的演唱人声音频数据的能量,确定原唱增益;根据所述原唱增益,对所述时间周期内的原唱人声音频数据进行增益处理,得到目标人声音频数据;将所述时间周期内的伴奏音频数据、目标人声音频数据和演唱人声音频数据进行合并,并进行音效增强处理,得到目标音频数据;音频输出接口,被配置为:输出所述目标音频数据。
附图说明
图1为根据本申请一些实施例的显示设备与控制装置之间操作场景的示意图;
图2为根据本申请一些实施例的显示设备200的硬件配置框图;
图3为根据本申请一些实施例的控制设备100的硬件配置框图;
图4为根据本申请一些实施例的显示设备200中软件配置示意图;
图5为根据本申请一些实施例的显示设备200中应用程序的图标控件界面显示示意图;
图6A为本申请一些实施例中音频处理方法的一种***架构的示意图;
图6B为本申请一些实施例中音频处理方法的一种示意图;
图7为声音分离的一种示意图;
图8为本申请一些实施例中音频处理方法的一种示意图;
图9a为标准录音棚或者家庭音响音箱分布角度的一种示意图;
图9b为电视机扬声器的角度的一种示意图;
图9c为改变电视机扬声器的能量分配关系的一种示意图;
图10为本申请一些实施例中函数f(x)的一种示意图;
图11A为本申请一些实施例中音频处理方法的一种***架构的示意图;
图11B为本申请一些实施例中音频处理方法的一种示意图;
图12为本申请一些实施例中音频处理方法的一种示意图;
图13A为本申请一些实施例中音频处理方法的一种***架构的示意图;
图13B为本申请一些实施例中音频处理方法的一种示意图;
图14为扬声器分布的一种示意图;
图15A为本申请一些实施例中音频处理方法的一种***架构的示意图;
图15B为本申请一些实施例中音频处理方法的一种示意图;
图16为本申请一些实施例中对原唱人声音频数据进行时域变换的一种示意图;
图17为本申请一些实施例中对原唱人声音频数据进行频域变换的一种示意图;
图18为本申请一些实施例中音频处理方法的一种流程图;
图19为本申请一些实施例中音频处理方法的一种流程图;
图20为本申请一些实施例中音频处理方法的一种流程图;
图21为本申请一些实施例中音频处理方法的一种流程图。
具体实施方式
为使本申请的目的、实施方式和优点更加清楚,下面将结合本申请示例性实施例中的附图,对本申请示例性实施方式进行清楚、完整地描述,显然,所描述的示例性实施例仅是本申请一部分实施例,而不是全部的实施例。
图1为根据本申请一些实施例的显示设备与控制装置之间操作场景的示意图,如图1所示,用户可通过移动终端300和控制装置100操作显示设备200。控制装置100可以是遥控器,遥控器和显示设备的通信包括红外协议通信、蓝牙协议通信,无线或其他有线方式来控制显示设备200。
在一些实施例中,移动终端300可与显示设备200安装软件应用,通过网络通信协议实现连接通信,实现一对一控制操作的和数据通信的目的。
图2为根据一些实施例的控制装置100的配置框图。如图2所示,控制装置100包括控制器110、通信接口130、用户输入/输出接口140、存储器、供电电源。
图3为根据一些实施例的显示设备200的硬件配置框图。如图3所示显示设备200包括调谐解调器210、通信器220、检测器230、外部装置接口240、控制器250、显示器260、音频输出接口270、外部存储器、供电电源、用户接口280中的至少一种。控制器包括中央处理器,视频处理器,音频处理器,图形处理器,RAM,ROM,用于输入/输出的第一接口至第n接口。显示器260可为液晶显示器、OLED显示器、触控显示器以及投影显示器中的至少一种,还可以为一种投影装置和投影屏幕。
图4为根据本申请一些实施例的显示设备200中软件配置示意图,如图4所示,将***分为四层,从上至下分别为应用程序(Applications)层(简称“应用层”),应用程序框架(Application Framework)层(简称“框架层”),安卓运行时(Android runtime)和***库层(简称“***运行库层”),以及内核层。内核层至少包含以下驱动中的至少一种:音频驱动、显示驱动、蓝牙驱动、摄像头驱动、WIFI驱动、USB驱动、HDMI驱动、传感器驱动(如指纹传感器,温度传感器,压力传感器等)、以及电源驱动等。
图5为根据本申请一些实施例的显示设备200中应用程序的图标控件界面显示示意图,如图5中所示,应用程序层包含至少一个应用程序可以在显示器中显示对应的图标控件,如:直播电视应用程序图标控件、视频点播应用程序图标控件、媒体中心应用程序图标控件、应用程序中心图标控件、游戏应用图标控件等。
本申请一些实施例在安卓***中的实现如图6A所示,安卓***中主要包括应用层、中间件以及核心层,实现逻辑可以在中间件,中间件包括:音频解码器、声音分离模块、增益控制模块、音效增强模块和音频输出接口。音频解码器用于对通过广播信号、网络、USB或HDMI等输入的信号源进行音频解码处理,得到音频数据。声音分离模块用于对解码后的音频数据进行声音分离,例如可以通过人声分离方法,分离出人声音频和背景音频。增益控制模块可以获取用户针对显示设备的声音控制模式,分别对人声音频和背景音频进 行不同的增益处理,以增强人声音频或背景音频。合并模块用于对增益处理后的人声音频和背景音频进行合并,得到合并音频数据,音效增强模块用于对合并音频数据进行音效增强处理,得到目标音频数据。音频输出接口用于输出目标音频数据。
需要说明的是,上述实现逻辑除了可以在中间件实现,也可以在核心层实现。或者,还可以在中间件和核心层实现,例如,音频解码器和声音分离模块可以在中间件实现,声音分离模块之后的模块可以在核心层实现。
与上述图6A相对应,图6B为本申请一些实施例中音频处理方法的一种示意图。音频解码器对获取的声音信号进行解码之后,可以得到第一音频数据。声音分离模块可以通过AI(人工智能)技术,通过预先训练的神经网络模型实现对第一音频数据的声音分离,得到第一目标音频数据和第一背景音频数据。例如,可以通过人声分离模型分离出人声,人声即第一目标音频数据,通过预先训练完成的汽车声分离模型分离出汽车声,汽车声即为第一目标音频数据,第一背景音频数据即为除第一目标音频数据之外的音频数据。增益控制模块根据声音控制模式可以得到第一增益和第二增益,第一增益和第二增益的值不相等。根据第一增益对第一目标音频数据进行增益处理,可以得到第二目标音频数据,根据第二增益对第一背景音频数据进行增益处理,得到第二背景音频数据。将第二目标音频数据和第二背景音频数据进行合并,并进行音效增强处理之后,得到并输出第二音频数据。本申请通过对第一目标音频数据和第一背景音频数据进行非等比例的增益处理,来增强第一目标音频数据或第一背景音频数据,从而可以提高音效增强的效果。
本申请一些实施例中,显示设备200包括:控制器250,被配置为:对获取到的第一音频数据进行声音分离,得到第一目标音频数据和第一背景音频数据。第一音频数据指包含至少两种混合声音的音频数据,例如,第一音频数据中可以包括人声和背景音乐,通过预先训练完成的人声分离模型,分离出人声,除人声之外的其他声音即为背景声。此时,第一目标音频数据即为人声,第一背景音频数据即为背景声。音频输出接口270,被配置为:输出第二音频数据。
参见图7,图7为声音分离的一种示意图。通常生活中的声音、影视剧作品中的声音,是由各种声音混在一起的,比如图7中声音信号1是乐器的声音,声音信号2是人唱歌的声音。混合声音信号是录音、音视频制作时将乐器的声音和人唱歌的声音混在一起的声音信号。传统的基于固定逻辑运算的音效算法,是无法在混合声音信号中分离出两种声音的,而借助AI技术可以实现声音的分离,得到与乐器相近的音频1和与人声相近的音频2。
或者,第一音频数据中包括人声、汽车声、枪炮声和背景音乐等多种混合声音,可以通过人声分离模型分离出人声,通过预先训练完成的汽车声分离模型分离出汽车声,通过预先训练完成的枪炮声分离模型分离出枪炮声。将第一音频数据中,除分离出的人声、汽车声和枪炮声之外的其他声音作为背景声。此时,第一目标音频数据可以包括人声、汽车声和枪炮声,第一背景音频数据即为背景声。
在一些实施例中,用户可以根据自己的喜好选择声音控制模式,根据该声音控制模式,可以确定第一增益和第二增益。控制器250,被配置为:按照第一增益对第一目标音频数据进行增益处理,得到第二目标音频数据;按照第二增益对第一背景音频数据进行增益处理,得到第二背景音频数据。也就是,对第一目标音频数据和第一背景音频数据进行不同大小的增益处理,以增强第一目标音频数据或第一背景音频数据。之后,将第二目标音频数据和第二背景音频数据进行合并,并进行音效增强处理,得到第二音频数据。
可以理解的是,如果第一增益和第二增益均为0dB,那么将第二目标音频数据和第二背景音频数据进行合并后的信号与声音分离之前的信号是高度相似的。通过音效增强算法对合并后的信号进行音效增强处理,得到第二音频数据。其中,音效增强算法包括但不限于AGC(自动增益控制)、DRC(Dynamic range compression,动态范围规划)、EQ(均衡器)、虚拟环绕等。
在一些实施例中,控制器250,被配置为:根据显示设备对应的声音控制模式,确定第一音频数据对应的音效增强模式的类型;音效增强模式的类型指用户想要增强的声音的类型,根据显示设备对应的声音控制模式,确定与音效增强模式的类型对应的第一增益和第二增益。音效增强模式的类型不同,对应的第一增益和第二增益也会不同。
在一些实施例中,根据该声音控制模式,可以先确定第一音频数据对应的音效增强模式的类型,音效增强模式的类型表示用户想要增强的声音的类型,音效增强模式的类型不同,第一增益和第二增益的确定方法也可以不同。因此,可以在音效增强模式的类型后,根据该声音控制模式,确定与音效增强模式的类型对应的第一增益和第二增益。例如,音效增强模式的类型可以包括声音增强模式和背景增强模式,声音增强模式表示用户想增强第一目标音频数据,背景增强模式表示用户想增强第一背景音频数据。
在一些实施例中,控制器250,被配置为:如果第一音频数据对应的音效增强模式的类型为声音增强模式,即增强第一目标音频数据,第一增益大于第二增益。如果第一音频数据对应的音效增强模式的类型为背景增强模式,即增强第一背景音频数据,第一增益小于第二增益。
假设第一增益是G1,第二增益是G2,如果用户想增强第一目标音频数据,可以增强第一目标音频数据,不改变第一背景音频数据,即G1可以为大于0 dB的数值,G2等于0 dB。如果用户想增强第一背景音频数据,可以不改变第一目标音频数据,增强第一背景音频数据,即G1等于0 dB,G2为大于0 dB的数值。
在一些实施例中,为了确保不出现正增益而导致音频信号出现破音,G1和G2的范围可以为[-911BB,0dB]。如果第一音频数据对应的音效增强模式的类型为声音增强模式,将第一增益设置为0 dB,根据声音控制模式,确定第二增益,其中,第二增益小于0 dB。这样,在不改变第一目标音频数据的情况下,通过减弱第一背景音频数据,而达到增强第一目标音频数据的目的。如果第一音频数据对应的音效增强模式的类型为背景增强模式,根据声音控制模式确定第一增益,将第二增益设置为0 dB,其中,第一增益小于0 dB。这样,在不改变第一背景音频数据的情况下,通过减弱第一目标音频数据,而达到增强第一背景音频数据的目的。
在一些实施例中,显示设备对应多种预设声音清晰度控制模式和/或多种预设音效模式。用户可以根据自己的需要和喜好调整人声清晰的程度,从多个预设声音清晰度控制模式中选取目标声音清晰度控制模式,每种预设声音清晰度控制模式具有对应的数值。例如多个预设声音清晰度控制模式分为多个不同的等级,每个等级对应不同的数值。用户也可以从多种预设音效模式(例如标准模式、音乐模式、电影模式等)中选取目标音效模式,每种预设音效模式具有对应的数值。
其中,预设声音清晰度控制模式表示显示设备的声音清晰程度,可以包括多个不同的等级。如果预设声音清晰度控制模式对应的数值为M1,用户可以通过菜单调整声音的清晰程度,为了简化计算,菜单调整数值可以被归一化为[0,1]内的数值,即M1为大于等于 0,且小于等于1的数值。假设0.5表示显示设备出厂时的默认数值,大于0.5表示声音的清晰程度更高,小于0.5表示声音的清晰程度更低。
预设音效模式表示显示设备所处的声音效果模式,可以包括标准音效、音乐音效、电影音效和新闻音效等。如果预设音效模式对应的数值为M2,M2也可以为归一化数值,假设标准模式下M2的值为0.5,音乐模式下M2的值为0.6,电影模式下M2的值为0.7,新闻模式下M2的值为0.8。
显示设备对应的声音控制模式包括:目标声音清晰度控制模式和/或目标音效模式;其中,目标声音清晰度控制模式为多种预设声音清晰度控制模式中的一种,目标音效模式为多种预设音效模式中的一种。控制器250,被配置为:根据目标声音清晰度控制模式对应的第一数值和/或目标音效模式对应的第二数值,确定第一音频数据对应的音效增强模式的类型。即可以根据第一数值和/或第二数值得到一个数值,根据该数值可以判断音效增强模式的类型。进一步地,根据第一数值和/或第二数值,确定与音效增强模式的类型对应的第一增益和第二增益。
在一些实施例中,根据第一数值和第二数值可以得到第三数值,并基于第三数值确定音效增强模式的类型。假设归一化场景下,第三数值可以为1时,表示不增强第一目标音频数据和第一背景音频数据。第三数值大于1时,表示增强第一目标音频数据,第三数值小于1时,表示增强第一背景音频数据。在一些实施例中,第三数值T可以表示为以下公式:
T=(2×M1)×(2×M2)      (1)
可以理解的是,标准模式下M1和M2的值不同,第三数值T的表达式也可以不同。
举例而言,在用户对显示设备的声音控制模式未进行调整的情况下,目标声音清晰度控制模式对应的第一数值为0.5,目标音效模式对应的第二数值也为0.5,此时T等于1,第一增益G1和第二增益G2均可以为0 dB,也就是不对第一目标音频数据和第一背景音频数据进行增益处理。
如果用户对显示设备的声音控制模式进行了调整,假设目标声音清晰度控制模式对应的第一数值为0.7,目标音效模式对应的第二数值为0.8。此时T的值大于1,即增强第一目标音频数据。如前所述,G1和G2均为不大于0 dB的数,因此可以将G1设置为0,G2设置为小于0的数值,在一些实施例中,G2可以表示为以下公式:
Figure PCTCN2022101859-appb-000001
其中,×代表乘法运算,log代表对数运算,-代表除法运算;当然,G2的确定方式不限于此,例如,可以对该公式(2)进行简单变形等。
反之,如果用户对显示设备的声音控制模式进行调整后,T的值小于1,表示增强第一背景音频数据。此时,可以将G2设置为0,G1设置为小于0的数值。在一些实施例中,G1可以表示为以下公式:
Figure PCTCN2022101859-appb-000002
当然,G1的确定方式不限于此,例如,可以对该公式(3)进行简单变形等。
参见图8,图8为本申请一些实施例中音频处理方法的一种示意图。在立体声显示设 备中,音频解码器在进行解码之后,左右声道的音频数据被独立地进行人声分离、增益处理及音效增强处理之后,送入相应的扬声器。
由于显示设备的扬声器多数位于显示设备底部并朝下发声,并且因为两个扬声器之间距离较近(一般在0.3~0.8米左右),而人的观看距离一般都是2.0~2.5米左右,角度只有8°~14°。人的方位分辨极限约为5°,也就是说,显示设备两个扬声器的距离比较接近人的方位分辨极限。而一般立体声音源创造时(标准录音棚),左右两个声道的角度为60°。参见图9a,图9a为标准录音棚或者家庭音响音箱分布角度的一种示意图。可以看出,左右两个声道的角度为60°。音源在创作时,一般声音不会只在一个声道里面存在,而是两个声道同时都有声音,创作者想要表现声音在左边时,会让左边的声音比右边的大,相反地,想要表现声音在右侧时,会让右侧的声音比左侧大。
然而,这样的创作是基于60°的角度而制作的,参见图9b,图9b为电视机扬声器的角度的一种示意图。在该角度下,所有声音元素的虚拟声像都被缩小了,与创作者基于60°扬声器创作的意图不同。当两个扬声器的角度降低为8°~14°时,如果左右声道配比还是按照原来配比,观众得到是声音映像会变得很模糊,很难听出来声音的方位感。
为了提升方位感,在扬声器等物理条件不变的情况下,可以通过改变声音在左右扬声器中的信号配比。比如片源中某个声音在左右声道的能量分布关系为7:3,通过改变能量分配关系到8:2或者9:1,能够增强声场的位置感。参见图9c,图9c为改变电视机扬声器的能量分配关系的一种示意图。可以看出,在改变能量分配关系之后,在观众的主观听感下汽车更靠近左扬声器。
通常情况下,影视剧中用于烘托气氛的背景音乐在左右声道中的能量是基本相同或者是信号是完全相同的,只是用于表现方位感的典型声音会被分配到不同的声道中,用于表现方位感,典型声音包括但不限于人声、枪炮声、汽车声、飞机声等。如果仍然按照上述方法测算左右声道的能量,然后简单改变两个声道的能量配比,将导致声像居中的背景音乐的中心也被改变,因此该方法是不可取的。
在一些实施例中,第一音频数据中包括至少一种属于预设声音类型(例如表现方位感的声音类型)的第三目标音频数据,第三目标音频数据包括但不限于人声、枪炮声、汽车声、飞机声等。
为了解决上述问题,控制器250,还被配置为:从第一音频数据中分离出至少一种第三目标音频数据和第三背景音频数据。
如前所述,第一音频数据指包含至少两种混合声音的音频数据,可以通过训练完成的、不同的神经网络模型从第一音频数据中分离出人声、枪炮声、汽车声等,第三目标音频数据是一种类型的音频数据,第一音频数据中可以包括一种或多种第三目标音频数据,第一音频数据中除第三目标音频数据之外的音频数据即为第三背景音频数据。例如,第一音频数据中包括人声和汽车声时,第一音频数据中包括两种第三目标音频数据,分别为人声和汽车声,除人声和汽车声之外的声音即为背景声。针对每种第三目标音频数据,均可以执行下述过程。
由于第三目标音频数据用于表现方位感,第三目标音频数据包括至少两个不同声道(例如第一声道和第二声道)的音频数据。在一些实施例中,第一声道和第二声道分别可以为左声道和右声道。例如,第三目标音频数据中包括两个声道的音频数据,即第一声道初始目标音频数据和第二声道初始目标音频数据。第一声道初始目标音频数据和第二声道 初始目标音频数据分别可以为左声道音频数据和右声道音频数据。再比如,下述的第一声道初始背景音频数据和第二声道初始背景音频数据分别可以为左声道初始背景音频数据和右声道初始背景音频数据。
可以理解的是,第三目标音频数据中的第一声道初始目标音频数据和第二声道初始目标音频数据的能量是不同的,因此,可以获取单个第三目标音频数据的第一声道初始目标音频数据的第一能量值和第二声道初始目标音频数据的第二能量值,根据第一能量值和第二能量值确定第一声道初始目标音频数据对应的第三增益,和第二声道初始目标音频数据对应的第四增益。
按照第三增益对第一声道初始目标音频数据进行增益处理,得到第一声道第一增益音频数据,即增益处理后的第一声道音频数据;按照第四增益对第二声道初始目标音频数据进行增益处理,得到第二声道第一增益音频数据,即增益处理后的第二声道音频数据;其中,第三增益和第四增益根据第一能量值和第二能量值确定。这样,分别按照第三增益对第一声道初始目标音频数据进行增益处理和按照第四增益对第二声道初始目标音频数据进行增益处理,可以进一步提高第三目标音频数据的方位感。同时,可以不改变第三背景音频数据的中心。
例如,如果第一声道初始目标音频数据的第一能量值大于第二声道初始目标音频数据的第二能量值,第三增益可以大于第四增益,例如可以将第三增益设置为大于0 dB的数值,第四增益设置为0 dB,即对第二声道初始目标音频数据不做增益处理。如果第一能量值等于第二能量值,表示两者能量相等,第三增益等于第四增益,或者可以不作处理。如果第一能量值小于第二能量值,第三增益可以小于第四增益,例如将第三增益设置0 dB,即对第一声道初始目标音频数据不做增益处理,第四增益设置为大于0 dB的数值。
在一些实施例中,为了确保不出现正增益而导致音频信号出现破音,如果第一能量值大于第二能量值,可以将第三增益设置为0 dB,根据第一能量值和第二能量值确定第四增益,其中,第四增益小于0 dB。按照第三增益对第一声道初始目标音频数据进行增益处理,得到第一声道第一增益音频数据;按照第四增益对第二声道初始目标音频数据进行增益处理,得到第二声道第一增益音频数据。
如果第一能量值小于第二能量值,可以根据第一能量值和第二能量值确定第三增益,第三增益小于0 dB,将第四增益设置为0 dB。按照第三增益对第一声道初始目标音频数据进行增益处理,得到第一声道第一增益音频数据;按照第四增益对第二声道初始目标音频数据进行增益处理,得到第二声道第一增益音频数据。
最后,将第一声道第一增益音频数据和第三背景音频数据的第一声道初始背景音频数据进行合并,并进行音效增强处理,得到第一声道第一增强音频数据;将第二声道第一增益音频数据和第三背景音频数据的第二声道初始背景音频数据进行合并,并进行音效增强处理,得到第二声道第一增强音频数据。
通过获取第三目标音频数据的第一声道初始目标音频数据的第一能量值和第二声道初始目标音频数据的第二能量值,可以分析第一声道初始目标音频数据和第二声道初始目标音频数据的能量大小关系,根据该能量大小关系,对第一声道初始目标音频数据和第二声道初始目标音频数据进行不同的增益处理,从而使能量高的声道的音频数据更强,以更好的提升声音的方位感,提升音效增强的效果。
需要说明的是,在第三目标音频数据中包含更多个声道的音频数据的情况下,处理过 程与此类似,在此不再赘述。
音频输出接口270包括:第一输出接口和第二输出接口;第一输出接口被配置为:输出第一声道第一增强音频数据;第二输出接口被配置为:输出第二声道第一增强音频数据。
在一些实施例中,还可以同时考虑声音控制模式、第一能量值和第二能量值,来对第三目标音频数据和第三背景音频数据进行增益处理。控制器250,还被配置为:根据显示设备对应的声音控制模式、第一能量值和第二能量值,确定单个第三目标音频数据对应的第五增益和第六增益。第五增益和第六增益分别是第三目标音频数据的第一声道初始目标音频数据和第二声道初始目标音频数据对应的增益。第五增益和第六增益可以不同。
根据显示设备对应的声音控制模式,确定第七增益;其中,第七增益指第三背景音频数据对应的增益,由于不改变第三背景音频数据的中心,因此,第七增益用于对第一声道初始背景音频数据和第二声道初始背景音频数据进行增益处理,即,对第一声道初始背景音频数据和第二声道初始背景音频数据进行相同的增益处理。
之后,按照第五增益对第一声道初始目标音频数据进行增益处理,得到第一声道第二增益音频数据,即增益处理后的第一声道音频数据。按照第六增益对第二声道初始目标音频数据进行增益处理,得到第二声道第二增益音频数据,即增益处理后的第二声道音频数据;按照第七增益分别对第一声道初始背景音频数据和第二声道初始背景音频数据进行增益处理,得到第一声道增益背景音频数据(即增益处理后第一声道的背景音频数据)和第二声道增益背景音频数据(即增益处理后第二声道的背景音频数据)。
需要说明的是,第一声道第二增益音频数据和前述的第一声道第一增益音频数据均是对第一声道初始目标音频数据进行增益处理后的第一声道音频数据,区别在于在增益处理时所对应的增益值不同。同样地,第二声道第二增益音频数据和前述的第二声道第一增益音频数据均是对第二声道初始目标音频数据进行增益处理后的第二声道音频数据,区别在于在增益处理时所对应的增益值不同。
音频输出接口270包括:第一输出接口和第二输出接口;第一输出接口被配置为:输出第一声道第二增强音频数据;第二输出接口被配置为:输出第二声道第二增强音频数据。
在一些实施例中,控制器250,被配置为:根据显示设备对应的声音控制模式,确定第一音频数据对应的音效增强模式的类型;根据第一声道初始目标音频数据的第一能量值和第二声道初始目标音频数据的第二能量值,确定左右声道能量大小关系。根据显示设备对应的声音控制模式、第一能量值和第二能量值,确定与音效增强模式的类型以及左右声道能量大小关系对应的第五增益和第六增益;根据显示设备对应的声音控制模式,确定与音效增强模式的类型以及左右声道能量大小关系对应的第七增益。
音效增强模式的类型不同,对第三目标音频数据和第三背景音频数据的增益处理方式不同。左右声道能量大小关系不同,对第一声道初始目标音频数据和第二声道初始目标音频数据的增益处理方式也不同。音效增强模式的类型用于确定增强第三目标音频数据还是第三背景音频数据,左右声道能量大小关系用于确定增强第一声道初始目标音频数据还是第二声道初始目标音频数据。因此,不同的音效增强模式的类型以及左右声道能量大小关系,对应不同的第五增益、第六增益和第七增益。
例如,如果音效增强模式的类型为声音增强模式,第五增益和第六增益均大于第七增益,如果第一能量大于第二能量,第五增益大于第六增益。如果第一能量等于第二能量,第五增益可以等于第六增益。如果第一能量小于第二能量,第五增益小于第六增益。
如果音效增强模式的类型为背景增强模式,第五增益和第六增益均小于第七增益,如果第一能量大于第二能量,第五增益大于第六增益。如果第一能量等于第二能量,第五增益可以等于第六增益。如果第一能量小于第二能量,第五增益小于第六增益。
在一些实施例中,在声音增强模式下,第三数值T可以大于1,假设第一能量值为P L,第二能量值为P R,如果P L大于P R,此时,第五增益可以等于0 dB,第六增益和第七增益均小于0 dB。例如,第五增益G 1L=0 dB,第六增益可以表示为以下公式:
Figure PCTCN2022101859-appb-000003
其中,第七增益可以表示为以下公式:
Figure PCTCN2022101859-appb-000004
如果第三数值T大于1,P L小于等于P R,此时,第六增益等于0 dB,第五增益和第七增益均小于0 dB。例如,第五增益可以表示为以下公式:
Figure PCTCN2022101859-appb-000005
第六增益G 1R=0 dB,第七增益可以表示为以下公式:
Figure PCTCN2022101859-appb-000006
如果第三数值T小于等于1,P L大于P R,此时,第五增益和第六增益均小于0,第七增益等于0 dB。例如,第五增益可以表示为以下公式:
G 1L=20×log T     (8)
第六增益可以表示为以下公式:
Figure PCTCN2022101859-appb-000007
第七增益G 2=0 dB。
如果第三数值T小于等于1,P L小于等于P R,此时,第五增益和第六增益均小于0 dB, 第七增益等于0 dB。例如,第五增益G 1L可以表示为以下公式:
Figure PCTCN2022101859-appb-000008
第六增益G 1R可以表示为以下公式:
G 1R=20×log T   (11)
第七增益G 2=0 dB。
其中,x在(0.5,1)之间时,f(x)>x,x在(0,0.5)之间时,f(x)<x,在x等于0.5时,f(x)=0.5;f(x)对应公式(4)~公式(11)。参见图10,图10为本申请一些实施例中函数f(x)的一种示意图,可以看出,f(x)随x的变化趋势满足上述关系。需要说明的是,f(x)随x的变化趋势不限于此,例如可以是指数型、抛物线型或多种形式的组合等,只要满足上述关系即可。
需要说明的是,第五增益、第六增益和第七增益的确定方式不限于此,例如,可以对上述公式的简单变形等。并且,第五增益、第六增益和第七增益也可以大于等于0dB。
控制器250被配置为:将第一声道第二增益音频数据和第一声道增益背景音频数据进行合并,并进行音效增强处理,得到并输出第一声道第二增强音频数据;将第二声道第二增益音频数据和第二声道增益背景音频数据进行合并,并进行音效增强处理,得到并输出第二声道第二增强音频数据。
本申请还可以同时考虑控制模式和第一声道初始目标音频数据和第二声道初始目标音频数据的能量大小关系,来确定第一声道初始目标音频数据和第二声道初始目标音频数据分别对应的增益值,从而可以进一步提升音效增强的效果。
如前所述,声音分离算法通常使用人工智能技术,声音经过人工智能处理后,再经过音效增强处理,有可能导致声音处理所需要的时长比较长,从而在扬声器输出的时间晚于图像,即出现音画不同步的问题。为了解决该问题,本申请提供了一种显示设备。
该显示设备可以运行安卓***,在安卓***中的实现可以如图11A所示,安卓***中主要包括应用层、中间件以及核心层,实现逻辑可以在中间件,在一些实施方式中,中间件包括:音频解码器、声音分离模块、音效增强模块、增益控制模块、延时模块和音频输出接口。在另外一些实施方式中,中间件还可以包括人声分离模块、声音分配模块和图像解码器,其中声音分配模块用于对图像解码器解码输出的图像进行唇动检测,以确定各个音频输出接口输出人声音频的权重和背景音频的权重,如图13A所示。在又一些实施方式中,中间件还可包括原唱音量控制模块,原唱音量控制模块根据演唱音频据和分离出的原唱人声音频,确定合并至伴奏音频的原唱音频的大小,即目标人声音频,如图15A所示。音频解码器用于对通过广播信号、网络、USB或HDMI等输入的信号源进行音频解码处理,得到音频数据。声音分离模块用于对解码后的音频数据进行声音分离,例如可以通过人声分离方法,分离出人声音频。音效增强模块用于对解码后的音频数据进行音效增强处理,增益控制模块可以获取用户针对显示设备的声音控制模式,分别对分离出的音频和音效增 强后的音频进行不同的增益处理。由于声音分离和音效增强所消耗的时长通常会不同,因此,延时模块可以对增益处理后的两个音频数据进行延时处理。合并模块用于对增益处理后的两个音频进行合并,得到合并音频数据。音频输出接口用于输出合并后的音频数据。
与上述图11A相对应,图11B为本申请一些实施例中音频处理方法的一种示意图。音频解码器对获取的声音信号进行解码之后,可以得到第一音频数据。声音分离模块可以通过AI技术,通过预先训练的神经网络模型实现对第一音频数据的声音分离,得到第一目标音频数据。第一目标音频数据可以是人声、汽车声等。同时,可以对第一音频数据进行音效增强处理之后,得到第二音频数据。增益控制模块根据声音控制模式可以得到第一增益和第二增益,第一增益和第二增益的值不相等。根据第一增益对第一目标音频数据进行增益处理,可以得到第二目标音频数据,根据第二增益对第二音频数据进行增益处理,得到第三音频数据。根据声音分离模块所消耗的时长和音效增强模块所消耗的时长,确定对第二目标音频数据进行延迟处理,或者,对第三目标音频数据进行延时处理。之后,将第二目标音频数据和第三音频数据进行合并。
可以看出,通过声音分离可以只分离出一种声音,即第一目标音频数据,而不用分离出背景声,从而减少声音分离所消耗的时长。并且,将声音分离和音效增强进行并行处理,而不是串行处理,可以进一步缩短整个音频处理流程所消耗的时长,从而提升音画同步的效果。
基于此,本申请一些实施例还提供了一种显示设备200包括:
控制器250,还可以被配置为:对获取到的第一音频数据分别进行声音分离和音效增强处理,得到第一目标音频数据和第二音频数据。控制器可以对上述的应用层、中间件以及核心层的相关数据处理进行控制。
第一音频数据指包含至少两种混合声音的音频数据,例如,第一音频数据中可以包括人声和背景音乐等。第一目标音频数据通常指用户想增强的音频数据,可以是人声或其他声音等,例如适用于在观看影视剧、听音乐等场景。通过预先训练完成的人声分离模型,可以分离出人声,此时,第一目标音频数据即为人声。或者,第一音频数据中包括人声、汽车声、枪炮声和背景音乐等多种混合声音,可以通过预先训练完成的汽车声分离模型分离出汽车声,此时,第一目标音频数据即为汽车声。上述声音分离过程,可以只分离出一种声音(第一目标音频数据)即可。与分离出多种声音相比,可以减少分离过程所消耗的时长。
本申请还可以对第一音频数据进行音效增强处理,为了降低音频处理的总时长,音效增强的处理过程和声音分离的处理过程可以并行处理,而不是串行处理,可以进一步缩短整个音频处理流程所消耗的时长,从而提升音画同步的效果。其中,音效增强算法包括但不限于自动增益控制、动态范围规划、均衡器、虚拟环绕等。
按照第一增益对第一目标音频数据进行增益处理,得到第二目标音频数据;按照第二增益对第二音频数据进行增益处理,得到第三音频数据,其中,第一增益和第二增益根据显示设备对应的声音控制模式确定。通过不同的增益分别对第一目标音频数据和第二音频数据进行增益处理,以提高音效增强的整体效果。
在一些实施例中,显示设备对应多种预设声音清晰度控制模式和/或多种预设音效模式;每种预设声音清晰度控制模式具有对应的数值,每种预设音效模式具有对应的数值。用户可以根据自己的需要和喜好对显示设备的声音控制模式进行调整。显示设备获取到用户设 置的声音控制模式后,显示设备对应的声音控制模式包括:目标声音清晰度控制模式和/或目标音效模式;其中,目标声音清晰度控制模式为多种预设声音清晰度控制模式中的一种,目标音效模式为多种预设音效模式中的一种。因此,根据目标声音清晰度控制模式对应的第一数值和/或目标音效模式对应的第二数值,确定第一增益和第二增益,其中,第一增益可以大于第二增益。
如前所述,第一目标音频数据通常指用户想增强的音频数据。因此,在音效增强模式的类型包括声音增强模式和背景增强模式的情况下,可以适用于声音增强模式的场景。假设归一化场景下,根据第一数值和第二数值得到第三数值,第三数值大于1时,增强第一目标音频数据。在一些实施例中,第三数值T可以表示为:(2×M1)×(2×M2),可以理解的是,标准模式下M1和M2的值不同,第三数值T的表达式也可以不同。
在一些实施例中,为了确保不出现正增益而导致音频信号出现破音,第一增益和第二增益可以小于等于0dB。例如,可以将第一增益设置为0 dB;根据目标声音清晰度控制模式对应的第一数值和/或目标音效模式对应的第二数值,确定第二增益,使第二增益小于0dB。需要说明的是,第一增益和第二增益的确定方法可参见前述实施例中的描述即可,在此不再赘述。
由于对第一音频数据进行声音分离的过程和音效增强处理的过程可以并行处理,而对第一音频数据进行声音分离所消耗的时长和音效增强处理所消耗的时长通常会不同,因此,如果直接将第二目标音频数据和第三音频数据合并,会出现声音信号无法重叠,而导致回音的问题。
为了解决该问题,可以对第二目标音频数据或第三音频数据进行延时处理,以使第二目标音频数据和第三音频数据同步;将第二目标音频数据和第三音频数据合并,得到第四音频数据。这样,可以避免声音信号无法重叠,造成回音等问题。
音频输出接口270,被配置为:输出第四音频数据。
在一些实施例中,控制器250,被配置为:获取声音分离时所消耗的第一时长以及音效增强处理时所消耗的第二时长;根据第一时长和第二时长,对第二目标音频数据或第三音频数据进行延时处理。也就是,可以直接统计声音分离和音效增强处理所消耗的时长,如果声音分离所消耗时长较短,可以对第二目标音频数据进行延迟处理;如果音效增强处理所消耗的时长较短,可以对第三音频数据进行延迟处理,最终使第二目标音频数据和第三音频数据同步。
当运行声音分离和音效增强的运算单元是专用的或者***资源充足时,第一时长与第二时长均可以依据测量计算出一组或几组固定的数值。然而,在实际场景下,声音分离算法在显示设备的芯片上通常不是专用的,而是与图像的AI算法同用APU(Accelerated Processing Unit,加速处理器)或GPU(graphics processing unit,图形处理器),使得声音分离的运算时间经常不是一个固定的数值,而是存在一定的波动性,通过实际测算波动性在±20ms之间。针对图6A所示的***架构,该波动虽然会影响音画同步,但是通常人对音画延时可以容忍的范围是±30ms。因此,该波动是可以被接受的。然而在图11A所示的***架构中,存在同一个声音在两个链路中处理,然后进行合并的处理方式。同一个声音误差超过±5ms后会带来明显的音质问题,因此,需要精准地对齐。
由于图11A所示的***架构中,存在同一声音在两个链路中处理的情况,因此,第一目标音频数据和第二音频数据之间具有一定的相关性。在一些实施例中,控制器250,被 配置为:根据第一目标音频数据和第二音频数据之间的相关性,确定第一目标音频数据和第二音频数据之间的时间差;根据时间差,对第二目标音频数据或第三音频数据进行延时处理。
在某些情况下,如果声音分离和音效增强处理所消耗的时长无法直接统计,或者统计的不准确,也可以通过分析第一目标音频数据和第二音频数据之间的相关性。根据该相关性,确定第一目标音频数据和第二音频数据之间的时间差,进而进行延时处理。
在一些实施例中,可以通过时域窗函数对第一目标音频数据和第二音频数据之间的相关性进行比对。控制器250,被配置为:获取第一目标音频数据在时间段t内的第一音频段,该第一音频段可以是任意时长为t的音频段;获取第二音频数据在所述时间段t内(即与第一音频段所处的时间相同)的第二音频段,以及第二音频段之前的多个第三音频段、第二音频段之后的多个第四音频段;其中,第三音频段和第四音频段对应的时长均与所述时间段t的时长相等。
确定第一音频段分别和第二音频段、第三音频段和第四音频段的相关性,确定相关性最高的音频段;将相关性最高的音频段和第一音频段的时间差确定为第一目标音频数据和第二音频数据之间的时间差。
也就是,从第一目标音频数据截取一段,记为w,同时,对相同时间段内第二音频数据采用相同的窗截取多段,记为w(x),并将逐个计算w与w(x)内所有数据的卷积数值,得到w与w(x)相关性数据。将相关性最高的w(x)与w的时间差确定为第一目标音频数据和第二音频数据之间的时间差。
或者,也可以从第二音频数据中截取一段,同时,对相同时间段内第一目标音频数据采用相同的窗截取多段,按照上述同样的方式进行相关性计算,确定第一目标音频数据和第二音频数据之间的时间差。
需要说明的是,窗口宽度与延时计算精度关系较大,窗口宽度是t,计算精度也是t。但是,t越小,对应的运算量也会越大。另外,t以内的数据如果采用逐点计算运算量也比较大,可以采用隔点计算的方式使运算量减少一半,具体可以根据处理器的计算能力选择相应的精度。
在普通的立体声电视机中左右声道的声音被独立地进行声音分离,并通过图8***架构所示的方法,并通过第一增益和第二增益分别对分离后得到的两种音频数据进行增益处理后进行合并,并进行音效增强处理后送入相应的扬声器。该架构虽然简单,但是左右声道的音频数据都需要经过声音分离算法的运算,而声音分离算法通常使用同一个物理运算处理器,时间上是叠加的,因此对于芯片的AI处理能力要求较高。可见,如何减少声音分离的运用量决定了能否在更多的显示设备中应用。
参见图12,图12为本申请一些实施例中音频处理方法的一种示意图。如图12所示,音频解码器输出的左声道音频数据和右声道音频数据,除了分别被进行音效增强处理,以及增益处理外,还被合并为一个信号后进行声音分离,并对分离出的第一目标音频数据进行增益处理。再对两个链路的声音信号进行延时处理,声音分离链路中的声音信号最终分别叠加至音效增强链路中的左、右声道中。这样,声音分离的运算量可以降低一半,使得所述实施方式的可实施性更高。
在一些实施例中,第一音频数据包括第一声道初始音频数据和第二声道初始音频数据。即第一音频数据可以包括两个声道的音频数据,例如,第一声道初始音频数据和第二声道 初始音频数据可以是第一音频数据中包含的左声道音频数据和右声道音频数据。
控制器250,被配置为:对第一声道初始音频数据和第二声道初始音频数据分别进行音效增强处理,得到第一声道音效增强音频数据(即音效增强后的第一声道音频数据)和第二声道音效增强音频数据(即音效增强后的第二声道音频数据)。
需要说明的是,针对声音分离的过程,可以直接对第一音频数据(即第一声道初始音频数据和第二声道初始音频数据合并后的音频数据)进行声音分离,得到第一目标音频数据,以使声音分离的运算量减少一半。
可以按照第一增益对第一目标音频数据进行增益处理,得到第二目标音频数据;按照第二增益分别对第一声道音效增强音频数据和第二声道音效增强音频数据进行增益处理,得到第一声道目标音频数据和第二声道目标音频数据。
对第二目标音频数据或第一声道目标音频数据进行延时处理,以使第二目标音频数据和第一声道目标音频数据同步;以及对第二目标音频数据或第二声道目标音频数据进行延时处理,以使第二目标音频数据和第二声道目标音频数据同步。
类似的,声音分离所消耗的时长和音效增强处理所消耗的时长通常会不同,因此,可以先进行延时处理后再进行合并。本申请一些实施例中,也可以统计声音分离所消耗的第一时长、对第一声道初始音频数据进行音效增强处理所消耗的第二时长,以及对第二声道初始音频数据进行音效增强处理所消耗的第三时长。根据第一时长和第二时长,对第二目标音频数据或第一声道目标音频数据进行延时处理;根据第一时长和第三时长,对第二目标音频数据或第二声道目标音频数据进行延时处理。
或者,也可以确定第一目标音频数据和第一声道音效增强音频数据之间的相关性,根据该相关性对第二目标音频数据或第一声道目标音频数据进行延时处理;确定第一目标音频数据和第二声道音效增强音频数据之间的相关性,根据该相关性对第二目标音频数据或第二声道目标音频数据进行延时处理。
可以理解的是,对第一声道初始音频数据进行音效增强处理所消耗的第二时长,和对第二声道初始音频数据进行音效增强处理所消耗的第三时长,两者之间通常相等,或者差距较小,可以忽略不计。因此,为了降低运算量,也可以只统计其中一个音效增强处理过程所消耗的时长。或者,确定第一目标音频数据和第一声道音效增强音频数据(第二声道音效增强音频数据)之间的相关性即可。
之后,将第二目标音频数据分别和第一声道目标音频数据和第二声道目标音频数据进行合并,得到第一声道合并音频数据和第二声道合并音频数据;
音频输出接口270包括:第一输出接口和第二输出接口;第一输出接口被配置为:输出第一声道合并音频数据;第二输出接口被配置为:输出第二声道合并音频数据。
如前所述,声音分离可以通过人工智能技术实现,在第一音频数据包括第一声道初始音频数据和第二声道初始音频数据的情况下,如果对第一声道初始音频数据和第二声道初始音频数据均分别进行声音分离和音效增强处理,声音分离将消耗较大的运算量,因此,对显示设备中芯片的处理能力要求较高。为了解决该问题,可以将第一声道初始音频数据和第二声道初始音频数据合并,也就是直接对第一音频数据进行声音分离,对分离得到的第一目标音频数据进行增益处理后,得到第二目标音频数据。将第二目标音频数据分别和第一声道目标音频数据和第二声道目标音频数据进行合并。这样,可以使声音分离的运算量减少一半,从而使得芯片的处理能力不是很高的情况下也可以实现本实施方式,提高本 实施方式的适用性。
随着芯片AI运算能力的提升,机器学习被广泛应用于图像、声音领域,甚至出现了很多场景上的结合。本申请还提供了一种提升声音立体效果的显示设备。在一些实施方式中,在安卓***中的实现可以如图13A所示,安卓***中主要包括应用层、中间件以及核心层,实现逻辑可以在中间件,中间件可以包括:音频解码器、人声分离模块、增益控制模块、图像解码器、声音分配模块、合并模块、音效增强模块和音频输出接口等模块,其中,关于音频解码器、音效增强模块、音频输出接口的介绍,与图11A中的相同,不同的在于人声分离模块用于对解码后的左声道音频数据和右声道音频数据分别进行人声分离,得到左声道人声音频数据和左声道背景音频数据,以及右声道人声音频数据和右声道背景音频数据。声音分配模块用于对图像解码器解码输出的图像进行唇动检测,以确定各个音频输出接口输出人声音频的权重和背景音频的权重。
与上述图13A相对应,图13B为本申请一些实施例中音频处理方法的一种示意图。音频解码器可以解码输出左声道音频数据和右声道音频数据,可以分别对左声道音频数据和右声道音频数据进行人声分离,得到左声道人声音频数据和左声道背景音频数据,以及右声道人声音频数据和右声道背景音频数据。例如,可以通过AI技术,通过预先训练的神经网络模型实现对左声道音频数据的人声分离,以及右声道音频数据的人声分离。将左声道人声音频数据和右声道人声音频数据进行合并,得到目标人声音频数据。
同时,图像解码器可以解码得到左声道音频数据和右声道音频数据所在时刻的图像,并对该图像进行唇动检测,根据唇动检测结果,确定目标人声音频数据在各个音频输出接口的权重。并且,可以根据音频输出接口的坐标,确定音频输出接口输出左声道背景音频数据和右声道背景音频数据的权重。之后,根据目标人声音频数据在各个音频输出接口的权重,音频输出接口输出左声道背景音频数据和右声道背景音频数据的权重,将人声音频和背景音频进行合并。最后,再对合并后的音频进行音效增强处理并输出。
可以看出,针对立体声显示设备,在分别对左声道音频数据和右声道音频数据分别进行人声分离后,可以先对分离出的左声道人声音频数据和右声道人声音频数据合并。然后根据人物在图像中说话的位置,调整各个音频输出接口对应的人声权重,即输出人声音频对应的权重,以及根据音频输出接口的位置,调整各个音频输出接口输出背景音频的权重,从而使声音的立体感增强,提升用户的观看体验。
本申请一些实施例中,显示设备200,包括:控制器250和多个音频输出接口270;
控制器250,被配置为:对获取到的第一声道音频数据和第二声道音频数据分别进行人声分离,得到第一声道第一人声音频数据和第一声道第一背景音频数据,以及第二声道第一人声音频数据和第二声道第一背景音频数据。
其中,第一声道音频数据和第二声道音频数据是同一时刻获取到的两个不同声道的音频数据,第一声道音频数据和第二声道音频数据可以使声音更具有立体感。例如,第一声道音频数据和第二声道音频数据分别可以为左声道音频数据和右声道音频数据。
针对第一声道音频数据,可以通过人声分离(例如人工智能技术)得到第一声道第一人声音频数据和第一声道第一背景音频数据。第一声道第一人声音频数据是指第一声道音频数据中的人声,第一声道第一人声音频数据的数量可以是多个,也就是,可以提取多个人的人声。除去第一声道第一人声音频数据之外的音频数据即为第一声道第一背景音频数据。同样地,可以对第二声道音频数据进行人声分离,得到第二声道第一人声音频数据和 第二声道第一背景音频数据。
将第一声道第一人声音频数据和第二声道第一人声音频数据进行合并,得到目标人声音频数据。
本申请一些实施例中,针对分离出的第一声道第一人声音频数据和第二声道第一人声音频数据,并没有直接被分配到第一声道和第二声道与背景音频合并,而是先直接将第一声道第一人声音频数据和第二声道第一人声音频数据进行合并,得到目标人声音频数据。进而,根据人物在图像中说话的位置,对目标人声音频数据在各个音频输出接口的输出情况进行分配。
需要说明的是,如果包含多个人物的人声音频,针对每个人物,将该人物对应的第一声道第一人声音频数据和第二声道第一人声音频数据进行合并,得到该人物的目标人声音频数据。每个人物的目标人声音频数据的分配方法类似,在此以一个人物的目标人声音频数据为例进行说明。
控制器250,被配置为:获取第一声道音频数据和第二声道音频数据所在时刻的图像数据,对图像数据进行唇动检测,如果检测到显示设备屏幕中的唇动坐标,根据唇动坐标和单个音频输出接口的坐标,确定该音频输出接口对应的人声权重。
在显示设备中,除了音频解码器解码得到音频数据外,图像解码器也可以解码得到对应的图像数据。在音画同步的情况下,可以同时获取音频对应的图像数据。在此,可以获取第一声道音频数据和第二声道音频数据所在时刻的图像数据。
通过人声分离提取到人声音频的情况下,图像数据中通常具有对应的人物图像。因此,可以对图像数据进行唇动检测,得到唇动坐标,即人物唇部的位置坐标。例如,可以通过人工智能技术,检测图像数据中是否存在嘴唇信息,以及是否存在唇动。如果存在发生动作的嘴唇,则可以检测到唇动坐标。
唇动坐标指示图像中人物在屏幕中说话的位置,而多个音频输出接口的坐标表示输出音频的位置。可以理解的是,当唇动坐标距离音频输出接口越近,该音频输出接口对应的人声权重也越大。人声权重越大,音频输出接口输出人声音频的能量也越大。
在一些实施例中,控制器250,被配置为:针对每个音频输出接口,根据音频输出接口的坐标,确定音频输出接口在屏幕中对应的区域;如果唇动坐标位于音频输出接口对应的区域内,确定音频输出接口对应的人声权重为第一数值;如果唇动坐标位于音频输出接口对应的区域外,确定音频输出接口对应的人声权重为第二数值,第二数值小于第一数值。
本申请一些实施例中,可以预先根据各个音频输出接口的坐标,在屏幕中为各个音频输出接口划分对应的区域。可以理解的是,当唇动坐标距离音频输出接口对应的区域越近,该音频输出接口对应的人声权重也越大。
例如,将屏幕划分为左区域和右区域,屏幕左下方和右下方均包含一个扬声器。唇动坐标可以是实际像素点的位置坐标(x,y),如果播放视频的行分辨率是L,列分辨率是C。那么,可以归一化得出唇动坐标为以下公式:
x’=x÷C,y’=y÷L      (12)
如果x’小于0.5,则说明唇动坐标在左区域,如果x’大于0.5,则说明唇动坐标在右区域。
如果唇动坐标在屏幕的左区域,那么,可以将屏幕左下方的扬声器对应的人声权重和屏幕右下方的扬声器对应的人声权重分别设置为1和0,也就是,通过屏幕左下方的扬声 器输出目标人声音频数据,屏幕右下方的扬声器不输出目标人声音频数据。或者,也可以将屏幕左下方的扬声器对应的人声权重和屏幕右下方的扬声器对应的人声权重分别设置为0.8和0.2等,可以具体参考唇动坐标在左区域的具***置确定。唇动坐标越靠近左区域的左侧,屏幕左下方的扬声器对应的人声权重和屏幕右下方的扬声器对应的人声权重的差值越大;唇动坐标越靠近左区域的右侧,也就是越靠近屏幕的中间,屏幕左下方的扬声器对应的人声权重和屏幕右下方的扬声器对应的人声权重的差值越小。
参见图14,图14为扬声器分布的一种示意图,可以看出,显示设备包含四个扬声器,分别在屏幕的左下方、右下方、左上方和右上方。四个扬声器在屏幕中对应的区域如图14所示,分别为屏幕的左下区域、右下区域、左上区域和右上区域。唇动坐标位于左上区域,左下方、右下方、左上方和右上方四个扬声器对应的人声权重分别可以为:0、0、1和0。或者,左下方、右下方、左上方和右上方四个扬声器对应的人声权重也可以为0.2、0、0.8和0等,使最终效果以主观听感位于屏幕左上方。
在一些实施例中,屏幕包括:中间区域和非中间区域。控制器250,被配置为:如果唇动坐标位于非中间区域,根据唇动坐标和多个音频输出接口的坐标,确定多个音频输出接口分别对应的人声权重。即,可以按照上述方法,确定多个音频输出接口分别对应的人声权重。
如果唇动坐标位于中间区域,根据多个音频输出接口的坐标和多个音频输出接口的属性信息,确定多个音频输出接口分别对应的人声权重,其中,属性信息包括音量大小和/或朝向。即,当唇动坐标位于屏幕的中间区域时,可以灵活地根据音频输出接口的音量大小、朝向及位置关系等,对各个音频输出接口对应的人声权重进行配置,使最终效果以主观听感位于屏幕中心为宜。
例如,针对图14所示的扬声器,屏幕下方的扬声器的朝向向下,屏幕上方的扬声器的朝向向上。在该朝向的基础上,扬声器的音量越大,该扬声器对应的人声增益越小,扬声器的音量越小,该扬声器对应的人声增益越大。这样,可以使主观听感位于屏幕中间。或者,如果四个扬声器的音量相同,四个扬声器对应的人声增益可以相同。
如果多个扬声器在屏幕周围的分布不均匀,各个扬声器的朝向也不是朝向正下方或者正上方,可以根据具体参考多个扬声器的位置关系、朝向和音量大小,确定人声权重,使主观听感位于屏幕中间即可。可以理解的是,各个扬声器对应的人声权重可以包含多种不同的情况。
控制器250,被配置为:根据音频输出接口的坐标,确定音频输出接口对应第一声道第一背景音频数据和/或第二声道第一背景音频数据。
对于背景音频数据,由于与人声无关,可以直接根据音频输出接口的坐标,确定该音频输出接口输出第一声道第一背景音频数据,还是第二声道第一背景音频数据,还是第一声道第一背景音频数据和第二声道第一背景音频数据。
在一些实施例中,屏幕包括:左区域和右区域,如果音频输出接口的坐标对应于左区域,确定音频输出接口对应第一声道初始背景音频数据;如果音频输出接口的坐标对应于右区域,确定音频输出接口对应第二声道初始背景音频数据。如果屏幕左下方和右下方均包含一个扬声器,分别对应于左区域和右区域,屏幕左下方的扬声器可以输出第一声道初始背景音频数据,屏幕右下方的扬声器可以输出第二声道初始背景音频数据。
在一些实施例中,屏幕包括:左区域、中间区域和右区域;控制器250,被配置为: 如果音频输出接口的坐标对应于左区域,确定音频输出接口对应第一声道第一背景音频数据;如果音频输出接口的坐标对应于右区域,确定音频输出接口对应第二声道第一背景音频数据;如果音频输出接口的坐标对应于中间区域,确定音频输出接口对应第一声道第一背景音频数据和第二声道第一背景音频数据。
例如,屏幕左下方、中下方和右下方均包含一个扬声器,分别对应于左区域、中间区域和右区域,屏幕左下方的扬声器可以输出第一声道第一背景音频数据,屏幕中下方的扬声器可以同时输出第一声道第一背景音频数据和第二声道第一背景音频数据,屏幕右下方的扬声器可以输出第二声道第一背景音频数据。
控制器250,被配置为:将目标人声音频数据和音频输出接口对应的人声权重的乘积,以及音频输出接口对应的第一声道第一背景音频数据和/或第二声道第一背景音频数据合并,并进行音效增强处理,得到音频输出接口对应的音频数据。
在确定每个音频输出接口对应的人声音频(即目标人声音频数据和音频输出接口对应的人声权重的乘积)和背景音频(即第一声道第一背景音频数据和/或第二声道第一背景音频数据)后,可以将人声音频和背景音频进行合并,并进行音效增强处理,得到音频输出接口对应的音频数据。
单个音频输出接口270,被配置为:输出所述音频输出接口对应的音频数据。
在一些实施例中,在对左声道音频数据和右声道音频数据分别进行人声分离后,还可以对人声音频和背景音频进行不同的增益处理,以突出增强人声音频或背景音频。
控制器250还被配置为:按照第一增益分别对第一声道第一人声音频数据和第二声道第一人声音频数据进行增益处理,得到第一声道第二人声音频数据和第二声道第二人声音频数据;按照第二增益分别对第一声道第一背景音频数据和第二声道第一背景音频数据进行增益处理,得到第一声道第二背景音频数据和第二声道第二背景音频数据;其中,第一增益和第二增益根据显示设备对应的声音控制模式确定。
需要说明的是,第一声道第一人声音频数据和第二声道第一人声音频数据均属于人声音频,可以对应相同的第一增益,第一声道第一背景音频数据和第二声道第一背景音频数据均属于背景音频,可以对应相同的第二增益。
在一些实施例中,显示设备对应多种预设声音清晰度控制模式和/或多种预设音效模式;每种预设声音清晰度控制模式具有对应的数值,每种预设音效模式具有对应的数值;声音控制模式包括:目标声音清晰度控制模式和/或目标音效模式;其中,目标声音清晰度控制模式为多种预设声音清晰度控制模式中的一种,目标音效模式为多种预设音效模式中的一种;控制器250,被配置为:根据目标声音清晰度控制模式对应的第一数值和/或目标音效模式对应的第二数值,确定第一增益和第二增益。
可见,用户可以根据自身的喜好来控制显示设备的声音控制模式,进而,控制器250可以根据该声音控制模式,确定如何对第一声道第一人声音频数据和第二声道第一人声音频数据进行增益处理,以及如何对第一声道第一背景音频数据和第二声道第一背景音频数据进行增益处理。
需要说明的是,第一增益和第二增益的确定方法,与前述实施例中第一增益和第二增益的确定方法相同,具体可参见前述实施例中的描述即可,在此不再赘述。
控制器250,被配置为:将第一声道第二人声音频数据和第二声道第二人声音频数据进行合并,得到目标人声音频数据;以及针对每个音频输出接口,根据音频输出接口的坐 标,确定音频输出接口对应第一声道第二背景音频数据和/或第二声道第二背景音频数据;将目标人声音频数据和音频输出接口对应的人声权重的乘积,以及音频输出接口对应的第一声道第二背景音频数据和/或第二声道第二背景音频数据合并,并进行音效增强处理,得到音频输出接口对应的音频数据。
在一些实施例中,图像数据中不包含人物,或者即使图像数据中包含人物,但是并没有显示出人物的嘴唇,例如只显示人物的侧脸、人物的背影等。或者,即使显示人物的嘴唇,但是人物的嘴唇是没有动作的,此时将无法检测到唇动坐标。控制器250,还被配置为:如果未检测到唇动坐标,针对每个音频输出接口,可以直接根据第一声道第一人声音频数据的能量和第二声道第一人声音频数据的能量的比值,以及音频输出接口的坐标,确定音频输出接口分别对应的人声权重。
例如,如果屏幕左下方和右下方各包含一个扬声器,且左声道人声音频数据的能量和右声道人声音频数据的能量的比值大于1,位于屏幕左下方的扬声器对应的人声权重可以大于位于屏幕右下方的扬声器对应的人声权重。如果左声道人声音频数据的能量和右声道人声音频数据的能量的比值为0.6:0.4,那么屏幕左下方的扬声器对应的人声权重可以为0.6,屏幕右下方的扬声器对应的人声权重可以为0.4。或者,为了更增强声音的方位感,屏幕左下方的扬声器对应的人声权重可以为0.7,屏幕右下方的扬声器对应的人声权重可以为0.3。
目前,电视机的卡拉OK功能,通常是在唱歌APP中完成。唱歌APP具有丰富的功能和较佳的用户体验,但是唱歌APP的媒体资源比较受限。例如,一首歌曲的原唱歌手A是一位男歌手,而翻唱歌手B是一位女歌手。当一位女性用户C想唱这首歌时,唱歌APP中可能只录入了原唱歌手A的伴奏视频,但是没有歌手B的伴奏视频,导致无法找到合适的伴奏。或者,虽然采取两个声道相减的方式来消除立体声歌曲中的人声。但是,该方法有时会损失歌曲中的低音,得到的伴奏声音比较微弱,没有唱歌伴奏感,用户体验较差。
因此,本申请一些实施例的显示设备通过人声分离技术,去除正在播放歌曲中的人声,使得用户可以在不依赖唱歌APP的情况下,找到自己喜欢的歌曲,如通过在线音乐播放器播放自己熟悉的歌曲,或通过电视播放自己付费购买的音视频内容。然后打开消除人声功能,可以去掉音频中的原唱人声,进而实现不受媒体资源限制地唱歌。同时,可以根据麦克风采集到的演唱人声的能量,将原唱人声全部或部分添加至伴奏中,避免因演唱者唱歌水平不高而影响唱歌体验。
该显示设备在安卓***中的实现可以如图15A所示,安卓***中主要包括应用层、中间件以及核心层,实现逻辑可以在中间件,中间件包括:音频解码器、人声分离模块、音频输入接口、原唱音量控制模块、合并模块、音效增强模块、增益控制模块、延时模块和音频输出接口。其中,音频解码器、人声分离模块、合并模块、音效增强模块、音频输出接口与图13A所示的模块相同,不同在于音频输入接口用于接收用户输入的演唱音频,原唱音量控制模块根据演唱音频据和分离出的原唱人声音频,确定合并至伴奏音频的原唱音频的大小,即目标人声音频。
与上述图15A相对应,图15B为本申请一些实施例中音频处理方法的一种示意图。音频解码器解码得到歌曲音频数据后,通过人声分离,得到原唱人声音频数据和伴奏音频数据。同时,麦克风可以采集用户输入的演唱人声音频数据,根据原唱人声音频数据和演唱 人声音频数据可以确定目标人声音频数据,即合并至伴奏音频数据中原唱人声音频数据的大小。将演唱人声音频数据、目标人声音频数据和伴奏音频数据合并,并进行音效增强处理之后再进行输出。
本申请一些实施例还提供了一种显示设备200,包括:
控制器250,被配置为:获取歌曲音频数据,对歌曲音频数据进行人声分离,得到原唱人声音频数据和伴奏音频数据。
歌曲音频数据可以是任意的歌曲,包括唱歌APP中收录的歌曲,以及唱歌APP中没有收录的歌曲。通过对歌曲音频数据进行人声分离,例如,通过人工智能技术可以分离出原唱人声音频数据和伴奏音频数据。可见,针对任何歌曲,均可以分离出对应的伴奏音频数据。
控制器250,还被配置为:根据每个时间周期内的原唱人声音频数据的能量和在时间周期内采集到的演唱人声音频数据的能量,确定原唱增益;根据原唱增益,对时间周期内的原唱人声音频数据进行增益处理,得到目标人声音频数据。
在唱歌过程中,用户可以通过音频输入接口(例如麦克风)演唱歌曲,此时,可以采集到演唱人声音频数据,而用户在唱歌时可能存在跑调、音准不够好等问题。另外,人声分离是在显示设备的主芯片实时运算的,可能存在人声分离不干净或分离时引入个别杂音的问题。为了解决该问题,可以在用户没有唱歌或唱歌间隙时,人声分离出的原唱人声音频被全部或部分合并至伴奏中,以烘托出唱歌现场的气氛,而当检测到用户在唱歌时,可以通过原唱人声音频的音量控制减小或者静音原唱人声音频,以播放用户唱歌的声音为主。
由于每个歌曲对应较长的也一个时间段,因此,在处理音频数据时,可以按预先设置的时间周期对该音频数据进行处理。也就是,按照时间顺序依次处理各个时间周期的音频数据。其中,时间周期可以是0.8秒、1秒等。
针对每个时间周期,可以根据原唱人声音频数据的能量和演唱人声音频数据的能量,得到原唱增益,通过原唱增益对原唱人声音频数据进行增益处理,得到目标人声音频数据,即合并至伴奏音频数据中的音频数据。
在一些实施例中,原唱增益小于等于预设增益阈值。例如,预设增益阈值可以是0.1dB、0dB、-0.1dB等。在预设增益阈值等于0dB的情况下,原唱增益小于等于0 dB。在原唱增益等于0 dB的情况下,表示原唱人声音频数据全部合并至伴奏音频数据中;在原唱增益小于0 dB的情况下,表示原唱人声音频数据部分合并至伴奏音频数据中。在预设增益阈值小于0 dB的情况下,原唱增益也小于0 dB,表示原唱人声音频数据部分合并至伴奏音频数据中。在预设增益阈值大于0 dB的情况下,表示原唱人声音频数据可以在增强处理后合并至伴奏音频数据中。
在一些实施例中,控制器250,被配置为:如果演唱人声音频数据的能量小于预设能量阈值,该预设能量阈值是一个较小的能量值,此时可以认为用户没有唱歌,可以将原唱增益设置为预设增益阈值,例如将原唱增益设置为0dB,即直接将原唱人声音频数据作为目标人声音频数据。如果演唱人声音频数据的能量大于等于预设能量阈值,此时可以认为用户已经开始唱歌了,根据演唱人声音频数据的能量和原唱人声音频数据的能量之间的能量比,确定原唱增益,使原唱增益小于预设增益阈值,即可以降低原唱人声音频数据的能量后,作为目标人声音频数据。
在一些实施例中,为了保证合并至伴奏音频数据中的声音相对稳定,而不是随着演唱 人声音频数据的音量大小忽大忽小的变化,可以预先建立演唱人声音频数据的能量和原唱人声音频数据的能量之间的能量比和原唱增益的对应关系,例如,能量比在某个能量比范围内时,原唱增益可以对应同一个值。例如,如果能量比小于等于0.25,表示演唱人声音频数据的能量较小,w=0dB,可以将原唱人声音频数据全部合并至伴奏音频数据中;如果0.25<能量比<0.75,表示演唱人声音频数据的能量适中,w=-6dB,可以将原唱人声音频数据部分合并至伴奏音频数据中;如果能量比大于等于0.75,表示演唱人声音频数据的能量较大,可以全部关闭原唱人声音频数据,只播放演唱人声音频数据。
控制器250,被配置为:将该时间周期内的伴奏音频数据、目标人声音频数据和演唱人声音频数据进行合并以及音效增强处理,得到目标音频数据。本申请在将伴奏音频数据和演唱人声音频数据合并的基础上,还合并有目标人声音频数据。目标人声音频数据指原唱人声音频数据的全部,或者原唱人声音频数据的部分,因此,最终输出的目标音频数据更丰富,效果更好。
音频输出接口270,被配置为:输出目标音频数据。
本申请一些实施例中,对于任何歌曲,均可以通过人声分离,得到伴奏音频数据,使用户在唱歌时,可以不受媒体资源的限制。并且,可以根据用户的唱歌水平,确定是否在伴奏音频数据中添加原唱人声音频数据,或部分地添加原唱人声音频数据,从而提升用户的唱歌体验。
在一些实施例中,控制器250,还被配置为:获取前一个时间周期对应的原唱增益,如果当前时间周期对应的原唱增益和前一个时间周期对应的原唱增益相同,表示前一个时间周期对应的演唱人声音频数据的能量和原唱人声音频数据的能量之间的能量比,与当前时间周期对应的能量比差距较小,例如位于同一能量比范围,表示用户唱歌比较稳定,用户对演唱的歌曲很熟悉,可以延长时间周期,以降低上述过程的处理频率,直至延长后的时间周期小于第一时间阈值(例如,可以是2秒等)。也就是,降低上述过程的处理频率,而不是在唱歌间隙频繁地将基于原唱人声音频数据得到的目标人声音频数据合并至伴奏音频数据中。当然,时间周期也不能无限地延长,避免时间周期过长而影响最终的演唱效果。
如果当前时间周期对应的原唱增益和前一个时间周期对应的原唱增益不同,表示用户唱歌时发生了音量变化,与原唱不合拍,用户可能出现了不会唱、唱不准等情况,此时缩短时间周期,即迅速地调出目标音频数据,将目标音频数据合并至伴奏音频数据中,直至缩短后的时间周期大于第二时间阈值(例如,可以是0.25秒等),其中,第一时间阈值大于第二时间阈值。
上述音频处理过程,与简单地将左右声道音频数据相减的消除原唱人声音频数据的方法相比,可以提升唱歌时伴奏的效果。但是,在专业的唱歌APP中,除了左右声道音频数据相减的曲库以外,还有很多专业的伴奏曲库。该伴奏曲库并不是通过左右声道音频数据相减的方法消除原唱人声音频数据得到的,而是在录制音乐时,把伴奏音频数据录制在一个单独音轨中。对于很多歌曲,除了伴奏还有一些专业伴唱人员的和声。而本申请一些实施例中,可以识别一切人声并进行消除,虽然可近似得到单独音乐伴奏音轨的效果,但是因为伴唱人员的和声也被消除了,导致被留下的伴奏缺少氛围感。另外,人声分离是在原始的音频信号中,把属于人声特征的信号剥离出来,然而人声和乐器的声音会在频域上有所重合,在分离人声时会导致与人声重合的乐器声音也被一起剥离出来。
为了解决该问题,可以将分离出的原唱人声音频数据进行变换,得到伴唱音频数据,再将伴唱音频数据以一定比例合并至伴奏中,用于弥补伴奏空洞感的问题。该比例与演唱人声音频数据的能量相关联,具体来讲,当演唱人声音频数据的能量变大时,该比例也变大,而当演唱声音变小的时候,该比例也变小。
在一些实施例中,为了避免在人声分离时,消除专业伴唱人员的和声的问题,控制器250,还被配置为:根据每个时间周期内的原唱人声音频数据,生成第一伴唱音频数据。
如前所述,如果演唱人声音频数据的能量小于预设能量阈值,表示用户没有唱歌,或者唱歌的声音极小,可以将原唱人声音频数据全部合并至伴奏音频数据中。此时,可以不用生成第一伴唱音频数据。因此,在一些实施例中,在演唱人声音频数据的能量大于等于预设能量阈值时,再根据每个时间周期内的原唱人声音频数据,生成第一伴唱音频数据。
在一些实施例中,可以对原唱人声音频数据进行时域变换,生成第一伴唱音频数据。控制器250,被配置为:获取多个不同的延时以及每个延时对应的增益;针对每个延时,根据延时对每个时间周期内的原唱人声音频数据进行延时处理,得到第一延时音频数据;根据延时对应的增益对延时音频数据进行增益处理,得到第二延时音频数据;将多个第二延时音频数据进行合并,得到第一伴唱音频数据。
参见图16,图16为本申请一些实施例中对原唱人声音频数据进行时域变换的一种示意图。
获取多个不同的延时以及每个延时对应的增益。多个不同的延时以及每个延时对应的增益可以是预先设置的。多个不同的延时可以等间隔,延时越长,增益越小,因此,多个不同的延时对应的增益逐渐减小。例如,T1为10ms、T2为20ms、T3为30ms……,增益1为0dB、增益2为-6dB、增益3为-10dB……
针对每个延时,可以根据该延时对每个时间周期内的原唱人声音频数据进行延时处理,得到第一延时音频数据。并根据该延时对应的增益对延时音频数据进行增益处理,得到第二延时音频数据。例如,针对T1,可以根据10ms对原唱人声音频数据进行延时处理,得到第一延时音频数据,并根据0dB对第一延时音频数据进行增益处理,得到第二延时音频数据。针对T2、T3……按照相同的方式进行处理,均可以得到对应的第二延时音频数据。
之后,将多个第二延时音频数据进行合并,得到第一伴唱音频数据。
这样,经过不同的延时后,再经过不同的增益叠加在一起,可以形成类似在室内或者体育场的混响效果。即原唱的声音听起来像是多人在一起唱歌的感觉,使原唱人声变成了具有合唱感的音乐。
在一些实施例中,还可以对原唱人声音频数据进行频域变换,生成第一伴唱音频数据。控制器250,被配置为:确定原唱人声音频数据所属的音区;根据音区对原唱人声音频数据进行升调处理或降调处理,得到第一伴唱音频数据。这样,可以形成伴唱,且伴唱与原唱不在一个声调上。例如,针对专业的演出,都有专业的伴唱团队,他们演唱的声音与原唱不在一个声部上,比如可能会比原唱高3度或低3度。
参见图17,图17为本申请一些实施例中对原唱人声音频数据进行频域变换的一种示意图。通过基频分析,可以确定原唱人声音频数据所属的音区。其中,基频分析是将人声做FFT(快速傅立叶变换),找到第一个峰值,该峰值频率即为基频。根据基频可以得知演唱者的音调,例如,中央C即“do”的频率为261.6Hz。根据计算出来的当前声音的声调,可以计算升调几度或者降调几度对应的频率。
需要说明的是,不同音区升调或降调是存在一定差距的,可以区别运算。例如,针对钢琴键谱图,在此可以根据钢琴键盘详细说明升3度或降3度的算法原理。如果当前原唱人声音频数据所属的音区是中音C,即C4,升3度即白键盘E4,中间一共4个半音,即当前声音变调升频
Figure PCTCN2022101859-appb-000009
倍数。而如果当前原唱人声音频数据的音调是B3,升3度为D4,一共3个半音,即升频
Figure PCTCN2022101859-appb-000010
倍数。
本申请一些实施例中,还可以根据一般演唱者的演唱习惯,对原唱人声音频数据进行升调处理或降调处理。具体而言,对于非专业歌手,通常会存在低音不够低、高音不够高的问题。因此,在一些实施例中,为了解决非专业歌手在唱歌时,低音不够低、高音不够高的问题。控制器250,被配置为:如果音区为低音区,对原唱人声音频数据进行降调处理,得到第一伴唱音频数据;如果音区为高音区,对原唱人声音频数据进行升调处理,得到第一伴唱音频数据;如果音区为中音区,对原唱人声音频数据进行升调处理和降调处理,分别得到第一人声音频数据和第二人声音频数据;将第一人声音频数据和第二人声音频数据作为第一伴唱音频数据。
具体的,当原唱人声音频数据低于某个低音调时,启动降调运算,而当原唱人声音频数据高于某个高音调时,启动升调运算。例如,当高于C5时启动升调运算,也就是控制降调运算的增益为最小,即静音,而控制升调运算的增益为0dB,即生成的第一伴唱音频数包含升调运算后的音频数据。相反地,当低于C4时,启动降调运算,控制降调运算的增益为0dB,而控制升调运算的增益为最小,即静音,即生成的第一伴唱音频数包含降调运算后的音频数据。而当处于C4和C5中间时,可以让升调运算和降调运算的增益均为-6dB,即生成的第一伴唱音频数同时包含声调运算后的音频数据和降调运算后的音频数据。
需要说明的是,如果按照原唱人声音频数据的能量大小,将第一伴唱音频数据合并至伴奏音频数据,可能会影响原本的伴奏曲风和音色。伴唱的目的在演唱声存在时,用于丰富和美化演唱声。因此,最终合并至伴奏音频数据中的伴唱音频数据的能量可以小于演唱人声音频数据的能量。例如,比演唱人声音频数据小12dB等。
因此,在生成第一伴唱音频数后,控制器250,被配置为:根据在时间周期内采集到的演唱人声音频数据的能量,确定伴唱增益;其中,伴唱增益和时间周期内采集到的演唱人声音频数据的能量成正相关;通过伴唱增益对第一伴唱音频数据进行增益处理,得到第二伴唱音频数据;其中,第二伴唱音频数据的能量小于演唱人声音频数据的能量。
可以理解的是,演唱人声音频数据的能量越大,最终合并至伴奏音频数据的伴唱音频数据的能量也可以越大,因此,伴唱增益和该时间周期内采集到的演唱人声音频数据的能量成正相关。假设演唱人声音频数据的能量为E,伴唱增益m可以根据以下公式计算得到:m=E-12。这样,通过伴唱增益得到的第二伴唱音频数据的能量小于演唱人声音频数据的能量。当然,伴唱增益的计算方法不限于此,可以对上述公式进行简单变形来计算伴唱增益。
控制器250,被配置为:将时间周期内的伴奏音频数据、第二伴唱音频数据、目标人声音频数据和演唱人声音频数据进行合并以及音效增强处理,得到目标音频数据。
这样,在伴奏音频数据、演唱人声音频数据和目标人声音频数据的基础上,进一步添加第二伴唱音频数据,可以避免在人声分离过程中,将歌曲中的伴唱音频数据也剥离导致伴奏效果差的问题,从而可以提高伴奏的整体效果,最终提升用户的唱歌体验。
相应于上述显示设备实施例,本申请还提供了一种音频处理方法。可以理解的是,图18~图21中所涉及的步骤在实际实现时可以包括更多的步骤,或者更少的步骤,并且这些步骤之间的顺序也可以不同,以能够实现本发明实施例中提供的音频处理方法为准。
参见图18,图18为本申请一些实施例中音频处理方法的一种流程图,可以包括以下步骤:
步骤S1810,对获取到的第一音频数据进行声音分离,得到第一目标音频数据和第一背景音频数据。
步骤S1820,按照第一增益对第一目标音频数据进行增益处理,得到第二目标音频数据,按照第二增益对第一背景音频数据进行增益处理,得到第二背景音频数据。其中,第一增益和第二增益根据显示设备对应的声音控制模式确定。
步骤S1830,将第二目标音频数据和第二背景音频数据进行合并,并进行音效增强处理,得到并输出第二音频数据。
在上述音频处理方法中,从第一音频数据中分离出第一目标音频数据和第一背景音频数据后,可以按照第一增益对第一目标音频数据进行增益处理,得到第二目标音频数据;按照第二增益对第一背景音频数据进行增益处理,得到第二背景音频数据。将第二目标音频数据和第二背景音频数据进行合并,并进行音效增强处理,得到并输出第二音频数据。由于第一增益和第二增益根据显示设备对应的声音控制模式确定,因此可以结合用户当前的观看需求,通过对第一目标音频数据和第一背景音频数据进行非等比例的增益处理后再合并,可以根据用户的观看需求来增强第一目标音频数据或者第一背景音频数据,从而可以提升音效增强的效果。
在一些实施例中,上述音频处理方法还包括:根据声音控制模式,确定第一音频数据对应的音效增强模式的类型;根据声音控制模式,确定与音效增强模式的类型对应的第一增益和第二增益。
在一些实施例中,显示设备对应多种预设声音清晰度控制模式和/或多种预设音效模式;每种预设声音清晰度控制模式具有对应的数值,每种预设音效模式具有对应的数值;声音控制模式包括:目标声音清晰度控制模式和/或目标音效模式;其中,目标声音清晰度控制模式为多种预设声音清晰度控制模式中的一种,目标音效模式为多种预设音效模式中的一种;根据声音控制模式,确定第一音频数据对应的音效增强模式的类型,包括:根据目标声音清晰度控制模式对应的第一数值和/或目标音效模式对应的第二数值,确定第一音频数据对应的音效增强模式的类型;根据声音控制模式,确定与音效增强模式的类型对应的第一增益和第二增益,包括:根据第一数值和/或第二数值,确定与音效增强模式的类型对应的第一增益和第二增益。
在一些实施例中,根据声音控制模式,确定与音效增强模式的类型对应的第一增益和第二增益,包括:如果第一音频数据对应的音效增强模式的类型为声音增强模式,第一增益大于第二增益;如果第一音频数据对应的音效增强模式的类型为背景增强模式,第一增益小于第二增益。
在一些实施例中,第一音频数据中包括至少一种属于预设声音类型的第三目标音频数据;
上述音频处理方法还包括:从第一音频数据中分离出至少一种第三目标音频数据和第三背景音频数据;获取单个第三目标音频数据的第一声道初始目标音频数据的第一能量值 和第二声道初始目标音频数据的第二能量值;按照第三增益对第一声道初始目标音频数据进行增益处理,得到第一声道第一增益音频数据;按照第四增益对第二声道初始目标音频数据进行增益处理,得到第二声道第一增益音频数据;其中,第三增益和第四增益根据第一能量值和第二能量值确定;将第一声道第一增益音频数据和第三背景音频数据的第一声道初始背景音频数据进行合并,并进行音效增强处理,得到并输出第一声道第一增强音频数据;将第二声道第一增益音频数据和第三背景音频数据的第二声道初始背景音频数据进行合并,并进行音效增强处理,得到并输出第二声道第一增强音频数据。
在一些实施例中,上述音频处理方法还包括:根据声音控制模式、第一能量值和第二能量值,确定单个第三目标音频数据对应的第五增益和第六增益;根据声音控制模式,确定第七增益;按照第五增益对第一声道初始目标音频数据进行增益处理,得到第一声道第二增益音频数据;按照第六增益对第二声道初始目标音频数据进行增益处理,得到第二声道第二增益音频数据;按照第七增益分别对第一声道初始背景音频数据和第二声道初始背景音频数据进行增益处理,得到第一声道增益背景音频数据和第二声道增益背景音频数据;将第一声道第二增益音频数据和第一声道增益背景音频数据进行合并,并进行音效增强处理,得到并输出第一声道第二增强音频数据;将第二声道第二增益音频数据和第二声道增益背景音频数据进行合并,并进行音效增强处理,得到并输出第二声道第二增强音频数据。
在一些实施例中,根据声音控制模式、第一能量值和第二能量值,确定单个第三目标音频数据对应的第五增益和第六增益,包括:根据声音控制模式,确定第一音频数据对应的音效增强模式的类型;根据第一声道初始目标音频数据的第一能量值和第二声道初始目标音频数据的第二能量值,确定左右声道能量大小关系;根据声音控制模式、第一能量值和第二能量值,确定与音效增强模式的类型以及左右声道能量大小关系对应的第五增益和第六增益;根据声音控制模式,确定第七增益,包括:根据声音控制模式,确定与音效增强模式的类型以及左右声道能量大小关系对应的第七增益。
参见图19,图19为本申请一些实施例中音频处理方法的一种流程图,可以包括以下步骤:
步骤S1910,对获取到的第一音频数据分别进行声音分离和音效增强处理,得到第一目标音频数据和第二音频数据。
步骤S1920,按照第一增益对第一目标音频数据进行增益处理,得到第二目标音频数据,按照第二增益对第二音频数据进行增益处理,得到第三音频数据,其中,第一增益和第二增益根据显示设备对应的声音控制模式确定。
步骤S1930,对第二目标音频数据或第三音频数据进行延时处理,以使第二目标音频数据和第三音频数据同步。
步骤S1940,将第二目标音频数据和第三音频数据合并,得到并输出第四音频数据。
本申请一些实施例的音频处理方法,由于声音分离算法只做目标声音的分离,不做背景声音的分离,因此,声音分离算法所消耗的时长可以减少一半。并且,声音分离和音效增强可以并行处理,而不是串行处理,可以进一步缩短整个音频处理流程所消耗的时长,从而提升音画同步的效果。另外,对第二目标音频数据或第三音频数据进行延时处理,例如,可以在音效增强链路和声音分离链路中运算时间少的链路中进行延时处理,使第二目标音频数据和第三音频数据同步后再合并,以避免回音问题,从而在提升音画同步效果的同时,不降低音效增强的效果。
在一些实施例中,对第二目标音频数据或第三音频数据进行延时处理,包括:获取声音分离时所消耗的第一时长以及音效增强处理时所消耗的第二时长;根据第一时长和第二时长,对第二目标音频数据或第三音频数据进行延时处理。在一些实施例中,对第二目标音频数据或第三音频数据进行延时处理,包括:根据第一目标音频数据和第二音频数据之间的相关性,确定第一目标音频数据和第二音频数据之间的时间差;根据时间差,对第二目标音频数据或第三音频数据进行延时处理。
在一些实施例中,根据第一目标音频数据和第二音频数据之间的相关性,确定第一目标音频数据和第二音频数据之间的时间差,包括:获取第一目标音频数据在时间段t内的第一音频段;获取第二音频数据在时间段t内的第二音频段,以及第二音频段之前的多个第三音频段、第二音频段之后的多个第四音频段;其中,第三音频段和第四音频段对应的时长均与时间段t的时长相等;确定第一音频段分别和第二音频段、第三音频段和第四音频段的相关性,确定相关性最高的音频段;将相关性最高的音频段和第一音频段的时间差确定为第一目标音频数据和第二音频数据之间的时间差。
在一些实施例中,第一音频数据包括第一声道初始音频数据和第二声道初始音频数据;对第一音频数据进行音效增强处理,得到第二音频数据,包括:对第一声道初始音频数据和第二声道初始音频数据分别进行音效增强处理,得到第一声道音效增强音频数据和第二声道音效增强音频数据;按照第二增益对第二音频数据进行增益处理,得到第三音频数据,包括:按照第二增益分别对第一声道音效增强音频数据和第二声道音效增强音频数据进行增益处理,得到第一声道目标音频数据和第二声道目标音频数据;对第二目标音频数据或第三音频数据进行延时处理,以使第二目标音频数据和第三音频数据同步,包括:对第二目标音频数据或第一声道目标音频数据进行延时处理,以使第二目标音频数据和第一声道目标音频数据同步;以及对第二目标音频数据或第二声道目标音频数据进行延时处理,以使第二目标音频数据和第二声道目标音频数据同步;将第二目标音频数据和第三音频数据合并,得到第四音频数据,包括:将第二目标音频数据分别和第一声道目标音频数据和第二声道目标音频数据进行合并,得到第一声道合并音频数据和第二声道合并音频数据。
在一些实施例中,显示设备对应多种预设声音清晰度控制模式和/或多种预设音效模式;每种预设声音清晰度控制模式具有对应的数值,每种预设音效模式具有对应的数值;声音控制模式包括:目标声音清晰度控制模式和/或目标音效模式;其中,目标声音清晰度控制模式为多种预设声音清晰度控制模式中的一种,目标音效模式为多种预设音效模式中的一种;
在一些实施例中,所述方法还包括:根据目标声音清晰度控制模式对应的第一数值和/或目标音效模式对应的第二数值,确定第一增益和第二增益,其中,第一增益大于第二增益。
在一些实施例中,根据目标声音清晰度控制模式对应的第一数值和/或目标音效模式对应的第二数值,确定第一增益和第二增益,包括:将第一增益设置为0dB;根据目标声音清晰度控制模式对应的第一数值和/或目标音效模式对应的第二数值,确定第二增益,使第二增益小于0 dB。
参见图20,图20为本申请一些实施例中音频处理方法的又一种流程图,应用于显示设备,可以包括以下步骤:
步骤S2010,对获取到的第一声道音频数据和第二声道音频数据分别进行人声分离, 得到第一声道第一人声音频数据和第一声道第一背景音频数据,以及第二声道第一人声音频数据和第二声道第一背景音频数据。
步骤S2020,将第一声道第一人声音频数据和第二声道第一人声音频数据进行合并,得到目标人声音频数据。
步骤S2030,获取第一声道音频数据和第二声道音频数据所在时刻的图像数据,对图像数据进行唇动检测,如果检测到显示设备屏幕中的唇动坐标,根据唇动坐标和显示设备的多个音频输出接口的坐标,确定多个音频输出接口分别对应的人声权重。
步骤S2040,针对每个音频输出接口,根据音频输出接口的坐标,确定音频输出接口对应第一声道第一背景音频数据和/或第二声道第一背景音频数据。
步骤S2050,将目标人声音频数据和音频输出接口对应的人声权重的乘积,以及音频输出接口对应的第一声道第一背景音频数据和/或第二声道第一背景音频数据合并,并进行音效增强处理,得到音频输出接口对应的音频数据,并通过音频输出接口输出音频数据。
本申请一些实施例的音频处理方法,在立体声场景下,在分别对第一声道音频数据和第二声道音频数据分别进行人声分离后,可以先对分离出的第一声道第一人声音频数据和第二声道第一人声音频数据合并,得到目标人声音频数据,将目标人声音频数据作为待输出的人声音频。然后根据人物在图像中说话的位置,调整各个音频输出接口对应的人声权重,即输出人声音频对应的权重,以及根据音频输出接口的位置,调整各个音频输出接口输出背景音频的权重,从而使声音的立体感增强,提升用户的观看体验。
在一些实施例中,上述音频处理方法还包括:按照第一增益分别对第一声道第一人声音频数据和第二声道第一人声音频数据进行增益处理,得到第一声道第二人声音频数据和第二声道第二人声音频数据;按照第二增益分别对第一声道第一背景音频数据和第二声道第一背景音频数据进行增益处理,得到第一声道第二背景音频数据和第二声道第二背景音频数据;其中,第一增益和第二增益根据显示设备对应的声音控制模式确定;将第一声道第一人声音频数据和第二声道第一人声音频数据进行合并,得到目标人声音频数据,包括:将第一声道第二人声音频数据和第二声道第二人声音频数据进行合并,得到目标人声音频数据;针对每个音频输出接口,根据音频输出接口的坐标,确定音频输出接口对应第一声道第一背景音频数据和/或第二声道第一背景音频数据,包括:针对每个音频输出接口,根据音频输出接口的坐标,确定音频输出接口对应第一声道第二背景音频数据和/或第二声道第二背景音频数据;将目标人声音频数据和音频输出接口对应的人声权重的乘积,以及音频输出接口对应的第一声道第一背景音频数据和/或第二声道第一背景音频数据合并,并进行音效增强处理,得到音频输出接口对应的音频数据,包括:将目标人声音频数据和音频输出接口对应的人声权重的乘积,以及音频输出接口对应的第一声道第二背景音频数据和/或第二声道第二背景音频数据合并,并进行音效增强处理,得到音频输出接口对应的音频数据。
在一些实施例中,上述音效处理方法还包括:如果未检测到唇动坐标,针对每个音频输出接口,根据第一声道第一人声音频数据的能量和第二声道第一人声音频数据的能量的比值,以及音频输出接口的坐标,确定音频输出接口分别对应的人声权重。
在一些实施例中,屏幕包括:左区域、中间区域和右区域;根据音频输出接口的坐标,确定音频输出接口对应第一声道第一背景音频数据和/或第二声道第一背景音频数据,包括:如果音频输出接口的坐标对应于左区域,确定音频输出接口对应第一声道第一背景音频数 据;如果音频输出接口的坐标对应于右区域,确定音频输出接口对应第二声道第一背景音频数据;如果音频输出接口的坐标对应于中间区域,确定音频输出接口对应第一声道第一背景音频数据和第二声道第一背景音频数据。
在一些实施例中,屏幕包括:中间区域和非中间区域;根据唇动坐标和显示设备的多个音频输出接口的坐标,确定多个音频输出接口分别对应的人声权重,包括:如果唇动坐标位于非中间区域,根据唇动坐标和多个音频输出接口的坐标,确定多个音频输出接口分别对应的人声权重;如果唇动坐标位于中间区域,根据多个音频输出接口的坐标和多个音频输出接口的属性信息,确定多个音频输出接口分别对应的人声权重,其中,属性信息包括音量大小和/或朝向。
在一些实施例中,针对每个音频输出接口,根据音频输出接口的坐标,确定音频输出接口在屏幕中对应的区域;如果唇动坐标位于音频输出接口对应的区域内,确定音频输出接口对应的人声权重为第一数值;如果唇动坐标位于音频输出接口对应的区域外,确定音频输出接口对应的人声权重为第二数值,第二数值小于第一数值。
在一些实施例中,显示设备对应多种预设声音清晰度控制模式和/或多种预设音效模式;每种预设声音清晰度控制模式具有对应的数值,每种预设音效模式具有对应的数值;声音控制模式包括:目标声音清晰度控制模式和/或目标音效模式;其中,目标声音清晰度控制模式为多种预设声音清晰度控制模式中的一种,目标音效模式为多种预设音效模式中的一种;上述音频处理方法还包括:根据目标声音清晰度控制模式对应的第一数值和/或目标音效模式对应的第二数值,确定第一增益和第二增益。
本申请一些实施例还提供了一种音频处理方法,通过人声分离可以实现不受媒体资源限制地唱歌。同时,可以根据麦克风采集到的演唱人声的能量,将原唱人声全部或部分添加至伴奏中,避免因演唱者唱歌水平不高而影响唱歌体验。
参见图21,图21为本申请一些实施例中音频处理方法的又一种流程图,应用于显示设备,可以包括以下步骤:
步骤S2110,获取歌曲音频数据,对歌曲音频数据进行人声分离,得到原唱人声音频数据和伴奏音频数据。
步骤S2120,根据每个时间周期内的原唱人声音频数据的能量和在该时间周期内采集到的演唱人声音频数据的能量,确定原唱增益,根据原唱增益,对时间周期内的原唱人声音频数据进行增益处理,得到目标人声音频数据。
步骤S2130,将每个时间周期内的伴奏音频数据、目标人声音频数据和演唱人声音频数据进行合并以及音效增强处理,得到并输出目标音频数据。
本申请一些实施例的音效处理方法,针对歌曲音频数据,可以通过人声分离,得到原唱人声音频数据和伴奏音频数据。这样,对于任意歌曲,即使是唱歌APP中不包含的歌曲也可以通过该方法实现唱歌。并且,根据实时采集的演唱人声音频数据的能量和原唱人声音频数据的能量,确定原声增益,并根据原声增益对原唱人声音频数据进行增益处理,得到目标人声音频数据。由于原唱增益根据演唱人声音频数据的能量和原唱人声音频数据的能量确定,因此,将目标人声音频数据合并至伴奏音频数据,也就是,根据用户的演唱情况,将原唱人声音频数据合并至伴奏音频数据中,例如,将全部原唱人声音频数据合并至伴奏音频数据,或者,将部分原唱人声音频数据合并至伴奏音频数据,从而提升用户演唱时的伴奏效果,提升用户体验。
在一些实施例中,原唱增益小于等于预设增益阈值。
在一些实施例中,根据每个时间周期内的原唱人声音频数据的能量和在时间周期内采集到的演唱人声音频数据的能量,确定原唱增益,包括:如果演唱人声音频数据的能量小于预设能量阈值,将原唱增益设置为预设增益阈值;如果演唱人声音频数据的能量大于等于预设能量阈值,根据演唱人声音频数据的能量和原唱人声音频数据的能量之间的能量比,确定原唱增益,使原唱增益小于预设增益阈值。
在一些实施例中,上述音效处理方法还包括:获取前一个时间周期对应的原唱增益,如果当前时间周期对应的原唱增益和前一个时间周期对应的原唱增益相同,延长时间周期,直至延长后的时间周期小于第一时间阈值;如果当前时间周期对应的原唱增益和前一个时间周期对应的原唱增益不同,缩短时间周期,直至缩短后的时间周期大于第二时间阈值,其中,第一时间阈值大于第二时间阈值。
在一些实施例中,上述音效处理方法还包括:根据每个时间周期内的原唱人声音频数据,生成第一伴唱音频数据;根据在时间周期内采集到的演唱人声音频数据的能量,确定伴唱增益;其中,伴唱增益和时间周期内采集到的演唱人声音频数据的能量成正相关;通过伴唱增益对第一伴唱音频数据进行增益处理,得到第二伴唱音频数据;其中,第二伴唱音频数据的能量小于演唱人声音频数据的能量;将时间周期内的伴奏音频数据、目标人声音频数据和演唱人声音频数据进行合并,并进行音效增强处理,得到目标音频数据,具体包括:将时间周期内的伴奏音频数据、第二伴唱音频数据、目标人声音频数据和演唱人声音频数据进行合并,并进行音效增强处理,得到目标音频数据。
在一些实施例中,根据每个时间周期内的原唱人声音频数据,生成第一伴唱音频数据,包括:获取多个不同的延时以及每个延时对应的增益;针对每个延时,根据延时对每个时间周期内的原唱人声音频数据进行延时处理,得到第一延时音频数据;根据延时对应的增益对延时音频数据进行增益处理,得到第二延时音频数据;将多个第二延时音频数据进行合并,得到第一伴唱音频数据。
在一些实施例中,根据每个时间周期内的原唱人声音频数据,生成第一伴唱音频数据,包括:确定原唱人声音频数据所属的音区;根据音区对原唱人声音频数据进行升调处理或降调处理,得到第一伴唱音频数据。
在一些实施例中,根据音区对原唱人声音频数据进行升调处理或降调处理,包括:
如果音区为低音区,对原唱人声音频数据进行降调处理,得到第一伴唱音频数据;如果音区为高音区,对原唱人声音频数据进行升调处理,得到第一伴唱音频数据;如果音区为中音区,对原唱人声音频数据进行升调处理和降调处理,分别得到第一人声音频数据和第二人声音频数据;将第一人声音频数据和第二人声音频数据作为第一伴唱音频数据。
本申请一些实施例还提供了一种计算机可读存储介质,该计算机可读存储介质上存储计算机程序,该计算机程序被处理器执行时实现上述音频处理方法。
为了方便解释,已经结合具体的实施方式进行了上述说明。但是,上述在一些实施例中讨论不是意图穷尽或者将实施方式限定到上述公开的具体形式。根据上述的教导,可以得到多种修改和变形。上述实施方式的选择和描述是为了更好的解释原理以及实际的应用,从而使得本领域技术人员更好的使用实施方式以及适于具体使用考虑的各种不同的变形的实施方式。

Claims (12)

  1. 一种显示设备,包括:
    控制器,被配置为:对获取到的第一音频数据进行声音分离,得到第一目标音频数据和第一背景音频数据;
    按照第一增益对所述第一目标音频数据进行增益处理,得到第二目标音频数据;
    按照第二增益对所述第一背景音频数据进行增益处理,得到第二背景音频数据;其中,所述第一增益和所述第二增益根据所述显示设备对应的声音控制模式确定;
    将所述第二目标音频数据和所述第二背景音频数据进行合并,并进行音效增强处理,得到第二音频数据;
    音频输出接口,被配置为:输出所述第二音频数据。
  2. 根据权利要求1所述的显示设备,所述控制器,被配置为:根据所述声音控制模式,确定所述第一音频数据对应的音效增强模式的类型;
    根据所述声音控制模式,确定与所述音效增强模式的类型对应的第一增益和第二增益。
  3. 根据权利要求2所述的显示设备,所述控制器,被配置为:如果所述第一音频数据对应的音效增强模式的类型为声音增强模式,所述第一增益大于所述第二增益;
    如果所述第一音频数据对应的音效增强模式的类型为背景增强模式,所述第一增益小于所述第二增益。
  4. 根据权利要求1所述的显示设备,所述第一音频数据中包括至少一种属于预设声音类型的第三目标音频数据;
    所述控制器,还被配置为:从所述第一音频数据中分离出至少一种所述第三目标音频数据和第三背景音频数据;
    获取单个所述第三目标音频数据的第一声道初始目标音频数据的第一能量值和第二声道初始目标音频数据的第二能量值;
    按照第三增益对所述第一声道初始目标音频数据进行增益处理,得到第一声道第一增益音频数据;按照第四增益对所述第二声道初始目标音频数据进行增益处理,得到第二声道第一增益音频数据;其中,所述第三增益和第四增益根据所述第一能量值和所述第二能量值确定;
    将所述第一声道第一增益音频数据和所述第三背景音频数据的第一声道初始背景音频数据进行合并,并进行音效增强处理,得到第一声道第一增强音频数据;
    将所述第二声道第一增益音频数据和所述第三背景音频数据的第二声道初始背景音频数据进行合并,并进行音效增强处理,得到第二声道第一增强音频数据;
    所述音频输出接口包括:第一输出接口和第二输出接口;
    所述第一输出接口被配置为:输出所述第一声道第一增强音频数据;
    所述第二输出接口被配置为:输出所述第二声道第一增强音频数据。
  5. 根据权利要求4所述的显示设备,所述控制器,还被配置为:根据所述声音控制模式、所述第一能量值和所述第二能量值,确定单个所述第三目标音频数据对应的第五增益和第六增益;
    根据所述声音控制模式,确定第七增益;
    按照所述第五增益对所述第一声道初始目标音频数据进行增益处理,得到第一声道第 二增益音频数据;按照所述第六增益对所述第二声道初始目标音频数据进行增益处理,得到第二声道第二增益音频数据;
    按照所述第七增益分别对所述第一声道初始背景音频数据和所述第二声道初始背景音频数据进行增益处理,得到第一声道增益背景音频数据和所述第二声道增益背景音频数据;
    将所述第一声道第二增益音频数据和所述第一声道增益背景音频数据进行合并,并进行音效增强处理,得到第一声道第二增强音频数据;
    将所述第二声道第二增益音频数据和所述第二声道增益背景音频数据进行合并,并进行音效增强处理,得到第二声道第二增强音频数据;
    所述音频输出接口包括:第一输出接口和第二输出接口;
    所述第一输出接口被配置为:输出所述第一声道第二增强音频数据;
    所述第二输出接口被配置为:输出所述第二声道第二增强音频数据。
  6. 根据权利要求5所述的显示设备,所述控制器,被配置为:根据所述声音控制模式,确定所述第一音频数据对应的音效增强模式的类型;
    根据所述第一声道初始目标音频数据的第一能量值和第二声道初始目标音频数据的第二能量值,确定左右声道能量大小关系;
    根据所述声音控制模式、所述第一能量值和所述第二能量值,确定与所述音效增强模式的类型以及所述左右声道能量大小关系对应的第五增益和第六增益;
    根据所述声音控制模式,确定与所述音效增强模式的类型以及所述左右声道能量大小关系对应的第七增益。
  7. 一种音频处理方法,应用于显示设备,所述方法包括:
    对获取到的第一音频数据进行声音分离,得到第一目标音频数据和第一背景音频数据;
    按照第一增益对所述第一目标音频数据进行增益处理,得到第二目标音频数据;
    按照第二增益对所述第一背景音频数据进行增益处理,得到第二背景音频数据;其中,所述第一增益和所述第二增益根据所述显示设备对应的声音控制模式确定;
    将所述第二目标音频数据和所述第二背景音频数据进行合并,并进行音效增强处理,得到并输出第二音频数据。
  8. 根据权利要求7所述的方法,所述方法还包括:
    根据所述声音控制模式,确定所述第一音频数据对应的音效增强模式的类型;
    根据所述声音控制模式,确定与所述音效增强模式的类型对应的第一增益和第二增益。
  9. 根据权利要求8所述的方法,如果所述第一音频数据对应的音效增强模式的类型为声音增强模式,所述第一增益大于所述第二增益;
    如果所述第一音频数据对应的音效增强模式的类型为背景增强模式,所述第一增益小于所述第二增益。
  10. 根据权利要求7所述的方法,所述第一音频数据中包括至少一种属于预设声音类型的第三目标音频数据;
    所述方法还包括:
    从所述第一音频数据中分离出至少一种所述第三目标音频数据和第三背景音频数据;
    获取单个所述第三目标音频数据的第一声道初始目标音频数据的第一能量值和第二声道初始目标音频数据的第二能量值;
    按照第三增益对所述第一声道初始目标音频数据进行增益处理,得到第一声道第一增益音频数据;按照第四增益对所述第二声道初始目标音频数据进行增益处理,得到第二声道第一增益音频数据;其中,所述第三增益和第四增益根据所述第一能量值和所述第二能量值确定;
    将所述第一声道第一增益音频数据和所述第三背景音频数据的第一声道初始背景音频数据进行合并,并进行音效增强处理,得到并输出第一声道第一增强音频数据;
    将所述第二声道第一增益音频数据和所述第三背景音频数据的第二声道初始背景音频数据进行合并,并进行音效增强处理,得到并输出第二声道第一增强音频数据。
  11. 根据权利要求7所述音频处理方法还包括:
    根据所述声音控制模式、所述第一能量值和所述第二能量值,确定单个所述第三目标音频数据对应的第五增益和第六增益;
    根据所述声音控制模式,确定第七增益;
    按照所述第五增益对所述第一声道初始目标音频数据进行增益处理,得到第一声道第二增益音频数据;按照所述第六增益对所述第二声道初始目标音频数据进行增益处理,得到第二声道第二增益音频数据;
    按照所述第七增益分别对所述第一声道初始背景音频数据和所述第二声道初始背景音频数据进行增益处理,得到第一声道增益背景音频数据和所述第二声道增益背景音频数据;
    将所述第一声道第二增益音频数据和所述第一声道增益背景音频数据进行合并,并进行音效增强处理,得到并输出第一声道第二增强音频数据;
    将所述第二声道第二增益音频数据和所述第二声道增益背景音频数据进行合并,并进行音效增强处理,得到并输出第二声道第二增强音频数据。
  12. 根据权利要求11所述的音频处理方法,所述根据所述声音控制模式、所述第一能量值和所述第二能量值,确定单个所述第三目标音频数据对应的第五增益和第六增益,包括:
    根据所述声音控制模式,确定所述第一音频数据对应的音效增强模式的类型;
    根据所述第一声道初始目标音频数据的第一能量值和第二声道初始目标音频数据的第二能量值,确定左右声道能量大小关系;
    根据所述声音控制模式、所述第一能量值和所述第二能量值,确定与所述音效增强模式的类型以及所述左右声道能量大小关系对应的第五增益和第六增益;
    所述根据所述声音控制模式,确定第七增益,包括:
    根据所述声音控制模式,确定与所述音效增强模式的类型以及所述左右声道能量大小关系对应的第七增益。
PCT/CN2022/101859 2022-01-27 2022-06-28 显示设备及音频处理方法 WO2023142363A1 (zh)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN202210102847.5A CN114615534A (zh) 2022-01-27 2022-01-27 显示设备及音频处理方法
CN202210102840.3A CN114466241A (zh) 2022-01-27 2022-01-27 显示设备及音频处理方法
CN202210102896.9 2022-01-27
CN202210102847.5 2022-01-27
CN202210102840.3 2022-01-27
CN202210102852.6A CN114598917B (zh) 2022-01-27 2022-01-27 显示设备及音频处理方法
CN202210102896.9A CN114466242A (zh) 2022-01-27 2022-01-27 显示设备及音频处理方法
CN202210102852.6 2022-01-27

Publications (1)

Publication Number Publication Date
WO2023142363A1 true WO2023142363A1 (zh) 2023-08-03

Family

ID=87470293

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/101859 WO2023142363A1 (zh) 2022-01-27 2022-06-28 显示设备及音频处理方法

Country Status (1)

Country Link
WO (1) WO2023142363A1 (zh)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002261553A (ja) * 2001-03-02 2002-09-13 Ricoh Co Ltd 音声自動利得制御装置、音声自動利得制御方法、音声自動利得制御用のアルゴリズムを持つコンピュータプログラムを格納する記憶媒体及び音声自動利得制御用のアルゴリズムを持つコンピュータプログラム
CN1980054A (zh) * 2005-11-30 2007-06-13 鸿富锦精密工业(深圳)有限公司 音频处理装置及其音量管理方法
CN104200810A (zh) * 2014-08-29 2014-12-10 无锡中星微电子有限公司 自动增益控制装置及方法
WO2015097829A1 (ja) * 2013-12-26 2015-07-02 株式会社東芝 方法、電子機器およびプログラム
CN108347688A (zh) * 2017-01-25 2018-07-31 晨星半导体股份有限公司 根据单声道音频数据提供立体声效果的影音处理方法及影音处理装置
CN108449502A (zh) * 2018-03-12 2018-08-24 广东欧珀移动通信有限公司 语音通话数据处理方法、装置、存储介质及移动终端
CN109360577A (zh) * 2018-10-16 2019-02-19 广州酷狗计算机科技有限公司 对音频进行处理的方法、装置存储介质
CN114466242A (zh) * 2022-01-27 2022-05-10 海信视像科技股份有限公司 显示设备及音频处理方法
CN114466241A (zh) * 2022-01-27 2022-05-10 海信视像科技股份有限公司 显示设备及音频处理方法
CN114598917A (zh) * 2022-01-27 2022-06-07 海信视像科技股份有限公司 显示设备及音频处理方法
CN114615534A (zh) * 2022-01-27 2022-06-10 海信视像科技股份有限公司 显示设备及音频处理方法

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002261553A (ja) * 2001-03-02 2002-09-13 Ricoh Co Ltd 音声自動利得制御装置、音声自動利得制御方法、音声自動利得制御用のアルゴリズムを持つコンピュータプログラムを格納する記憶媒体及び音声自動利得制御用のアルゴリズムを持つコンピュータプログラム
CN1980054A (zh) * 2005-11-30 2007-06-13 鸿富锦精密工业(深圳)有限公司 音频处理装置及其音量管理方法
WO2015097829A1 (ja) * 2013-12-26 2015-07-02 株式会社東芝 方法、電子機器およびプログラム
CN104200810A (zh) * 2014-08-29 2014-12-10 无锡中星微电子有限公司 自动增益控制装置及方法
CN108347688A (zh) * 2017-01-25 2018-07-31 晨星半导体股份有限公司 根据单声道音频数据提供立体声效果的影音处理方法及影音处理装置
CN108449502A (zh) * 2018-03-12 2018-08-24 广东欧珀移动通信有限公司 语音通话数据处理方法、装置、存储介质及移动终端
CN109360577A (zh) * 2018-10-16 2019-02-19 广州酷狗计算机科技有限公司 对音频进行处理的方法、装置存储介质
CN114466242A (zh) * 2022-01-27 2022-05-10 海信视像科技股份有限公司 显示设备及音频处理方法
CN114466241A (zh) * 2022-01-27 2022-05-10 海信视像科技股份有限公司 显示设备及音频处理方法
CN114598917A (zh) * 2022-01-27 2022-06-07 海信视像科技股份有限公司 显示设备及音频处理方法
CN114615534A (zh) * 2022-01-27 2022-06-10 海信视像科技股份有限公司 显示设备及音频处理方法

Similar Documents

Publication Publication Date Title
US11749243B2 (en) Network-based processing and distribution of multimedia content of a live musical performance
KR101958664B1 (ko) 멀티미디어 콘텐츠 재생 시스템에서 다양한 오디오 환경을 제공하기 위한 장치 및 방법
EP3108672B1 (en) Content-aware audio modes
US20140105411A1 (en) Methods and systems for karaoke on a mobile device
CN114615534A (zh) 显示设备及音频处理方法
CN118175376A (zh) 显示设备及音频处理方法
JP2010136173A (ja) 音量補正装置、音量補正方法、音量補正プログラムおよび電子機器
JP5577787B2 (ja) 信号処理装置
WO2018017878A1 (en) Network-based processing and distribution of multimedia content of a live musical performance
CN114598917B (zh) 显示设备及音频处理方法
US20130108079A1 (en) Audio signal processing device, method, program, and recording medium
CN114466241A (zh) 显示设备及音频处理方法
WO2018066383A1 (ja) 情報処理装置および方法、並びにプログラム
WO2023142363A1 (zh) 显示设备及音频处理方法
JP2006254187A (ja) 音場判定方法及び音場判定装置
Riionheimo et al. Movie sound, Part 1: Perceptual differences of six listening environments
JP5316560B2 (ja) 音量補正装置、音量補正方法および音量補正プログラム
JP2012093519A (ja) カラオケシステム
JPWO2007004397A1 (ja) 音響信号処理装置、音響信号処理方法、音響信号処理プログラムおよびコンピュータに読み取り可能な記録媒体
Geluso Mixing and Mastering
Woszczyk et al. Creating mixtures: The application of auditory scene analysis (ASA) to audio recording
JP2015099266A (ja) 信号処理装置、信号処理方法およびプログラム
CN116847272A (zh) 音频处理方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923172

Country of ref document: EP

Kind code of ref document: A1