CN109643555B

CN109643555B - Automatic correction of loudness level in an audio signal containing a speech signal

Info

Publication number: CN109643555B
Application number: CN201680086918.XA
Authority: CN
Inventors: T.明奇; A.亨斯根斯
Original assignee: Harman Becker Automotive Systems GmbH
Current assignee: Harman Becker Automotive Systems GmbH
Priority date: 2016-07-04
Filing date: 2016-07-04
Publication date: 2024-01-30
Anticipated expiration: 2036-07-04
Also published as: EP3479378B1; CN109643555A; JP2019525213A; KR20190025816A; KR102622459B1; WO2018006927A1; JP6902049B2; US20190362735A1; US10861481B2; EP3479378A1

Abstract

The invention relates to a method for adjusting the gain of an N-channel audio input signal for generating an N-channel audio output signal, wherein the N-channel audio input signal comprises a speech input channel (21), if a speech signal component is present in the N-channel audio input signal, the speech signal component is present in the speech input channel, and the N-channel audio input signal comprises further audio input channels (20). The perceived loudness of the N-channel audio input signal is dynamically determined, and a determination is made as to whether a speech signal component is present in the speech input channel (21). If this is the case, the gain of the voice input channel is adjusted differently compared to the gain of the other audio input channels.

Description

Automatic correction of loudness level in an audio signal containing a speech signal

Technical Field

The present application relates to a method for adjusting the gain of an N-channel audio input signal comprising at least two different audio tracks having different signal level ranges and comprising possible speech signal components. Furthermore, a corresponding system is provided for this.

Background

Many different sources of audio signals are known in the art, including music and/or speech. The music signal may be stored on a CD, DVD or any other storage medium. In particular, with the development of new compression schemes such as MPEG, audio signals having different styles and performers are stored on storage media and can be combined into a playlist to be played to a user. Particularly in a vehicle environment, the audio signal perceived by the passengers contains the audio signal itself and road tire noise, aerodynamic noise and engine noise. Different audio signals of different audio sources often have different signal and dynamic compression levels. Often, different audio tracks of an audio output signal have different signal level ranges perceived by a user, with different loudness levels. Especially in a vehicle environment, the received audio signal should be perceptible to the user, meaning that it must exceed the noise present in the vehicle. At the same time, the total audio signal level should not exceed a level beyond which hearing impairment or pain is generated to the user.

When playing movies with multi-channel audio in a vehicular environment, the center channel plays speech and dialog. However, the perceived loudness of speech material is often insufficient, such that the user cannot correctly perceive the dialog.

Disclosure of Invention

Thus, there is a need to allow dynamic automatic correction of the loudness level in an audio signal while maintaining good perception of speech signals present in the audio signal, especially in noisy environments.

This need is met by the features of the independent claims. Preferred embodiments of the invention are described in the dependent claims.

According to a first aspect, there is provided a method for adjusting the gain of an N-channel audio input signal to generate an N-channel audio output signal, wherein the N-channel audio input signal comprises a speech input channel, and if a speech signal component is present in the N-channel audio input signal, the speech signal component is present in the speech input channel. The N-channel audio input signal also includes other audio input channels. According to one step of the method, the perceived loudness of the N-channel audio input signal is dynamically determined. Further, it is determined whether a speech signal component is present in the speech input channel. If a speech signal component is present in the speech input channel, the gain of the other audio input channel is dynamically adjusted in the first gain control unit with a first gain parameter based on the determined perceived loudness of the N-channel audio input signal such that at least two consecutive tracks of the other audio output channel output from the first gain control unit are limited to a predefined signal level range or a predefined loudness range. Based on the determined loudness of the N-channel audio input signal, the gain of the speech input channel is dynamically adjusted in the second gain control unit with a second gain parameter such that at least two consecutive tracks of the speech output channel output from the second gain control unit are limited to a predefined signal level range or loudness range. Thus, the second gain parameter is different from the first gain parameter.

Furthermore, a corresponding system is provided, which is configured to adjust the gain of the N-channel audio input signal. The system includes a loudness determination unit configured to determine a perceived loudness of an N-channel audio input signal. Furthermore, a speech detection unit is provided, which is configured to determine whether a speech signal component is present in the speech input channel. The first gain control unit is provided and configured to control the gain of the other audio input channels, and the second gain control unit is provided and configured to control the gain of the speech input channels. If a speech signal component is present in the speech input signal, the first gain control unit dynamically adjusts the gain of the other audio input channels with a first gain parameter based on the determined perceived loudness of the N-channel audio input signal such that at least two consecutive tracks of the other audio output channels output from the first gain control unit are limited to a predefined signal level range or a predefined loudness range. The second gain control unit dynamically adjusts the gain of the speech input channel with a second gain parameter based on the determined loudness of the N-channel audio input signal such that at least two consecutive tracks of the speech output channel output from the second gain control unit are limited to a predefined signal level range or loudness range. The first gain control unit and the second gain control unit determine the first gain and the second gain such that different gain parameters are different.

The gain of the speech input channel may be increased by a higher amount than the gain of the other audio input channels in order to improve the intelligibility of the speech components. For example, the first gain parameter and the second gain parameter may be determined such that a ratio of a signal level of the speech input signal to a signal level of the speech output signal is smaller than a ratio of a signal level of the other audio input channel to a signal level of the other audio output channel. In other words, this means that a higher gain is applied to the speech input channel than to the other audio input channels.

A further example is that the first gain parameter and the second gain parameter are determined such that the signal level of the speech input signal is increased by a higher amount by the second gain parameter than the signal level of the other audio input channels increased by the first gain parameter.

If the signal level of the N-channel audio input signal is reduced in order to keep the signal level in a predefined signal level range, it is possible that the first gain parameter and the second gain parameter are determined such that the signal level of the speech input signal is reduced by a small amount by the second gain parameter compared to the signal levels of the other audio input channels reduced by the first gain parameter.

It will be understood that the features mentioned above or yet to be explained below can be used not only in the respective combination indicated, but also in other combinations or independently of one another, without departing from the scope of the present application. Features of the above-mentioned aspects embodiments may be combined with each other in other embodiments, unless explicitly mentioned otherwise.

Drawings

The foregoing and additional features and effects of the present application will become apparent from the following detailed description when read in conjunction with the accompanying drawings, in which like reference numerals refer to like elements.

Fig. 1 schematically shows a system for adjusting the gain of an N-channel audio input signal.

Fig. 2 shows a more detailed view of an audio analysis unit for determining the loudness of an audio input signal and detecting speech signal components in the speech input channels of an N-channel audio input signal.

Fig. 3 shows an example of an estimated loudness of an audio input signal with no gain adjustment, including different time constants to smooth the loudness, i.e. a fast reaction to an increased loudness and a delayed reaction at a reduced loudness level.

Fig. 4 shows the dynamic level adjustment of the audio input signal of fig. 3, as should be done for automatic loudness adjustment, ideally corrected when the entire signal content is known.

Fig. 5 schematically shows how a speech signal component is detected in a speech detection unit for use in the audio analysis unit of fig. 2.

Fig. 6 schematically shows the introduction of time constants into an audio signal, representing the gain change from one block of an N-channel audio input signal to another.

Fig. 7 shows the signal level of an N-channel audio input signal before and after automatic loudness adjustment, wherein the signal level is reduced so as to stay within a defined signal level range.

Fig. 8 shows another example of signal levels of an N-channel audio input signal before and after automatic loudness adjustment, where the signal levels increase.

Fig. 9 shows a schematic representation of a system in which speech signal components are adjusted differently from other signal components.

Detailed Description

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. It will be appreciated that the following description of the embodiments is not to be taken in a limiting sense. The scope of the invention is not intended to be limited by the embodiments described below or by the drawings which will be understood only by way of illustration.

The figures are to be regarded as representations and the elements illustrated in the figures are not necessarily drawn to scale. Rather, the various elements are represented so that their function and general use will become apparent to those skilled in the art. Any connection or coupling between functional blocks, devices, components, or physical functional units shown in the figures or described herein may be implemented by indirect connection or coupling. The coupling between the components may be established by a wired or wireless connection. Furthermore, the functional blocks may be implemented in hardware, software, firmware, or a combination thereof.

In fig. 1, a system is shown with which the loudness of an N-channel audio input signal can be adjusted. The N-channel audio input signal may be a 5.1 or 7.1 audio signal and may be stored on a CD or DVD or any other storage unit, such as a hard disk. The N-channel audio input signal comprises a speech input channel 21 in which speech signal components are present if present in the input signal. In either the 5.1 or 7.1 audio signals, the speech input channel may be the center channel. In addition, the N-channel audio input channels include other audio input channels 20.

The system shown comprises an audio signal analysis unit 30 in which a psycho-acoustic localization model of, inter alia, human hearing is used and signal statistics are used to determine the loudness of the channel audio input signal.

In the signal analysis unit 30, the loudness is determined based on a psychoacoustic model of human hearing and based on signal statistics. As described in further detail below, the psycho-acoustic model is used to estimate loudness for locating sound, and to determine whether noise is present in the audio input signal (as a dominant factor), e.g., during pauses or between two tracks. Signal statistics are the second basis for determining or estimating loudness and for determining whether there is a pause with noise in the audio signal. By way of example, the signal strength of the entertainment audio signal may be determined. The loudness adjustment is determined by dynamically determining an adaptive time constant based on a psychoacoustic model (alone or in combination with a statistical signal model), as will be described in further detail below.

In fig. 2, a more detailed view of the audio signal analysis unit 30 is shown.

In the audio signal analysis unit 30, the N-channel audio input signal may be subjected to a down-mixing in the down-mixing unit 36. In this example, down-mixing means deciding in the down-mixing unit whether different channels of the N-channel audio input signal are analyzed separately in the signal analysis unit 30 or whether certain groups of audio signals are generated. For example, the front signal channels of a 5.1 surround signal may be grouped together into one group or the front signal channels and the center channel, while the rear or surround channels are grouped into another group. Thus, in the down-mixing unit it is determined why different input channels of the audio input signal are processed or whether all channels are processed separately. Further, the voice input channel 20 is fed to a voice detection unit 37 in which it is detected whether or not a voice component is present in the voice input channel. If speech signal components (such as dialog) are present in the N-channel audio input signal, these are present in the speech input channel. The other audio input channels 20 do not include speech signal components. The speech detection unit is explained in further detail later with reference to fig. 5.

Further, the audio signal analysis unit comprises a loudness determination unit 31 that estimates the loudness of the received audio input signal. The loudness determination unit 31 may determine loudness using methods known in the art and as described, inter alia, in ITU-R BS1770-1. For further details on the determination of localization and loudness of an N-channel audio input signal, reference is also made to Wolfgang Hess et al at Audio Engineering Society Convention Paper 5864, 115 ^th "Acoustical Evaluation of Virtual Rooms by Means of Binaural Activity Patterns" published in Convention (month 10 2003), W.Lindemann "published in Journal of Acoustic Society of America (month 12 1986, pages 1608 to 1622, volume 80 (6))," Extension of a Binaural Cross-Correlation Model by Contralateral inhibition.I.formulation of Lateralization for Stationary Signals "and ITU-R BS1770-1. However, it should be mentioned that any other method known in the art for determining the loudness of an audio signal may be used.

Furthermore, the loudness determination unit 31 may use a binaural model of human hearing for determining loudness when hearing the input signals 20 and 21 and for determining whether and where the audio input signals may be localized by the user. This binaural model simulates the spatial perception of the audio input signal and allows to determine whether the audio input signal mainly contains noise or any other input signal such as music or speech. The localization of the audio input signal is described in more detail in the documents mentioned earlier in the application or as mentioned in EP 1522 868 A1, in the document of w.lindemann or in Audio Engineering Society Convention Paper 5864 mentioned above. This localization technique allows distinguishing between noise and other sound signals and helps to avoid that noise is output with increased gain in case only noise is detected in the audio input signal. It is also allowed to reset the adaptive time constant generated by the time constant generating unit 32 when a pause is detected. The loudness determination unit 31 uses a psycho-acoustic model of human hearing to estimate the loudness of the audio input signal. The detection of pauses between two consecutive tracks is schematically shown by a pause detection unit 33.

Furthermore, the loudness determination unit 31 may additionally use statistical signal processing in order to estimate the loudness of the audio input signal or detect signal pauses. In a statistical analysis of an audio input signal, the actual signal levels of different samples of the audio input signal are determined. For example, if the signal level of several consecutive samples of the input signal corresponds to a gaussian distribution, it can be inferred that the processed samples contain noise and no other audio signal.

The result of the loudness estimation is then used by the audio signal analysis unit for calculating the time constants that are introduced into the audio input signals 20 and 21. In fig. 2, the calculation of the time constant is symbolized by the time constant generator 32. The time constant facilitates adjustment of the gain as described in detail in connection with fig. 6.

The audio signal analysis unit 30 further comprises a gain determination unit 35 that adjusts the gains of the speech input channel 21 and the other audio input channels 20. The loudness determination unit 31 provides the loudness of a certain part of the music input signal (e.g. a block comprising several samples) by emitting dB loudness equivalents (dBLEQ). The gain determination unit 35 has a predefined signal level, or any other signal level threshold, which is a signal level that should be met when outputting the audio signal (e.g. -12dB as shown in the lower part of these figures in fig. 7 and 8). In the gain determination unit 35, the determined loudness is subtracted from the average signal level to be obtained in order to calculate the gain. For example, if the determined loudness corresponds to-5 dB, and if the target is-12 dB full scale, the gain must be adjusted accordingly by reducing the gain so as to have an average signal level of about-12 dB. The gain determination unit determines a first gain parameter of the other audio input channel 20 and determines a second gain parameter of the speech input channel 21. The gain determination unit calculates a time constant for adjusting the gain as will be explained in connection with fig. 6.

The gain determination unit is configured such that it adjusts the gains of the speech input channels and other audio input channels in such a way that the user can better perceive the dialog present in the speech input channels.

For example, when the overall signal level increases, the signal level of the speech input signal increased by the second gain parameter may be increased by a higher amount than the signal levels of the other audio input channels increased by the first gain parameter. In other words, this means: the first gain parameter and the second gain parameter are determined such that a ratio of a signal level of the speech input signal to a signal level of the speech output signal is smaller than a ratio of a signal level of the other audio input channel to a signal level of the other audio output channel.

However, when the total signal level of the audio signal should be reduced in order to keep the signal within a certain range, the first gain parameter and the second gain parameter may be determined such that the signal level of the speech input signal is reduced by a smaller amount by the second gain parameter than the signal levels of the other audio input channels reduced by the first gain parameter.

In a vehicle environment, an occupant of the vehicle perceives different environmental noises depending on the vehicle used. The vehicle sound signal includes a noise component and an audio signal component. The noise signal component may be due to road tire noise, aerodynamic noise, or engine noise. The noise may have a value between 60 and 85dB SPL (signal sound pressure level). Since the auditory pain threshold is about 120dB SPL, the audio signal component is in the range of 20-40dB SPL.

Referring back to fig. 1, the signal output 38 of the speech input channel of the audio signal analysis unit and the signal output 39 of the other audio input channels are input into the signal control unit 40. The signal output 38 describing the gain adjustment in the form of a time constant is fed to a gain control unit 44, while the signal output 39 is fed to a gain control unit 43. The other audio input channels 21 are input to the first delay element 41. The delay element introduces the delay into the input signal 20, where the gain is determined in the signal analysis unit and the delay required to detect the possible speech signal components. The delay element helps to ensure that the signal processed by the signal analysis unit 30 is actually controlled by the correct time constant corresponding to the audio signal for which the correct time constant is determined. In the same way, the speech input signal 21 is fed to a second delay unit 42 in which a corresponding delay is introduced into the speech input signal. In the embodiment shown, two different delay units 41 and 42 are provided, however, a single delay unit may be used, since the delays introduced into the signals 20 and 21 are preferably identical.

Furthermore, the signal control unit 40 comprises a gain control unit 43 for the other audio input channels and a gain control unit 44 for the speech input channel 21. The gain control units 43, 44 help to determine how much of the gain determined by the gain determination unit 35 actually affects the signal output level of the other audio output channel 45 output from the gain control unit 43 or the signal output level of the speech output channel 46 output from the gain control unit 44. To this end, a user interface (not shown) may be provided in which the user may indicate how many percentages of the gain corrections by the audio signal analysis unit 30 are used for output. If a gain of 100% should be output (as present in the combined output signal 60), the value as determined by the gain determination unit 35 is taken over. However, the user may also not want gain adjustment (e.g., in the case where he or she wants to maintain the loudness evolution in a piece of music). In this example, the user may set the gain adjustment in the gain control unit 43 to 0%, meaning that the correction as determined in unit 30 is not used for output. In the gain control unit 43, the amount of gain correction may be determined, for example, by setting a factor between 0% and 100%. If a factor of 0% is set, the gain is determined without affecting the time constant.

In addition to or instead of the user interface, a noise estimator 50 may be provided, which estimates the ambient noise in the vehicle cabin. As mentioned above, the vehicle speed strongly influences the noise in the vehicle cabin. If the vehicle is traveling at a very low speed or in a stationary state, it may be considered that no gain adjustment as determined by the gain determination unit is required. If the output signal 60 should not be affected at all by the gain determination unit (meaning that the correction as determined in unit 30 is not used for output), the gain control unit may set a factor of 0% by which the output signal is affected by the calculation performed in unit 30. The noise estimator 50 may receive the vehicle speed and may access a table 51 in which a relationship between the vehicle speed and noise is provided. This table may be a predefined table set by the vehicle manufacturer. In general, the driver should not be able to adjust the values given in table 51. However, the values given in the table may be changed, for example, by a software tool with which the setting of the sound can be adjusted. When the vehicle speed is higher, the ambient noise may also be at 80dB (a). In this example, if the threshold of 105dB (A) should not be exceeded, only 25dB (A) remains. In the case of 80dB (a) of ambient noise, the loudness of the audio output signal may be dynamically determined by the gain determination unit as described above. The gain determination unit may determine a factor between 0% and 100% based on the ambient noise, this percentage describing how much loudness should be adjusted, as described above. In the illustrated embodiment, vehicle speed is the only variable that determines ambient noise. However, other factors, such as ambient noise, such as that determined by a microphone (not shown), may be used alone or in combination with vehicle speed.

In the upper part of fig. 3 the signal level of the audio input signal is shown in full scale, meaning that a 0dB full scale (0 dBFS) is allocated to the largest possible signal level in the digital domain, dB full scale meaning decibels relative to full scale. As can be seen from the upper part of fig. 3, the signal level varies considerably and thus the loudness level corresponding to the signal as perceived by the user also varies considerably. In the lower part of fig. 3, the corresponding loudness is estimated from the signal input level. One possibility for loudness estimation is described in the recommendation ITU-R bs.1770-1 ("Algorithms to Measure Audio Program Loudness and to a Peak Audio Level"). In this application, loudness may be estimated by a binaural localization model. If the sound signal is played to the user in a vehicle as shown in fig. 3, some portions of the audio signal may be perceived as being at an unpleasant loudness, while other portions of the audio signal may be perceived as being too low to be properly perceived by the user. In fig. 4, the level of the signal of fig. 3 is shown after ideal adjustment. For example, in order for a user to perceive well, the signal samples in range 201 should be adjusted to a lower signal level, while the signals in range 202 should be adjusted to a higher signal level. Similarly, signals in range 203 should be output at strongly reduced signal levels.

In the lower part of fig. 4, the corresponding estimated loudness of the ideally adjusted level in the upper part is shown. When comparing the lower part of fig. 2 with the lower part of fig. 4, it can be inferred that the loudness estimation value shown in fig. 4 is better than the loudness estimation value shown in fig. 3. The loudness estimation of fig. 4 may be perceived better than the loudness estimation of fig. 3. A smooth, relatively constant loudness is achieved and shaped here.

Fig. 5 shows a more detailed view of a part of the speech detection unit. The speech detection unit has to decide whether the speech input signal comprises a speech component. For this, the voice input signal may be separated into frames (for example, two seconds) having a defined length in the separation unit 370, and for each frame, the features are calculated and extracted in the feature extraction unit 371. Thus, the speech input signal is divided into frames and input into buffers for feature extraction, wherein feature extraction is performed for each buffer content. Based on the extracted features, classification is performed in element 372. For example, the mean deviation and standard deviation may be calculated. Finally, in element 373, clustering is performed. In this clustering unit 373, it is attempted to find a category label of each frame to determine a cluster center in the feature space, and then assign each feature vector to the nearest center. As an example, a K-means algorithm may be used.

The extracted features in element 371 may include features such as: total spectral power, zero-crossing rate, or mel-frequency cepstral coefficient (MFCC).

It should be appreciated that any other method known in the art may be used to detect a speech signal component in a speech input signal. The voice detection should be specifically configured to distinguish between voices and words occurring in the song. Only the speech components of the acoustic language should be detected so that these components can be handled in a different way by the gain determination compared to other non-speech components in the N-channel audio input signal.

The output of the speech detection unit may be a probability between 0% and 100%. If the probability is above a certain level, the speech detection unit may assume that speech is present in the speech input channel and may thus inform the gain determination unit such that the latter (gain determination unit) may control the speech input channel in a different way than the other audio input channels. If the speech detection unit assumes that no speech is present in the speech input channel, both the speech input channel and the other audio input channels may be adapted in the same way.

In fig. 6, different samples 61 to 63 of one of the speech output channels 46 separated by different time constants 71 to 73 are shown. The time constants 71 to 73 indicate how the loudness should be adjusted from one sample to the next. The time constant may be a rise time constant or a fall time constant. The rise time constant indicates how the signal gain increases from one sample to the next, while the fall time constant indicates the gain decreases from one sample to the next. The time constants 71 to 73 are determined in such a way that the rise time constant can be adjusted much faster than the fall time constant. For example, if a signal pause is determined between two tracks or within one track, the audio signal level should not be increased in order to avoid amplification of noise. When a new track starts, a high signal level may occur directly after a very low signal level. Therefore, the lift time constant of the loudness estimation must be adjusted in order to avoid that the signal level at the beginning of a new track is greatly increased. The falling time constant in the case of a decrease in the audio signal level allows only a slower decrease in the signal level than in the case of an increase. In addition, the time constant is an adaptive time constant, meaning that the longer the track, the slower the time constant reacts. This may be valid for a rise time constant and a fall time constant. The smoothed loudness estimate also guarantees the same loudness estimate as the way the human perceives loudness. Spikes and valleys (dip) are smoothed out by the human auditory system. The fact that the time constant changes more slowly as the time of the audio track increases helps to maintain the dynamics of the audio signal. However, when a long run time of the music signal is reached, a shorter reaction time to the increased loudness also ensures an appropriate reaction to the rapid signal increase. Furthermore, the time constant is such that components of the speech output channel including speech are adjusted in a different way compared to components of other audio output channels. Furthermore, the upper part of fig. 6 shows different samples of the other audio output channels 45, which samples are separated by different time constants 91 to 93.

In the lower part of fig. 6 gain increases and gain decreases over time for the output signals 45 and 46 are shown. The first gain is determined 75 as shown for the first block 64 of music samples. For the following signal block 65 an increased gain is determined, followed by a signal block 66 with a slightly reduced gain, so that a gain reduction as symbolized by 76 is applied. The gain of each block, i.e., the target gain of each block, is determined based on the loudness adjustment using a time constant. The target gain for block n is then implemented as a linear ramp starting from the target gain of the previous block n-1. In the example shown in the lower section, gain increases and decreases are shown for a speech output channel containing different samples 84 to 86 with corresponding gains 95 and 96. It is assumed that speech is detected after the end of block 64. Furthermore, it is assumed that the speech signal component should be increased compared to other components in order to improve the intelligibility of the speech component. When comparing gain 75 with gain 95, it can be inferred that: the speech output channel 46 has been increased more strongly than the other audio output channels 45.

The time constant may be reset if it is determined that there is a pause in one track or between two tracks. The pause detection or track detection implemented in the signal analysis unit 30 of fig. 2 is symbolized by a pause detection unit 33 and a track detection unit 34. In the embodiment of fig. 2, the loudness determination unit 31, the time constant generation unit 32, the pause detection unit 33 and the audio detection unit 34, the gain determination unit 35, the downmix unit 36 and the speech detection unit 37 are shown as separate units. However, it will be clear to a person skilled in the art that different units may be combined into fewer units and that these units may be combined into several units or even into one unit. Furthermore, the signal analysis unit may be designed by hardware elements or software or a combination of hardware and software.

In fig. 7, a first example of automatic loudness adjustment is shown. In the upper part of fig. 7, the audio input signal is shown prior to loudness estimation. As can be seen from the two channels of the audio input signal, the input signal covers different input level ranges. The maximum input level may be 0dB full scale. In the lower part of fig. 7, the audio output signal 19 is shown after loudness estimation and gain adjustment. As can be seen from the lower part of fig. 7, the average signal level is set to-12 dB full scale. At the same time, the dynamic structure of the audio signal is preserved.

Another example is shown in fig. 8, where the input level has a maximum input level of-20 dB full scale. In the lower part of fig. 8, the audio output signal 19 is shown after loudness estimation and gain adjustment. Again the dynamic structure is preserved and the average signal level is again-12 dB full scale. If the input signal shown in the upper part of fig. 7 and 8 is output to the user, the user will have to frequently adjust the volume in order to avoid that the signal level is too high to be unpleasant and in order to signal up for parts of the audio signal where the signal level is too low.

Fig. 9 shows a schematic architectural view of a system 400. The system 400 is configured to implement all of the steps discussed above in connection with the other figures. The system 400 comprises an interface 410 having an input unit and an output unit (not shown in detail). The interface is provided for outputting the combined output signal 60 shown in fig. 1. The interface is further configured to receive the different input signals 20, 21 discussed above in connection with fig. 1.

Furthermore, a processing unit 420 is provided, which is responsible for the operation of the system 400. Processing unit 420, including one or more processors (e.g., digital Signal Processors (DSPs)), may implement instructions stored on memory 430, where the memory may include read only memory, random access memory, mass storage, and the like. Furthermore, the memory may comprise suitable program code to be executed by the processing unit 420 in order to implement the above-described functions of the system, wherein the speech signal components are adjusted in a different manner compared to other audio input channels of the N-channel audio input signal, as discussed above in connection with fig. 1 to 8.

With the present application, the user no longer needs frequent volume adjustments, since the system estimates loudness and automatically and dynamically aligns the gain prior to output. Furthermore, the gains of the different components are adjusted so that the speech components present in the N-channel signal can be better understood.

Claims

1. A method for adjusting the gain of an N-channel audio input signal to generate an N-channel audio output signal, wherein the N-channel audio input signal comprises a speech input channel (21) in which speech signal components are present if present in the N-channel audio input signal, and the N-channel audio input signal comprises other audio input channels (20), the method comprising:

dynamically determining the perceived loudness of the N-channel audio input signal,

determining whether a speech signal component is present in the speech input channel (21),

wherein if a speech signal component is present in said speech input channel (21),

dynamically adjusting the gain of the other audio input channels (20) with a first gain parameter (39) in a first gain control unit (43) based on the determined perceived loudness of the N-channel audio input signal such that at least two consecutive tracks of other audio output channels (45) output from the first gain control unit (43) are limited to a predefined signal level range or a predefined loudness range,

dynamically adjusting the gain of the speech input channel (21) in a second gain control unit (44) with a second gain parameter (38) based on the determined loudness of the N-channel audio input signal, such that at least two consecutive tracks of speech output channels (52) output from the second gain control unit (44) are limited to the predefined signal level range or loudness range, wherein the second gain parameter (38) is different from the first gain parameter (39),

estimating ambient noise in a space to which the N-channel audio input signal is output, wherein gains of the other audio input channels and the speech input channels are adjusted in consideration of the estimated ambient noise, wherein the N-channel audio input signal is output to an interior of a vehicle, wherein estimating the ambient noise comprises determining a vehicle speed and determining the ambient noise based on the determined vehicle speed,

wherein determining whether a speech signal component is present in the speech input channel (21) comprises the steps of:

dividing the speech input channel into audio frames,

feature extraction is performed on a per frame basis,

-clustering the extracted features in a feature space.

2. The method of claim 1, wherein the first gain parameter (39) and the second gain parameter (38) are determined such that a ratio of a signal level of the speech input channel (21) to a signal level of the speech output channel (52) is smaller than a ratio of a signal level of the other audio input channel (20) to a signal level of the other audio output channel (45).

3. The method of claim 1 or 2, wherein the first gain parameter (39) and the second gain parameter (38) are determined such that the signal level of the speech input channel is increased by a higher amount by the second gain parameter (38) than the signal level of the other audio input channel increased by the first gain parameter (39).

4. The method of claim 1 or 2, wherein the first gain parameter (39) and the second gain parameter (38) are determined such that the signal level of the speech input channel (21) is reduced by a smaller amount by the second gain parameter than the signal level of the other audio input channel reduced by the first gain parameter (39).

5. The method of claim 1 or 2, wherein the perceived loudness of the N-channel audio input signal is determined overall for all N channels.

6. The method of claim 1 or 2, wherein the perceived loudness of the N-channel audio input signals of individual groups is determined separately.

7. The method according to claim 1 or 2, wherein the other audio input channels (20) and the speech input channels (21) are adjusted with the same gain if no speech signal component is present in the speech input channels.

8. A system configured to adjust a gain of an N-channel audio input signal to generate an N-channel audio output signal, wherein the N-channel audio input signal comprises a speech input channel (21) in which speech signal components are present if present in the N-channel audio input signal, and the N-channel audio input signal comprises other audio input channels (20), the system comprising:

a loudness determination unit (31) configured to determine a perceived loudness of the N-channel audio input signal,

a speech detection unit (37) configured to determine whether a speech signal component is present in the speech input channel (21),

a first gain control unit (43) configured to control the gain of the other audio input channel (20),

a second gain control unit (44) configured to control the gain of the speech input channel (21),

wherein if the speech detection unit detects the presence of a speech signal component in the speech input channel,

-the first gain control unit (43) dynamically adjusts the gain of the other audio input channels (20) with a first gain parameter (39) based on the determined perceived loudness of the N-channel audio input signal such that at least two consecutive audio tracks of other audio output channels (45) output from the first gain control unit (43) are limited to a predefined signal level range or a predefined loudness range,

-the second gain control unit (44) dynamically adjusting the gain of the speech input channel (21) with a second gain parameter (38) based on the determined loudness of the N-channel audio input signal, such that at least two consecutive audio tracks of a speech output channel (52) output from the second gain control unit (44) are limited to the predefined signal level range or loudness range, wherein the second gain parameter is different from the first gain parameter,

-a noise estimator (50) configured to estimate ambient noise in a space to which the N-channel audio input signal is output, wherein the first gain control unit (43) and the second gain control unit (44) are configured to adjust the gains of the other audio input channels and the speech input channels in view of the estimated ambient noise, wherein the N-channel audio input signal is output to the interior of the vehicle, and wherein the noise estimator is configured to determine a vehicle speed and to determine the ambient noise based on the determined vehicle speed;

wherein the speech detection unit (37) is configured to determine whether a speech signal component is present in the speech input channel based on steps comprising:

dividing the speech input channel into audio frames,

feature extraction is performed on a per frame basis,

-clustering the extracted features in a feature space.

9. The system of claim 8, wherein the first gain control unit (43) and the second gain control unit (44) determine the first gain parameter and the second gain parameter such that a ratio of a signal level of the speech input channel (21) to a signal level of the speech output channel (52) is smaller than a ratio of a signal level of the other audio input channel (20) to a signal level of the other audio output channel (45).

10. The system of claim 8 or 9, wherein the first gain control unit (43) and the second gain control unit (44) determine the first gain parameter and the second gain parameter such that the signal level of the speech input channel is increased by a higher amount by the second gain parameter than the signal levels of the other audio input channels increased by the first gain parameter.

11. The system of claim 8 or 9, wherein the first gain control unit (43) and the second gain control unit (44) determine the first gain parameter and the second gain parameter such that the signal level of the speech input channel (21) is reduced by a smaller amount by the second gain parameter than the signal level of the other audio input channel reduced by the first gain parameter.

12. The system according to claim 8 or 9, wherein the loudness determination unit (31) is configured to determine the perceived loudness of the N-channel audio input signal together as a combined loudness level for all N channels.

13. The system according to claim 8 or 9, wherein the loudness determination unit (31) is configured to determine the perceived loudness of the N-channel audio input signals of individual groups individually.

14. A system (400) configured to adjust a gain of an N-channel audio input signal to generate an N-channel audio output signal, wherein the N-channel audio input signal comprises a speech input channel, a speech signal component is present in the speech input channel if the speech signal component is present in the N-channel audio input signal, and the N-channel audio input signal comprises other audio input channels, the system comprising:

at least one processor (420),

-a memory (430) containing instructions executable by the at least one processor, wherein the system is operative to implement the method according to any one of claims 1 to 7.