CN115396776A

CN115396776A - Earphone control method and device, earphone and computer readable storage medium

Info

Publication number: CN115396776A
Application number: CN202211027381.3A
Authority: CN
Inventors: 张锐; 李罡
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2022-11-25

Abstract

The present disclosure relates to a method and an apparatus for controlling an earphone, and a computer-readable storage medium, wherein the method for controlling the earphone comprises: acquiring an environment frequency domain signal, wherein the environment frequency domain signal is a frequency domain expression mode of a sound signal in the environment around the earphone; acquiring ear canal frequency domain signals, wherein the ear canal frequency domain signals are frequency domain expression modes of sound signals in an ear canal of a user wearing the earphone; obtaining a frequency spectrum amplitude difference according to the environment frequency domain signal and the auditory canal frequency domain signal, wherein the frequency spectrum amplitude difference represents a difference value between the amplitude of the environment frequency domain signal and the amplitude of the auditory canal frequency domain signal; and if the voice activity of the user wearing the earphone is determined according to the frequency spectrum amplitude difference and a preset voice detection strategy, controlling the mode of the earphone to be switched to a transparent mode. This application confirms whether there is voice activity the user who wears the earphone through the spectral amplitude difference that utilizes environment frequency domain signal and the ear canal frequency domain signal to acquire, not only can improve the intelligence of earphone control, can effectively reduce the consumption of earphone moreover.

Description

Earphone control method and device, earphone and computer readable storage medium

Technical Field

The present disclosure relates to the field of earphone technologies, and in particular, to an earphone control method and apparatus, an earphone, and a computer-readable storage medium.

Background

In recent years, under the background of rapid popularization of new-generation consumer electronics devices such as global smart phones and tablet computers, earphone products, particularly wireless earphone products, have shown an explosive growth trend. The noise reduction earphone can isolate external noise and bring the promotion of tone quality, receives liking of more and more people gradually.

However, the noise reduction earphone suppresses the environmental noise and also suppresses the speaking voice of the person, which may affect the communication between the wearer and other people. Among the prior art, if the person of wearing when wearing the earphone, need communicate with other people, the person of wearing either takes off the earphone of making an uproar of falling, or the manual adjustment is fallen the earphone of making an uproar and is closed the mode of making an uproar of falling, and is very inconvenient. Therefore, how to better control the earphone is an urgent technical problem to be solved.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method and an apparatus for controlling an earphone, and a computer-readable storage medium.

According to a first aspect of embodiments of the present disclosure, there is provided a method for controlling a headset, the method including:

acquiring an environment frequency domain signal, wherein the environment frequency domain signal is a frequency domain expression mode of a sound signal in the environment around the earphone;

acquiring ear canal frequency domain signals, wherein the ear canal frequency domain signals are frequency domain expression modes of sound signals in an ear canal of a user wearing the earphone;

obtaining a frequency spectrum amplitude difference according to the environment frequency domain signal and the auditory canal frequency domain signal, wherein the frequency spectrum amplitude difference represents a difference value between the amplitude of the environment frequency domain signal and the amplitude of the auditory canal frequency domain signal;

and if the voice activity of the user wearing the earphone is determined according to the frequency spectrum amplitude difference and a preset voice detection strategy, controlling the mode of the earphone to be switched to a transparent mode.

Optionally, before controlling the mode of the earphone to switch to the pass-through mode, the method further includes:

acquiring power consumption modes of the earphone, wherein the power consumption modes comprise a low power consumption mode and a high performance mode;

if the power consumption mode is the high-performance mode, determining whether voice activity exists in a user wearing the earphone according to the frequency spectrum amplitude difference and a preset first voice detection strategy;

and if the power consumption mode is the low power consumption mode, determining whether voice activity exists in the user wearing the earphone according to the frequency spectrum amplitude difference and a preset second voice detection strategy.

Optionally, obtaining the power consumption mode of the headset comprises:

obtaining a power consumption mode preset by a user, or

And acquiring performance parameters of the earphone, and determining a corresponding power consumption mode according to the performance parameters of the earphone, wherein the performance parameters of the earphone comprise at least one of the remaining battery capacity and the CPU load.

Optionally, determining whether voice activity exists in the user wearing the headset according to the spectrum amplitude difference and a preset first voice detection strategy includes:

inputting the spectrum amplitude difference to a preset voice detection model to obtain a voice detection result, wherein the voice detection model is obtained by performing machine learning training on an initial model by using a training set, the training set comprises a positive sample data set and a negative sample data set, the positive sample data set comprises a plurality of positive sample data, each positive sample data comprises first audio data and a first label corresponding to the first audio data, the first audio data comprises the speaking sound of a user in a state that the user is wearing earphones, the first label is used for indicating that the corresponding first audio data is a target audio, the negative sample data set comprises a plurality of negative sample data, each negative sample data comprises second audio data and a second label corresponding to the second audio data, the second audio data does not comprise the speaking sound of the user in the state that the user is wearing the earphones, and the second label is used for indicating that the corresponding second audio data is not the target audio;

it is determined whether voice activity is present for the user wearing the headset based on the voice detection result.

Optionally, the determining, by the spectrum amplitude difference including a plurality of sub-amplitude differences, whether there is voice activity in the user wearing the headset according to the spectrum amplitude difference and a preset second voice detection policy includes:

acquiring a perception weighted value, wherein the perception weighted value is the sum of products of each sub-amplitude difference and the corresponding perception weighted coefficient, and the perception weighted coefficient is the weighted value corresponding to different frequency bands;

determining whether voice activity is present for the user wearing the headset based on the perceptual weighting values.

Optionally, if it is determined that there is voice activity in the user wearing the headset according to the spectrum amplitude difference and a preset voice detection policy, controlling the mode of the headset to switch to a transparent mode, including:

if the voice activity of the user wearing the earphone is determined according to the frequency spectrum amplitude difference and a preset voice detection strategy, acquiring the type of the voice activity;

if the type of the voice activity is a conversation type, controlling the mode of the earphone to be switched to a transparent mode;

if the voice activity is of a non-conversational type, the mode of the headset is kept unchanged.

Optionally, obtaining a spectrum amplitude difference according to the environment frequency domain signal and the ear canal frequency domain signal includes:

acquiring a speaker reference signal, wherein the speaker reference signal is an original sound signal of an audio signal to be played by a speaker of an earphone, and the speaker reference signal comprises a plurality of audio frames;

determining the energy sum of each audio frame contained in the loudspeaker reference signal;

and if the energy sum of the audio frames is less than the preset energy, obtaining the spectrum amplitude difference according to the environment frequency domain signal and the auditory canal frequency domain signal.

Optionally, the method further comprises:

if the energy sum of the audio frame is greater than or equal to the preset energy, performing echo cancellation on the environment frequency domain signal and the ear canal frequency domain signal based on the frequency domain amplitude corresponding to the speaker reference signal to obtain the environment frequency domain signal and the ear canal frequency domain signal after the echo cancellation;

and obtaining a frequency spectrum amplitude difference according to the environment frequency domain signal after the echo cancellation and the auditory canal frequency domain signal.

Optionally, obtaining an ambient frequency domain signal comprises:

acquiring an environment time domain signal, wherein the environment time domain signal refers to a time domain expression mode of a sound signal in the environment around the earphone;

performing time-frequency conversion on the environment time domain signal to obtain an environment frequency domain signal;

acquiring ear canal frequency domain signals, comprising:

acquiring an ear canal time domain signal, wherein the ear canal time domain signal refers to a time domain expression mode of a sound signal in an ear canal of a user wearing an earphone;

and carrying out time-frequency conversion on the ear canal time domain signal to obtain an ear canal frequency domain signal.

According to a second aspect of embodiments of the present disclosure, there is provided a control apparatus of a headphone including a feedforward microphone and a feedback microphone, the apparatus including:

a first obtaining module configured to obtain an environment frequency domain signal, where the environment frequency domain signal is a frequency domain expression of a sound signal in an environment around the earphone;

a second obtaining module configured to obtain an ear canal frequency domain signal, which is a frequency domain expression of a sound signal in an ear canal of a user wearing the earphone;

a determining module configured to obtain a spectrum amplitude difference according to the environment frequency domain signal and the ear canal frequency domain signal, wherein the spectrum amplitude difference represents a difference value between an amplitude of the environment frequency domain signal and an amplitude of the ear canal frequency domain signal;

and the switching module is configured to control the mode of the earphone to be switched to a transparent mode if the voice activity of the user wearing the earphone is determined according to the spectrum amplitude difference and a preset voice detection strategy.

According to a third aspect of embodiments of the present disclosure, there is provided a headset comprising:

a processor;

a memory for storing processor-executable instructions;

a feed-forward microphone and a feedback microphone;

wherein the processor is configured to:

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the control method of a headset provided by the first aspect of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: when the environment frequency domain signal and the auditory canal frequency domain signal are obtained, the frequency spectrum amplitude difference can be obtained according to the environment frequency domain signal and the auditory canal frequency domain signal, and on the basis, if the user wearing the earphone is determined to have voice activity according to the frequency spectrum amplitude difference and a preset voice detection strategy, the mode of the earphone is controlled to be switched to a transparent mode. Whether the user who wears the earphone has voice activity is confirmed through utilizing the difference in frequency spectrum amplitude and the voice detection strategy that predetermines in this application, not only can improve the flexibility of earphone control, can reduce the consumption of earphone control moreover, need not the user intervene alright with realize the control to the earphone, and then can improve user's use and experience.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flowchart illustrating a control method of a headset according to an exemplary embodiment.

Fig. 2 is a schematic diagram illustrating a structure of a headset in a control method of the headset according to an exemplary embodiment.

Fig. 3 is an exemplary diagram illustrating switching of an earphone mode to a pass-through mode in a control method of an earphone according to an exemplary embodiment.

Fig. 4 is a flowchart illustrating a control method of a headset according to another exemplary embodiment.

Fig. 5 is a block diagram illustrating a control device of a headset according to an exemplary embodiment.

Fig. 6 is a block diagram illustrating a headset according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In recent years, the TWS (True Wireless Stereo) headset market has been continuously exploded, the industrial scale has been continuously expanded, and various products with rich functions have been introduced by various large headset manufacturers. The active noise reduction and transparent function is basically a necessary function of the existing TWS earphone, and nowadays, the intelligent degree of electronic products is more and more high, and many products are matched with an intelligent picking-free function, namely, the TWS earphone is automatically switched to a transparent mode in a specific scene.

The intelligent picking-free function is also called as an adaptive transparent function, and means that when an earphone user wears the earphone, if the user speaks is detected, the earphone is automatically switched from a noise reduction mode to a transparent mode, so that the user can communicate with other people without picking the earphone.

Most of the intelligent picking-free schemes in the current market are realized based on hardware sensors such as bone vibration sensors. Therefore, the schemes need additional hardware assistance when intelligent picking-free is achieved, and further additional hardware cost of the earphone is increased. In addition, the schemes are easy to detect mistakenly in vocal cord vibration scenes such as coughs.

In view of the above problems, the present application provides a control method and apparatus for an earphone, the earphone, and a computer-readable storage medium, which determine whether there is voice activity in a user wearing the earphone by using a frequency spectrum amplitude difference obtained by an environment frequency domain signal and an ear canal frequency domain signal, so that false detection in vocal cord vibration scenes such as cough can be effectively reduced, and user experience can be effectively improved.

Fig. 1 is a flowchart illustrating a control method of a headset according to an exemplary embodiment, where the control method of the headset is used in the headset as shown in fig. 1, and includes the following steps.

In step S11, an ambient frequency domain signal is acquired.

In this embodiment, the earphone may be an active noise reduction earphone, and the default operating mode of the earphone may be a noise reduction mode. This application can obtain environment frequency domain signal when detecting that the user wears the earphone, wherein, environment frequency domain signal can be for the frequency domain expression mode of the sound signal in the earphone surrounding environment. Optionally, in the embodiment of the present application, the environment frequency domain signal may also be obtained when it is detected that the noise reduction mode of the headset is in an on state. In addition, the embodiment of the application can also acquire the environmental frequency domain signal when the noise reduction mode of the earphone is detected to be in the open state.

Alternatively, sound signals in the environment surrounding the headset may be picked up by a feed-forward microphone of the headset. As shown in fig. 2, the earphone in the embodiment of the present application may include a feedforward microphone 101. Here, the feed-forward microphone (FF MIC) 101 is disposed outside the headphone, and can receive both the voice signal of the wearer and the voice signals of other people around the wearer. Optionally, the method and the device can acquire the environmental frequency domain signal in the current sampling period.

As another optional manner, when the environment frequency domain signal is obtained, the environment time domain signal may be obtained and subjected to time-frequency conversion to obtain the environment frequency domain signal, where the environment time domain signal refers to a time domain expression manner of a sound signal in an environment around the earphone.

Alternatively, after acquiring the sound signals in the surrounding environment of the earphone, the embodiment of the present application may perform frame overlapping windowing on the sound signals in the surrounding environment of the earphone to obtain the frame-level signal. Wherein, the sampling frequency of the sound signal in the environment around the earphone can be greater than the preset frequency, and the preset frequency can be 8000. For example, the sampling frequency of the sound signal within the environment around the headset may be 16000.

Alternatively, when the sound signal in the environment around the headphone is subjected to the framing overlap windowing process, the framing overlap frame length may be any one of 16ms, 32ms, and 64ms, and the framing overlap frame length in the embodiment of the present application may be 32ms. In addition, the frame shift in the embodiments of the present application may be 25%, 50% or 75% of the frame length, and the present application preferably has a frame shift of 50% of the frame length. Alternatively, the windowing in the embodiments of the present application may use a hanning window, a hamming window, etc., with a hanning window being preferred in the present application. In addition, the window length and the frame length when performing the frame overlap windowing may be the same.

As an optional mode, when performing frame-division overlapping windowing on the sound signal in the environment around the headphone, the present application may first perform frame-division processing on the sound signal in the environment around the headphone to obtain a multi-frame acoustic signal. On the basis, the acoustic signal is converted into an environment time domain signal to obtain a multi-frame environment time domain signal. The calculation formula of the environmental time domain signal can be as follows.

ff _f (m,n)＝ff((m-1)*inc+n)*w(n),0≤n≤(L-1)；

Where m denotes a frame number index, n denotes a data point index of audio data, ff _f (m) represents the m-th frame ambient time domain signal, ff _f And (m, n) represents the nth time domain sampling point of the mth frame environment time domain signal. inc is the sampling point length of frame shift, w (n) is a window function, L is the sampling point length of the frame length, and the calculation formula of the sampling point length (L) of the frame length can be as follows: sample rate (fs) × frame length (ft)/1000. On this basis, the embodiment of the application can perform fast fourier transform on the environment time domain signal to obtain a frame-level feedforward microphone frequency domain amplitude signal (FF (m)), that is, obtain an environment frequency domain signal FF (m).

The calculation formula of the environmental frequency domain signal FF (m) can be as follows.

Wherein m represents the index of the frame number, n represents the index of the frequency point, FF (m) represents the frequency domain amplitude signal of the feedforward microphone of the mth frame, namely the environment frequency domain signal, FF (m, k) represents the kth frequency domain sampling point of the environment frequency domain signal of the mth frame, and due to the conjugate symmetry of discrete Fourier transform, the method can only obtain the front frequency domain sampling point of the frequency domain data

Data of the points.

In step S12, an ear canal frequency domain signal is acquired.

As an optional mode, when it is detected that the user wears the earphone, the ear canal frequency domain signal may also be obtained, where the ear canal frequency domain signal may be a frequency domain expression mode of a sound signal in an ear canal of the user wearing the earphone. Optionally, the embodiment of the present application may also acquire the ear canal frequency domain signal when it is detected that the noise reduction mode of the earphone is in the open state. In addition, the embodiment of the application can also acquire the ear canal frequency domain signal when the noise reduction mode of the earphone is detected to be in the open state.

As an alternative, the sound signal in the ear canal of the user wearing the headset may be picked up by a feedback microphone of the headset. As shown in fig. 2, the headset in the embodiment of the present application may include a feedback microphone 102. Wherein a feedback microphone (FB MIC) 102 is provided inside the earpiece for receiving speech signals from the wearer that are transmitted by the eustachian tube into the ear canal. Optionally, the present application may acquire the ear canal frequency domain signal in the current sampling period.

As another optional mode, when acquiring the ear canal frequency domain signal, the present application may acquire an ear canal time domain signal, and perform time-frequency conversion on the ear canal time domain signal to obtain the ear canal frequency domain signal, where the ear canal time domain signal refers to a time domain expression mode of a sound signal in an ear canal of a user wearing an earphone.

As an optional manner, after acquiring a sound signal in an ear canal of a user wearing an earphone, the embodiment of the present application may perform frame-based overlapping windowing on the sound signal in the ear canal of the user wearing the earphone to obtain a frame-level signal. The sampling frequency of the sound signal in the ear canal of the user wearing the earphone is similar to the sampling frequency of the sound signal in the environment around the earphone, and therefore the description is omitted here.

In addition, the process of performing the framing overlapping windowing on the sound signal in the ear canal of the user wearing the earphone is similar to the process of performing the framing overlapping windowing on the sound signal in the environment around the earphone, and the process may also be performed by first performing the framing processing on the sound signal in the ear canal of the user wearing the earphone to obtain the multi-frame acoustic signal. On the basis, the acoustic signal is converted into an ear canal time domain signal to obtain a multi-frame ear canal time domain signal. The calculation formula of the ear canal time domain signal can be as follows.

fb _f (m，n)＝fb((m-1)*inc+n)*w(n)，0≤n≤(L-1)；

Wherein fb _f (m) denotes the m-th frame ear canal time domain signal, fb _f (m, n) represents the nth time domain sample point of the mth frame of the ear canal time domain signal.

On this basis, the embodiment of the application can perform fast fourier transform on the ear canal time domain signal to obtain the frame-level feedback microphone frequency domain amplitude signal (FB (m)), that is, the ear canal frequency domain signal FB (m) is obtained. The calculation formula of the ear canal frequency domain signal FB (m) can be shown as follows.

And FB (m) represents the mth frame of feedback microphone frequency domain amplitude signal, namely the ear canal frequency domain signal, and FF (m, k) represents the kth frequency domain sampling point of the mth frame of ear canal frequency domain signal.

It should be noted that the earphone, in addition to the feedforward microphone and the feedback microphone, may also include a speaker 103 as shown in fig. 2, and the speaker 103 may be used for outputting voice information. When the environment frequency domain signal and the ear canal frequency domain signal are obtained, the loudspeaker reference signal can also be obtained according to the embodiment of the application. Wherein, the times corresponding to the environment frequency domain signal, the ear canal frequency domain signal and the speaker reference signal may be the same.

In step S13, a spectrum amplitude difference is obtained according to the environment frequency domain signal and the ear canal frequency domain signal.

As an optional manner, after the environment frequency domain signal and the ear canal frequency domain signal are obtained, in the embodiment of the present application, a spectrum amplitude difference may be obtained according to the environment frequency domain signal and the ear canal frequency domain signal, where the spectrum amplitude difference represents a difference between an amplitude of the environment frequency domain signal and an amplitude of the ear canal frequency domain signal, and a calculation equation of the spectrum amplitude difference is as follows.

Where FC (m) denotes the spectral amplitude difference of the mth frame, and FC (m, k) denotes the kth frequency-domain sample point (frequency point) of the spectral amplitude difference of the mth frame.

In step S14, if it is determined that there is voice activity in the user wearing the headset according to the spectrum amplitude difference and a preset voice detection policy, the mode of the headset is controlled to switch to the transparent mode.

As an optional manner, after obtaining the spectrum amplitude difference according to the environment frequency domain signal and the ear canal frequency domain signal, the embodiment of the present application may determine whether there is voice activity in the user wearing the earphone according to the spectrum amplitude difference and a preset voice detection policy. If it is determined that the user wearing the earphone has voice activity according to the frequency spectrum amplitude difference and a preset voice detection strategy, the mode of the earphone can be controlled to be switched to a transparent mode. In the embodiment of the application, the transparent mode refers to the earphone collecting the environment sound, the environment sound is output after being filtered, the sound leaked into the human ear is superposed, and the human ear receives the complete environment sound.

In other words, in the case that it is determined that there is voice activity in the user wearing the headset, the present application may switch the mode of the headset from the active noise reduction mode to the pass-through mode, and the switching of the headset mode may be as shown in fig. 3. As can be seen from fig. 3, the embodiment of the present application may switch its mode from the active noise reduction mode to the pass-through mode when detecting that there is voice activity in the user wearing the headset.

In addition, if it is determined that there is no voice activity for the user wearing the headset according to the spectrum amplitude difference and the preset voice detection policy, the embodiment of the present application may keep the mode of the headset unchanged in the active noise reduction mode, that is, not perform the mode switching operation.

Optionally, if it is determined that the user wearing the headset does not have voice activity according to the spectrum amplitude difference and a preset voice detection policy, the embodiment of the present application may also determine whether the user wearing the headset is in a specified scene, and if it is determined that the user wearing the headset is in the specified scene, the embodiment of the present application may also switch the mode of the headset from the noise reduction mode to the pass-through mode.

As a specific implementation manner, when determining whether a user wearing the headset is in a specified scene, the application may determine whether the headset acquires an audio signal of a specified type, and if it is determined that the headset acquires an audio signal of a specified type, the embodiment of the application may determine that the user is in the specified scene. Wherein the specified type of audio signal may be a whistle signal or a call signal.

In addition, if it is determined that there is no voice activity of the user wearing the headset and it is determined that the user wearing the headset is not in the specified scene, the embodiment of the present application may keep the mode of the headset unchanged, that is, keep the mode of the headset in the noise reduction mode.

As an optional manner, when determining whether a user wearing the headset has voice activity according to the spectrum amplitude difference and a preset voice detection policy, the embodiment of the present application may determine whether the user wearing the headset has voice activity by using different voice detection policies. Specifically, how to determine whether voice activity exists in a user wearing the headset by using different voice detection strategies is described in detail in the following embodiments, and will not be described herein again.

When the environment frequency domain signal and the auditory canal frequency domain signal are obtained, the frequency spectrum amplitude difference can be obtained according to the environment frequency domain signal and the auditory canal frequency domain signal, on the basis, if the user wearing the earphone is determined to have voice activity according to the frequency spectrum amplitude difference and a preset voice detection strategy, the mode of the earphone is controlled to be switched to a transparent mode. Whether this application has voice activity through utilizing the poor and predetermined pronunciation detection strategy of frequency spectrum amplitude to confirm the user who wears the earphone, not only can improve the flexibility of earphone control, can reduce the consumption of earphone control moreover, need not the user intervene alright with the control of realization to the earphone, and then can improve user's use and experience.

Fig. 4 is a flowchart illustrating a control method of a headset according to an exemplary embodiment, where the control method of the headset is used in the headset as shown in fig. 4, and includes the following steps.

In step S21, an ambient frequency domain signal is acquired.

In step S22, an ear canal frequency domain signal is acquired.

The above embodiments of steps S21 to S22 have been described in detail, and are not repeated herein.

In step S23, a spectrum amplitude difference is obtained according to the environment frequency domain signal and the ear canal frequency domain signal.

As an optional manner, when the frequency spectrum amplitude difference is obtained according to the environment frequency domain signal and the ear canal frequency domain signal, the embodiment of the present application may also obtain a speaker reference signal, where the speaker reference signal may be an original sound signal of an audio signal to be played by a speaker of the earphone, and the speaker reference signal includes a plurality of audio frames.

On this basis, the embodiment of the present application may obtain the energy sum of each audio frame included in the speaker reference signal, and then determine whether the energy sum of the audio frame is less than the preset energy, and if it is determined that the energy sum of the audio frame is less than the preset energy, the embodiment of the present application may obtain the spectrum amplitude difference according to the environment frequency domain signal and the ear canal frequency domain signal.

Specifically, when the energy sum of each audio frame included in the speaker reference signal is obtained, the embodiment of the present application may also perform frame-wise overlapping windowing on the speaker reference signal to obtain a multi-frame acoustic signal. On the basis, the acoustic signal is converted into a time domain signal to obtain a multi-frame reference time domain signal, and the calculation formula of the reference time domain signal can be as follows.

spk _f (m，n)＝spk((m-1)*inc+n)*w(n)，0≤n≤(L-1)；

Where m denotes a frame number index, n denotes a data point index of audio data, spk _f (m) denotes the mth frame reference time domain signal, spk _f (m, n) represents the nth time-domain sample point of the mth frame reference signal. inc is the sampling point length of frame shift, w (n) is a window function, L is the sampling point length of the frame length, and the calculation formula of the sampling point length (L) of the frame length can be as follows: sample rate (fs) × frame length (ft)/1000.

As an example, the sampling frequency in the embodiment of the present application may be 16000, and the frame length ft may be 32, i.e., the frame length L may be 16000 × 32/1000=512. In addition, the frame shift may take 50% of the frame length, i.e., the sample point length inc of the frame shift may be 256, and the window function w (n) may be a hanning window.

On the basis, the embodiment of the application can calculate the energy sum of each audio frame contained in the speaker reference signal, and the calculation formula of the energy sum is as follows.

Wherein spk _ e (m) is the energy sum of each audio frame contained in the speaker reference signal, m represents the frame number index, n represents the data point index of the audio data, and L is the sampling point length of the frame length.

As an optional manner, after the energy sum of each audio frame included in the speaker reference signal is obtained, the embodiment of the present application may determine whether the energy sum of the audio frame is less than a preset energy, so as to determine whether the content is being played by the earphone. When the sum of the energies of the audio frames is determined to be less than the preset energy, it is determined that the earphone does not play the content, and at this time, the spectrum amplitude difference can be directly calculated.

Optionally, if it is determined that the sum of the energies of the audio frames is greater than or equal to the preset energy, it is determined that the content of the earphone is played, and at this time, echo cancellation may be performed on the environment frequency domain signal and the ear canal frequency domain signal based on the frequency domain amplitude corresponding to the speaker reference signal, so as to obtain the environment frequency domain signal and the ear canal frequency domain signal after echo cancellation. On the basis, the frequency spectrum amplitude difference is obtained according to the environment frequency domain signal after echo cancellation and the ear canal frequency domain signal.

Before echo cancellation is performed on the environment frequency domain signal and the ear canal frequency domain signal based on the frequency domain amplitude value corresponding to the speaker reference signal, the embodiment of the application can perform fast Fourier transform on the speaker reference signal to obtain a frame-level spk frequency domain amplitude value signal, and thus the frequency domain amplitude value corresponding to the speaker reference signal is obtained. The specific calculation formula of the spk frequency domain amplitude signal at the frame level is shown below.

Wherein SPK (m) represents the spectral amplitude difference of the mth frame, and SPK (m, k) represents the kth frequency domain sampling point (frequency point) of the mth frame speaker reference frequency domain amplitude signal.

On this basis, the embodiment of the present application may perform echo cancellation on the environment frequency domain signal and the ear canal frequency domain signal based on the frequency domain amplitude corresponding to the speaker reference signal to obtain the environment frequency domain signal and the ear canal frequency domain signal after echo cancellation, and the equation for performing echo cancellation on the environment frequency domain signal and the ear canal frequency domain signal may be as follows.

FF′(m，k)＝AecFunction(FF(m，k)，SPK(m，k))；

FB′(m，k)＝AecFunction(FB(m，k)，SPK(m，k))；

Wherein FF '(m) represents the ambient frequency domain signal after echo removal, FB' (m) represents the ear canal frequency domain signal after echo removal, and echo represents the echo cancellation process.

Therefore, the frequency spectrum amplitude difference can be obtained according to the environment frequency domain signal after echo cancellation and the ear canal frequency domain signal, and the calculation formula of the time frequency spectrum amplitude difference is shown as follows.

In summary, the obtaining manner of the difference between the spectrum amplitude corresponding to the content being played and the spectrum amplitude corresponding to the content not being played by the earphone is different, so that the accuracy of controlling the earphone can be improved.

In step S24, if it is determined that there is voice activity in the user wearing the headset according to the spectrum amplitude difference and a preset voice detection policy, a power consumption mode of the headset is obtained.

In the embodiment of the application, the power consumption modes of the earphone can comprise a low power consumption mode and a high performance mode. When the power consumption mode of the earphone is obtained, the power consumption mode preset by a user can be obtained according to the embodiment of the application.

Optionally, when the power consumption mode of the headset is acquired, the embodiment of the application may also acquire the performance parameter of the headset, and determine the corresponding power consumption mode according to the performance parameter of the headset. The earphone performance parameter may include at least one of a remaining battery capacity, a Central Processing Unit (CPU) load, and the like.

As a specific implementation manner, when determining the corresponding power consumption mode according to the performance parameter of the headset, the embodiment of the application may determine whether the remaining power of the headset is greater than the preset power, and when determining that the remaining power of the headset is greater than the preset power, determine that the power consumption mode of the headset is the high performance mode. And when the residual electric quantity of the earphone is determined to be less than or equal to the preset electric quantity, determining the power consumption mode of the earphone to be the low power consumption mode.

As another specific implementation manner, when determining the corresponding power consumption mode according to the performance parameter of the headset, the embodiment of the application may determine whether the CPU load of the headset is less than a preset load, and when determining that the CPU load of the headset is less than the preset load, determine that the power consumption mode of the headset is the high performance mode. And when determining whether the CPU load of the earphone is greater than or equal to the preset load, determining the power consumption mode of the earphone to be the low power consumption mode.

As another specific implementation manner, when the power consumption mode preset by the user is acquired, if the high performance detection indication information input by the user is received, the power consumption mode of the headset is determined to be the high performance mode. And if the low power consumption detection indication information input by the user is received, determining that the power consumption mode of the earphone is the low power consumption mode. The high-performance detection indication information and the low-power detection indication information may be information input by a user through operating the headset, or indication information sent to the headset through the terminal device by the user through operating the terminal device.

As an optional manner, after the power consumption mode of the headset is obtained, if it is determined that the power consumption mode of the headset is the high performance mode, in this embodiment of the application, whether voice activity exists in the user wearing the headset may be determined according to the difference in the spectrum amplitude and a preset first voice detection policy, that is, step S25 is performed. In addition, if it is determined that the power consumption mode of the headset is the low power consumption mode, in this embodiment of the application, whether voice activity exists in the user wearing the headset may be determined according to the difference in the spectrum amplitude and a preset second voice detection policy, that is, step S26 is performed.

In step S25, if the power consumption mode is the high performance mode, it is determined whether there is voice activity for the user wearing the headset according to the spectrum amplitude difference and a preset first voice detection policy.

As an optional manner, when it is determined that the power consumption mode of the headset is the high performance mode, the embodiment of the application may determine whether voice activity exists in a user wearing the headset according to the difference in the spectrum amplitude and a preset first voice detection policy. Specifically, the embodiment of the application can input the frequency spectrum amplitude difference to a preset voice detection model to obtain a voice detection result.

The speech detection model may be obtained by performing machine learning training on the initial model using a training set, where the training set may include a positive sample data set and a negative sample data set. In addition, the positive sample data set may include a plurality of positive sample data, and each positive sample data may include first audio data containing a user's speaking sound in a state where the user is wearing the headset and a first tag corresponding to the first audio data for indicating that the corresponding first audio data is the target audio.

Alternatively, the negative sample data set may include a plurality of negative sample data, each of which may include second audio data and a second tag corresponding to the second audio data, wherein the second audio data does not contain the user's speaking sound in a state where the user is wearing the headphone, and the second tag is used to indicate that the corresponding second audio data is not the target audio.

On the basis, the embodiment of the application can determine whether voice activity exists in the user wearing the headset based on the voice detection result. In the embodiment of the present application, the speech detection Model may include a Gaussian Mixture Model (GMM) or a Bayesian Gaussian Mixture Model (BGMM).

As an optional way, before the spectral amplitude difference is input into the speech detection model, the speech detection model may be trained first, that is, machine learning training is performed on the initial model by using a training set, so as to obtain the speech detection model. Specifically, the positive and negative sample data sets may be created first, that is, the positive and negative sample data sets are recorded by using the feedforward microphone, the feedback microphone, and the speaker. Wherein, the audio sampling frequency of the positive and negative sample data sets recording may be 16000.

It is known from the above description that the positive sample data set may be audio recorded while wearing earphones, the positive sample data set may include recorded data of not less than 10 people, and the proportion of male and female may be 50%. Alternatively, the speech text may be a conversational phrase of a scene. In addition, the negative sample data set may be audio of a non-earphone wearer speaking while wearing the earphone, such as cough, other people speaking, and environmental sound.

On the basis, the method and the device can extract features, construct a training set and a test set, use the recorded sample data set to calculate the frame-level spectrum amplitude difference features, then carry out data standardization processing on the features, use the processed data as the input of an initial model, and train the initial model through the training set to obtain a voice detection model. In addition, when positive and negative sample data sets are constructed, a corresponding label can be generated in each frame, the positive sample label can be 1, the negative sample label can be 0, and the corresponding frame-level label can be used as another type of training input when a speech detection model is trained.

As an example, when the voice detection model is a gaussian mixture model, the number of gaussian distributions in the gaussian mixture model may be 20, 30, or 40, and the number of gaussian distributions in the embodiment of the present application is preferably 30. The Maximum number of iterations of the EM algorithm (Expectation maximization) in the Gaussian mixture model may be 150.

In the embodiment of the application, the essence of obtaining the voice detection result by using the voice detection model is to perform two-class recognition, that is, positive sample data and negative sample data of the training voice detection model belong to independent and same distribution, and the two types of data are respectively used for training one GMM and finally two corresponding GMMs are trained.

As an optional manner, in a case that the power consumption mode of the headset is determined to be the high performance mode, the embodiment of the application may input the spectrum amplitude difference to the voice detection model to obtain a voice detection result. The voice detection result may include log-maximum likelihood values, and the voice detection model may obtain two log-maximum likelihood values through calculation after receiving the spectrum amplitude difference, where the two log-maximum likelihood values respectively include likelihood values corresponding to the positive sample and the negative sample. When the maximum likelihood value corresponding to the positive sample is larger than the maximum likelihood value corresponding to the negative sample, the fact that voice activity exists in the user wearing the earphone is indicated, and at the moment, the mode of the earphone can be controlled to be switched to a transparent mode.

In step S26, if the power consumption mode is the low power consumption mode, it is determined whether there is voice activity for the user wearing the headset according to the spectrum amplitude difference and a preset second voice detection strategy.

As another alternative, if it is determined that the power consumption mode of the headset is the low power consumption mode, in this embodiment of the application, whether voice activity exists in the user wearing the headset may be determined according to the difference in the spectrum amplitude and a preset second voice detection policy. Wherein the spectral magnitude difference may comprise a plurality of sub-magnitude differences.

Specifically, in the embodiment of the present application, the perceptual weighting value may be obtained, where the perceptual weighting value may be a sum of products of each sub-amplitude difference and a corresponding perceptual weighting coefficient, and the perceptual weighting coefficient is a weighting value corresponding to different frequency bands. On this basis, it is determined whether voice activity is present for the user wearing the headset based on a perceptual weighting value, which may be calculated as follows.

Where fc _ am (m) represents a perceptual weighting value,

the perceptual weighting coefficients representing the calculated perceptual weighting values may be different from one another depending on the headphones, specifically based on the actual situation.

In the embodiment of the present application, the speech detection result of the headphone user per frame may be u _ vad (m), and if the value of u _ vad (m) is 1, it indicates that there is speech activity of the headphone user in this frame. If u _ vad (m) has a value of 0, this indicates that there is no headset user speech activity for this frame.

Alternatively, when the perceptual weighting value fc _ am (m) is greater than the preset threshold, the value of u _ vad (m) may be set to 1; when the perceptual weighting value fc _ am (m) is less than or equal to the preset threshold value, the value of u _ vad (m) may be set to 0. The preset threshold may be a threshold for determining voice activity of each frame of the headset user in the low power consumption mode.

In the embodiment of the application, when determining whether voice activity exists in a user wearing the earphone by using a low power consumption mode or a high performance mode, continuous frame data with a specified quantity can be counted to obtain a voice activity detection result. And if the result of determining that the specified percentage is higher than the preset percentage shows that the voice activity of the earphone exists, determining that the voice activity of the user wearing the earphone exists, and controlling the mode of the earphone to be switched to a transparent mode, namely entering the step S27. As one example, the specified number may be 64 frames of data, which may be 1 second of data, and the specified percentage may be 80%, i.e., when more than 80% of the results are determined to be the presence of headset voice activity, the user wearing the headset is determined to be voice activity.

According to the embodiment of the application, the picking of the earphone can be avoided through the spectral characteristics of the feedforward microphone and the feedback microphone, the user can select the earphone through setting the low power consumption mode and the high performance mode, the algorithm calculation requirement is low in the low power consumption mode, the influence on the endurance of the earphone can be effectively reduced, the false detection of vocal cord vibration scenes such as coughing can be effectively reduced through the statistical learning mode of the earphone in the high performance mode, and the user experience can be effectively improved.

In step S27, if it is determined that there is voice activity in the user wearing the headset according to the spectrum amplitude difference and a preset voice detection policy, the mode of the headset is controlled to switch to the transparent mode.

It is known from the above description that, in the case that it is determined that there is voice activity in the user wearing the headset based on the voice detection result, the mode of the headset may be controlled to switch to the pass-through mode in the embodiment of the present application. In the process, the embodiment of the application can acquire the type of the voice activity and then determine whether the type of the voice activity is a conversation type. If the type of the voice activity is determined to be a conversation type, the embodiment of the application can control the mode of the earphone to be switched into a transparent mode.

As an alternative, obtaining the type of voice acquisition may include: and inputting the ear canal frequency domain signal into a semantic recognition model to obtain a semantic recognition result. The semantic recognition model can be obtained by performing machine learning training on the initial semantic recognition model by using a semantic training set, wherein the semantic training set comprises a conversation sample data set and a non-conversation sample data set.

In this embodiment of the application, the dialog sample data set may include a plurality of dialog sample data, each dialog sample data includes third audio data and a third tag corresponding to the third audio data, the third audio data contains a sound of a dialog between a user wearing the headset and another person, and the third tag is used to indicate that the corresponding third audio data is a target dialog audio. In addition, the non-dialog sample data set includes a plurality of non-dialog sample data, each non-dialog sample data includes fourth audio data and a fourth tag corresponding to the fourth audio data, the fourth audio data contains sound of a user wearing the headset not having a dialog with others, and the fourth tag is used for indicating that the corresponding fourth audio data is not a target dialog audio. On the basis, the type of the voice activity corresponding to the semantic recognition result is obtained. The sound of the user wearing the headset not having a conversation with other people may include a self-speaking sound, or a singing sound of the user wearing the headset.

Optionally, in a case where it is determined that there is voice activity in the user wearing the headset based on the voice detection result, if it is determined that the voice activity is of a non-conversation type, the embodiment of the present application may keep the mode of the headset unchanged. As an example, in the case that it is determined that there is voice activity of the user wearing the headset, if it is determined that the voice activity is self-speaking, the embodiment of the present application may keep the mode of the headset unchanged.

As another example, in a case that it is determined that there is voice activity of a user wearing an earphone, if it is determined that the voice activity is singing, the embodiment of the present application may keep the mode of the earphone unchanged, so that it may be avoided that the mode switching affects normal use of the earphone by the user.

It should be noted that, after the mode of controlling the earphone is switched to the transparent mode, the embodiment of the present application may continuously monitor whether the voice activity of wearing the earphone is finished, and if it is determined that the voice activity is finished, the embodiment of the present application may switch the earphone from the transparent mode back to the active noise reduction mode after the voice time is finished for a specified duration. For example, 15s after detecting the end of voice activity, embodiments of the present application may switch the mode of the headset back to active noise reduction mode.

When the environment frequency domain signal and the auditory canal frequency domain signal are acquired, the frequency spectrum amplitude difference can be obtained according to the environment frequency domain signal and the auditory canal frequency domain signal, on the basis, if the user wearing the earphone is determined to have voice activity according to the frequency spectrum amplitude difference and a preset voice detection strategy, the mode of the earphone is controlled to be switched to the transparent mode. Whether the user who wears the earphone has voice activity is confirmed through utilizing the difference in frequency spectrum amplitude and the voice detection strategy that predetermines in this application, not only can improve the flexibility of earphone control, can reduce the consumption of earphone control moreover, need not the user intervene alright with realize the control to the earphone, and then can improve user's use and experience. In addition, the earphone can be automatically switched to a transparent mode from an active noise reduction mode in the embodiment of the application, and a user can communicate with other people better without taking off the earphone in a conversation process, so that better intelligent experience can be provided for the user.

Fig. 5 is a block diagram illustrating a control apparatus 300 for a headset according to an exemplary embodiment. Referring to fig. 5, the apparatus includes a first obtaining module 301, a second obtaining module 302, a determining module 303, and a switching module 304.

The first obtaining module 301 is configured to obtain the environment frequency domain signal, where the environment frequency domain signal is a frequency domain representation of a sound signal in the environment around the earphone;

the second obtaining module 302 is configured to obtain the ear canal frequency domain signal, where the ear canal frequency domain signal is a frequency domain representation of a sound signal in an ear canal of a user wearing the earphone;

the determining module 303 is configured to derive a spectral amplitude difference from the ambient frequency domain signal and the ear canal frequency domain signal;

the switching module 304 is configured to control the mode of the headset to switch to the pass-through mode if it is determined that there is voice activity in the user wearing the headset according to the spectrum amplitude difference and a preset voice detection policy.

In some embodiments, the switching module 304 may include:

a mode acquisition submodule configured to acquire power consumption modes of the headset, the power consumption modes including a low power consumption mode and a high performance mode;

a first determining sub-module configured to determine whether voice activity exists in a user wearing the headset according to the spectrum amplitude difference and a preset first voice detection strategy if the power consumption mode is a high performance mode;

a second determining sub-module configured to determine whether there is voice activity for a user wearing the headset according to the spectrum amplitude difference and a preset second voice detection policy if the power consumption mode is a low power consumption mode.

In some embodiments, the mode obtaining submodule may be further configured to obtain a power consumption mode preset by a user, or obtain a performance parameter of the headset, and determine a corresponding power consumption mode according to the performance parameter of the headset, where the performance parameter of the headset includes at least one of a remaining battery capacity and a CPU load.

In some embodiments, the first determining sub-module may be further configured to input the spectral amplitude difference to a preset speech detection model to obtain a speech detection result, where the speech detection model is obtained by performing machine learning training on an initial model using a training set, where the training set includes a positive sample data set and a negative sample data set, the positive sample data set includes a plurality of positive sample data, each positive sample data includes first audio data and a first tag corresponding to the first audio data, the first audio data includes a speaking sound of a user in a state of wearing the headset, the first tag is used to indicate that the corresponding first audio data is a target audio, the negative sample data set includes a plurality of negative sample data, each negative sample data includes second audio data and a second tag corresponding to the second audio data, the second audio data does not include a speaking sound of the user in a state of wearing the headset, and the second tag is used to indicate that the corresponding second audio data is not the target audio; determining whether voice activity exists for a user wearing the headset based on the voice detection result.

In some embodiments, the spectral magnitude difference comprises a plurality of sub-magnitude differences, and the second determining sub-module may be further configured to obtain a perceptual weighting value, the perceptual weighting value being a sum of products of each of the sub-magnitude differences and a corresponding perceptual weighting coefficient, the perceptual weighting coefficient being a weighting value corresponding to a different frequency band; determining whether voice activity is present for a user wearing the headset based on the perceptual weighting values.

In some embodiments, the switching module 304 may further include:

the type obtaining sub-module is configured to obtain the type of the voice activity if the voice activity of the user wearing the earphone is determined to exist according to the spectrum amplitude difference and a preset voice detection strategy;

a switching sub-module configured to control the mode of the headset to switch to a pass-through mode if the type of the voice activity is a conversation type;

a maintaining module configured to maintain a mode of the headset unchanged if the voice activity is of a non-conversational type.

In some embodiments, the determining module 303 may include:

a reference signal obtaining sub-module configured to obtain a speaker reference signal, where the speaker reference signal is an original sound signal of an audio signal to be played by a speaker of the headset, and the speaker reference signal includes a plurality of audio frames;

an energy sum obtaining submodule configured to determine an energy sum of each audio frame contained in the speaker reference signal;

a first spectrum amplitude difference obtaining submodule configured to obtain a spectrum amplitude difference according to the environment frequency domain signal and the ear canal frequency domain signal if a sum of energies of the audio frames is smaller than a preset energy.

In some embodiments, the determining module 303 may further include:

the echo cancellation sub-module is configured to perform echo cancellation on the environment frequency domain signal and the ear canal frequency domain signal based on a frequency domain amplitude corresponding to the speaker reference signal if the energy sum of the audio frame is greater than or equal to a preset energy, so as to obtain an environment frequency domain signal and the ear canal frequency domain signal after echo cancellation;

and the second frequency spectrum amplitude difference acquisition sub-module is configured to obtain a frequency spectrum amplitude difference according to the environment frequency domain signal after echo cancellation and the ear canal frequency domain signal.

In some embodiments, the first obtaining module 301 may include:

a first time domain signal obtaining sub-module configured to obtain an ambient time domain signal, where the ambient time domain signal refers to a time domain expression of a sound signal in an environment around the headset;

the first time-frequency conversion module is configured to perform time-frequency conversion on the environment time domain signal to obtain the environment frequency domain signal;

the second obtaining module 302 may include:

a second time domain signal obtaining sub-module configured to obtain an ear canal time domain signal, which refers to a time domain expression of a sound signal in an ear canal of a user wearing the earphone;

and the second time-frequency conversion module is configured to perform time-frequency conversion on the ear canal time domain signal to obtain the ear canal frequency domain signal.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the control method of a headset provided by the present disclosure.

Fig. 6 is a block diagram of a headset 800 illustrating a control for the headset according to an exemplary embodiment. The headset 800 may include a speaker, a feedforward microphone and a feedback microphone. The earphone 800 may be a wireless earphone or a wired earphone, but is not limited thereto.

Referring to fig. 6, the headset 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the headset 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or some of the steps of the headset control method described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the headset 800. Examples of such data include instructions for any application or method operating on the headset 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the headset 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the headset 800.

The multimedia component 808 includes a screen that provides an output interface between the headset 800 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the headset 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the headset 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The input/output interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the headset 800. For example, the sensor assembly 814 may detect the open/closed state of the headset 800, the relative positioning of the components, such as the display and keypad of the headset 800, the sensor assembly 814 may also detect a change in the position of the headset 800 or a component of the headset 800, the presence or absence of user contact with the headset 800, the orientation or acceleration/deceleration of the headset 800, and a change in the temperature of the headset 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the headset 800 and other devices. The headset 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the headset 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the control method of the headset.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the headset 800 to perform the headset control method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned control method of a headset when executed by the programmable apparatus.

Claims

1. A method of controlling a headset, comprising:

acquiring an ear canal frequency domain signal, wherein the ear canal frequency domain signal is a frequency domain expression mode of a sound signal in an ear canal of a user wearing the earphone;

obtaining a frequency spectrum amplitude difference according to the environment frequency domain signal and the ear canal frequency domain signal, wherein the frequency spectrum amplitude difference represents a difference value between the amplitude of the environment frequency domain signal and the amplitude of the ear canal frequency domain signal;

2. The method for controlling the earphone according to claim 1, further comprising, before switching the mode of controlling the earphone to the pass-through mode:

if the power consumption mode is a high-performance mode, determining whether voice activity exists in a user wearing the earphone according to the frequency spectrum amplitude difference and a preset first voice detection strategy;

and if the power consumption mode is a low power consumption mode, determining whether voice activity exists in the user wearing the earphone according to the frequency spectrum amplitude difference and a preset second voice detection strategy.

3. The method of claim 2, wherein the obtaining the power consumption mode of the headset comprises:

obtaining a power consumption mode preset by a user, or

The method comprises the steps of obtaining performance parameters of the earphone, and determining a corresponding power consumption mode according to the performance parameters of the earphone, wherein the performance parameters of the earphone comprise at least one of the remaining battery capacity and the CPU load.

4. The method for controlling a headset according to any one of claims 2-3, wherein the determining whether voice activity exists in the user wearing the headset according to the spectrum amplitude difference and a preset first voice detection strategy comprises:

inputting the spectrum amplitude difference to a preset voice detection model to obtain a voice detection result, wherein the voice detection model is obtained by performing machine learning training on an initial model by using a training set, the training set comprises a positive sample data set and a negative sample data set, the positive sample data set comprises a plurality of positive sample data, each positive sample data comprises first audio data and a first tag corresponding to the first audio data, the first audio data comprises the speaking sound of a user in a state that the user wears the earphone, the first tag is used for indicating that the corresponding first audio data is a target audio, the negative sample data set comprises a plurality of negative sample data, each negative sample data comprises second audio data and a second tag corresponding to the second audio data, the second audio data does not comprise the speaking sound of the user in a state that the user wears the earphone, and the second tag is used for indicating that the corresponding second audio data is not a target audio;

determining whether voice activity exists for a user wearing the headset based on the voice detection result.

5. The method of controlling a headset according to any one of claims 2-3, wherein the spectral amplitude difference comprises a plurality of sub-amplitude differences, and wherein the determining whether voice activity is present in the user wearing the headset according to the spectral amplitude difference and a preset second voice detection strategy comprises:

acquiring a perception weighted value, wherein the perception weighted value is the sum of the product of each sub-amplitude difference and a corresponding perception weighted coefficient, and the perception weighted coefficient is a weighted value corresponding to different frequency bands;

determining whether voice activity is present for a user wearing the headset based on the perceptual weighting values.

6. The method of claim 1, wherein if it is determined that there is voice activity in the user wearing the headset according to the spectrum amplitude difference and a preset voice detection policy, controlling the mode of the headset to switch to a pass-through mode comprises:

if the voice activity is of a non-conversation type, keeping the mode of the earphone unchanged.

7. The method for controlling the earphone according to claim 1, wherein the obtaining the difference of the spectral magnitudes according to the ambient frequency domain signal and the ear canal frequency domain signal comprises:

acquiring a speaker reference signal, wherein the speaker reference signal is an original sound signal of an audio signal to be played by a speaker of the earphone, and the speaker reference signal comprises a plurality of audio frames;

8. The method of controlling a headset of claim 7, further comprising:

if the energy sum of the audio frame is greater than or equal to the preset energy, performing echo cancellation on the environment frequency domain signal and the ear canal frequency domain signal based on the frequency domain amplitude corresponding to the speaker reference signal to obtain an echo-cancelled environment frequency domain signal and the ear canal frequency domain signal;

and obtaining the spectrum amplitude difference according to the environment frequency domain signal after echo cancellation and the ear canal frequency domain signal.

9. The method for controlling the earphone according to claim 1, wherein the obtaining the environmental frequency domain signal comprises:

the acquiring of the ear canal frequency domain signal comprises:

acquiring an ear canal time domain signal, wherein the ear canal time domain signal refers to a time domain expression mode of a sound signal in an ear canal of a user wearing the earphone;

and carrying out time-frequency conversion on the auditory canal time domain signal to obtain the auditory canal frequency domain signal.

10. A control device for a headset, the device comprising:

a first obtaining module configured to obtain an environment frequency domain signal, where the environment frequency domain signal is a frequency domain representation of a sound signal in an environment around the headset;

a second obtaining module configured to obtain an ear canal frequency domain signal, which is a frequency domain representation of a sound signal in an ear canal of a user wearing the earphone;

a determining module configured to obtain a spectral amplitude difference from the ambient frequency domain signal and the ear canal frequency domain signal, the spectral amplitude difference representing a difference between an amplitude of the ambient frequency domain signal and an amplitude of the ear canal frequency domain signal;

and the switching module is configured to control the mode of the earphone to be switched to a transparent mode if the user wearing the earphone is determined to have voice activity according to the spectrum amplitude difference and a preset voice detection strategy.

11. An earphone, comprising:

a processor;

a memory for storing processor-executable instructions;

a feed-forward microphone and a feedback microphone;

wherein the processor is configured to:

and if the user wearing the earphone is determined to have voice activity according to the frequency spectrum amplitude difference and a preset voice detection strategy, controlling the mode of the earphone to be switched to a transparent mode.

12. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 9.