CN115278441A

CN115278441A - Voice detection method, device, earphone and storage medium

Info

Publication number: CN115278441A
Application number: CN202211042440.4A
Authority: CN
Inventors: 周岭松
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-11-01

Abstract

The disclosure relates to a voice detection method, a voice detection device, an earphone and a storage medium. The voice detection method comprises the following steps: acquiring a first voice signal and a second voice signal, wherein the first voice signal is a sound signal acquired from an ear canal of a wearer, and the second voice signal is a sound signal acquired from an environment where the wearer is located; acquiring a first energy parameter and a second energy parameter, wherein the first energy parameter represents an energy value of a first voice signal in a preset frequency band, and the second energy parameter represents an energy value of a second voice signal in the preset frequency band, and the preset frequency band is a frequency interval generated by a blocking effect; and obtaining a numerical relation between the first energy parameter and the second energy parameter, and if the numerical relation indicates that the first energy parameter is larger than the second energy parameter, determining that the wearer sends out the sound signal. By using the method disclosed by the disclosure, whether a sound signal is emitted by an earphone wearer can be accurately judged on the premise of not additionally adding a microphone array consisting of sensor hardware and the earphone.

Description

Voice detection method, device, earphone and storage medium

Technical Field

The present disclosure relates to the field of voice detection technologies, and in particular, to a voice detection method and apparatus, an earphone, and a storage medium.

Background

With the popularization of TWS (True Wireless Stereo) earphones, when a user wears an earphone, the earphone needs to be turned on to transmit clear voice information when the user speaks, so that the user can communicate with other people through the earphone while wearing the earphone; or, when the wearer speaks, the voice recognition function of the earphone is started to recognize the voice information of the wearer and perform related operations. Therefore, the headset needs to accurately detect whether the headset wearer speaks, and then determine whether to keep the current state according to the detection result or enable other functions according to the voice signal of the wearer.

Currently, in order to accurately detect whether a wearer of the headset speaks, one solution in the related art is to provide auxiliary sensors, such as a vibration sensor and an acceleration sensor, on the headset to determine whether the user speaks. In such a detection method, an auxiliary sensor needs to be added, which increases the cost of the earphone. In another related art, a microphone array is provided on the headset to estimate the direction from which the sound is collected to determine whether the wearer is speaking. For the scheme, the estimation of the incoming wave direction is easily interfered by environmental sounds, and when the external sounds are larger than the speaking sounds of a wearer, the direction of the sound source cannot be accurately estimated, so that the detection result is inaccurate.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a voice detection method, apparatus, headset, and storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a voice detection method applied to an in-ear headphone, the voice detection method including:

acquiring a first voice signal and a second voice signal, wherein the first voice signal is a sound signal acquired from an ear canal of a wearer, and the second voice signal is a sound signal acquired from an environment in which the wearer is located;

acquiring a first energy parameter and a second energy parameter, wherein the first energy parameter represents an energy value of the first voice signal in a preset frequency band, and the second energy parameter represents an energy value of the second voice signal in the preset frequency band, and the preset frequency band is a frequency interval generated by a blocking effect;

obtaining a numerical relationship between the first energy parameter and the second energy parameter, and determining that the wearer has emitted a sound signal if the numerical relationship indicates that the first energy parameter is greater than the second energy parameter.

In an exemplary embodiment, the preset frequency band includes a start frequency point and an end frequency point;

the acquiring the first energy parameter includes: acquiring a first initial energy value, a first termination energy value and a first average energy value of the first voice signal, wherein the first initial energy value represents an energy value of the first voice signal at the initial frequency point, the first termination energy value represents an energy value of the first voice signal at the termination frequency point, and the first average energy value represents an average energy value of the first voice signal in the preset frequency band;

the acquiring the second energy parameter includes: and acquiring a second initial energy value, a second termination energy value and a second average energy value of the second voice signal, wherein the second initial energy value represents an energy value of the second voice signal at the initial frequency point, the second termination energy value represents a termination energy value of the second voice signal at the termination frequency point, and the second average energy value represents an average energy value of the second voice signal in the preset frequency band.

In an exemplary embodiment, the obtaining the first average energy value includes:

setting a first reference frequency point at intervals of a first preset frequency difference value in the preset frequency band, wherein the preset frequency band comprises a plurality of first reference frequency points;

respectively obtaining a first reference energy value of the first voice signal for each first reference frequency point;

summing each of the first reference energy values and taking an average value as the first average energy value;

the obtaining the second average energy value comprises:

setting a second reference frequency point at intervals of a second preset frequency difference value in the preset frequency band, wherein the preset frequency band comprises a plurality of second reference frequency points;

respectively obtaining a second reference energy value for each second reference frequency point in the second voice signal;

and summing each second reference energy value and taking an average value as the second average energy value.

In an exemplary embodiment, said determining that said wearer has emitted an audible signal if said numerical relationship indicates that said first energy parameter is greater than said second energy parameter comprises:

obtaining a first energy difference value, a second energy difference value and an average energy difference value, wherein the first energy difference value represents the energy difference value between the first starting energy value and the second starting energy value, the second energy difference value represents the energy difference value between the first ending energy value and the second ending energy value, and the average energy difference value represents the energy difference value between the first average energy value and the second average energy value;

determining that the wearer has emitted an audible signal based on the first energy difference value, the second energy difference value, and the average energy difference value.

In an exemplary embodiment, said determining that the wearer has emitted a sound signal based on the first energy difference value, the second energy difference value, and the average energy difference value comprises:

and if the first energy difference value is larger than the second energy difference value and the average energy difference value is larger than a preset reference value, determining that the wearer sends out a sound signal.

In an exemplary embodiment, the voice detection method further includes:

if the first energy difference is less than or equal to the second energy difference, and/or the average energy difference is less than or equal to the preset reference value, determining that the wearer does not emit a sound signal.

In an exemplary embodiment, the predetermined frequency range is 200Hz to 500Hz.

According to a second aspect of the embodiments of the present disclosure, there is provided a voice detection apparatus applied to an in-ear headphone, the voice detection apparatus including:

the system comprises an acquisition module, a processing module and a control module, wherein the acquisition module is configured to acquire a first voice signal and a second voice signal, the first voice signal is a sound signal acquired from an ear canal of a wearer, and the second voice signal is a sound signal acquired from an environment where the wearer is located;

the calculation module is configured to obtain a first energy parameter and a second energy parameter, wherein the first energy parameter represents an energy value of the first voice signal in a preset frequency band, and the second energy parameter represents an energy value of the second voice signal in the preset frequency band, and the preset frequency band is a frequency interval generated by an occlusion effect;

a determination module configured to obtain a numerical relationship between the first energy parameter and the second energy parameter, and determine that the wearer has emitted an audible signal if the numerical relationship indicates that the first energy parameter is greater than the second energy parameter.

the computing module is further configured to:

acquiring a first initial energy value, a first termination energy value and a first average energy value of the first voice signal, wherein the first initial energy value represents an energy value of the first voice signal at the initial frequency point, the first termination energy value represents an energy value of the first voice signal at the termination frequency point, and the first average energy value represents an average energy value of the first voice signal in the preset frequency band;

the computing module is further configured to:

and acquiring a second initial energy value, a second termination energy value and a second average energy value of the second voice signal, wherein the second initial energy value represents an energy value of the second voice signal at the initial frequency point, the second termination energy value represents an energy value of the second voice signal at the termination frequency point, and the second average energy value represents an average energy value of the second voice signal in the preset frequency band.

In an exemplary embodiment, the computing module is further configured to:

respectively obtaining a first reference energy value for each first reference frequency point in the first voice signal;

the computing module is further configured to:

In an exemplary embodiment, the determining module is further configured to:

obtaining a first energy difference value, a second energy difference value and an average energy difference value, wherein the first energy difference value represents the energy difference value between the first start energy value and the second start energy value, the second energy difference value represents the energy difference value between the first termination energy value and the second termination energy value, and the average energy difference value represents the energy difference value between the first average energy value and the second average energy value;

In an exemplary embodiment, the determining module is further configured to:

According to a third aspect of embodiments of the present disclosure, there is provided a headset comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the speech detection method according to the first aspect of the embodiments of the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having instructions therein, which when executed by a processor of a headset, enable the headset to perform the voice detection method as set forth in the first aspect of the embodiments of the present disclosure.

By adopting the method disclosed by the invention, the following beneficial effects are achieved: on the premise of not additionally adding a microphone array consisting of sensor hardware and the earphone, whether the earphone wearer sends a sound signal or not can be accurately judged.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating a method of speech detection according to an exemplary embodiment;

FIG. 2 is a schematic diagram illustrating the structure of an in-ear TWS headset according to an exemplary embodiment;

FIG. 3 is a block diagram of a speech detection apparatus according to an exemplary embodiment;

FIG. 4 is a block diagram illustrating a speech detection apparatus according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In the related art, when a user wears a headset, the method for detecting whether the wearer speaks by the headset includes the following two methods: firstly, an auxiliary sensor, such as a vibration sensor or an acceleration sensor, is built in the headset, and whether a headset wearer speaks is judged through a sensor signal, but the method requires a user to wear the headset according to a preset posture, so that the user experience is affected, and the cost of the sensor is high; secondly, a microphone array is formed by a plurality of earphones on the earphones to collect sound signals, the incoming wave direction of the collected sound signals is estimated by utilizing a sound direction finding related algorithm, whether the direction of the sound signals is the direction of the mouth of the wearer is determined, and whether the wearer speaks is judged.

In an exemplary embodiment of the present disclosure, in order to overcome the problems in the related art, a voice detection method is provided, which is applied to an in-ear headphone. Respectively acquiring a first voice signal from an ear canal of a wearer, acquiring a second voice signal from an environment where the wearer is located, selecting a frequency interval generated by an occlusion effect as a preset frequency band, acquiring a first energy parameter of the first voice signal in the preset frequency band and a second energy parameter of the second voice signal in the preset frequency band, acquiring a numerical relationship between the first energy parameter and the second energy parameter, and determining that the wearer sends a sound signal if the numerical relationship indicates that the first energy parameter is greater than the second energy parameter. By using the voice detection method disclosed by the disclosure, whether the earphone wearer sends a sound signal can be accurately judged on the premise of not additionally adding a microphone array consisting of sensor hardware and the earphone.

In an exemplary embodiment of the present disclosure, a voice detection method is provided, which is applied to an in-ear headphone. Fig. 1 is a flow chart illustrating a voice detection method according to an exemplary embodiment, as shown in fig. 1, the voice detection method includes the steps of:

step S101, acquiring a first voice signal and a second voice signal, wherein the first voice signal is a sound signal acquired from an ear canal of a wearer, and the second voice signal is a sound signal acquired from an environment where the wearer is located;

step S102, a first energy parameter and a second energy parameter are obtained, the first energy parameter represents the energy value of a first voice signal in a preset frequency band, the second energy parameter represents the energy value of a second voice signal in the preset frequency band, and the preset frequency band is a frequency interval generated by a blocking effect;

step S103, obtaining a numerical relation between the first energy parameter and the second energy parameter, and if the numerical relation indicates that the first energy parameter is larger than the second energy parameter, determining that the wearer sends out a sound signal.

In step S101, due to the structural characteristics of the in-ear headphone, speech signals can be acquired from the ear canal of the wearer and the environment in which the wearer is located, respectively. Fig. 2 is a schematic diagram illustrating a configuration of an in-ear TWS headset according to an exemplary embodiment, as shown in fig. 2, the acoustic components of the in-ear earphone mainly include a feedforward microphone 1, a feedback microphone 2 and a call microphone 3. Wherein the feed forward microphone 1 is placed outside the pinna of the wearer and is capable of acquiring sound signals of the environment in which the wearer is located. The feedback microphone 2 is placed inside the ear canal of the wearer and is able to pick up sound signals in the ear canal of the wearer. The talking microphone 3 is disposed at the earphone handle and can acquire the sound signal of the wearer to perform the relevant operation on the sound signal of the wearer. A first speech signal is obtained from a feedback microphone of the headset as shown in fig. 2 for characterizing a sound signal in the ear canal of the wearer and a second speech signal is obtained from the feed-forward microphone for characterizing a sound signal in the environment in which the wearer is located. It should be noted that, in addition to the wireless in-ear earphone shown in fig. 2, the voice detection method in the present disclosure can be used in a wired in-ear earphone as long as a feedforward microphone and a feedback microphone are provided.

It should be noted that, since the present disclosure is to determine whether the earphone wearer has sent a sound signal, when the wearer plays music or video, etc. while wearing the earphone, it is necessary to use an echo cancellation algorithm to remove the interference of the music or video sound signal, and extract a human sound signal from the acquired sound signal as the acquired voice signal. In addition, in order to ensure the clarity of the extracted voice signal, noise reduction processing may be performed on the voice signal after it is extracted.

In step S102, when the user does not wear the in-ear TWS headset, the sound signal emitted by the user will pass through the bone to the ear and then spread out of the ear through the ear canal, but when the user wears the in-ear headset, the sound signal will be blocked in the ear by the headset after passing through the bone to the ear, and the sound signal in the ear will be enhanced, so when the user wears the in-ear headset, the energy of the sound signal of the wearer obtained from the ear canal will be greater than the energy of the sound signal of the wearer obtained from the ear canal, which is called blocking effect. Because the blocking effect can be generated only in the frequency range of 200 Hz-500 Hz, the frequency range generated by the blocking effect is determined as the preset frequency range which is 200 Hz-500 Hz and comprises the starting frequency point of 200Hz and the ending frequency point of 500Hz. The method includes the steps of obtaining a first energy parameter of a first voice signal in a preset frequency band and a second energy parameter of a second voice signal in the preset frequency band, wherein the first energy parameter and the second energy parameter are parameters capable of reflecting energy characteristics of the first voice signal and the second voice signal in the preset frequency band, and may include an energy value of a voice signal corresponding to a special frequency point in the preset frequency band and an average energy value in the preset frequency band, for example, energy values of a start frequency point and an end frequency point of the voice signal in the preset frequency band.

In step S103, due to the occlusion effect generated in the predetermined frequency band, when the user wears the in-ear earphone to speak, a part of the sound signal is conducted into the ear canal through the bone and enhanced, so that the energy of the sound signal in the ear canal is greater than the energy of the sound signal in the environment outside the ear, i.e. the first energy parameter of the first speech signal is greater than the second energy parameter of the second speech signal. After the first energy parameter and the second energy parameter are obtained, a numerical relationship between the first energy parameter and the second energy parameter is obtained through calculation, and the numerical relationship may be any numerical relationship capable of representing the magnitude between the first energy parameter and the second energy parameter, for example, through numerical magnitude or through difference positive and negative or other composite operations. If the numerical relationship between the first energy parameter and the second energy parameter indicates that the first energy parameter is greater than the second energy parameter, it indicates that the wearer is speaking while wearing the headset, i.e., the wearer utters a sound signal.

In an exemplary embodiment of the disclosure, when a user wears an in-ear earphone, a first voice signal is obtained from an ear canal of a wearer, a second voice signal is obtained from an environment where the wearer is located, a frequency interval generated by an occlusion effect is selected as a preset frequency band, a first energy parameter of the first voice signal in the preset frequency band and a second energy parameter of the second voice signal in the preset frequency band are obtained, a numerical relationship between the first energy parameter and the second energy parameter is obtained, and if the numerical relationship indicates that the first energy parameter is greater than the second energy parameter, it is determined that the wearer has sent a sound signal. The interference of external environment sound can be avoided on the premise that no additional sensor hardware is needed and no microphone array is formed by the earphones, and whether the earphone wearer sends out a sound signal or not is accurately judged.

In an exemplary embodiment of the present disclosure, the obtaining a first energy parameter of the first speech signal in a preset frequency band in step S102 includes:

the method comprises the steps of obtaining a first initial energy value, a first termination energy value and a first average energy value of a first voice signal, wherein the first initial energy value represents the energy value of the first voice signal at an initial frequency point, the first termination energy value represents the energy value of the first voice signal at a termination frequency point, and the first average energy value represents the average energy value of the first voice signal in a preset frequency band.

The preset frequency band comprises an initial frequency point and a termination frequency point, when the preset frequency band is 200 Hz-500 Hz, the initial frequency point is 200Hz, and the termination frequency point is 500Hz. The first energy parameter includes a first start energy value, a first end energy value, and a first average energy value. The first initial energy value is the energy of the initial frequency point of the first voice signal in the preset frequency band, and when the initial frequency point is 200Hz, the first initial energy value is marked as A _200Hz (ii) a The first termination energy value is the energy of the termination frequency point of the first voice signal in the preset frequency band, and when the termination frequency point is 500Hz, the first termination energy value is marked as A _500Hz (ii) a The first average energy value is the average energy of the first voice signal in a preset frequency band, when the preset frequency band is 200Hz to 500Hz, the calculation mode of the first average energy value can be set according to actual requirements, and can be the average value of the energy sum of each frequency point in the preset frequency band, or the average value of the energy sum of a plurality of preset frequency points in the preset frequency band, and the first average energy value is marked as a _{200Hz～500Hz} 。

In one embodiment, a first average energy value of the first speech signal in a preset frequency band is obtained by:

in a preset frequency band, setting a first reference frequency point at intervals of a first preset frequency difference value, wherein the preset frequency band comprises a plurality of first reference frequency points;

respectively obtaining a first reference energy value aiming at each first reference frequency point in the first voice signal;

and summing each first reference energy value and taking the average value as a first average energy value.

In the preset frequency band, a first reference frequency point is set at each interval of a first preset frequency difference value, the first preset frequency difference value can be set according to actual requirements, the preset frequency band comprises a plurality of first reference frequency points, the larger the value of the first preset frequency difference value is, the more the first reference frequency points are, the larger the calculated amount is, and the higher the accuracy is. After a plurality of first reference frequency points are obtained, first reference energy values aiming at each first reference frequency point in the first voice signal are respectively obtained, and an average value is taken as a first average energy value after each first reference energy value is summed.

When the preset frequency band is 200Hz to 500Hz, for example, the first preset frequency difference is 60Hz, the preset frequency band includes 5 first reference frequency points, the corresponding first reference energy values thereof are respectively marked as E11, E12, E13, E14, and E15, and then the first average energy value a is obtained _{200Hz～500Hz} ＝(E11+E12+E13+E14+E15)/5。

In an exemplary embodiment of the present disclosure, the obtaining of the second energy parameter of the second speech signal in the preset frequency band in step S102 includes:

and acquiring a second initial energy value, a second termination energy value and a second average energy value of the second voice signal, wherein the second initial energy value represents the energy value of the second voice signal at the initial frequency point, the second termination energy value represents the energy value of the second voice signal at the termination frequency point, and the second average energy value represents the average energy value of the second voice signal in the preset frequency band.

When the preset frequency range is 200 Hz-500 Hz, the starting frequency point is 200Hz, and the ending frequency point is 500Hz. The second energy parameter includes a second starting energy value, a second ending energy value, and a second average energy value. The second initial energy value is the energy of the second voice signal at the initial frequency point of the preset frequency band, and when the initial frequency point is 200Hz, the second initial energy value is marked as B _200Hz (ii) a The second termination energy value is the energy of the termination frequency point of the second voice signal in the preset frequency band, and when the termination frequency point is 500Hz, the second termination energy value is marked as B _500Hz (ii) a The second average energy value is the average energy of the second voice signal in the preset frequency band, when the preset frequency band is 200 Hz-500 Hz, the calculation mode of the second average energy value can be set according to the actual requirement, and can be the average value of the energy sum of each frequency point in the preset frequency band, or the average value of the energy sum of a plurality of preset frequency points in the preset frequency band, and the second average energy value is recorded as B _{200Hz～500Hz} 。

In one embodiment, the second average energy value of the second speech signal in the preset frequency band is obtained by:

in the preset frequency band, setting a second reference frequency point at intervals of a second preset frequency difference value, wherein the preset frequency band comprises a plurality of second reference frequency points;

respectively obtaining a second reference energy value aiming at each second reference frequency point in the second voice signal;

and summing each second reference energy value, and taking the average value as a second average energy value.

In the preset frequency band, a second reference frequency point is set at each interval of a second preset frequency difference value, the second preset frequency difference value can be set according to actual requirements, the preset frequency band comprises a plurality of second reference frequency points, the larger the value of the second preset frequency difference value is, the more the second reference frequency points are, the larger the calculated amount is, and the higher the accuracy is. After a plurality of second reference frequency points are obtained, second reference energy values aiming at each second reference frequency point in the second voice signal are respectively obtained, and an average value is taken as a second average energy value after each second reference energy value is summed.

When the preset frequency band is 200Hz to 500Hz, for example, the second preset frequency difference is 60Hz, the preset frequency band includes 5 second reference frequency points, the corresponding second reference energy values are respectively marked as E21, E22, E23, E24, E25, and the second average energy value B _{200Hz～500Hz} ＝(E21+E22+E23+E24+E25)/5。

In an exemplary embodiment of the present disclosure, the method of obtaining the numerical relationship between the first energy parameter and the second energy parameter in step S103 includes:

and acquiring a first energy difference value, a second energy difference value and an average energy difference value, wherein the first energy difference value represents the energy difference value between the first initial energy value and the second initial energy value, the second energy difference value represents the energy difference value between the first termination energy value and the second termination energy value, and the average energy difference value represents the energy difference value between the first average energy value and the second average energy value.

Obtaining a first energy difference value between a first initial energy value and a second initial energy value, wherein the first initial energy value is marked as A _200Hz The second initial energy value is marked as B _200Hz The first energy difference is recorded as Δ ₁ Then a is Δ ₁ ＝A _200Hz -B _200Hz (ii) a Obtaining a second energy difference value between the first termination energy value and the second termination energy value, and recording the first termination energy value as A _500Hz And the second termination energy value is B _500Hz And the second energy difference is recorded as delta ₂ Then a is Δ ₂ ＝A _500Hz -B _500Hz (ii) a Obtaining the average energy difference value of the first average energy value and the second average energy value, and recording the first average energy value as A _{200Hz～500Hz} And the second average energy value is B _{200Hz～500Hz} And the average energy difference is recorded as delta, then delta = A _{200Hz～500Hz} -B _{200Hz～500Hz} 。

In an exemplary embodiment of the present disclosure, the determination of whether the wearer emits the sound signal according to the numerical relationship between the first energy parameter and the second energy parameter in step S103 includes the following two cases:

the first method comprises the following steps: if the first energy difference value is larger than the second energy difference value and the average energy difference value is larger than a preset reference value, determining that the wearer sends a sound signal;

and the second method comprises the following steps: and if the first energy difference value is smaller than or equal to the second energy difference value and/or the average energy difference value is smaller than or equal to a preset reference value, determining that the wearer does not send the sound signal.

The preset reference value is denoted as F, and the specific value can be determined according to actual requirements, for example, 6dB. If the first energy difference value delta ₁ Greater than the second energy difference Δ ₂ And the average energy difference value delta is larger than the preset reference value F, the wearer is determined to send out a sound signalNumber (A) is satisfied _200Hz -B _200Hz )>(A _500Hz -B _500Hz ) And (A) _{200Hz～500Hz} -B _{200Hz～500Hz} )>And F, determining that the wearer emits a sound signal. After the fact that the wearer sends the sound signal is confirmed, the sound transparent transmission function or the voice recognition function can be started, the voice information of the wearer is collected, and corresponding instructions are executed according to the collected voice information. If the first energy difference Δ ₁ Less than or equal to the second energy difference value delta ₂ And/or the mean energy difference Δ being less than or equal to a preset reference value F, determining that the wearer is not emitting sound signals, i.e. when (A) is not satisfied _200Hz -B _200Hz )>(A _500Hz -B _500Hz ) And (A) _{200Hz～500Hz} -B _{200Hz～500Hz} )>F, it is determined that the wearer is not emitting sound signals. When it is determined that the wearer does not emit a sound signal, the earphone maintains a current speech signal receiving state and detects whether the wearer emits a sound signal in real time.

In an exemplary embodiment of the present disclosure, a voice detection device is provided, which is applied to an in-ear earphone. Fig. 3 is a block diagram illustrating a voice detection apparatus according to an exemplary embodiment, and as shown in fig. 3, the voice detection apparatus includes:

an obtaining module 301 configured to obtain a first voice signal and a second voice signal, wherein the first voice signal is a sound signal obtained from an ear canal of a wearer, and the second voice signal is a sound signal obtained from an environment in which the wearer is located;

a calculating module 302 configured to obtain a first energy parameter and a second energy parameter, where the first energy parameter represents an energy value of the first voice signal in a preset frequency band, and the second energy parameter represents an energy value of the second voice signal in the preset frequency band, where the preset frequency band is a frequency interval generated by a blocking effect;

a determining module 303 configured to obtain a numerical relationship between the first energy parameter and the second energy parameter, and determine that the wearer has emitted the sound signal if the numerical relationship indicates that the first energy parameter is greater than the second energy parameter.

the calculation module 302 is further configured to:

acquiring a first initial energy value, a first termination energy value and a first average energy value of a first voice signal, wherein the first initial energy value represents an energy value of the first voice signal at an initial frequency point, the first termination energy value represents an energy value of the first voice signal at a termination frequency point, and the first average energy value represents an average energy value of the first voice signal in a preset frequency band;

the calculation module 302 is further configured to:

In an exemplary embodiment, the calculation module 302 is further configured to:

summing each first reference energy value and taking an average value as a first average energy value;

the calculation module 302 is further configured to:

In an exemplary embodiment, the determining module 303 is further configured to:

acquiring a first energy difference value, a second energy difference value and an average energy difference value, wherein the first energy difference value represents the energy difference value between a first initial energy value and a second initial energy value, the second energy difference value represents the energy difference value between a first termination energy value and a second termination energy value, and the average energy difference value represents the energy difference value between the first average energy value and the second average energy value;

determining that the wearer has emitted the sound signal based on the first energy difference value, the second energy difference value, and the average energy difference value.

and if the first energy difference value is larger than the second energy difference value and the average energy difference value is larger than a preset reference value, determining that the wearer sends out the sound signal.

and if the first energy difference value is less than or equal to the second energy difference value and/or the average energy difference value is less than or equal to a preset reference value, determining that the wearer does not send the sound signal.

In an exemplary embodiment, the predetermined frequency range is 200 Hz-300 Hz.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

The present disclosure also provides a headset comprising a processor, a memory for storing executable instructions of the processor, the processor being configured to perform a speech detection method as shown in embodiments of the present disclosure.

Fig. 4 is a block diagram illustrating a speech detection apparatus 400 according to an example embodiment.

Referring to fig. 4, the apparatus 400 may include one or more of the following components: processing components 402, memory 404, power components 406, multimedia components 408, audio components 410, input/output (I/O) interfaces 412, sensor components 414, and communication components 416.

The processing component 402 generally controls overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 may include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.

The memory 404 is configured to store various types of data to support operations at the apparatus 400. Examples of such data include instructions for any application or method operating on the device 400, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 404 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply components 406 provide power to the various components of device 400. The power components 406 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 400.

The multimedia component 408 includes a screen that provides an output interface between the device 400 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 400 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 410 is configured to output and/or input audio signals. For example, audio component 410 includes a Microphone (MIC) configured to receive external audio signals when apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 also includes a speaker for outputting audio signals.

The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 414 includes one or more sensors for providing various aspects of status assessment for the apparatus 400. For example, the sensor assembly 414 may detect an open/closed state of the apparatus 400, the relative positioning of the components, such as a display and keypad of the apparatus 400, the sensor assembly 414 may also detect a change in the position of the apparatus 400 or a component of the apparatus 400, the presence or absence of user contact with the apparatus 400, orientation or acceleration/deceleration of the apparatus 400, and a change in the temperature of the apparatus 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 416 is configured to facilitate wired or wireless communication between the apparatus 400 and other devices. The apparatus 400 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 404 comprising instructions, executable by the processor 420 of the apparatus 400 to perform the voice detection method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium, instructions in the storage medium, when executed by a processor of an apparatus, enable the apparatus to perform the speech detection method in the above embodiments.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes can be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A voice detection method applied to an in-ear headphone, the voice detection method comprising:

acquiring a first energy parameter and a second energy parameter, wherein the first energy parameter represents an energy value of the first voice signal in a preset frequency band, and the second energy parameter represents an energy value of the second voice signal in the preset frequency band, wherein the preset frequency band is a frequency interval generated by a blocking effect;

2. The voice detecting method of claim 1, wherein the predetermined frequency band comprises a start frequency point and an end frequency point;

the acquiring the second energy parameter includes: and acquiring a second initial energy value, a second termination energy value and a second average energy value of the second voice signal, wherein the second initial energy value represents an energy value of the second voice signal at the initial frequency point, the second termination energy value represents an energy value of the second voice signal at the termination frequency point, and the second average energy value represents an average energy value of the second voice signal in the preset frequency band.

3. The speech detection method according to claim 2,

the obtaining the first average energy value comprises:

the obtaining the second average energy value comprises:

4. The speech detection method of claim 3, wherein determining that the wearer uttered an audible signal if the numerical relationship indicates that the first energy parameter is greater than the second energy parameter comprises:

5. The method of claim 4, wherein determining that the wearer uttered a sound signal based on the first energy difference value, the second energy difference value, and the average energy difference value comprises:

6. The voice detection method according to claim 5, characterized in that the voice detection method further comprises:

7. The voice detection method according to any one of claims 1 to 6, wherein the preset frequency band is 200Hz to 500Hz.

8. A voice detection device, for use in an in-ear headset, the voice detection device comprising:

the system comprises an acquisition module, a processing module and a control module, wherein the acquisition module is configured to acquire a first voice signal and a second voice signal, the first voice signal is a sound signal acquired from an ear canal of a wearer, and the second voice signal is a sound signal acquired from an environment in which the wearer is located;

9. An earphone, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the speech detection method of any of claims 1-7.

10. A non-transitory computer readable storage medium, wherein instructions in the storage medium, when executed by a processor of a headset, enable the headset to perform the voice detection method of any of claims 1-7.