CN111816217B

CN111816217B - Self-adaptive endpoint detection voice recognition method and system and intelligent device

Info

Publication number: CN111816217B
Application number: CN202010633139.5A
Authority: CN
Inventors: 肖积涛; 耿士顶; 孙非凡
Original assignee: Nanjing Aoto Electronics Co ltd
Current assignee: Nanjing Aoto Electronics Co ltd
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2024-02-09
Anticipated expiration: 2040-07-02
Also published as: CN111816217A

Abstract

The invention relates to a voice recognition method, a voice recognition system and intelligent equipment for self-adaptive endpoint detection, wherein the voice recognition method comprises the steps of constructing environment sounds with different intensity levels; playing a test sound source under the environment sound of each intensity level, acquiring test audio data, and detecting endpoints; determining an endpoint detection threshold under the environmental sound of the corresponding intensity level according to the endpoint detection threshold-endpoint detection result curve of each test audio data and the endpoint detection result reference value, and summarizing to obtain a mapping table of the environmental sound intensity and the endpoint detection threshold; acquiring the intensity of environmental sound; obtaining a corresponding endpoint detection threshold value from the mapping table according to the acquired environmental sound intensity; the audio data is subjected to end point detection and then speech recognition. The method can be well adapted to the current environmental noise, so that the endpoint detection is more accurate, and the accuracy of voice recognition is higher and is not influenced by the environmental noise.

Description

Self-adaptive endpoint detection voice recognition method and system and intelligent device

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a method and system for speech recognition with adaptive endpoint detection, and an intelligent device.

Background

With the continuous development of speech recognition technology, speech recognition technology is being applied in more and more scenes. For example, in a website/place such as a bank, a business hall of a communication carrier, an administrative service hall, a mall, or the like, or in a scene such as a teleconference system, it is necessary to perform corresponding operations such as queuing, handling a specific business, determining a speaker, and the like, based on audio information.

In speech recognition, it is generally necessary to perform endpoint detection on the collected audio data to identify a speaking segment, and then to transmit the segment to a processor or a speech recognition engine for speech recognition. Therefore, the occupation of the invalid sound data to the resources such as the storage space, the processor and the like can be avoided, the waste of the resources is avoided, and the system overhead is reduced.

The traditional voice endpoint detection technology is mainly based on a characteristic extraction scheme, wherein a characteristic parameter, such as a short-time energy value, a zero crossing rate and other related characteristic parameters on a time domain/frequency domain, is firstly extracted from an audio signal; and then comparing the voice with a preset threshold value, and judging that the voice is voice if the voice exceeds the threshold value.

The thresholds for these characteristic parameters will typically be set to fixed values. If the threshold value of the characteristic parameter is set too low, a lot of invalid audio data can be used for subsequent voice recognition, so that the voice recognition efficiency and accuracy are affected; and, since additional, invalid audio data is processed, overhead is increased. If the threshold of the feature parameter is set too high, part of the valid audio data may be filtered out, reducing the speech recognition accuracy.

Therefore, the existing endpoint detection scheme cannot be effectively matched with the environment where the recording equipment is located, so that the voice recognition results under different environmental noises have larger phase difference, the accuracy of voice recognition is unstable, and particularly, the accuracy of voice recognition is lower under non-stable and complex noise environments.

Disclosure of Invention

Based on the above, it is necessary to provide a voice recognition method, a system and an intelligent device for adaptive endpoint detection, aiming at the problems of large difference of voice recognition results and unstable accuracy of voice recognition under different environmental noises existing in the existing endpoint detection scheme.

An embodiment of the present application provides a speech recognition method for adaptive endpoint detection, including:

constructing environmental sounds with different intensity levels;

playing a test sound source under the environment sound of each intensity level, and acquiring test audio data, wherein each test audio data corresponds to the environment sound of one intensity level, and each test audio data comprises directional audio data acquired under a directional enhancement angle and non-directional audio data acquired under a non-directional enhancement angle;

performing end point detection on each piece of test audio data to obtain a curve of an end point detection threshold value-end point detection result of the test audio data;

acquiring a reference value of an endpoint detection result, and determining an endpoint detection threshold under the environmental sound of a corresponding intensity level according to an endpoint detection threshold-endpoint detection result curve of each piece of test audio data and the reference value of the endpoint detection result;

summarizing the end point detection values of the environmental sounds with all the intensity levels to obtain a mapping table of the environmental sound intensity and the end point detection threshold value;

acquiring the intensity of environmental sound;

obtaining a corresponding endpoint detection threshold value from a mapping table of the environmental sound intensity and the endpoint detection threshold value according to the acquired environmental sound intensity;

acquiring audio data, and performing endpoint detection on the audio data by using the obtained endpoint detection threshold;

and carrying out voice recognition on the audio data after the endpoint detection.

In some embodiments, the step of obtaining the reference value of the endpoint detection result, determining the endpoint detection threshold under the environmental sound of the corresponding intensity level according to the endpoint detection threshold-endpoint detection result curve of each test audio data and the reference value of the endpoint detection result, specifically includes:

acquiring a reference value of an end point detection result, and constructing an end point detection result reference area according to the reference value of the end point detection result;

and determining the endpoint detection threshold value under the environment sound of the corresponding intensity level according to the intersection area of the endpoint detection threshold value-endpoint detection result curve of each test audio data and the endpoint detection result reference area.

In some embodiments, the step of obtaining the intensity of the environmental sound specifically includes: acquiring the environmental sound intensity in a preset time period at regular time; and counting the environmental sound intensity in a preset time period to obtain the environmental sound intensity.

In some embodiments, the reference value of the end point detection result is determined through testing, specifically: collecting dialogue audio under three speech speeds, namely slow speech speed, normal speech speed and fast speech speed; and counting the time period without dialogue content, calculating the minimum value of the time period with the position of 3 times of standard deviation according to the normal distribution form, and determining the reference value of the end point detection result according to the ratio of the minimum value of the time period to the minimum time interval of the end point detection.

In some embodiments, before the step of acquiring the ambient sound intensity, further comprising: judging whether a user exists, and acquiring the environmental sound intensity when the user exists.

In some embodiments, the endpoint detection threshold is a volume threshold.

Another embodiment of the present application provides a speech recognition system for adaptive endpoint detection, comprising:

the environment sound construction module is used for constructing environment sounds with different intensity levels;

the test audio data acquisition module is used for playing a test sound source under the environment sound of each intensity level, acquiring test audio data, wherein each test audio data corresponds to the environment sound of one intensity level, and each test audio data comprises directional audio data acquired under a directional enhancement angle and non-directional audio data acquired under a non-directional enhancement angle;

the first endpoint detection module is used for carrying out endpoint detection on each piece of test audio data to obtain an endpoint detection threshold value-endpoint detection result curve of the test audio data;

the threshold calculation module is used for obtaining a reference value of the endpoint detection result, and determining an endpoint detection threshold under the environmental sound of the corresponding intensity level according to the endpoint detection threshold-curve of the endpoint detection result of each piece of test audio data and the reference value of the endpoint detection result;

the threshold value mapping table module is used for summarizing the endpoint detection values of the environmental sounds with all the intensity levels to obtain a mapping table of the environmental sound intensity and the endpoint detection threshold value;

the environment sound detection module is used for acquiring the intensity of the environment sound;

the endpoint detection threshold determining module is used for obtaining a corresponding endpoint detection threshold from a mapping table of the environmental sound intensity and the endpoint detection threshold according to the acquired environmental sound intensity;

the second endpoint detection module is used for acquiring audio data and detecting endpoints of the audio data by utilizing the obtained endpoint detection threshold;

and the voice recognition module is used for carrying out voice recognition on the audio data after the endpoint detection.

In some embodiments, the system may further include a user detection module, configured to determine whether there is a user, and trigger the ambient sound detection module to acquire the ambient sound intensity when it is determined that there is a user.

An embodiment of the present application further provides an intelligent device, including a speech recognition system for adaptive endpoint detection according to any one of the preceding embodiments.

Another embodiment of the present application also provides a machine-readable storage medium, on which is stored a computer program which, when executed by a processor, implements the speech recognition method of adaptive endpoint detection of any of the previous embodiments.

According to the voice recognition scheme for the self-adaptive endpoint detection, the corresponding endpoint detection threshold value can be automatically matched according to the current environmental sound intensity, so that the current environmental noise can be well adapted, the endpoint detection is more accurate, the accuracy of the voice recognition is higher and is not influenced by the environmental noise, and the voice recognition result with the same accuracy can be obtained even under the non-stable and complex noise environments.

Drawings

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of a speech recognition method according to another embodiment of the present application;

FIG. 3 is a schematic diagram of an endpoint detection threshold-endpoint detection result curve and an endpoint detection result reference region for testing audio data according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a frame structure of a speech recognition system according to an embodiment of the present application.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. In addition, embodiments of the present application and features of the embodiments may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

As shown in fig. 1, an embodiment of the present application discloses a speech recognition method for adaptive endpoint detection, which includes:

step S110, constructing environmental sounds with different intensity levels;

the voice recognition method in this embodiment may be performed by a voice recognition system, or may be performed by other functional devices. The following describes a speech recognition method using a speech recognition system as an execution subject.

In the use of different speech recognition scenarios, there may be different levels of intensity of ambient sound, i.e. ambient noise. The intensity levels of the plurality of ambient sounds may be divided according to the intensity of the ambient sounds. In the construction of the environmental sounds with different intensity levels, a section of the environmental sounds can be collected, and then the environmental sounds with a plurality of intensity levels can be obtained by adjusting the gain.

For example, the intensity of ambient sound may be characterized in decibel values. Taking the example of the intensity coverage of the ambient sound of 0-120 db, it can be divided into 7 intensity classes: 0-40, 40-50, 50-60, 60-70, 70-80, 80-90, 90-120 decibels. In determining the gain magnitude of the ambient sound for each intensity level, the median value of the intensity level range may be taken as the gain value. For example, for an intensity level of 60-70, the corresponding gain value may be taken as 65.

It will be appreciated that other divisions of the intensity level of the ambient sound may be made as desired. The gain for each intensity level may take other values within the range of intensity levels.

It will be appreciated that different ambient sounds may be employed for different intensity levels when constructing ambient sounds of different intensity levels.

Step S120, playing a test sound source under the environment sound of each intensity level, and acquiring test audio data, wherein each test audio data corresponds to the environment sound of one intensity level, and each test audio data comprises directional audio data acquired under a directional enhancement angle and non-directional audio data acquired under a non-directional enhancement angle;

the test sound source is played under ambient sound of each intensity level. Sound may be collected using a microphone, microphone array, microphone, or other sound pickup device. According to the relative positional relationship between the sound pickup apparatus and the test sound source playback apparatus, two areas of a directional enhancement angle and a non-directional enhancement angle can be formed. Sound acquired in the directional enhancement angle is directional audio data; the sound collected in the non-directional enhancement angle is non-directional audio data, and the sound and the non-directional audio data form test audio data.

Each test audio data corresponds to an intensity level of ambient sound. When the environment sound with a plurality of intensity levels exists, the collection of the test audio data is repeatedly carried out under the environment sound with each intensity level, and the corresponding number of the test audio data is obtained.

By simultaneously collecting the directional audio data collected under the directional enhancement angle and the non-directional audio data collected under the non-directional enhancement angle, the audio in the non-directional enhancement direction is considered when the endpoint detection threshold is determined, so that the voice in the non-directional enhancement direction can be conveniently filtered, and the voice recognition effect is improved.

Step 130, performing endpoint detection on each piece of test audio data to obtain a curve of endpoint detection threshold value-endpoint detection result of the test audio data;

endpoint detection may use common schemes for threshold-based feature decisions, such as threshold decisions for feature parameters in the time/frequency domain, e.g., volume, short-time energy value, zero crossing rate, etc. And the endpoint detection threshold value corresponds to the selected characteristic parameter.

By way of example, the following description of the scheme will be made taking a judgment scheme in which the end point detection employs a volume threshold value as an example. When the endpoint is detected, if the audio exceeds the volume threshold, the sound is judged to exist, otherwise, the sound is judged to be absent. At this time, the end point detection threshold is the volume threshold.

When the endpoint detection is carried out on each piece of test audio data, a plurality of different endpoint detection thresholds are adopted to carry out the endpoint detection on the test audio data. For a test audio data, different end point detection thresholds are used, so that an end point detection result, such as the number of detected end points, can be obtained.

Because each piece of test audio data comprises directional audio data and non-directional audio data, endpoint detection needs to be performed on the directional audio data and the non-directional audio data respectively, an endpoint detection threshold is taken as an abscissa, an endpoint detection result is taken as an ordinate, a curve of the endpoint detection threshold-endpoint detection result of the directional audio data and a curve of the endpoint detection threshold-endpoint detection result of the non-directional audio data are drawn and obtained, and the two curves are summarized, so that the curve of the endpoint detection threshold-endpoint detection result of the test audio data can be obtained, as shown in fig. 3.

Step S140, obtaining the reference value of the end point detection result, and determining the end point detection threshold value under the environment sound of the corresponding intensity level according to the end point detection threshold value-curve of the end point detection result and the reference value of the end point detection result of each piece of test audio data;

the reference value of the end point detection result can be set empirically or can be determined through testing. For example, the reference value of the end point detection result may be determined by a test. The method can enable a person to read dialogue contents or enable a plurality of persons to conduct dialogue, the dialogue speed is divided into three types of slow speed, normal speed and fast speed, and dialogue audio frequencies under the three speeds are collected; and counting the time period without dialogue content, calculating the minimum value of the time period of the 3 sigma (sigma is the standard deviation of the normal distribution) position according to the normal distribution form, and obtaining the reference value of the endpoint detection result according to the ratio of the minimum value of the time period to the minimum time interval of the endpoint detection.

In some embodiments, in step S140, a horizontal line having a vertical coordinate value equal to the reference value of the end point detection result, that is, the reference line of the end point detection result, may be constructed according to the reference value of the end point detection result; and determining the intersection points of the end point detection threshold value-end point detection result curve of the directional audio data, the end point detection threshold value-end point detection result curve of the non-directional audio data and the reference line of the end point detection result in each piece of test audio data, and taking the end point detection threshold values corresponding to the center points of the plurality of intersection points as the end point detection threshold values under the environment sound of the corresponding intensity level.

In some embodiments, step S140 may specifically include:

In constructing the end point detection result reference region, a fluctuation interval of the reference value of the end point detection result may be set. By way of example, as shown in fig. 3, taking the reference value of the end point detection result as 12 as an example, the fluctuation interval may be set to ±3. That is, the end point detection result reference area may be an area surrounded by the end point detection result fluctuation of 6 as a lower limit and 15 as an upper limit.

And then determining an end point detection threshold value-end point detection result curve of the directional audio data, an end point detection threshold value-end point detection result curve of the non-directional audio data and an intersection area of the end point detection result reference area in each test audio data, and calculating an end point detection threshold value corresponding to the center point of the intersection area as the end point detection threshold value under the environment sound of the corresponding intensity level.

Step S150, summarizing the end point detection values of the environmental sounds with all the intensity levels to obtain a mapping table of the environmental sound intensity and the end point detection threshold value;

for example, a mapping table of ambient sound intensity and endpoint detection threshold may be as shown in table 1 below.

Table 1 mapping table of ambient sound intensity and endpoint detection threshold

Step S200, obtaining the intensity of the environmental sound;

the acquisition of the ambient sound intensity may be performed in real time or at regular time. In a speech recognition system, a sound sensor may be provided to capture the ambient sound intensity. The sound sensor can collect the environmental sound intensity in real time and can store the collected environmental sound intensity in a memory or a buffer. The ambient sound intensity in step S200 may be the ambient sound intensity collected in real time, or may be determined according to the ambient sound intensity within a preset time period.

In some embodiments, in step S200, the ambient sound intensity is determined according to the ambient sound intensity within a preset period of time. In general, there may be a sudden increase in ambient noise at a particular moment, but the overall trend of ambient noise changes slowly. Such abrupt environmental noise generally has limited impact on endpoint detection. The environmental sound intensity is determined by utilizing the environmental sound intensity in the preset time period, so that the endpoint detection threshold value obtained later is stable, and the fluctuation of the voice recognition accuracy is avoided.

Step S200 may specifically be: acquiring the environmental sound intensity in a preset time period at regular time; and counting the environmental sound intensity in a preset time period to obtain the environmental sound intensity. For example, step S200 may be performed once every minute, and the preset period of time may be set to the first 5 minutes of the execution time. When executing the step S200, firstly, acquiring all the environmental sound intensities in the first 5 minutes; then, an average value of the acquired ambient sound intensities is calculated as the ambient sound intensity. The statistics of the environmental sound intensity in the preset time period can be carried out by the average value, the median determination, or the mode determination or the average value and the median of the modes.

It is understood that the time interval of the timing and the length of the preset time period can be freely determined according to actual needs.

Step S300, according to the acquired environmental sound intensity, obtaining a corresponding endpoint detection threshold value from a mapping table of the environmental sound intensity and the endpoint detection threshold value;

step S400, obtaining audio data, and detecting the end point of the audio data by using the obtained end point detection threshold;

step S500, voice recognition is performed on the audio data after the end point detection.

Audio data may be acquired using a microphone, microphone array, microphone, or other pickup device.

Endpoint detection may use common schemes for threshold-based feature decisions, such as threshold decisions for feature parameters in the time/frequency domain, e.g., volume, short-time energy value, zero crossing rate, etc. And the endpoint detection threshold value corresponds to the selected characteristic parameter. In some embodiments, the endpoint detection employs a determination scheme of a volume threshold, and if the volume threshold is exceeded, the endpoint detection is determined to be sound, otherwise, the endpoint detection is determined to be sound-free. At this time, the end point detection threshold is the volume threshold.

According to the voice recognition method for the self-adaptive endpoint detection, the corresponding endpoint detection threshold value can be automatically matched according to the current environmental sound intensity, so that the current environmental noise can be well adapted, the endpoint detection is more accurate, the accuracy of voice recognition is higher and is not influenced by the environmental noise, and the voice recognition result with the same accuracy can be obtained even under the non-stable and complex noise environments.

In some embodiments, as shown in fig. 2, before step S200, it may further include:

step S205, judging whether a user exists, and when the user exists, entering step S200 to acquire the intensity of the environmental sound; otherwise, continuing to judge whether the user is the user.

An infrared sensor may be provided in the speech recognition system to determine whether a user is present in the vicinity of the speech recognition system. If it is determined that there is no user, speech recognition need not be performed. Therefore, the voice recognition action can be triggered only when the user is nearby, so that the energy consumption can be effectively reduced, and the method is more energy-saving and environment-friendly.

In some embodiments, after step S200, it may further include:

when the environmental sound intensity exceeds the reminding threshold, the user can be reminded, for example, the user is reminded that the current environment is noisy, talking loudly or talking to the pickup device.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments and that the acts referred to are not necessarily required by the embodiments of the present application.

As shown in fig. 4, an embodiment of the present application discloses a speech recognition system for adaptive endpoint detection, including:

an ambient sound construction module 110 for constructing ambient sounds of different intensity levels;

the test audio data obtaining module 120 is configured to play a test sound source under the environmental sound of each intensity level, collect test audio data, where each test audio data corresponds to the environmental sound of one intensity level, and each test audio data includes directional audio data collected under a directional enhancement angle and non-directional audio data collected under a non-directional enhancement angle;

a first endpoint detection module 130, configured to perform endpoint detection on each piece of test audio data, so as to obtain a curve of endpoint detection threshold value-endpoint detection result of the test audio data;

the threshold calculation module 140 is configured to obtain a reference value of the endpoint detection result, and determine an endpoint detection threshold under the environmental sound of the corresponding intensity level according to the endpoint detection threshold-curve of the endpoint detection result and the reference value of the endpoint detection result of each test audio data;

the threshold mapping table module 150 is configured to aggregate the endpoint detection values under the environmental sounds of all intensity levels to obtain a mapping table of the environmental sound intensity and the endpoint detection threshold;

an ambient sound detection module 200 for acquiring an ambient sound intensity;

the endpoint detection threshold determining module 300 is configured to obtain, according to the obtained environmental sound intensity, a corresponding endpoint detection threshold from a mapping table of the environmental sound intensity and the endpoint detection threshold;

a second endpoint detection module 400, configured to obtain audio data, and perform endpoint detection on the audio data by using the obtained endpoint detection threshold;

the voice recognition module 500 is configured to perform voice recognition on the audio data after the endpoint detection.

In some embodiments, the apparatus may further include a user detection module 205 configured to determine whether there is a user, and trigger the ambient sound detection module 200 to obtain the ambient sound intensity when it is determined that there is a user.

In some embodiments, an alarm module may be further included, where the alarm module is configured to prompt the user when the environmental sound intensity exceeds the alert threshold, for example, to prompt the user that the current environment is relatively noisy, please speak loudly, or speak into the sound pickup device.

The specific operation modes of the environmental sound construction module 110, the test audio data acquisition module 120, the first endpoint detection module 130, the threshold calculation module 140, the threshold mapping table module 150, the environmental sound detection module 200, the endpoint detection threshold determination module 300, the second endpoint detection module 400, the voice recognition module 500, the user detection module 205, and the alert module may be referred to the description in the foregoing method embodiments, and will not be repeated herein.

An embodiment of the present application further provides an intelligent device, which may include the foregoing adaptive endpoint detection voice recognition system, or perform the foregoing adaptive endpoint detection voice recognition method.

An embodiment of the present application provides a machine-readable storage medium, on which is stored a computer program, which when executed by a processor, implements the speech recognition method of adaptive endpoint detection described in any of the above embodiments.

The components/modules/units of the system/computer apparatus integration, if implemented as software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may also be implemented by implementing all or part of the flow of the method of the above embodiment, or by instructing the relevant hardware by a computer program, where the computer program may be stored in a computer readable storage medium, and where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

In the several embodiments provided herein, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the components is merely a logical functional division, and additional divisions may be implemented in practice.

In addition, each functional module/component in the embodiments of the present invention may be integrated in the same processing module/component, or each module/component may exist alone physically, or two or more modules/components may be integrated in the same module/component. The integrated modules/components described above may be implemented in hardware or in hardware plus software functional modules/components.

It will be evident to those skilled in the art that the embodiments of the invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units, modules or means recited in a system, means or terminal claim may also be implemented by means of software or hardware by means of one and the same unit, module or means. The terms first, second, etc. are used to denote a name, but not any particular order.

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A method of speech recognition for adaptive endpoint detection, comprising:

constructing environmental sounds with different intensity levels;

acquiring the intensity of environmental sound;

2. The method for speech recognition according to claim 1, wherein the step of obtaining the reference value of the end point detection result, determining the end point detection threshold value under the environmental sound of the corresponding intensity level according to the end point detection threshold value-curve of the end point detection result and the reference value of the end point detection result of each test audio data, specifically comprises:

3. The method for speech recognition according to claim 1, wherein the step of obtaining the intensity of the ambient sound comprises: acquiring the environmental sound intensity in a preset time period at regular time; and counting the environmental sound intensity in a preset time period to obtain the environmental sound intensity.

4. The method for speech recognition according to claim 1, wherein the reference value of the end point detection result is determined by a test, specifically: collecting dialogue audio under three speech speeds, namely slow speech speed, normal speech speed and fast speech speed; and counting the time period without dialogue content, calculating the minimum value of the time period with the position of 3 times of standard deviation according to the normal distribution form, and determining the reference value of the end point detection result according to the ratio of the minimum value of the time period to the minimum time interval of the end point detection.

5. The method of claim 1, further comprising, prior to the step of obtaining the ambient sound intensity: judging whether a user exists, and acquiring the environmental sound intensity when the user exists.

6. The method of any of claims 1-5, wherein the endpoint detection threshold is a volume threshold.

7. A speech recognition system for adaptive endpoint detection, comprising:

8. The adaptive end-point detected speech recognition system of claim 7, further comprising a user detection module configured to determine whether a user is present, and when it is determined that a user is present, trigger the ambient sound detection module to obtain the ambient sound intensity.

9. A smart device comprising the speech recognition system of adaptive endpoint detection of any of claims 7-8.

10. A machine readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech recognition method of adaptive endpoint detection according to any of claims 1-6.