CN111816217B - Self-adaptive endpoint detection voice recognition method and system and intelligent device - Google Patents

Self-adaptive endpoint detection voice recognition method and system and intelligent device Download PDF

Info

Publication number
CN111816217B
CN111816217B CN202010633139.5A CN202010633139A CN111816217B CN 111816217 B CN111816217 B CN 111816217B CN 202010633139 A CN202010633139 A CN 202010633139A CN 111816217 B CN111816217 B CN 111816217B
Authority
CN
China
Prior art keywords
endpoint detection
audio data
intensity
end point
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010633139.5A
Other languages
Chinese (zh)
Other versions
CN111816217A (en
Inventor
肖积涛
耿士顶
孙非凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Aoto Electronics Co ltd
Original Assignee
Nanjing Aoto Electronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Aoto Electronics Co ltd filed Critical Nanjing Aoto Electronics Co ltd
Priority to CN202010633139.5A priority Critical patent/CN111816217B/en
Publication of CN111816217A publication Critical patent/CN111816217A/en
Application granted granted Critical
Publication of CN111816217B publication Critical patent/CN111816217B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • G10L2025/786Adaptive threshold

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention relates to a voice recognition method, a voice recognition system and intelligent equipment for self-adaptive endpoint detection, wherein the voice recognition method comprises the steps of constructing environment sounds with different intensity levels; playing a test sound source under the environment sound of each intensity level, acquiring test audio data, and detecting endpoints; determining an endpoint detection threshold under the environmental sound of the corresponding intensity level according to the endpoint detection threshold-endpoint detection result curve of each test audio data and the endpoint detection result reference value, and summarizing to obtain a mapping table of the environmental sound intensity and the endpoint detection threshold; acquiring the intensity of environmental sound; obtaining a corresponding endpoint detection threshold value from the mapping table according to the acquired environmental sound intensity; the audio data is subjected to end point detection and then speech recognition. The method can be well adapted to the current environmental noise, so that the endpoint detection is more accurate, and the accuracy of voice recognition is higher and is not influenced by the environmental noise.

Description

Self-adaptive endpoint detection voice recognition method and system and intelligent device
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method and system for speech recognition with adaptive endpoint detection, and an intelligent device.
Background
With the continuous development of speech recognition technology, speech recognition technology is being applied in more and more scenes. For example, in a website/place such as a bank, a business hall of a communication carrier, an administrative service hall, a mall, or the like, or in a scene such as a teleconference system, it is necessary to perform corresponding operations such as queuing, handling a specific business, determining a speaker, and the like, based on audio information.
In speech recognition, it is generally necessary to perform endpoint detection on the collected audio data to identify a speaking segment, and then to transmit the segment to a processor or a speech recognition engine for speech recognition. Therefore, the occupation of the invalid sound data to the resources such as the storage space, the processor and the like can be avoided, the waste of the resources is avoided, and the system overhead is reduced.
The traditional voice endpoint detection technology is mainly based on a characteristic extraction scheme, wherein a characteristic parameter, such as a short-time energy value, a zero crossing rate and other related characteristic parameters on a time domain/frequency domain, is firstly extracted from an audio signal; and then comparing the voice with a preset threshold value, and judging that the voice is voice if the voice exceeds the threshold value.
The thresholds for these characteristic parameters will typically be set to fixed values. If the threshold value of the characteristic parameter is set too low, a lot of invalid audio data can be used for subsequent voice recognition, so that the voice recognition efficiency and accuracy are affected; and, since additional, invalid audio data is processed, overhead is increased. If the threshold of the feature parameter is set too high, part of the valid audio data may be filtered out, reducing the speech recognition accuracy.
Therefore, the existing endpoint detection scheme cannot be effectively matched with the environment where the recording equipment is located, so that the voice recognition results under different environmental noises have larger phase difference, the accuracy of voice recognition is unstable, and particularly, the accuracy of voice recognition is lower under non-stable and complex noise environments.
Disclosure of Invention
Based on the above, it is necessary to provide a voice recognition method, a system and an intelligent device for adaptive endpoint detection, aiming at the problems of large difference of voice recognition results and unstable accuracy of voice recognition under different environmental noises existing in the existing endpoint detection scheme.
An embodiment of the present application provides a speech recognition method for adaptive endpoint detection, including:
constructing environmental sounds with different intensity levels;
playing a test sound source under the environment sound of each intensity level, and acquiring test audio data, wherein each test audio data corresponds to the environment sound of one intensity level, and each test audio data comprises directional audio data acquired under a directional enhancement angle and non-directional audio data acquired under a non-directional enhancement angle;
performing end point detection on each piece of test audio data to obtain a curve of an end point detection threshold value-end point detection result of the test audio data;
acquiring a reference value of an endpoint detection result, and determining an endpoint detection threshold under the environmental sound of a corresponding intensity level according to an endpoint detection threshold-endpoint detection result curve of each piece of test audio data and the reference value of the endpoint detection result;
summarizing the end point detection values of the environmental sounds with all the intensity levels to obtain a mapping table of the environmental sound intensity and the end point detection threshold value;
acquiring the intensity of environmental sound;
obtaining a corresponding endpoint detection threshold value from a mapping table of the environmental sound intensity and the endpoint detection threshold value according to the acquired environmental sound intensity;
acquiring audio data, and performing endpoint detection on the audio data by using the obtained endpoint detection threshold;
and carrying out voice recognition on the audio data after the endpoint detection.
In some embodiments, the step of obtaining the reference value of the endpoint detection result, determining the endpoint detection threshold under the environmental sound of the corresponding intensity level according to the endpoint detection threshold-endpoint detection result curve of each test audio data and the reference value of the endpoint detection result, specifically includes:
acquiring a reference value of an end point detection result, and constructing an end point detection result reference area according to the reference value of the end point detection result;
and determining the endpoint detection threshold value under the environment sound of the corresponding intensity level according to the intersection area of the endpoint detection threshold value-endpoint detection result curve of each test audio data and the endpoint detection result reference area.
In some embodiments, the step of obtaining the intensity of the environmental sound specifically includes: acquiring the environmental sound intensity in a preset time period at regular time; and counting the environmental sound intensity in a preset time period to obtain the environmental sound intensity.
In some embodiments, the reference value of the end point detection result is determined through testing, specifically: collecting dialogue audio under three speech speeds, namely slow speech speed, normal speech speed and fast speech speed; and counting the time period without dialogue content, calculating the minimum value of the time period with the position of 3 times of standard deviation according to the normal distribution form, and determining the reference value of the end point detection result according to the ratio of the minimum value of the time period to the minimum time interval of the end point detection.
In some embodiments, before the step of acquiring the ambient sound intensity, further comprising: judging whether a user exists, and acquiring the environmental sound intensity when the user exists.
In some embodiments, the endpoint detection threshold is a volume threshold.
Another embodiment of the present application provides a speech recognition system for adaptive endpoint detection, comprising:
the environment sound construction module is used for constructing environment sounds with different intensity levels;
the test audio data acquisition module is used for playing a test sound source under the environment sound of each intensity level, acquiring test audio data, wherein each test audio data corresponds to the environment sound of one intensity level, and each test audio data comprises directional audio data acquired under a directional enhancement angle and non-directional audio data acquired under a non-directional enhancement angle;
the first endpoint detection module is used for carrying out endpoint detection on each piece of test audio data to obtain an endpoint detection threshold value-endpoint detection result curve of the test audio data;
the threshold calculation module is used for obtaining a reference value of the endpoint detection result, and determining an endpoint detection threshold under the environmental sound of the corresponding intensity level according to the endpoint detection threshold-curve of the endpoint detection result of each piece of test audio data and the reference value of the endpoint detection result;
the threshold value mapping table module is used for summarizing the endpoint detection values of the environmental sounds with all the intensity levels to obtain a mapping table of the environmental sound intensity and the endpoint detection threshold value;
the environment sound detection module is used for acquiring the intensity of the environment sound;
the endpoint detection threshold determining module is used for obtaining a corresponding endpoint detection threshold from a mapping table of the environmental sound intensity and the endpoint detection threshold according to the acquired environmental sound intensity;
the second endpoint detection module is used for acquiring audio data and detecting endpoints of the audio data by utilizing the obtained endpoint detection threshold;
and the voice recognition module is used for carrying out voice recognition on the audio data after the endpoint detection.
In some embodiments, the system may further include a user detection module, configured to determine whether there is a user, and trigger the ambient sound detection module to acquire the ambient sound intensity when it is determined that there is a user.
An embodiment of the present application further provides an intelligent device, including a speech recognition system for adaptive endpoint detection according to any one of the preceding embodiments.
Another embodiment of the present application also provides a machine-readable storage medium, on which is stored a computer program which, when executed by a processor, implements the speech recognition method of adaptive endpoint detection of any of the previous embodiments.
According to the voice recognition scheme for the self-adaptive endpoint detection, the corresponding endpoint detection threshold value can be automatically matched according to the current environmental sound intensity, so that the current environmental noise can be well adapted, the endpoint detection is more accurate, the accuracy of the voice recognition is higher and is not influenced by the environmental noise, and the voice recognition result with the same accuracy can be obtained even under the non-stable and complex noise environments.
Drawings
FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present application;
FIG. 2 is a flow chart of a speech recognition method according to another embodiment of the present application;
FIG. 3 is a schematic diagram of an endpoint detection threshold-endpoint detection result curve and an endpoint detection result reference region for testing audio data according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a frame structure of a speech recognition system according to an embodiment of the present application.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. In addition, embodiments of the present application and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
As shown in fig. 1, an embodiment of the present application discloses a speech recognition method for adaptive endpoint detection, which includes:
step S110, constructing environmental sounds with different intensity levels;
the voice recognition method in this embodiment may be performed by a voice recognition system, or may be performed by other functional devices. The following describes a speech recognition method using a speech recognition system as an execution subject.
In the use of different speech recognition scenarios, there may be different levels of intensity of ambient sound, i.e. ambient noise. The intensity levels of the plurality of ambient sounds may be divided according to the intensity of the ambient sounds. In the construction of the environmental sounds with different intensity levels, a section of the environmental sounds can be collected, and then the environmental sounds with a plurality of intensity levels can be obtained by adjusting the gain.
For example, the intensity of ambient sound may be characterized in decibel values. Taking the example of the intensity coverage of the ambient sound of 0-120 db, it can be divided into 7 intensity classes: 0-40, 40-50, 50-60, 60-70, 70-80, 80-90, 90-120 decibels. In determining the gain magnitude of the ambient sound for each intensity level, the median value of the intensity level range may be taken as the gain value. For example, for an intensity level of 60-70, the corresponding gain value may be taken as 65.
It will be appreciated that other divisions of the intensity level of the ambient sound may be made as desired. The gain for each intensity level may take other values within the range of intensity levels.
It will be appreciated that different ambient sounds may be employed for different intensity levels when constructing ambient sounds of different intensity levels.
Step S120, playing a test sound source under the environment sound of each intensity level, and acquiring test audio data, wherein each test audio data corresponds to the environment sound of one intensity level, and each test audio data comprises directional audio data acquired under a directional enhancement angle and non-directional audio data acquired under a non-directional enhancement angle;
the test sound source is played under ambient sound of each intensity level. Sound may be collected using a microphone, microphone array, microphone, or other sound pickup device. According to the relative positional relationship between the sound pickup apparatus and the test sound source playback apparatus, two areas of a directional enhancement angle and a non-directional enhancement angle can be formed. Sound acquired in the directional enhancement angle is directional audio data; the sound collected in the non-directional enhancement angle is non-directional audio data, and the sound and the non-directional audio data form test audio data.
Each test audio data corresponds to an intensity level of ambient sound. When the environment sound with a plurality of intensity levels exists, the collection of the test audio data is repeatedly carried out under the environment sound with each intensity level, and the corresponding number of the test audio data is obtained.
By simultaneously collecting the directional audio data collected under the directional enhancement angle and the non-directional audio data collected under the non-directional enhancement angle, the audio in the non-directional enhancement direction is considered when the endpoint detection threshold is determined, so that the voice in the non-directional enhancement direction can be conveniently filtered, and the voice recognition effect is improved.
Step 130, performing endpoint detection on each piece of test audio data to obtain a curve of endpoint detection threshold value-endpoint detection result of the test audio data;
endpoint detection may use common schemes for threshold-based feature decisions, such as threshold decisions for feature parameters in the time/frequency domain, e.g., volume, short-time energy value, zero crossing rate, etc. And the endpoint detection threshold value corresponds to the selected characteristic parameter.
By way of example, the following description of the scheme will be made taking a judgment scheme in which the end point detection employs a volume threshold value as an example. When the endpoint is detected, if the audio exceeds the volume threshold, the sound is judged to exist, otherwise, the sound is judged to be absent. At this time, the end point detection threshold is the volume threshold.
When the endpoint detection is carried out on each piece of test audio data, a plurality of different endpoint detection thresholds are adopted to carry out the endpoint detection on the test audio data. For a test audio data, different end point detection thresholds are used, so that an end point detection result, such as the number of detected end points, can be obtained.
Because each piece of test audio data comprises directional audio data and non-directional audio data, endpoint detection needs to be performed on the directional audio data and the non-directional audio data respectively, an endpoint detection threshold is taken as an abscissa, an endpoint detection result is taken as an ordinate, a curve of the endpoint detection threshold-endpoint detection result of the directional audio data and a curve of the endpoint detection threshold-endpoint detection result of the non-directional audio data are drawn and obtained, and the two curves are summarized, so that the curve of the endpoint detection threshold-endpoint detection result of the test audio data can be obtained, as shown in fig. 3.
Step S140, obtaining the reference value of the end point detection result, and determining the end point detection threshold value under the environment sound of the corresponding intensity level according to the end point detection threshold value-curve of the end point detection result and the reference value of the end point detection result of each piece of test audio data;
the reference value of the end point detection result can be set empirically or can be determined through testing. For example, the reference value of the end point detection result may be determined by a test. The method can enable a person to read dialogue contents or enable a plurality of persons to conduct dialogue, the dialogue speed is divided into three types of slow speed, normal speed and fast speed, and dialogue audio frequencies under the three speeds are collected; and counting the time period without dialogue content, calculating the minimum value of the time period of the 3 sigma (sigma is the standard deviation of the normal distribution) position according to the normal distribution form, and obtaining the reference value of the endpoint detection result according to the ratio of the minimum value of the time period to the minimum time interval of the endpoint detection.
In some embodiments, in step S140, a horizontal line having a vertical coordinate value equal to the reference value of the end point detection result, that is, the reference line of the end point detection result, may be constructed according to the reference value of the end point detection result; and determining the intersection points of the end point detection threshold value-end point detection result curve of the directional audio data, the end point detection threshold value-end point detection result curve of the non-directional audio data and the reference line of the end point detection result in each piece of test audio data, and taking the end point detection threshold values corresponding to the center points of the plurality of intersection points as the end point detection threshold values under the environment sound of the corresponding intensity level.
In some embodiments, step S140 may specifically include:
acquiring a reference value of an end point detection result, and constructing an end point detection result reference area according to the reference value of the end point detection result;
and determining the endpoint detection threshold value under the environment sound of the corresponding intensity level according to the intersection area of the endpoint detection threshold value-endpoint detection result curve of each test audio data and the endpoint detection result reference area.
In constructing the end point detection result reference region, a fluctuation interval of the reference value of the end point detection result may be set. By way of example, as shown in fig. 3, taking the reference value of the end point detection result as 12 as an example, the fluctuation interval may be set to ±3. That is, the end point detection result reference area may be an area surrounded by the end point detection result fluctuation of 6 as a lower limit and 15 as an upper limit.
And then determining an end point detection threshold value-end point detection result curve of the directional audio data, an end point detection threshold value-end point detection result curve of the non-directional audio data and an intersection area of the end point detection result reference area in each test audio data, and calculating an end point detection threshold value corresponding to the center point of the intersection area as the end point detection threshold value under the environment sound of the corresponding intensity level.
Step S150, summarizing the end point detection values of the environmental sounds with all the intensity levels to obtain a mapping table of the environmental sound intensity and the end point detection threshold value;
for example, a mapping table of ambient sound intensity and endpoint detection threshold may be as shown in table 1 below.
Table 1 mapping table of ambient sound intensity and endpoint detection threshold
Step S200, obtaining the intensity of the environmental sound;
the acquisition of the ambient sound intensity may be performed in real time or at regular time. In a speech recognition system, a sound sensor may be provided to capture the ambient sound intensity. The sound sensor can collect the environmental sound intensity in real time and can store the collected environmental sound intensity in a memory or a buffer. The ambient sound intensity in step S200 may be the ambient sound intensity collected in real time, or may be determined according to the ambient sound intensity within a preset time period.
In some embodiments, in step S200, the ambient sound intensity is determined according to the ambient sound intensity within a preset period of time. In general, there may be a sudden increase in ambient noise at a particular moment, but the overall trend of ambient noise changes slowly. Such abrupt environmental noise generally has limited impact on endpoint detection. The environmental sound intensity is determined by utilizing the environmental sound intensity in the preset time period, so that the endpoint detection threshold value obtained later is stable, and the fluctuation of the voice recognition accuracy is avoided.
Step S200 may specifically be: acquiring the environmental sound intensity in a preset time period at regular time; and counting the environmental sound intensity in a preset time period to obtain the environmental sound intensity. For example, step S200 may be performed once every minute, and the preset period of time may be set to the first 5 minutes of the execution time. When executing the step S200, firstly, acquiring all the environmental sound intensities in the first 5 minutes; then, an average value of the acquired ambient sound intensities is calculated as the ambient sound intensity. The statistics of the environmental sound intensity in the preset time period can be carried out by the average value, the median determination, or the mode determination or the average value and the median of the modes.
It is understood that the time interval of the timing and the length of the preset time period can be freely determined according to actual needs.
Step S300, according to the acquired environmental sound intensity, obtaining a corresponding endpoint detection threshold value from a mapping table of the environmental sound intensity and the endpoint detection threshold value;
step S400, obtaining audio data, and detecting the end point of the audio data by using the obtained end point detection threshold;
step S500, voice recognition is performed on the audio data after the end point detection.
Audio data may be acquired using a microphone, microphone array, microphone, or other pickup device.
Endpoint detection may use common schemes for threshold-based feature decisions, such as threshold decisions for feature parameters in the time/frequency domain, e.g., volume, short-time energy value, zero crossing rate, etc. And the endpoint detection threshold value corresponds to the selected characteristic parameter. In some embodiments, the endpoint detection employs a determination scheme of a volume threshold, and if the volume threshold is exceeded, the endpoint detection is determined to be sound, otherwise, the endpoint detection is determined to be sound-free. At this time, the end point detection threshold is the volume threshold.
According to the voice recognition method for the self-adaptive endpoint detection, the corresponding endpoint detection threshold value can be automatically matched according to the current environmental sound intensity, so that the current environmental noise can be well adapted, the endpoint detection is more accurate, the accuracy of voice recognition is higher and is not influenced by the environmental noise, and the voice recognition result with the same accuracy can be obtained even under the non-stable and complex noise environments.
In some embodiments, as shown in fig. 2, before step S200, it may further include:
step S205, judging whether a user exists, and when the user exists, entering step S200 to acquire the intensity of the environmental sound; otherwise, continuing to judge whether the user is the user.
An infrared sensor may be provided in the speech recognition system to determine whether a user is present in the vicinity of the speech recognition system. If it is determined that there is no user, speech recognition need not be performed. Therefore, the voice recognition action can be triggered only when the user is nearby, so that the energy consumption can be effectively reduced, and the method is more energy-saving and environment-friendly.
In some embodiments, after step S200, it may further include:
when the environmental sound intensity exceeds the reminding threshold, the user can be reminded, for example, the user is reminded that the current environment is noisy, talking loudly or talking to the pickup device.
It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments and that the acts referred to are not necessarily required by the embodiments of the present application.
As shown in fig. 4, an embodiment of the present application discloses a speech recognition system for adaptive endpoint detection, including:
an ambient sound construction module 110 for constructing ambient sounds of different intensity levels;
the test audio data obtaining module 120 is configured to play a test sound source under the environmental sound of each intensity level, collect test audio data, where each test audio data corresponds to the environmental sound of one intensity level, and each test audio data includes directional audio data collected under a directional enhancement angle and non-directional audio data collected under a non-directional enhancement angle;
a first endpoint detection module 130, configured to perform endpoint detection on each piece of test audio data, so as to obtain a curve of endpoint detection threshold value-endpoint detection result of the test audio data;
the threshold calculation module 140 is configured to obtain a reference value of the endpoint detection result, and determine an endpoint detection threshold under the environmental sound of the corresponding intensity level according to the endpoint detection threshold-curve of the endpoint detection result and the reference value of the endpoint detection result of each test audio data;
the threshold mapping table module 150 is configured to aggregate the endpoint detection values under the environmental sounds of all intensity levels to obtain a mapping table of the environmental sound intensity and the endpoint detection threshold;
an ambient sound detection module 200 for acquiring an ambient sound intensity;
the endpoint detection threshold determining module 300 is configured to obtain, according to the obtained environmental sound intensity, a corresponding endpoint detection threshold from a mapping table of the environmental sound intensity and the endpoint detection threshold;
a second endpoint detection module 400, configured to obtain audio data, and perform endpoint detection on the audio data by using the obtained endpoint detection threshold;
the voice recognition module 500 is configured to perform voice recognition on the audio data after the endpoint detection.
In some embodiments, the apparatus may further include a user detection module 205 configured to determine whether there is a user, and trigger the ambient sound detection module 200 to obtain the ambient sound intensity when it is determined that there is a user.
In some embodiments, an alarm module may be further included, where the alarm module is configured to prompt the user when the environmental sound intensity exceeds the alert threshold, for example, to prompt the user that the current environment is relatively noisy, please speak loudly, or speak into the sound pickup device.
The specific operation modes of the environmental sound construction module 110, the test audio data acquisition module 120, the first endpoint detection module 130, the threshold calculation module 140, the threshold mapping table module 150, the environmental sound detection module 200, the endpoint detection threshold determination module 300, the second endpoint detection module 400, the voice recognition module 500, the user detection module 205, and the alert module may be referred to the description in the foregoing method embodiments, and will not be repeated herein.
According to the voice recognition scheme for the self-adaptive endpoint detection, the corresponding endpoint detection threshold value can be automatically matched according to the current environmental sound intensity, so that the current environmental noise can be well adapted, the endpoint detection is more accurate, the accuracy of the voice recognition is higher and is not influenced by the environmental noise, and the voice recognition result with the same accuracy can be obtained even under the non-stable and complex noise environments.
An embodiment of the present application further provides an intelligent device, which may include the foregoing adaptive endpoint detection voice recognition system, or perform the foregoing adaptive endpoint detection voice recognition method.
An embodiment of the present application provides a machine-readable storage medium, on which is stored a computer program, which when executed by a processor, implements the speech recognition method of adaptive endpoint detection described in any of the above embodiments.
The components/modules/units of the system/computer apparatus integration, if implemented as software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may also be implemented by implementing all or part of the flow of the method of the above embodiment, or by instructing the relevant hardware by a computer program, where the computer program may be stored in a computer readable storage medium, and where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
In the several embodiments provided herein, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the components is merely a logical functional division, and additional divisions may be implemented in practice.
In addition, each functional module/component in the embodiments of the present invention may be integrated in the same processing module/component, or each module/component may exist alone physically, or two or more modules/components may be integrated in the same module/component. The integrated modules/components described above may be implemented in hardware or in hardware plus software functional modules/components.
It will be evident to those skilled in the art that the embodiments of the invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units, modules or means recited in a system, means or terminal claim may also be implemented by means of software or hardware by means of one and the same unit, module or means. The terms first, second, etc. are used to denote a name, but not any particular order.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (10)

1. A method of speech recognition for adaptive endpoint detection, comprising:
constructing environmental sounds with different intensity levels;
playing a test sound source under the environment sound of each intensity level, and acquiring test audio data, wherein each test audio data corresponds to the environment sound of one intensity level, and each test audio data comprises directional audio data acquired under a directional enhancement angle and non-directional audio data acquired under a non-directional enhancement angle;
performing end point detection on each piece of test audio data to obtain a curve of an end point detection threshold value-end point detection result of the test audio data;
acquiring a reference value of an endpoint detection result, and determining an endpoint detection threshold under the environmental sound of a corresponding intensity level according to an endpoint detection threshold-endpoint detection result curve of each piece of test audio data and the reference value of the endpoint detection result;
summarizing the end point detection values of the environmental sounds with all the intensity levels to obtain a mapping table of the environmental sound intensity and the end point detection threshold value;
acquiring the intensity of environmental sound;
obtaining a corresponding endpoint detection threshold value from a mapping table of the environmental sound intensity and the endpoint detection threshold value according to the acquired environmental sound intensity;
acquiring audio data, and performing endpoint detection on the audio data by using the obtained endpoint detection threshold;
and carrying out voice recognition on the audio data after the endpoint detection.
2. The method for speech recognition according to claim 1, wherein the step of obtaining the reference value of the end point detection result, determining the end point detection threshold value under the environmental sound of the corresponding intensity level according to the end point detection threshold value-curve of the end point detection result and the reference value of the end point detection result of each test audio data, specifically comprises:
acquiring a reference value of an end point detection result, and constructing an end point detection result reference area according to the reference value of the end point detection result;
and determining the endpoint detection threshold value under the environment sound of the corresponding intensity level according to the intersection area of the endpoint detection threshold value-endpoint detection result curve of each test audio data and the endpoint detection result reference area.
3. The method for speech recognition according to claim 1, wherein the step of obtaining the intensity of the ambient sound comprises: acquiring the environmental sound intensity in a preset time period at regular time; and counting the environmental sound intensity in a preset time period to obtain the environmental sound intensity.
4. The method for speech recognition according to claim 1, wherein the reference value of the end point detection result is determined by a test, specifically: collecting dialogue audio under three speech speeds, namely slow speech speed, normal speech speed and fast speech speed; and counting the time period without dialogue content, calculating the minimum value of the time period with the position of 3 times of standard deviation according to the normal distribution form, and determining the reference value of the end point detection result according to the ratio of the minimum value of the time period to the minimum time interval of the end point detection.
5. The method of claim 1, further comprising, prior to the step of obtaining the ambient sound intensity: judging whether a user exists, and acquiring the environmental sound intensity when the user exists.
6. The method of any of claims 1-5, wherein the endpoint detection threshold is a volume threshold.
7. A speech recognition system for adaptive endpoint detection, comprising:
the environment sound construction module is used for constructing environment sounds with different intensity levels;
the test audio data acquisition module is used for playing a test sound source under the environment sound of each intensity level, acquiring test audio data, wherein each test audio data corresponds to the environment sound of one intensity level, and each test audio data comprises directional audio data acquired under a directional enhancement angle and non-directional audio data acquired under a non-directional enhancement angle;
the first endpoint detection module is used for carrying out endpoint detection on each piece of test audio data to obtain an endpoint detection threshold value-endpoint detection result curve of the test audio data;
the threshold calculation module is used for obtaining a reference value of the endpoint detection result, and determining an endpoint detection threshold under the environmental sound of the corresponding intensity level according to the endpoint detection threshold-curve of the endpoint detection result of each piece of test audio data and the reference value of the endpoint detection result;
the threshold value mapping table module is used for summarizing the endpoint detection values of the environmental sounds with all the intensity levels to obtain a mapping table of the environmental sound intensity and the endpoint detection threshold value;
the environment sound detection module is used for acquiring the intensity of the environment sound;
the endpoint detection threshold determining module is used for obtaining a corresponding endpoint detection threshold from a mapping table of the environmental sound intensity and the endpoint detection threshold according to the acquired environmental sound intensity;
the second endpoint detection module is used for acquiring audio data and detecting endpoints of the audio data by utilizing the obtained endpoint detection threshold;
and the voice recognition module is used for carrying out voice recognition on the audio data after the endpoint detection.
8. The adaptive end-point detected speech recognition system of claim 7, further comprising a user detection module configured to determine whether a user is present, and when it is determined that a user is present, trigger the ambient sound detection module to obtain the ambient sound intensity.
9. A smart device comprising the speech recognition system of adaptive endpoint detection of any of claims 7-8.
10. A machine readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech recognition method of adaptive endpoint detection according to any of claims 1-6.
CN202010633139.5A 2020-07-02 2020-07-02 Self-adaptive endpoint detection voice recognition method and system and intelligent device Active CN111816217B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010633139.5A CN111816217B (en) 2020-07-02 2020-07-02 Self-adaptive endpoint detection voice recognition method and system and intelligent device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010633139.5A CN111816217B (en) 2020-07-02 2020-07-02 Self-adaptive endpoint detection voice recognition method and system and intelligent device

Publications (2)

Publication Number Publication Date
CN111816217A CN111816217A (en) 2020-10-23
CN111816217B true CN111816217B (en) 2024-02-09

Family

ID=72855130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010633139.5A Active CN111816217B (en) 2020-07-02 2020-07-02 Self-adaptive endpoint detection voice recognition method and system and intelligent device

Country Status (1)

Country Link
CN (1) CN111816217B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100082239A (en) * 2009-01-08 2010-07-16 주식회사 코아로직 Device and method for stabilizing voice source and communication apparatus comprising the same device
JP2015022112A (en) * 2013-07-18 2015-02-02 独立行政法人産業技術総合研究所 Voice activity detection device and method
CN106663445A (en) * 2014-08-18 2017-05-10 索尼公司 Voice processing device, voice processing method, and program
CN107331386A (en) * 2017-06-26 2017-11-07 上海智臻智能网络科技股份有限公司 End-point detecting method, device, processing system and the computer equipment of audio signal
US9852620B1 (en) * 2014-09-19 2017-12-26 Thomas John Hoeft System and method for detecting sound and performing an action on the detected sound
CN108877776A (en) * 2018-06-06 2018-11-23 平安科技(深圳)有限公司 Sound end detecting method, device, computer equipment and storage medium
CN109473092A (en) * 2018-12-03 2019-03-15 珠海格力电器股份有限公司 A kind of sound end detecting method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9524735B2 (en) * 2014-01-31 2016-12-20 Apple Inc. Threshold adaptation in two-channel noise estimation and voice activity detection

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100082239A (en) * 2009-01-08 2010-07-16 주식회사 코아로직 Device and method for stabilizing voice source and communication apparatus comprising the same device
JP2015022112A (en) * 2013-07-18 2015-02-02 独立行政法人産業技術総合研究所 Voice activity detection device and method
CN106663445A (en) * 2014-08-18 2017-05-10 索尼公司 Voice processing device, voice processing method, and program
US9852620B1 (en) * 2014-09-19 2017-12-26 Thomas John Hoeft System and method for detecting sound and performing an action on the detected sound
CN107331386A (en) * 2017-06-26 2017-11-07 上海智臻智能网络科技股份有限公司 End-point detecting method, device, processing system and the computer equipment of audio signal
CN108877776A (en) * 2018-06-06 2018-11-23 平安科技(深圳)有限公司 Sound end detecting method, device, computer equipment and storage medium
CN109473092A (en) * 2018-12-03 2019-03-15 珠海格力电器股份有限公司 A kind of sound end detecting method and device

Also Published As

Publication number Publication date
CN111816217A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
CN112004177B (en) Howling detection method, microphone volume adjustment method and storage medium
US6782363B2 (en) Method and apparatus for performing real-time endpoint detection in automatic speech recognition
CN103632666A (en) Voice recognition method, voice recognition equipment and electronic equipment
CN113129917A (en) Speech processing method based on scene recognition, and apparatus, medium, and system thereof
CN107331386B (en) Audio signal endpoint detection method and device, processing system and computer equipment
US20060100866A1 (en) Influencing automatic speech recognition signal-to-noise levels
CN110808030B (en) Voice awakening method, system, storage medium and electronic equipment
WO2023137861A1 (en) Divisive normalization method, device, audio feature extractor and a chip
US20190302916A1 (en) Near ultrasound based proximity sensing for mobile devices
CN112185408A (en) Audio noise reduction method and device, electronic equipment and storage medium
CN113014844A (en) Audio processing method and device, storage medium and electronic equipment
CN111048118A (en) Voice signal processing method and device and terminal
CN114627899A (en) Sound signal detection method and device, computer readable storage medium and terminal
CN112992153B (en) Audio processing method, voiceprint recognition device and computer equipment
CN111816217B (en) Self-adaptive endpoint detection voice recognition method and system and intelligent device
CN110556128B (en) Voice activity detection method and device and computer readable storage medium
CN111986694B (en) Audio processing method, device, equipment and medium based on transient noise suppression
CN113409800A (en) Processing method and device for monitoring audio, storage medium and electronic equipment
CN113270118B (en) Voice activity detection method and device, storage medium and electronic equipment
CN115359804A (en) Directional audio pickup method and system based on microphone array
CN108899041B (en) Voice signal noise adding method, device and storage medium
CN114255779A (en) Audio noise reduction method for VR device, electronic device and storage medium
CN112653979A (en) Adaptive dereverberation method and device
CN111128199A (en) Sensitive speaker monitoring and recording control method and system based on deep learning
EP4307297A1 (en) Method and apparatus for switching main microphone, voice detection method and apparatus for microphone, microphone-loudspeaker integrated device, and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant