CN108986844B - Speech endpoint detection method based on speaker speech characteristics - Google Patents

Speech endpoint detection method based on speaker speech characteristics Download PDF

Info

Publication number
CN108986844B
CN108986844B CN201810887035.XA CN201810887035A CN108986844B CN 108986844 B CN108986844 B CN 108986844B CN 201810887035 A CN201810887035 A CN 201810887035A CN 108986844 B CN108986844 B CN 108986844B
Authority
CN
China
Prior art keywords
voice
frame
judgment area
sound
background noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810887035.XA
Other languages
Chinese (zh)
Other versions
CN108986844A (en
Inventor
孝大宇
张淑蕾
王超
康雁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201810887035.XA priority Critical patent/CN108986844B/en
Publication of CN108986844A publication Critical patent/CN108986844A/en
Application granted granted Critical
Publication of CN108986844B publication Critical patent/CN108986844B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention relates to a voice endpoint detection method based on the voice characteristics of a speaker; the method comprises the following steps: 100. pre-acquiring voice characteristics of at least two persons; 101. collecting and preprocessing voice signals of at least two speakers to obtain background noise signals; 102. respectively windowing a voice signal and a background noise signal to obtain a voice frame and a background noise frame; 103. acquiring short-time energy zero product values and threshold values of a sound frame and a background noise frame; 104. acquiring a voiced segment of the voice signal by a threshold value aiming at all voice frames; 105. updating a threshold value and acquiring an end point of a voice signal according to the voice characteristics of the voice segment; the method combines speaker recognition based on the traditional voice endpoint detection, takes the noise influence into consideration, and extracts and compares the voice characteristics of the speakers, so that the voice endpoint detection is more accurate, and the recognition of multiple speakers is more accurate.

Description

Speech endpoint detection method based on speaker speech characteristics
Technical Field
The invention relates to the technical field of voice information processing and mode recognition, in particular to a voice endpoint detection method based on voice characteristics of a speaker.
Background
Voice endpoint detection is an important link in voice analysis, voice synthesis, voice coding, and speaker recognition. In speech recognition and speaker recognition, a voiced segment and an unvoiced segment in a speech signal are usually segmented according to a certain endpoint detection algorithm, and then recognized according to certain characteristics of speech for the voiced segment. The correct and effective voice endpoint detection can reduce the calculation amount and shorten the processing time, and can eliminate the noise interference of the silence section and improve the accuracy of voice recognition and speaker recognition. Common voice endpoint detection methods include short-term average energy, short-term average zero-crossing rate, and short-term energy-zero-product.
Under the condition of low signal-to-noise ratio, the traditional Voice endpoint Detection based on a threshold is influenced by noise with low accuracy, especially in a multi-speaker recognition scene, the situation that utterances of different speakers are connected closely sometimes occurs, and a Voice segment detected by general Voice endpoint Detection (Voice Activity Detection, abbreviated as VAD) may contain different speakers, so that the Voice segments of different speakers are not easy to detect.
In the recognition scene of multiple speakers, the voiced segments detected by the traditional threshold-based voice endpoint detection method may contain different speakers, which affects the accuracy of speaker recognition in the later period, and the correct voice endpoint detection is a key factor for improving the accuracy of speaker recognition; therefore, a method for detecting a voice endpoint more accurately in a complex scene of speaking multiple speakers is needed, so that the accuracy of later-stage multiple speaker recognition is improved.
Disclosure of Invention
Technical problem to be solved
In order to solve the above problems in the prior art, the present invention provides a method for detecting a speech endpoint based on speech characteristics of a speaker.
(II) technical scheme
In order to achieve the purpose, the invention adopts the main technical scheme that the method comprises the following steps:
100. the method comprises the steps of obtaining voice characteristics of at least two persons in advance through voice information samples;
101. collecting voice signals including at least two speeches of people, carrying out preprocessing aiming at the voice signals, and taking 0-100ms of the preprocessed voice signals as background noise signals;
102. windowing is respectively carried out on the preprocessed voice signals and the preprocessed background noise signals, and at least two voice frames corresponding to the voice signals and at least one background noise frame corresponding to the background noise signals are obtained;
103. acquiring a short-time energy zero product value of each sound frame, a short-time energy zero product value of each background noise frame and a threshold value;
acquiring average energy E and a short-time average zero-crossing rate Z by the following formula (1) and formula (2) respectively aiming at each sound frame and each background noise frame;
Figure GDA0002525197380000021
Figure GDA0002525197380000022
n in the formula denotes the length of the window, sw(k) Representing a windowed speech signal;
wherein, the short-time energy zero product is the product of the average energy E and the short-time average zero-crossing rate Z;
the threshold value is the multiplication of the average value of short-time energy zero product values of all background noise frames and a constant C;
104. according to the corresponding sequence of all sound frames and voice signals, taking the sound frame with the first short-time energy zero product value larger than a threshold value as an initial frame, and taking the sound frame with the first short-time energy zero product value smaller than the threshold value in the sound frames after the initial frame as an end frame;
all sound frames from the starting frame to the ending frame are sound segments of the voice signal;
105. acquiring the voice characteristics of a first judgment area and the voice characteristics of a second judgment area in a voiced sound segment, updating a threshold value according to the voice characteristics of the first judgment area and the voice characteristics of the second judgment area, and acquiring an end point of a voice signal;
the first judgment area is at least one sound frame behind the initial frame of the sound segment, and the voice characteristics of the first judgment area are obtained;
the second judgment area is at least one sound frame before the termination frame of the sound segment, and the voice characteristics of the second judgment area are obtained;
if the voice characteristics of the first judgment area and the voice characteristics of the second judgment area are matched with the voice characteristics of the same person in the pre-acquired voice characteristics of at least two persons, the end point with the sound segment is taken as the end point of the voice signal;
otherwise, increasing the threshold by a preset value, updating the threshold, and executing the step 104 to obtain an updated voiced sound segment according to the updated threshold;
and executing step 105 for the updated voiced segment to obtain a corresponding updated first judgment area and a corresponding updated second judgment area, comparing the voice characteristics, repeating updating for a preset number of times until the updated first judgment area and the updated second judgment area are both matched with the voice characteristics of the same person in the pre-acquired voice characteristics of at least two persons, and taking the end point of the updated voiced segment as the end point of the voice signal.
Optionally, the voice information samples include:
at least two pieces of voice information, wherein the duration of each piece of voice information is more than one minute, and each piece of voice information is the voice information of speaking of different people;
and acquiring a Gaussian mixture model of each voice message to obtain the voice characteristics corresponding to each voice message.
Optionally, the pre-processing comprises:
aiming at voice signal filtering, the upper limit cut-off frequency of the filtering is 3400Hz, and the lower limit cut-off frequency is 60-100 Hz.
Optionally, the windowing process comprises:
in step 102, dividing a speech signal into at least two sound frames according to a Hamming window; .
Alternatively,
the frame length of each sound frame corresponding to the voice signal is 10 ms-30 ms, and the frame shift between adjacent sound frames is half of the frame length;
the frame length of each background noise frame corresponding to the background noise signal is 10ms, and the frame shift between adjacent background noise frames is half of the frame length.
Alternatively,
in step 105, the duration of the first judgment area and the duration of the second judgment area are both 1s-3 s.
Alternatively,
in step 105, a gaussian mixture model is obtained for the first judgment area and the second judgment area of the voiced sound segment; the Gaussian mixture model of the first judgment area is the voice characteristic of the first judgment area;
the Gaussian mixture model of the second judgment area is the voice characteristic of the second judgment area.
Alternatively,
in step 105, the preset number of times of repeated updating is 10 times.
Alternatively,
in step 105, the threshold is increased by a preset value by 5% of the threshold before updating.
(III) advantageous effects
The invention has the beneficial effects that:
the method combines speaker recognition based on the traditional voice endpoint detection, extracts and compares the characteristics of the speakers while considering the noise influence, so that the voice endpoint detection is more accurate, and the recognition of multiple speakers is more accurate.
Drawings
FIG. 1 is a flowchart illustrating a method for detecting a voice endpoint based on characteristics of a speaker according to an embodiment of the present invention;
FIG. 2(a) is a time domain diagram of speaker A pronunciation "0" according to an embodiment of the present invention;
FIG. 2(b) is a diagram of the frequency spectrum of speaker A pronunciation "0" according to an embodiment of the present invention;
FIG. 2(c) is a time domain diagram of the utterance "0" of speaker B according to an embodiment of the present invention;
FIG. 2(d) is a diagram of the frequency spectrum of the speaker B's pronunciation of "0" according to an embodiment of the present invention;
FIG. 3(a) is a diagram of a speaker's voice signal according to an embodiment of the present invention;
FIG. 3(b) is a short-time energy zero-product diagram of a speaker voice signal according to an embodiment of the present invention;
FIG. 3(c) is a voice endpoint detection result of the short-time zero-integration method for the voice signal of the speaker according to an embodiment of the present invention;
FIG. 4 is a functional block diagram of speaker recognition according to an embodiment of the present invention;
fig. 5 is a flowchart of voice endpoint detection according to an embodiment of the present invention.
Detailed Description
For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
As shown in fig. 1, the method of the present invention comprises the following steps:
100. the method comprises the steps of obtaining voice characteristics of at least two persons in advance through voice information samples;
the voice information samples include: at least two pieces of voice information, wherein the duration of each piece of voice information is more than one minute, and each piece of voice information is the voice information of speaking of different people;
and acquiring a Gaussian mixture model of each voice message to obtain the voice characteristics corresponding to each voice message. For example, in the present embodiment, the speaker a and the speaker B are taken as an example, specifically, the speech information of the speaker a and the speaker B is collected in advance, and a gaussian mixture model of the speech information of the speaker a and the speaker B is obtained and is taken as the speech feature;
as shown in fig. 2(a) and 2(b), a time domain diagram and a spectrogram of speaker a uttering "0" respectively;
as shown in fig. 2(c) and 2(d), a time domain diagram and a spectrogram of speaker B utterance "0", respectively;
specifically, in this embodiment, the content of the registered voice information is not limited in the present invention, and this embodiment is only for illustration.
101. Collecting voice signals including at least two speeches of people, carrying out preprocessing aiming at the voice signals, and taking 0-100ms of the preprocessed voice signals as background noise signals;
for example, since there is usually a silence region at the beginning of recording, the first 100ms signal is usually taken as the analysis of background noise;
further, filtering is performed on the voice signal;
for example, the upper cut-off frequency of the filtering is 3400Hz, and the lower cut-off frequency is 60-100 Hz.
102. Windowing is respectively carried out on the preprocessed voice signals and the preprocessed background noise signals, and at least two voice frames corresponding to the voice signals and at least one background noise frame corresponding to the background noise signals are obtained;
for example, a speech signal usually has time variability and short-time stationarity, so the speech signal is usually divided into a plurality of sound frames to obtain the feature parameters of the speech signal, in this embodiment, a window function is added to the speech signal, wherein the window function is a hamming window, and the hamming window is shown in the following formula (1);
Figure GDA0002525197380000061
wherein N in formula (1) represents the length of the window;
for example, the frame length of each voice frame corresponding to the voice signal after windowing is 10ms to 30ms, and the frame shift between adjacent voice frames is half of the frame length.
The frame length of each background noise frame corresponding to the background noise signal after windowing is 10ms, and the frame shift between adjacent background noise frames is half of the frame length.
103. Acquiring a short-time energy zero product value of each sound frame, a short-time energy zero product value of each background noise frame and a threshold value;
acquiring average energy E and a short-time average zero-crossing rate Z by the following formula (2) and formula (3) respectively aiming at each sound frame and each background noise frame;
Figure GDA0002525197380000062
Figure GDA0002525197380000063
n in the formula denotes the length of the window, sw(k) Representing a windowed speech signal;
wherein, the short-time energy zero product is the product of the average energy E and the short-time average zero-crossing rate Z;
the threshold value is the multiplication of the average value of short-time energy zero product values of all background noise frames and a constant C, for example, C is 1.2 according to an empirical value;
in the present embodiment, since the windowing process is performed in step 102, the original speech signal is converted into a corresponding audio frame, and therefore, the background noise is processed with 10m as one frame in the present embodiment;
104. according to the corresponding sequence of all sound frames and voice signals, taking the sound frame with the first short-time energy zero product value larger than a threshold value as an initial frame, and taking the sound frame with the first short-time energy zero product value smaller than the threshold value in the sound frames after the initial frame as an end frame;
all sound frames from the starting frame to the ending frame are used as sound segments of the voice signal;
for example, as shown in fig. 3(a), the present embodiment collects a speech signal of a speaker, 3(b) shows a short-time energy zero product of the speech signal of the speaker, and fig. 3(c) shows a speech endpoint detection result graph using a short-time energy zero product method;
105. acquiring the voice characteristics of a first judgment area and the voice characteristics of a second judgment area in a voiced sound segment, updating a threshold value according to the voice characteristics of the first judgment area and the voice characteristics of the second judgment area, and acquiring an end point of a voice signal;
the first judgment area is at least one sound frame behind the initial frame of the sound segment, and the voice characteristics of the first judgment area are obtained;
the second judgment area is at least one sound frame before the termination frame of the sound segment, and the voice characteristics of the second judgment area are obtained;
as shown in FIG. 4, FIG. 4 is a block diagram illustrating the principle of using speaker recognition in the present embodiment
Specifically, for example, the first determination area is a total sound frame corresponding to 1s-3s after the start frame of the sound segment, and the speech feature of the first determination area is obtained by obtaining a gaussian mixture model of the first determination area;
the second judgment area is a plurality of sound frames corresponding to 1s-3s before the termination frame of the sound segment, and the voice characteristics of the first judgment area are obtained by obtaining the Gaussian mixture model of the first judgment area;
specifically, as shown in fig. 5, if the voice feature of the first judgment area and the voice feature of the second judgment area are both matched with the voice feature of the same person in the pre-acquired voice features of at least two persons, the endpoint with the sound segment is used as the endpoint of the voice signal;
for example, when the voice feature of the first judgment area is matched with the voice feature of the speaker A, and the voice feature of the second judgment area is also matched with the voice feature of the speaker A, the end point with the voice segment is taken as the end point of the voice signal; the voice characteristics obtained in advance are only used for illustration and are not limited, and the voice characteristics of the corresponding first judgment region and the voice characteristics of the second judgment region can also be matched with the voice characteristics of the speaker B.
Otherwise, increasing the threshold by a preset value, for example, increasing the threshold by 5% of the threshold before updating, updating the threshold, and executing step 104 to obtain an updated voiced sound segment according to the updated threshold;
for example, when the voice feature of the first judgment area is matched with the voice feature of the speaker a and the voice feature of the second judgment area is also matched with the voice feature of the speaker B, the threshold value is updated, and the updated voiced segment is acquired correspondingly according to the updated threshold value;
and executing step 105 for the updated voiced segment to obtain the corresponding updated first judgment area and second judgment area, comparing the voice characteristics, and repeatedly updating for a preset number of times, for example, setting the number of times of repeated updating to 10 times, until the updated first judgment area and second judgment area are both matched with the voice characteristics of the same person in the pre-obtained voice characteristics of at least two persons, and then using the end point of the updated voiced segment as the end point of the voice signal.
The method combines speaker recognition based on the traditional voice endpoint detection, extracts and compares the characteristics of the speakers while considering the noise influence, so that the voice endpoint detection is more accurate, and the recognition of multiple speakers is more accurate.
Finally, it should be noted that: the above-mentioned embodiments are only used for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A voice endpoint detection method based on speaker voice characteristics is characterized by comprising the following steps:
100. the method comprises the steps of obtaining voice characteristics of at least two persons in advance through voice information samples;
101. collecting voice signals including at least two speeches of people, carrying out preprocessing aiming at the voice signals, and taking 0-100ms of the preprocessed voice signals as background noise signals;
102. windowing is respectively carried out on the preprocessed voice signals and the preprocessed background noise signals, and at least two voice frames corresponding to the voice signals and at least one background noise frame corresponding to the background noise signals are obtained;
103. acquiring a short-time energy zero product value of each sound frame, a short-time energy zero product value of each background noise frame and a threshold value;
acquiring average energy E and a short-time average zero-crossing rate Z by the following formula (1) and formula (2) respectively aiming at each sound frame and each background noise frame;
Figure FDA0002525197370000011
Figure FDA0002525197370000012
n in the formula denotes the length of the window, sw(k) Representing a windowed speech signal;
wherein, the short-time energy zero product is the product of the average energy E and the short-time average zero-crossing rate Z;
the threshold value is the multiplication of the average value of short-time energy zero product values of all background noise frames and a constant C;
104. according to the corresponding sequence of all sound frames and voice signals, taking the sound frame with the first short-time energy zero product value larger than a threshold value as an initial frame, and taking the sound frame with the first short-time energy zero product value smaller than the threshold value in the sound frames after the initial frame as an end frame;
all sound frames from the starting frame to the ending frame are sound segments of the voice signal;
105. acquiring the voice characteristics of a first judgment area and the voice characteristics of a second judgment area in a voiced sound segment, updating a threshold value according to the voice characteristics of the first judgment area and the voice characteristics of the second judgment area, and acquiring an end point of a voice signal;
the first judgment area is at least one sound frame behind the initial frame of the sound segment, and the voice characteristics of the first judgment area are obtained;
the second judgment area is at least one sound frame before the termination frame of the sound segment, and the voice characteristics of the second judgment area are obtained;
if the voice characteristics of the first judgment area and the voice characteristics of the second judgment area are matched with the voice characteristics of the same person in the pre-acquired voice characteristics of at least two persons, the end point with the sound segment is taken as the end point of the voice signal;
otherwise, increasing the threshold by a preset value, updating the threshold, and executing the step 104 to obtain an updated voiced sound segment according to the updated threshold;
and executing step 105 for the updated voiced segment to obtain a corresponding updated first judgment area and a corresponding updated second judgment area, comparing the voice characteristics, repeating updating for a preset number of times until the updated first judgment area and the updated second judgment area are both matched with the voice characteristics of the same person in the pre-acquired voice characteristics of at least two persons, and taking the end point of the updated voiced segment as the end point of the voice signal.
2. The method of claim 1, wherein the speech information samples comprise:
at least two pieces of voice information, wherein the duration of each piece of voice information is more than one minute, and each piece of voice information is the voice information of speaking of different people;
and acquiring a Gaussian mixture model of each voice message to obtain the voice characteristics corresponding to each voice message.
3. The method of claim 2, wherein pre-processing comprises:
aiming at voice signal filtering, the upper limit cut-off frequency of the filtering is 3400Hz, and the lower limit cut-off frequency is 60-100 Hz.
4. The method of claim 3, wherein the windowing comprises:
in step 102, at least two sound frames are split for a speech signal according to a hamming window.
5. The method of claim 4,
the frame length of each sound frame corresponding to the voice signal is 10 ms-30 ms, and the frame shift between adjacent sound frames is half of the frame length;
the frame length of each background noise frame corresponding to the background noise signal is 10ms, and the frame shift between adjacent background noise frames is half of the frame length.
6. The method of claim 5,
in step 105, the duration of the first judgment area and the duration of the second judgment area are both 1s-3 s.
7. The method of claim 6,
in step 105, a gaussian mixture model is obtained for the first judgment area and the second judgment area of the voiced sound segment; the Gaussian mixture model of the first judgment area is the voice characteristic of the first judgment area;
the Gaussian mixture model of the second judgment area is the voice characteristic of the second judgment area.
8. The method of claim 7,
in step 105, the preset number of times of repeated updating is 10 times.
9. The method of claim 8,
in step 105, the threshold is increased by a preset value by 5% of the threshold before updating.
CN201810887035.XA 2018-08-06 2018-08-06 Speech endpoint detection method based on speaker speech characteristics Active CN108986844B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810887035.XA CN108986844B (en) 2018-08-06 2018-08-06 Speech endpoint detection method based on speaker speech characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810887035.XA CN108986844B (en) 2018-08-06 2018-08-06 Speech endpoint detection method based on speaker speech characteristics

Publications (2)

Publication Number Publication Date
CN108986844A CN108986844A (en) 2018-12-11
CN108986844B true CN108986844B (en) 2020-08-28

Family

ID=64554966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810887035.XA Active CN108986844B (en) 2018-08-06 2018-08-06 Speech endpoint detection method based on speaker speech characteristics

Country Status (1)

Country Link
CN (1) CN108986844B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109616097B (en) * 2019-01-04 2024-05-10 平安科技(深圳)有限公司 Voice data processing method, device, equipment and storage medium
CN111613250B (en) * 2020-07-06 2023-07-18 泰康保险集团股份有限公司 Long voice endpoint detection method and device, storage medium and electronic equipment
CN112820292B (en) * 2020-12-29 2023-07-18 平安银行股份有限公司 Method, device, electronic device and storage medium for generating meeting summary

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100737358B1 (en) * 2004-12-08 2007-07-09 한국전자통신연구원 Method for verifying speech/non-speech and voice recognition apparatus using the same
JP5088741B2 (en) * 2008-03-07 2012-12-05 インターナショナル・ビジネス・マシーンズ・コーポレーション System, method and program for processing voice data of dialogue between two parties
SG189182A1 (en) * 2010-10-29 2013-05-31 Anhui Ustc Iflytek Co Ltd Method and system for endpoint automatic detection of audio record
CN102522081B (en) * 2011-12-29 2015-08-05 北京百度网讯科技有限公司 A kind of method and system detecting sound end
CN103117067B (en) * 2013-01-19 2015-07-15 渤海大学 Voice endpoint detection method under low signal-to-noise ratio
US8923929B2 (en) * 2013-02-01 2014-12-30 Xerox Corporation Method and apparatus for allowing any orientation answering of a call on a mobile endpoint device
CN103886871B (en) * 2014-01-28 2017-01-25 华为技术有限公司 Detection method of speech endpoint and device thereof
CN104021789A (en) * 2014-06-25 2014-09-03 厦门大学 Self-adaption endpoint detection method using short-time time-frequency value
WO2018100391A1 (en) * 2016-12-02 2018-06-07 Cirrus Logic International Semiconductor Limited Speaker identification
CN107799126B (en) * 2017-10-16 2020-10-16 苏州狗尾草智能科技有限公司 Voice endpoint detection method and device based on supervised machine learning
CN107910017A (en) * 2017-12-19 2018-04-13 河海大学 A kind of method that threshold value is set in noisy speech end-point detection

Also Published As

Publication number Publication date
CN108986844A (en) 2018-12-11

Similar Documents

Publication Publication Date Title
CN111816218B (en) Voice endpoint detection method, device, equipment and storage medium
CN108986844B (en) Speech endpoint detection method based on speaker speech characteristics
CN101625857A (en) Self-adaptive voice endpoint detection method
JPH08508107A (en) Method and apparatus for speaker recognition
CN108922541A (en) Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model
JP2003513319A (en) Emphasis of short-term transient speech features
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
CN108682432B (en) Speech emotion recognition device
CN101625862B (en) Method for detecting voice interval in automatic caption generating system
CN101625860A (en) Method for self-adaptively adjusting background noise in voice endpoint detection
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
CN101625858A (en) Method for extracting short-time energy frequency value in voice endpoint detection
Sorin et al. The ETSI extended distributed speech recognition (DSR) standards: client side processing and tonal language recognition evaluation
JPH0449952B2 (en)
Sudhakar et al. Automatic speech segmentation to improve speech synthesis performance
Kasap et al. A unified approach to speech enhancement and voice activity detection
Jayan et al. Detection of stop landmarks using Gaussian mixture modeling of speech spectrum
Hahn et al. An improved speech detection algorithm for isolated Korean utterances
Jayan et al. Detection of burst onset landmarks in speech using rate of change of spectral moments
CN108573712B (en) Voice activity detection model generation method and system and voice activity detection method and system
CN112489692A (en) Voice endpoint detection method and device
Kyriakides et al. Isolated word endpoint detection using time-frequency variance kernels
RU2174714C2 (en) Method for separating the basic tone
Kacur et al. ZCPA features for speech recognition
Seman et al. Evaluating endpoint detection algorithms for isolated word from Malay parliamentary speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant