CN105405439B

CN105405439B - Speech playing method and device

Info

Publication number: CN105405439B
Application number: CN201510757786.6A
Authority: CN
Inventors: 高建清; 王智国; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2015-11-04
Filing date: 2015-11-04
Publication date: 2019-07-05
Anticipated expiration: 2035-11-04
Also published as: CN105405439A

Abstract

The invention discloses a kind of speech playing method and devices, this method comprises: receiving voice data to be played；End-point detection is carried out to the voice data to be played, obtains each voice segments；Determine whether each voice is key message section；When playing the voice data to be played, it is adjusted according to word speed of the key message section to the voice data to be played.Using the present invention, user can be helped rapidly and accurately to find voice segments of interest.

Description

Speech playing method and device

Technical field

The present invention relates to field of voice signal, and in particular to a kind of speech playing method and device.

Background technique

Currently, more and more people like replacing in the form of recording recording required information in the form of text, When such as having a meeting, meeting content record is got off using the form of recording, for subsequent access；When interview, content will be interviewed It is recorded in the form of recording, based on the Edition Contains at original text；Student's upper class hour the place that do not understand was recorded, is returned Remove inspection information etc..However, people are difficult rapidly and accurately to find valuable recording substance when recording data amount is larger. In order to reduce the play time of recording, the method that existing speech playing method generally uses end-point detection detects pure noise segment Or mute section, these voice segments are skipped over, remaining voice data is played with normal word speed.However, during recording, warp Some unessential contents can often be recorded together, when existing method playback, it is often necessary to which user hand turn changes into F.F. form plays, or directly skips unessential content.Especially in the case where playback environ-ment is bad, the voice number of recording Often poor according to quality, in order to not hear recording substance, user needs to repeat playing manually repeatedly, greatly reduces user experience Degree.

Summary of the invention

The present invention provides a kind of speech playing method and device, to help user rapidly and accurately to find voice of interest Section.

For this purpose, the invention provides the following technical scheme:

A kind of speech playing method, comprising:

Receive voice data to be played；

End-point detection is carried out to the voice data to be played, obtains each voice segments；

Determine whether institute's speech segment is key message section according to the voice content of each voice segments and/or vocal print feature；

When playing the voice data to be played, according to the key message section to the language of the voice data to be played Speed is adjusted.

Preferably, the voice content according to each voice segments determines whether institute's speech segment is that key message section includes:

Speech recognition is carried out to each voice segments, obtains the identification text of each voice segments；

According to the identification text of each voice segments, determine whether institute's speech segment is key message section.

Preferably, the identification text according to each voice segments determines whether institute's speech segment is that key message section includes:

Whether the identification text for determining each voice segments includes preset keyword；

If it is, determining that institute's speech segment is key message section.

Abstract voice segments are extracted from all voice segments using iterative manner, and after reaching the number of iterations of setting, are obtained To multiple abstract voice segments, using the multiple abstract voice segments as key message section.

Preferably, the abstract voice segments that extract from all voice segments include:

The similarity for calculating the identification text of current speech segment and the identification text of the voice data to be played, obtains the One calculated value；

It calculates the identification text of the current speech segment and has extracted the similarity of abstract voice segments identification text, obtain the Two calculated values；

The difference for calculating the first calculated value and the second calculated value obtains the abstract score of current speech segment；

After obtaining the abstract score of all voice segments, select the abstract maximum voice segments of score as abstract voice segments.

Preferably, the vocal print feature according to each voice segments determines whether institute's speech segment is that key message section includes:

If the voice data to be played includes the voice data of multiple speakers, the vocal print for extracting each voice segments is special Sign；

According to the vocal print feature and the sound-groove model of speaker dependent, determine whether institute's speech segment is speaker dependent Voice data；

If it is, determining that institute's speech segment is key message section.

If the voice data to be played includes the voice data of multiple speakers, pass through speaker's isolation technics, Determine main speaker；

Using the voice segments of the main speaker as key message section.

Preferably, described be adjusted according to word speed of the key message section to the voice data to be played includes:

If current speech segment is key message section, the current speech segment is played using normal word speed, is otherwise used Fast word speed plays the current speech segment；Or

If current speech segment is key message section, the current speech segment is played using slow word speed, otherwise using just Normal word speed or fast word speed play the current speech segment.

Preferably, the method also includes:

Obtain the confidence level of each voice segments；

The word speed of the voice data to be played is adjusted specifically: according to the key message section and each voice segments Confidence level the word speed of the voice data to be played is adjusted.

Preferably, it is described according to the confidence level of the key message section and each voice segments to the voice data to be played Word speed, which is adjusted, includes:

If current speech segment is key message section, if its confidence level is greater than second threshold, broadcast using fast word speed The current speech segment is put, the current speech segment is otherwise played using slow word speed；

If current speech segment is non-critical information section, if its confidence level is greater than second threshold, described work as is skipped over Preceding voice segments；If its confidence level is less than or equal to first threshold, the current speech segment is played using slow word speed, described first Threshold value is less than the second threshold.

Preferably, the method also includes:

Each voice segments are carried out with the analysis of voice signal level, the analysis of the voice signal level includes following any one Kind is a variety of: volume change situation, reverberation situation, noise situations；

When playing the voice data to be played, processing is optimized to institute's speech segment based on the analysis results, it is described Optimization processing includes any of the following or a variety of:

(1) if the amplitude for having continuous multiple frames voice data in current speech segment is more than upper limit value, current speech is turned down The amplitude of section；If there is the amplitude of continuous multiple frames voice data lower than lower limit value in current speech segment, current speech segment is turned up Amplitude；

(2) if the reverberation time of current speech segment is more than threshold value, reverberation elimination is carried out to current speech segment；

(3) if the signal-to-noise ratio of current speech segment is less than snr threshold, denoising is carried out to current speech segment.

A kind of voice playing device, comprising:

Receiving module, for receiving voice data to be played；

Endpoint detection module obtains each voice segments for carrying out end-point detection to the voice data to be played；

Key message section determining module, including the first determining module and/or the second determining module, first determining module For determining whether institute's speech segment is key message section according to the voice content of each voice segments, second determining module is used for Determine whether institute's speech segment is key message section according to the vocal print feature of each voice segments；

Playing module, for playing the voice data to be played；

Word speed adjusts module, is used for when the playing module plays the voice data to be played, according to the key Message segment is adjusted the word speed of the voice data to be played.

Preferably, first determining module includes:

Voice recognition unit obtains the identification text of each voice segments for carrying out speech recognition to each voice segments；

Determination unit determines whether institute's speech segment is key message section for the identification text according to each voice segments.

Preferably, the determination unit, whether the identification text specifically for each voice segments of determination includes preset key Word；If it is, determining that institute's speech segment is key message section.

Preferably, the determination unit includes:

The number of iterations sets subelement, for the number of iterations to be arranged；

Abstract extracts subelement, for extracting abstract voice segments from all voice segments using iterative manner；

Judgment sub-unit, the number of iterations for judging whether to reach setting, and after reaching the number of iterations of setting, touching Send out described abstract extract subelement stop iterative process；

Key message section obtains subelement, for obtaining current after the abstract extracts subelement stopping iterative process All abstract voice segments, and as key message section.

Preferably, the abstract extraction subelement includes:

First computation subunit, the identification of identification text and the voice data to be played for calculating current speech segment The similarity of text obtains the first calculated value；

Second computation subunit, for calculating the identification text of the current speech segment and having extracted abstract voice segments identification The similarity of text obtains the second calculated value；

Difference computation subunit obtains current speech segment for calculating the difference of the first calculated value and the second calculated value Abstract score；

Subelement is selected, for after obtaining the abstract score of all voice segments, selecting the abstract maximum voice segments of score As abstract voice segments.

Preferably, second determining module includes:

Vocal print feature extraction unit, for the voice data to be played include multiple speakers voice data when, Extract the vocal print feature of each voice segments；

Application on Voiceprint Recognition unit determines the voice for the sound-groove model according to the vocal print feature and speaker dependent Section whether be speaker dependent voice data；If it is, determining that institute's speech segment is key message section.

Preferably, second determining module includes:

Speaker's separative unit, for, using speaker's isolation technics, determining main speaker according to the vocal print feature, and Using the voice segments of the main speaker as key message section.

Preferably, the word speed adjusts module, is specifically used for adjusting its broadcasting when current speech segment is key message section Word speed is normal word speed, and otherwise adjusting it and playing word speed is fast word speed；Or when current speech segment is key message section, adjustment It is slow word speed that it, which plays word speed, and otherwise adjusting it and playing word speed is normal word speed or fast word speed.

Preferably, described device further include:

Confidence level obtains module, for obtaining the confidence level of each voice segments；

The word speed adjusts module, specifically for according to the confidence level of the key message section and each voice segments to it is described to The word speed for playing voice data is adjusted.

Preferably, described device further include:

Signal analysis and processing module, for each voice segments to be carried out with the analysis of voice signal level, and based on the analysis results Processing is optimized to institute's speech segment；The signal analysis and processing module includes following any one or more units；

Volume analysis and processing unit, for calculating the amplitude of every frame voice data as unit of frame, and in current speech segment In have continuous multiple frames voice data amplitude be more than upper limit value when, turn down the amplitude of current speech segment, have in current speech segment When the amplitude of continuous multiple frames voice data is lower than lower limit value, the amplitude of current speech segment is turned up；

Reverberation analysis and processing unit, for calculating the reverberation time of each voice segments, and in the reverberation time of current speech segment When more than threshold value, reverberation elimination is carried out to current speech segment；

Noise analysis processing unit is greater than letter for calculating the noise of each voice segments, and in the signal-to-noise ratio of current speech segment When making an uproar than threshold value, denoising is carried out to current speech segment.

Speech playing method and device provided in an embodiment of the present invention, analyze voice data to be played, determine it In key message section, and when playing the voice data to be played, according to the key message section to voice number to be played According to word speed be adjusted, so as to help user rapidly and accurately to find user's language of interest from a large amount of voice data Segment or valuable voice segments such as help user to be quickly found out from a large amount of session recording data relevant to session topic Voice segments.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only one recorded in the present invention A little embodiments are also possible to obtain other drawings based on these drawings for those of ordinary skill in the art.

Fig. 1 is the flow chart of speech playing method of the embodiment of the present invention；

Fig. 2 is a kind of structural schematic diagram of voice playing device of the embodiment of the present invention；

Fig. 3 is another structural schematic diagram of voice playing device of the embodiment of the present invention；

Fig. 4 is another structural schematic diagram of voice playing device of the embodiment of the present invention.

Specific embodiment

The scheme of embodiment in order to enable those skilled in the art to better understand the present invention with reference to the accompanying drawing and is implemented Mode is described in further detail the embodiment of the present invention.

As shown in Figure 1, being the flow chart of speech playing method of the embodiment of the present invention, comprising the following steps:

Step 101, voice data to be played is received.

The voice data can be TV programme recording, interview recording, session recording etc..

Step 102, end-point detection is carried out to the voice data to be played, obtains each voice segments.

The end-point detection can use existing some detection techniques, for example be based on short-time energy and short-time average zero passage The end-point detection of rate, the end-point detection based on cepstrum feature, end-point detection based on entropy etc..

Step 103, determine whether institute's speech segment is crucial letter according to the voice content of each voice segments and/or vocal print feature Cease section.

This is described in detail separately below:

1. determining key message section according to the voice content of each voice segments

Firstly, it is necessary to carry out speech recognition to each voice segments, the identification text of each voice segments is obtained, then, according to each language The identification text of segment determines whether institute's speech segment is key message section.Specifically, the method that predetermined keyword can be used Or the method for extracting abstract.

The method of the predetermined keyword needs to be preset the keyword of voice data to be played by user, then for Each voice segments judge that when specifically judging, accurate can be used whether comprising the keyword in the identification text of the voice segments Match or the method for fuzzy matching, will identify that the word in text is matched with the keyword；If successful match, it is determined that the language Segment includes the keyword, correspondingly, using the voice segments as key message section.

The method for extracting abstract, which refers to, extracts abstract voice segments according to the identification text of voice data to be played, will pluck Want voice segments as key message section.Specifically, abstract voice segments can be extracted from all voice segments using iterative manner, and After reaching the number of iterations of setting, multiple abstract voice segments are obtained, using the multiple abstract voice segments as key message section. In each iterative process, select with the entire identification maximum voice segments of the text degree of correlation as abstract voice segments, such as using most Big edge correlation technique extracts abstract voice segments.Calculate the degree of correlation when, first calculating current speech segment identification text with wait broadcast The similarity for putting the identification text of voice data, obtains the first calculated value, then calculate again the identification text of current speech segment with The similarity for having extracted the identification text of abstract voice segments, obtains the second calculated value, finally calculates the first calculated value and second Score of the difference of value as current speech segment, as shown in following formula (1).

MMR(S_i)=α * Sim₁(S_i, D) and-(1- α) * Sim₂(S_i, Sum) and (1)

Wherein, MMR (S_i) be i-th of voice segments score, D be voice data to be played identification text vector, S_iTo work as The identification text vector of preceding voice segments, Sum are the identification text vectors for having extracted abstract voice segments, and α is weight parameter, Ke Yigen According to experience or experimental result value.Sim₁(S_i, D) and it is the identification text of current speech segment and the identification text of voice data to be played This similarity.Sim₂(S_i, Sum) and it is the identification text of current speech segment and the phase for identifying text for having extracted abstract voice segments Like degree.

It should be noted that text vectorization can be used the prior art, this is repeated no more.In addition, described similar Degree can be measured using the distance between vector, such as COS distance, and circular can use the prior art, herein not It is described in detail again.

After the score of all voice segments calculates, select the maximum voice segments of score as abstract voice segments, then Carry out next iteration process.

After iteration is multiple, multiple abstract voice segments can be obtained, these abstract voice segments are as key message section.

It should be noted that the extraction of abstract can also use other abstracting methods in the prior art, to this present invention Embodiment is without limitation.

2. determining key message section according to the vocal print feature of each voice segments

If voice data to be played includes the voice data of multiple speakers, speaker dependent or master can be spoken The voice data of people is as key message section.

For speaker dependent, can be preset by user, by method for recognizing sound-groove, determine each voice segments whether be The voice data of the speaker dependent, if it is, determining that the voice segments are key message section.It should be noted that in the reality It applies in example, it is also necessary to collect the voice data of the speaker dependent, its sound-groove model of training in advance.In Application on Voiceprint Recognition, extract The vocal print feature of each voice segments identifies whether institute's speech segment is that this is specific using the vocal print feature and the sound-groove model The voice data of speaker.

For main speaker, it can utilize speaker's isolation technics will be in voice data according to the vocal print feature of each voice segments The voice data of each speaker is separated, and using the more speaker of voice data as main speaker, or will wherein be risen As main speaker, the voice data of the speaker to play a leading role usually can be to be applied in the speaker of leading role In entire voice data to be played, such as in beginning, end, centre position all there is the voice data of the speaker, it is described The location information of voice data can be determined according to the time in the voice data after separation before separation entire voice data. Speaker's isolation technics can use the prior art, for example realize speaker's separation by cluster scheduling algorithm.

3. determining key message section according to the voice content of each voice segments and vocal print feature

This application is the synthesis of the two kinds of situations in front, for example, there is the session recording etc. of multiple spokesman, user is closed What is infused is the content in wherein someone's speech in terms of environmental protection measure.In such a case, it is possible to be said first with this The sound-groove model for talking about people carries out Application on Voiceprint Recognition, identifies the voice segments of the speaker, is then identified again by predetermined keyword It include the voice segments of the keyword in the voice segments of words people, using these voice segments as key message section.

As it can be seen that not only can satisfy the demand that user pays close attention to voice content to be played, but also more having using the present invention In the case where speaker, can also meet the needs of user pays close attention to different speakers.

Step 104, when playing the voice data to be played, according to the key message section to the voice to be played The word speed of data is adjusted.

In embodiments of the present invention, as unit of voice segments, word speed adjustment is carried out to voice data to be played, it is specific to adjust Perfect square formula can determine according to application demand, without limitation to this embodiment of the present invention.For example, being recorded in user according to interview When writing interview original text, key message section can be played using normal word speed, non-critical information section is played using fast word speed Mode；For another example, when being learnt for student according to classroom recording, key message section can be played using slow word speed, it is right Non-critical information section is in such a way that normal word speed or fast word speed play.

The speech playing method of the embodiment of the present invention, to voice data to be played from voice content and/or vocal print feature layer Face is analyzed, and key message section therein is determined, and when playing the voice data to be played, according to the key message Section is adjusted the word speed of the voice data to be played, so as to help user rapidly and accurately from a large amount of voice data In find user's voice segments of interest or valuable voice segments, such as help user from a large amount of session recording data quickly Find voice segments relevant to session topic.

In order to further increase result of broadcast, guarantee the validity of key message section, in another embodiment of the method for the present invention In, when being adjusted according to word speed of the key message section to the voice data to be played, can also comprehensively consider each The confidence level of voice segments.Specifically, in speech recognition process, the posterior probability of each word in available voice segments, by language After the posterior probability of all words is averaged in segment, the posterior probability of the voice segments can be obtained, using the posterior probability as The confidence level of the voice segments.Certainly, the calculating of the confidence level can also use other methods, such as by extracting voice segments Text feature (semantic feature for referring mainly to identification text) or acoustic feature are calculated by the method for statistical modeling.It needs It is bright, if being based solely on vocal print feature determines whether each voice segments are key message section, speech recognition can also be passed through Decoding process obtains the posterior probability of each word in each voice segments, and then obtains the confidence level of each voice segments.

It in this embodiment, can be according to the key message section when being adjusted to the word speed of voice data to be played And the confidence level of each voice segments is adjusted the word speed of the voice data to be played.

Specifically, the value of confidence level can be divided into multiple grades in advance, one or more confidence threshold values is set, such as When 2 threshold values are arranged, i.e. first threshold and second threshold, wherein second threshold is greater than first threshold, can taking confidence level Value is divided into three grades, i.e. high confidence level, middle confidence level and low confidence, according to grade belonging to confidence level value and key Message segment is adjusted word speed.When specific adjustment, the word speed of voice data can be adjusted separately as fast word speed, normal word speed And slow word speed, the value of the fast word speed and slow word speed can determine that such as voice data to be played is important according to practical application The recording of meeting can set fast word speed to faster than normal word speed 5%, and slow word speed can be set to slower than normal word speed 10%.

Below for two confidence threshold values are set, voice data word speed adjustment process to be played is illustrated, specifically It is as follows:

Step a) obtains the confidence level of current speech segment；

Step b) judges whether the value of the confidence level is more than preset threshold value, if the confidence level is greater than second Threshold value, go to step c), if the confidence level first threshold between second threshold, go to step d)；If the confidence level Less than first threshold, step is gone to e)；

Step c) judges whether current speech segment is key message section, if it is, the word speed of current speech segment is adjusted to Fast word speed；Otherwise, current speech segment is directly skipped over；

Step d) judges whether current speech segment is key message section, if it is, the word speed of current speech segment is adjusted to Slow word speed, if it is not, then that is, the word speed of current speech segment is normal word speed without adjustment；

The word speed of current speech segment e) is adjusted to slow word speed by step.

Certainly, it is adjusted not according to word speed of the confidence level of key message section and each voice segments to voice data to be played It is confined to above-mentioned specific adjustment mode, according to application demand, this embodiment of the present invention can not be limited with other adjustment modes It is fixed.

In order to further increase the quality of voice data to be played, the sense of hearing of user is promoted, in voice broadcasting side of the present invention In another embodiment of method, the analysis of voice signal level can also be carried out to voice data to be played, and right based on the analysis results Institute's speech segment optimizes processing.The analysis of the voice signal level may include it is following any one or more: voice The volume change of data, the size of voice data reverberation, noise situations etc..Correspondingly, voice segments are carried out based on the analysis results The analysis and processing of one or more voice signals can be carried out when optimization processing.Different dispositions is subject to respectively below Explanation.

1) result is analyzed according to volume and processing is optimized to voice data

When volume is analyzed, as unit of frame, the range value of every frame voice data is calculated, if continuous multiple frames voice data Amplitude is more than upper limit value, then it is assumed that the case where cut ridge occurs in current speech segment, and when playing voice, volume can be higher, influences to use Family sense of hearing needs to turn down the amplitude of current speech segment；If the amplitude of continuous multiple frames voice data is lower than lower limit value, Think that the amplitude of current speech segment is smaller, user is not easy to catch voice content, needs for the amplitude of current speech segment to be turned up It is some.

2) result is analyzed according to reverberation and processing is optimized to voice data

When reverberation is analyzed, current speech segment can be detected using the reverberation detection model constructed in advance, it is described mixed Ringing detection model can be obtained, the statistics mould by collecting a large amount of voice data in advance using the method for building statistical model Type such as deep neural network model extracts input of the spectrum signature of current speech segment as reverberation detection model, calculates current language The reverberation time T60 (after i.e. sound source stops sounding, acoustic energy decays the time needed for 60db) of segment, if the reverberation time T60 is higher than the threshold value of setting, then it is assumed that current speech segment reverberation is excessive, using the mixed of reverberation removing method removal current speech segment It rings, to guarantee the clarity of current speech segment, the reverberation removing method is such as based on the reverberation removing method of liftering technology.

3) processing is optimized to voice data according to noise analysis result

The signal-to-noise ratio of current speech data is calculated, if signal-to-noise ratio is greater than preset snr threshold, to current speech Duan Jinhang noise reduction, such as can be using the noise of the method removal current speech segment of speech enhan-cement, and the speech enhan-cement refers to makes an uproar from band Primary voice data as pure as possible is extracted in voice data, is eliminated ambient noise and is allowed user to feel to improve voice quality Feel that, less than fatigue, the method for concrete sound enhancing is the prior art, for example, the method using neural network carries out speech enhan-cement, In advance using the amplitude spectrum signature of a large amount of noisy speeches as the input of speech enhan-cement model, the amplitude spectrum signature of clean speech is made For the output of speech enhan-cement model, the training of enhancing model is carried out, speech enhan-cement model is obtained；Then current noisy speech is extracted Input of the amplitude spectrum signature of data as speech enhan-cement model, obtains the amplitude spectrum signature of clean speech data；Finally combine The phase information of the voice data can be obtained by the amplitude spectrum characteristic recovery of the clean speech data at voice data Clean voice data.

It should be noted that above-mentioned optimization processing can be for all voice segments in voice data to be played, it can also Just for key message section.Moreover, processing opportunity can be before or after determining key message section, it is real to this present invention Apply example without limitation.

In order to help user quickly and efficiently to find valuable voice data, the speech playing method of the embodiment of the present invention Before playing voice data, from voice content and/or vocal print feature level and voice signal level respectively to voice data into Row analysis, from the key message of the available voice data to be played of the analysis result of voice content and/or vocal print feature level Section；Voice data to be played is judged from the analysis result of voice signal level with the presence or absence of problem, if it does, for wait broadcast It puts the voice data problem and processing is optimized to voice data to be played, improve the quality of voice data to be played, Improve the sense of hearing of user.

Correspondingly, the embodiment of the present invention also provides a kind of voice playing device, as shown in Fig. 2, being a kind of knot of the device Structure schematic diagram.

In this embodiment, described device includes:

Receiving module 201, for receiving voice data to be played；

Endpoint detection module 202 obtains each voice segments for carrying out end-point detection to the voice data to be played；

Key message section determining module 203, including the first determining module 231 and/or the second determining module 232, described One determining module 231 is used to according to the voice content of each voice segments determine whether institute's speech segment is key message section, described the Two determining modules 232 are used to determine whether institute's speech segment is key message section according to the vocal print feature of each voice segments；

Playing module 204, for playing the voice data to be played；

Word speed adjusts module 205, for when the playing module 204 plays the voice data to be played, according to institute Key message section is stated to be adjusted the word speed of the voice data to be played.

Key message section determining module 203 is shown in Fig. 2 while including the first determining module 231 and the second determining module 232 the case where.

Above-mentioned first determining module 231 can determine whether the voice segments are crucial letter according to the voice content of each voice segments Cease section comprising: voice recognition unit and determination unit, wherein the voice recognition unit is used to carry out language to each voice segments Sound identification, obtains the identification text of each voice segments；The determination unit is used for according to the identification texts of each voice segments, determine described in Whether voice segments are key message section.

In a particular application, the determination unit can using predetermined keyword method or extract abstract method come Determine whether each voice segments are key message section.

For example, in one embodiment, the determination unit can determine whether the identification text of each voice segments includes pre- The keyword set；If it is, determining that institute's speech segment is key message section, otherwise determine that institute's speech segment is non-critical information Section.

For another example, in another embodiment, the determination unit can be taken out from all voice segments using iterative manner It picks and wants voice segments, and after reaching the number of iterations of setting, obtain multiple abstract voice segments, by the multiple abstract voice segments As key message section.Correspondingly, specific structure may include following subelement:

Wherein, the abstract extraction subelement includes:

The specific calculating process of the abstract score of voice segments can be found in the description in preceding formula (1), and details are not described herein.

Above-mentioned second determining module 232 can determine whether the voice segments are crucial letter according to the vocal print feature of each voice segments Section is ceased, it specifically, can be by speaker dependent or master if voice data to be played includes the voice data of multiple speakers The voice data of speaker is as key message section.

For speaker dependent, a kind of specific structure of the second determining module 232 may include following each unit:

For main speaker, a kind of specific structure of the second determining module 232 may include following each unit:

Speaker's separative unit will be every in voice data for utilizing speaker's isolation technics according to the vocal print feature The voice data of a speaker is separated, and determines main speaker, and using the voice segments of the main speaker as key message Section.Speaker's isolation technics can use the prior art, for example realize speaker's separation by cluster scheduling algorithm.

It simultaneously include the feelings of the first determining module 231 and the second determining module 232 for key message section determining module 203 Condition can will meet the voice segments of both above requirement as key message section simultaneously.For example, mould can be determined by second first Block 232 carries out Application on Voiceprint Recognition using the sound-groove model of speaker dependent, the voice segments of the speaker is identified, then again by first Determining module 231 identifies the voice segments in the voice segments of words people comprising the keyword by predetermined keyword, by these languages Segment is as key message section；Or it is extracted first by the first determining module 231 according to the identification text of voice data to be played Multiple abstract voice segments, then the second determining module 232 utilizes the sound-groove model of speaker dependent to these voice segments of making a summary again Application on Voiceprint Recognition is carried out, the voice segments of speaker dependent in these abstract voice segments is identified, believes these voice segments as key Cease section.

In embodiments of the present invention, word speed adjusts module 205 as unit of voice segments, carries out language to voice data to be played Velocity modulation is whole, according to the difference of application demand, can there is different adjustment modes, for example, being key message section in current speech segment When, adjusting it and playing word speed is normal word speed, and otherwise adjusting it and playing word speed is fast word speed；Or current speech segment be key When message segment, adjusting it and playing word speed is slow word speed, and otherwise adjusting it and playing word speed is normal word speed or fast word speed.

The voice playing device of the embodiment of the present invention, to voice data to be played from voice content and/or vocal print feature layer Face is analyzed, and key message section therein is determined, and when playing the voice data to be played, according to the key message Section is adjusted the word speed of the voice data to be played, so as to help user rapidly and accurately from a large amount of voice data In find user's voice segments of interest or valuable voice segments, such as help user from a large amount of session recording data quickly Find voice segments relevant to session topic.

In order to further increase result of broadcast, guarantee the validity of key message section, as shown in figure 3, in voice of the present invention In another embodiment of playing device, further includes: confidence level obtains module 206, for obtaining the confidence level of each voice segments.

It should be noted that the confidence level obtains module 206 can specifically obtain each voice by speech recognition process The posterior probability of section, and using the posterior probability as the confidence level of the voice segments.

Due to the difference according to application demand, key message section determining module 203 has different implementations, includes at it When the first determining module 231, the first determining module 231 includes: that voice recognition unit and determination unit are set in this case Reliability obtains module 206 and can also integrate with key message section determining module 203.Certainly, it is determined in key message section When module 203 only includes the second determining module 232, confidence level, which obtains module 206, can be used as a standalone module, pass through extraction The feature of each voice segments solves institute's speech segment using the feature and trained in advance acoustic model and language model of extraction Code, obtains the posterior probability of each word in the voice segments, is then averaged the posterior probability of words all in the voice segments The posterior probability of the voice segments is obtained, using the posterior probability as the confidence level of the voice segments.

That is, structure shown in Fig. 3 be intended merely to facilitate understand voice playing device of the present invention and show one Kind schematic diagram, rather than entity structure when its application.Moreover, according to the difference of application demand, some of them module be also required into The adjustment of row adaptability.

Correspondingly, in this embodiment, word speed adjustment module 205 is needed according to the key message section and each voice segments Confidence level is adjusted the word speed of the voice data to be played.Equally, according to the difference of application demand, can have different Adjustment mode, for example, if current speech segment is key message section, if its confidence level is greater than second threshold, using fast Word speed plays the current speech segment, otherwise plays the current speech segment using slow word speed；If current speech segment is non-pass Key information section skips over the current speech segment then if its confidence level is greater than second threshold；If its confidence level is less than or equal to First threshold (first threshold is less than second threshold), then play the current speech segment using slow word speed.

Compared with embodiment illustrated in fig. 2, the voice playing device of the embodiment adjusts the word speed of voice data to be played More flexible multiplicity, and better assure that the validity of key message section, result of broadcast is further improved, user is met Different application demand.

In order to further increase the quality of voice data to be played, the sense of hearing of user is promoted, as shown in figure 4, in the present invention In another embodiment of voice playing device, it may also include that signal analysis and processing module 207, for carrying out voice to each voice segments The analysis of signal level, and processing is optimized to institute's speech segment based on the analysis results.

The signal analysis and processing module 207 includes following any one or more units；

Noise analysis processing unit is less than letter for calculating the noise of each voice segments, and in the signal-to-noise ratio of current speech segment When making an uproar than threshold value, denoising is carried out to current speech segment.

Amplitude, reverberation time, the calculating of noise and the detection process of voice segments are referred to the implementation of front the method for the present invention Description in example, details are not described herein.

It should be noted that signal analysis and processing module 207 can be for be played the optimization processing of voice data All voice segments in voice data, can also be just for key message section.Moreover, processing opportunity, which can be, is determining crucial letter Before or after ceasing section, without limitation to this embodiment of the present invention.In addition, it is necessary to which explanation, in practical applications, is not limited to Structure shown in Fig. 4 can also include that above-mentioned confidence level obtains module 206 in another embodiment of apparatus of the present invention simultaneously With signal analysis and processing module 207, moreover, wherein the specific structure of each module according to the difference of application demand can do adaptability Adjustment.

The voice signal device of the embodiment, before playing voice data, from voice content and/or vocal print feature level Voice data is analyzed respectively with voice signal level, it can from the analysis result of voice content and/or vocal print feature level To obtain the key message section of voice data to be played；Voice data to be played is judged from the analysis result of voice signal level With the presence or absence of problem, if it does, optimizing place to voice data to be played for the voice data problem to be played Reason, improves the quality of voice data to be played, improves the sense of hearing of user.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method Part explanation.The apparatus embodiments described above are merely exemplary, wherein described be used as separate part description Unit may or may not be physically separated, component shown as a unit may or may not be Physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to the actual needs Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying In the case where creative work, it can understand and implement.

The embodiment of the present invention has been described in detail above, and specific embodiment used herein carries out the present invention It illustrates, method and device of the invention that the above embodiments are only used to help understand；Meanwhile for the one of this field As technical staff, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up institute It states, the contents of this specification are not to be construed as limiting the invention.

Claims

1. a kind of speech playing method characterized by comprising

Receive voice data to be played；

When playing the voice data to be played, according to the key message section to the word speed of the voice data to be played into Row adjustment.

2. the method according to claim 1, wherein the voice content according to each voice segments determines institute's predicate Whether segment is that key message section includes:

3. according to the method described in claim 2, it is characterized in that, the identification text according to each voice segments, determine described in Whether voice segments are that key message section includes:

If it is, determining that institute's speech segment is key message section.

4. according to the method described in claim 2, it is characterized in that, the identification text according to each voice segments, determine described in Whether voice segments are that key message section includes:

Abstract voice segments are extracted from all voice segments using iterative manner, and after reaching the number of iterations of setting, are obtained more A abstract voice segments, using the multiple abstract voice segments as key message section.

5. according to the method described in claim 4, it is characterized in that, described extract abstract voice segments packet from all voice segments It includes:

The similarity for calculating the identification text of current speech segment and the identification text of the voice data to be played, obtains the first meter Calculation value；

The identification text of the current speech segment and the similarity for having extracted abstract voice segments identification text are calculated, the second meter is obtained Calculation value；

6. the method according to claim 1, wherein the vocal print feature according to each voice segments determines institute's predicate Whether segment is that key message section includes:

If the voice data to be played includes the voice data of multiple speakers, the vocal print feature of each voice segments is extracted；

According to the vocal print feature and the sound-groove model of speaker dependent, determine institute's speech segment whether be speaker dependent language Sound data；

If it is, determining that institute's speech segment is key message section.

7. the method according to claim 1, wherein the vocal print feature according to each voice segments determines institute's predicate Whether segment is that key message section includes:

If the voice data to be played includes the voice data of multiple speakers, by speaker's isolation technics, determine Main speaker；

Using the voice segments of the main speaker as key message section.

8. method according to any one of claims 1 to 7, which is characterized in that it is described according to the key message section to institute The word speed for stating voice data to be played, which is adjusted, includes:

If current speech segment is key message section, the current speech segment is played using normal word speed, otherwise uses fast language Speed plays the current speech segment；Or

If current speech segment is key message section, the current speech segment is played using slow word speed, otherwise uses normal language Speed or fast word speed play the current speech segment.

9. method according to any one of claims 1 to 7, which is characterized in that the method also includes:

Obtain the confidence level of each voice segments；

The word speed of the voice data to be played is adjusted specifically: according to setting for the key message section and each voice segments Reliability is adjusted the word speed of the voice data to be played.

10. according to the method described in claim 9, it is characterized in that, described according to the key message section and each voice segments Confidence level is adjusted the word speed of the voice data to be played

If current speech segment is key message section, if its confidence level is greater than second threshold, institute is played using fast word speed Current speech segment is stated, the current speech segment is otherwise played using slow word speed；

If current speech segment is non-critical information section, if its confidence level skips over the current language greater than second threshold Segment；If its confidence level is less than or equal to first threshold, the current speech segment, the first threshold are played using slow word speed Less than the second threshold.

11. the method according to claim 1, wherein the method also includes:

To each voice segments carry out voice signal level analysis, the analysis of the voice signal level include any of the following or It is a variety of: volume change situation, reverberation situation, noise situations；

When playing the voice data to be played, processing, the optimization are optimized to institute's speech segment based on the analysis results Processing includes any of the following or a variety of:

(1) if the amplitude for having continuous multiple frames voice data in current speech segment is more than upper limit value, current speech segment is turned down Amplitude；If there is the amplitude of continuous multiple frames voice data lower than lower limit value in current speech segment, the width of current speech segment is turned up Value；

12. a kind of voice playing device characterized by comprising

Receiving module, for receiving voice data to be played；

Key message section determining module, including the first determining module and/or the second determining module, first determining module are used for Determine whether institute's speech segment is key message section according to the voice content of each voice segments, second determining module is used for basis The vocal print feature of each voice segments determines whether institute's speech segment is key message section；

Playing module, for playing the voice data to be played；

Word speed adjusts module, is used for when the playing module plays the voice data to be played, according to the key message Section is adjusted the word speed of the voice data to be played.

13. device according to claim 12, which is characterized in that first determining module includes:

14. device according to claim 13, which is characterized in that

The determination unit, whether the identification text specifically for each voice segments of determination includes preset keyword；If it is, Determine that institute's speech segment is key message section.

15. device according to claim 13, which is characterized in that the determination unit includes:

Judgment sub-unit, the number of iterations for judging whether to reach setting, and after reaching the number of iterations of setting, trigger institute It states abstract and extracts subelement stopping iterative process；

Key message section obtains subelement, for obtaining current all after the abstract extracts subelement stopping iterative process Abstract voice segments, and as key message section.

16. device according to claim 15, which is characterized in that the abstract extracts subelement and includes:

First computation subunit, the identification text of identification text and the voice data to be played for calculating current speech segment Similarity, obtain the first calculated value；

Second computation subunit, for calculating the identification text of the current speech segment and having extracted abstract voice segments identification text Similarity, obtain the second calculated value；

Difference computation subunit obtains the abstract of current speech segment for calculating the difference of the first calculated value and the second calculated value Score；

Select subelement, for after obtaining the abstract score of all voice segments, select make a summary the maximum voice segments of score as Abstract voice segments.

17. device according to claim 12, which is characterized in that second determining module includes:

Vocal print feature extraction unit, for extracting when the voice data to be played includes the voice data of multiple speakers The vocal print feature of each voice segments；

Application on Voiceprint Recognition unit determines that institute's speech segment is for the sound-groove model according to the vocal print feature and speaker dependent The no voice data for speaker dependent；If it is, determining that institute's speech segment is key message section.

18. device according to claim 12, which is characterized in that second determining module includes:

Speaker's separative unit determines main speaker for utilizing speaker's isolation technics according to the vocal print feature, and by institute The voice segments of main speaker are stated as key message section.

19. 2 to 18 described in any item devices according to claim 1, which is characterized in that

The word speed adjusts module, is specifically used for when current speech segment is key message section, and it is normal for adjusting it and playing word speed Word speed, otherwise adjusting it and playing word speed is fast word speed；Or when current speech segment is key message section, adjusts it and play word speed For slow word speed, otherwise adjusting it and playing word speed is normal word speed or fast word speed.

20. 2 to 18 described in any item devices according to claim 1, which is characterized in that described device further include:

The word speed adjusts module, specifically for according to the confidence level of the key message section and each voice segments to described to be played The word speed of voice data is adjusted.

21. device according to claim 12, which is characterized in that described device further include:

Signal analysis and processing module, for each voice segments to be carried out with the analysis of voice signal level, and based on the analysis results to institute Speech segment optimizes processing；The signal analysis and processing module includes following any one or more units；

Volume analysis and processing unit for calculating the amplitude of every frame voice data as unit of frame, and has in current speech segment When the amplitude of continuous multiple frames voice data is more than upper limit value, the amplitude of current speech segment is turned down, is had in current speech segment continuous When the amplitude of multiframe voice data is lower than lower limit value, the amplitude of current speech segment is turned up；

Reverberation analysis and processing unit is more than for calculating the reverberation time of each voice segments, and in the reverberation time of current speech segment When threshold value, reverberation elimination is carried out to current speech segment；

Noise analysis processing unit is greater than signal-to-noise ratio for calculating the noise of each voice segments, and in the signal-to-noise ratio of current speech segment When threshold value, denoising is carried out to current speech segment.