CN100461179C - Audio analysis system based on content - Google Patents

Audio analysis system based on content Download PDF

Info

Publication number
CN100461179C
CN100461179C CNB2006101408314A CN200610140831A CN100461179C CN 100461179 C CN100461179 C CN 100461179C CN B2006101408314 A CNB2006101408314 A CN B2006101408314A CN 200610140831 A CN200610140831 A CN 200610140831A CN 100461179 C CN100461179 C CN 100461179C
Authority
CN
China
Prior art keywords
audio
audio stream
module
content
analysis system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2006101408314A
Other languages
Chinese (zh)
Other versions
CN101021854A (en
Inventor
张弛
苏磊
鲍东山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Nufront Software Technology Co., Ltd.
Original Assignee
BEIJING NUFRONT SOFTWARE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING NUFRONT SOFTWARE TECHNOLOGY Co Ltd filed Critical BEIJING NUFRONT SOFTWARE TECHNOLOGY Co Ltd
Priority to CNB2006101408314A priority Critical patent/CN100461179C/en
Publication of CN101021854A publication Critical patent/CN101021854A/en
Application granted granted Critical
Publication of CN100461179C publication Critical patent/CN100461179C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An audio analysis system for contents including: an audio flow obtaining module, an audio flow segment module, an audio flow type judgment module, a phone flow analysis module and a key word search module based on spelling sequences, in which, the audio flow obtaining module gets an audio flow from outside and sends it to the segment module to segment it to enable each section has a single acoustic character, and the audio flow with a single character is sent to the audio flow judgment module for analysis to get its character kind, the segment determined to be non-phonetic is cast away and the phone audio segments are sent to the phone analysis module to get a spelling pattern and then the key word search module based on the spelling sequence searches the key words to get the position in the audio flow.

Description

Content-based audio analysis system
Technical field
The invention provides content-based automated audio analytic system and method thereof.More specifically, provide a kind of audio analysis system and method thereof of from audio fragment, determining the positional information of certain content.
Background technology
Along with development of technology, increasing in recent years audio-frequency information stores with digital form.For these information can be used effectively, make people can find needed information quickly and accurately, be necessary to set up the effective audio retrieval of cover system.
Audio retrieval is meant the process of finding out the special audio of meeting consumers' demand from audio resource.At present, be based on the attribute of artificial input mostly and describe for the retrieval of audio frequency and carry out, but along with the abundant and people of audio resource to the audio search growth of requirement, the searching system that this mode is built can not be finished required task well.Therefore, need the content-based audio retrieval of research, its basic thought is to retrieve by audio frequency characteristics in the analyzing audio and contextual relation.
Summary of the invention
The invention provides a kind of content-based automated audio analytic system and method thereof.
Content-based automated audio analytic system of the present invention comprises: the audio stream acquisition module, in order to obtain audio stream from external audio source according to a definite decoding rule; The audio stream segmentation module in order to the audio stream that obtains from the audio stream acquisition module is carried out segmentation, makes that each paragraph after the segmentation has single acoustic feature; Audio stream type identification module in order to analyzing from the audio stream with single acoustic feature of audio stream segmentation module output, obtains the acoustic feature that it is determined; The voice flow analysis module is that the audio stream of voice is discerned and obtained phonetic figure in order to the acoustic feature that audio stream type identification module is determined; Based on the keyword retrieval module of pinyin sequence, retrieve in order to the phonetic figure that the voice flow analysis module is obtained, to interested keyword, obtain the position of this keyword in audio stream.
Content-based automated audio analytic system as shown in Figure 1, wherein, audio stream acquisition module 100 obtains audio stream according to a definite decoding rule from external audio source.External audio source can be an audio file 101, can be video file 102, also can be audio input device 103.For audio file and video file, need decode to file according to certain decoding rule, only comprised the audio stream of data division; For audio input device, as microphone, need provide the interface of this equipment, thereby only be comprised the audio stream of data division to audio analysis system.The audio stream that obtains is sent into the processing that the audio stream segmentation module carries out next stage.
Audio stream segmentation module 200 carries out segmentation to the audio stream that obtains from the audio stream acquisition module.After the segmentation, each segmentation all has single acoustic feature.The audio stream segmentation is needed at first to seek the quiet point in the audio stream by energy variance 201,202 two unit of energy, and quiet point is to obtain by energy and the energy variance of calculating sound signal.When the energy value of sound signal during less than certain threshold level, the quiet point that finds in the audio stream is judged by system; When the variance yields of audio signal energies value during less than certain threshold level, there is quiet point in system's judgement audio stream, after in determining audio stream, having quiet point, calculate the variance of audio signal energies value, when the variance of audio signal energies value during greater than certain threshold level, determine quiet section end in the audio stream, so far determine to find the quiet point in the audio stream.201, the information via of the output of 202 two unit is comprehensive, draws unified breakpoint information.The audio stream of known quiet point is sent into audio frequency characteristics detecting unit 203, to obtain finer audio frequency change point.By calculating the feature difference degree value of the adjacent audio fragment of in audio stream, obtaining, can judge the change point of audio frequency characteristics.After surpassing certain threshold level, feature difference degree value determines that the audio frequency characteristics change point exists.Each segmentation of Unit 203 output has all had single acoustic feature, but owing to the restriction of Unit 203 to the Audio Processing principle, the carve information that obtain this moment can be too meticulous, therefore need carry out waypoint merge cells 204.Unit 204 can detect the situation of change of acoustic characteristic in the adjacent sectional, if find that the acoustic characteristic of adjacent sectional is quite similar, judge that then these two segmentations should merge.Audio stream after the segmentation is admitted to the audio stream type judging module to judge the particular type of this segmentation.
Audio stream type identification module 300 to analyzing from the audio stream with single acoustic feature of audio stream segmentation module output, obtains the acoustic feature that it is determined.The method of analyzing has two kinds, according to time domain and frequency domain character the feature of audio stream is analyzed respectively according to temporal signatures taxon 301 with according to frequency domain character taxon 302.The type that single audio frequency stream is judged by a kind of temporal signatures in zero-crossing rate, short-time energy, short-time energy mean square deviation, quiet frame ratio, the sub-band energy distribution or several temporal signatures in Unit 301; The type that single audio frequency flows is judged by a kind of frequency domain character in linear prediction cepstrum coefficient, the Mel cepstrum coefficient or two kinds of frequency domain characters in Unit 302.These two unit are parallel processing audio streams, and the result of two unit outputs determines the type of acoustic feature thus through merging after the differentiation of degree of confidence.Have the differentiation of the audio stream of single acoustic characteristic through 301 and 302 two unit, its acoustic characteristic can be determined.
Voice flow analysis module 400, the acoustic feature that audio stream type identification module is determined are that the audio stream of voice is discerned and obtained phonetic figure.Voice flow analysis module in system of the present invention is mainly partly analyzed voice flow, and non-voice stream part is abandoned after through audio stream type identification module, has only voice flow to be admitted to analysis and identification that the voice flow analysis module carries out content.Speech analysis is based on mainly that eigenvector analysis behind the branch frame carries out, and therefore, at first enters feature vector sequence extraction unit 401 from the voice flow of audio stream type identification module input, obtains representing the feature vector sequence of this voice flow.In Model Matching unit 402, the acoustic model of this feature vector sequence and phonetic mates, and each candidate pinyin sequence is calculated the matching distance of phonetic respectively.Behind the statistics dependence unit 403, the matching distance of pinyin sequence is recomputated, and obtains the phonetic figure that many candidate pinyin constitute between adding phonetic.The phonetic figure of this moment also rests on the stage of alligatoring phonetic Figure 40 4.After alligatoring phonetic figure carried out self-adaptation correction and level and smooth the correction, just can obtain refinement phonetic Figure 40 5.Unit 405 are stored, and so far the analytic process to voice flow finishes.
Based on the keyword retrieval module 500 of pinyin sequence, the phonetic figure that the voice flow analysis module is obtained retrieves.Pinyin sequence unit 502 storages pinyin sequence to be retrieved, the phonetic figure of pinyin sequence and unit 405 outputs is admitted to confidence computation unit 501 simultaneously, carries out the calculating of degree of confidence.The calculating of degree of confidence is to utilize front and back to algorithm, calculates the posterior probability of pinyin string, judges according to prior preset confidence thresholding whether this pinyin sequence that is retrieved occurs in this audio fragment.If through judge that this pinyin sequence occurs in audio fragment, Unit 501 can obtain the position of this pinyin sequence in audio stream simultaneously.
Description of drawings
Fig. 1 represents content-based automated audio analytic system block diagram of the present invention.
Embodiment
Referring to Fig. 1, for audio analysis system, the audio stream acquisition module is the basis of whole audio analysis, is that data are carried out pretreated process.Demoder can be taked different decoding processes at different audio stream sources.The source of audio stream has multiple, asf/wma/wmv/avi/wav/mpeg/mp3/aiff/pcm/raw/vox is possible audio stream source, and the sample frequency and the sampling resolution of the audio frequency in every kind of source all are not quite similar: for example the sample frequency of telephone audio is generally 8000Hz, and the sample frequency of CD is generally 44100Hz; Sampling resolution also can change to 24 from 8 owing to the difference in source.Behind the audio stream acquisition module, the audio stream of various separate sources all is unified into a kind of form, and the audio stream of this consolidation form has identical sample frequency and sampling resolution, and only includes the information of digitized audio frequency itself.
For the audio stream of one section unknown properties, need to analyze its acoustic characteristic, at this moment just need the method for utilizing audio parsing and audio types to differentiate.It is the basis of audio retrieval that audio parsing and audio types are differentiated, and in the front end signal of speech recognition is handled crucial meaning is arranged.In content-based speech analysis system, what import speech analysis module should be the single audio-frequency fragments of acoustic feature.But the voice flow of gathering under the true environment does not often satisfy such condition, but various features is mixed in together.With the Broadcast Journalism is example, and its acoustic enviroment is complicated and changeable, and sentence boundary the unknown if directly do not send into speech analysis module just do not do front-end processing, can reduce the performance of speech analysis module greatly.This just need cut apart voice flow voice flow pre-service in addition.
Pause has reflected the structural information of language material as an important prosodic features.Usually can have pause in the middle of sentence and the sentence and in the middle of voice and the non-voice, be characterized by quiet and pause, the sound signal of this moment only is a ground unrest.By detecting quiet can cutting apart, reach the purpose of preliminary segmentation to continuous audio frequency stream.
Judge that by the height of energy whether quiet point exists is a kind of mode of the easiest realization, in the audio parsing module, we use energy as one of foundation of judging quiet point.But because acoustic enviroment is not unalterable in the reality, it is low to that is to say that quiet energy has height to have, and so, it is not enough only relying on the height of energy to come segmentation, therefore considers to use the variance of energy to be used as another foundation of segmentation.
Definition energy variance is as follows:
σ = 1 N Σ i = 1 N ( e i - μ ) 2 , Wherein μ = 1 N Σ i = 1 N e i ; Ei is the energy of each frame; N is the frame number of energy, and is relevant with the length of pausing, and be 300ms if promptly set minimum pause, and the frame rate of energy is 100 o'clock, and N is 30.
Energy e calculates with following formula:
e = 1 T Σ t = 1 T x 2 ( t ) , Wherein x (t) is a t sampled point, and T is total sampled point number that every frame comprises.
Set thresholding T Var, be the long window self-energy variance of calculating of window with N.Thresholding T VarComputing method as follows:
T Var=α .log 10σ Global, σ GlobalBe the variance in the whole audio stream scope, α is a scale factor, between the value 0.7-1.0.
If the variance yields that obtains greater than the thresholding of setting, then illustrate do not have in this section audio quiet.Step-length moving window with certain recomputates the energy variance.If the energy variance yields that obtains is less than threshold value, then signal contains quiet point in the instructions window.In order to find quiet end point, constantly increase the length of window, till the energy variance that calculates is greater than thresholding, so far find a quiet point.The long N that is made as again of window, beginning is searched next time afterwards.So can find all quiet points in the audio stream.Each quiet point then is slit into each audio section to continuous audio frequency flow point, can be that unit is further processed afterwards with the audio section.
The essence that detects the change point of acoustic characteristic is the distance of calculating between two models.At first sound signal is carried out modeling with acoustic feature vector, calculate the distance of acoustic feature vector between two adjacent windows then.The distance of directly calculating acoustic feature vector is difficult to carry out, and therefore need take indirect mode.It has been generally acknowledged that the acoustic feature vector that obtains like this satisfies Gaussian distribution,, calculate the distance of two Gaussian distribution then so consider at first the acoustic feature vector in the window to be come match with Gaussian distribution.The distance of calculating acoustic feature vector so just is converted into the distance of counting statistics model.Be the distance of calculating two statistical models now, so the method for a lot of metric ranges is arranged.
Supposing existing two Gaussian distribution, is respectively N (μ 1, ∑ 1) and N (μ 2, ∑ 2), the method for several frequently seen calculating Gaussian distribution distance is as follows:
Kullback-Leibler-2?distance:
d KL = 1 2 ( μ 1 - μ 2 ) T ( Σ 1 - 1 + Σ 2 - 1 ) ( μ 1 - μ 2 ) + 1 2 tr ( Σ 1 - 1 Σ 2 + Σ 2 - 1 Σ 1 - 2 I )
Mahalanobis?distance:
d M = 1 2 ( μ 1 - μ 2 ) T ( Σ 1 Σ 2 ) - 1 ( μ 1 - μ 2 )
Bhattacharyya?distance:
d B = 1 4 ( μ 1 - μ 2 ) T ( Σ 1 + Σ 2 ) - 1 ( μ 1 - μ 2 ) + 1 2 log | Σ 1 + Σ 2 | 2 | Σ 1 Σ 2 |
But the computation model distance can be used not limit to and use top three kinds of methods, and here we use the distance between two models of Kullback-Leibler-2 distance calculation.
If obviously the distance value of two distributions is very big, then explanation is likely an acoustic feature trip point, and very little these two parts that then illustrate of distance that distribute are in the middle of the similar acoustic enviroment, and acoustic characteristic should be identical.
On a continuous audio signal stream, respectively get a segment signal with two adjacent windows, character vector of signals in every window is fitted to a Gaussian distribution, calculate the distance between these two Gaussian distribution.Move this two windows by certain step-length then, calculate the distance of two windows once more, thereby obtain a distance measure curve.At last find out acoustical change point possible on the curve according to certain threshold setting rule.
This module is relatively more responsive to the change of environment, and recall rate is very high, but also can detect a lot of redundant points simultaneously, causes false alarm rate too high.Just because of this characteristic that detects acoustic characteristic change point unit, the setting of waypoint merge cells is only and is necessary.
The waypoint merge cells is to determine to judge whether two continuant frequency ranges can merge under the situation of cut-point.
Suppose x 1, x 2... x N~N (μ, ∑) supposes only to comprise a cut-point in audio-frequency fragments, and saltus step takes place i constantly, and whole audio section is divided into former and later two parts, and two parts to after cutting apart have
x 1, x 2... x i~N (μ 1, ∑ 1) and x I+1, x I+2... x N~N (μ 2, ∑ 2)
∑, ∑ 1, ∑ 2It is respectively the covariance matrix of all voice datas, a preceding i voice data, N-i the voice data in back.
Can regard a problem of model selection as judging whether to merge so.One of model is that all voice datas are described with a Gaussian distribution; Two of model is to be the boundary with the cut-point, and voice data is divided into two parts, describes with a Gaussian distribution respectively.The BIC value of two models can be represented with following formula:
BIC = N log | Σ | - N 1 log | Σ 1 | - N 2 log | Σ 2 | - 1 2 λ ( d + 1 2 d ( d + 1 ) log N )
Wherein, N, N 1, N 2Be respectively the eigenvector number of describing this Gaussian distribution, d is the dimension of vector space, and λ is a penalty factor, and general value is 1.
If the BIC value, thinks then that two audio sections are to belong to same distribution less than 0, should merge, on the contrary then nonjoinder.
Also different values can be arranged for the penalty factor λ in the following formula, can set different λ values according to different situations like this, in the hope of obtaining more excellent result.
The characteristic information that has shown audio stream through the waypoint after merging.Wherein audio frequency characteristics is the usefulness of the part of voice for the analysis of voice flow analysis module.
After voice flow is sent into speech analysis module, at first to extract feature vector sequence to voice flow.Phonetic feature has multiple, as LPC coefficient, Mel frequency cepstral coefficient (Mel-Frequency Cepstral Coefficient, MFCC), the perception linear forecasting parameter (Perceptual Linear Predictive, PLP) or the like.Here we and be indifferent to and adopt which kind of parameter, the present invention can be suitable for any characteristic parameter.Here we to adopt the MFCC coefficient be example.
Here the MFCC coefficient of Cai Yonging is 14 dimensions, and 14 dimension MFCC coefficients add corresponding single order second order difference, with and the single order second order difference of logarithm energy and logarithm energy constituted the eigenvectors of 45 dimensions.The eigenvector that divides frame to extract is combined and has been constituted feature vector sequence.
The acoustic model that the Model Matching unit can adopt has multiple equally: single-tone submodel (Monophone), diphone model (Biphone), three-tone model (Triphone) or multitone submodel more.Here adopt the single-tone submodel for the convenience of narrating.Simultaneously in the Model Matching unit, add phonemic language model, promptly utilize the statistics dependence submodel coupling between phonetic, in the hope of obtaining the result of better phonetic figure.
The Model Matching unit can obtain the phonetic figure of alligatoring.After obtaining alligatoring phonetic figure, can adopt adaptive approach and smoothing method that phonetic figure is revised, so that phonetic figure can reflect the information that audio stream carries better.Common adaptive approach has the maximum a posteriori probability self-adaptation, and (Maximum A Posteriori, MAP) (Maximum Likelihood Linear Regression MLLR), adopts the MLLR adaptive approach here with the linear recurrence of maximum likelihood self-adaptation.So just, can obtain the phonetic figure of refinement.This phonetic figure can be used as keyword retrieval.
The keyword retrieval system generally is divided into following two classes:
The one, single-phase system, search is carried out on the network of keyword model and non-key speech model (perhaps being called garbage model) parallel connection.When keyword changed, system must analyze once more to voice flow, and to when repeatedly retrieving with audio stream, retrieval rate is relatively slow, therefore was not suitable for using under the occasion that the user need revise querying condition repeatedly.
Two is two stage system.Phase one is a pretreatment stage, is phonetic figure or speech figure or text by speech analysis module with the voice flow data conversion, every section audio stream is all only needed operation once, and the retrieval for the response user later on only needs to search coupling in phonetic figure or speech figure or text.
Audio retrieval task in the native system is under the certain situation of database, returns corresponding results according to different query requests, therefore considers to take two stage system as system schema.
We have obtained the phonetic figure of refinement the front, just need this moment the input pinyin sequence to be retrieved and the phonetic figure of refinement to handle together, just the result that can obtain retrieving.
During retrieval, utilize the posterior probability of front and back, thereby carry out the calculating of degree of confidence, judge the pinyin sequence that whether occurs being retrieved in this audio fragment according to prior preset confidence thresholding afterwards to algorithm computation pinyin sequence to be retrieved.If there is this pinyin sequence, can obtain the position of pinyin sequence to be retrieved in audio stream simultaneously.

Claims (19)

1. a content-based audio analysis system in order to by the analysis to sound signal, satisfies different content-based retrieval demands, and this system comprises:
An audio stream acquisition module is in order to obtain audio stream according to a definite decoding rule from external audio source;
An audio stream segmentation module in order to the audio stream that obtains from the audio stream acquisition module is carried out segmentation, makes that each part after the segmentation has single acoustic feature; Described audio stream segmentation module comprises: quiet point detects submodule, detects quiet point from audio stream, so that audio stream is carried out segmentation; Audio frequency characteristics change point detection sub-module detects the audio frequency characteristics change point so that audio stream is carried out segmentation from audio stream; Judge submodule with the waypoint degree of confidence, judge the waypoint rationality, remove unreasonable waypoint so that audio stream is carried out segmentation by BIC criterion;
An audio stream type identification module in order to analyzing from the audio stream with single acoustic feature of audio stream segmentation module output, obtains the acoustic feature that it is determined;
A voice flow analysis module is that the audio stream of voice is discerned and obtained phonetic figure in order to the acoustic feature that audio stream type identification module is determined;
Keyword retrieval module based on pinyin sequence is retrieved in order to the phonetic figure that the voice flow analysis module is obtained, and to interested keyword, obtains the position of this keyword in audio stream.
2. content-based audio analysis system as claimed in claim 1 is characterized in that: the external audio source in the audio stream acquisition module be audio file, video file, audio input device one of them.
3. content-based audio analysis system as claimed in claim 1 is characterized in that: the sampling rate scope of the external audio source in the audio stream acquisition module is from 8000Hz to 44100Hz.
4. content-based audio analysis system as claimed in claim 1 is characterized in that: the scope of the sampling resolution of the external audio source in the audio stream acquisition module is from 8 to 24.
5. content-based audio analysis system as claimed in claim 1 is characterized in that: the definite decoding rule in the audio stream acquisition module comprises the decoding rule to the asf/wma/wmv/avi/wav/mpeg/mp3/aiff/pcm/raw/vox file.
6. content-based audio analysis system as claimed in claim 1 is characterized in that: the audio stream in the audio stream acquisition module is the raw formatted data.
7. content-based audio analysis system as claimed in claim 1 is characterized in that: quiet point detects the quiet point of submodule sound intermediate frequency signal, is to obtain by the energy value that calculates sound signal; When the energy value of sound signal during, determine to find the quiet point in the audio stream less than certain threshold level.
8. content-based audio analysis system as claimed in claim 7 is characterized in that: quiet point detects the certain threshold level of submodule sound intermediate frequency signal energy value, is by the energy value of a complete audio signal fragment is estimated to obtain.
9. content-based audio analysis system as claimed in claim 1 is characterized in that: quiet point detects the quiet point of submodule sound intermediate frequency signal, also allows to obtain by the variance yields that calculates the audio signal energies value, and concrete mode is:
When the variance yields of audio signal energies value during, determine to exist in the audio stream quiet point less than certain threshold level; After in determining audio stream, having quiet point, calculate the variance of audio signal energies value,, determine quiet section end in the audio stream, determine to find the quiet point in the audio stream when the variance of audio signal energies value during greater than certain threshold level.
10. the content-based audio analysis system described in claim 9, it is characterized in that: quiet point detects the certain threshold level of the variance yields of submodule sound intermediate frequency signal energy value, is to estimate to obtain by the variance yields to the energy value of a complete audio signal fragment.
11. content-based audio analysis system as claimed in claim 1, it is characterized in that: the audio frequency characteristics change point in the audio frequency characteristics change point detection sub-module is to obtain by the value that obtains adjacent audio fragment in audio stream, calculate the audio feature vector sequence difference degree of two audio fragments; After reaching certain threshold level, the value of diversity factor determines the audio frequency characteristics change point.
12. content-based audio analysis system as claimed in claim 11 is characterized in that: the feature vector sequence of the input audio section in the audio frequency characteristics change point detection sub-module obtains by the input audio section is extracted the audio frequency characteristics parameter.
13. content-based audio analysis system as claimed in claim 1 is characterized in that: audio stream type identification module comprises two submodules:
The time-domain analysis submodule is classified to audio stream by the temporal signatures of analyzing audio;
The frequency-domain analysis submodule is classified to audio stream by the frequency domain character of analyzing audio.
14. content-based audio analysis system as claimed in claim 13, it is characterized in that: the temporal signatures of the audio frequency in the time-domain analysis submodule comprises a kind of temporal signatures or several temporal signatures in zero-crossing rate, short-time energy, short-time energy mean square deviation, quiet frame ratio, the sub-band energy distribution.
15. content-based audio analysis system as claimed in claim 13 is characterized in that: the frequency domain character of the audio frequency in the frequency-domain analysis submodule comprises a kind of frequency domain character or two kinds of frequency domain characters in linear prediction cepstrum coefficient, the Mel cepstrum coefficient.
16. content-based audio analysis system as claimed in claim 1 is characterized in that: the voice flow analysis module comprises three submodules:
Eigenvector extracts submodule, voice flow is carried out the branch frame handle, and extracts the feature vector sequence that speech characteristic parameter obtains voice flow;
The Model Matching submodule, the acoustic model of feature vector sequence and phonetic is mated,, also utilize the matching distance of statistics dependence calculating pinyin sequence between phonetic then to obtain the phonetic figure that many candidate pinyin constitute, and the matching distance of phonetic sorted, obtain alligatoring phonetic figure;
The model modification submodule carries out self-adaptation correction and level and smooth the correction to alligatoring phonetic figure, obtains refinement phonetic figure.
17. content-based audio analysis system as claimed in claim 1 is characterized in that:, the pinyin sequence of term correspondence is carried out confidence calculations based on the keyword retrieval module of pinyin sequence.
18. content-based audio analysis system as claimed in claim 17 is characterized in that: utilize the posterior probability of front and back based on the keyword retrieval module of pinyin sequence, come the degree of confidence of deterministic retrieval speech with this to the algorithm computation pinyin string.
19. content-based audio analysis system as claimed in claim 17 is characterized in that: the keyword retrieval module based on pinyin sequence also comprises according to the different application needs, determines different degree of confidence thresholdings.
CNB2006101408314A 2006-10-11 2006-10-11 Audio analysis system based on content Expired - Fee Related CN100461179C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006101408314A CN100461179C (en) 2006-10-11 2006-10-11 Audio analysis system based on content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006101408314A CN100461179C (en) 2006-10-11 2006-10-11 Audio analysis system based on content

Publications (2)

Publication Number Publication Date
CN101021854A CN101021854A (en) 2007-08-22
CN100461179C true CN100461179C (en) 2009-02-11

Family

ID=38709622

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006101408314A Expired - Fee Related CN100461179C (en) 2006-10-11 2006-10-11 Audio analysis system based on content

Country Status (1)

Country Link
CN (1) CN100461179C (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI416367B (en) * 2009-12-16 2013-11-21 Hon Hai Prec Ind Co Ltd Electronic device and method of audio data copyright protection thereof

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101159834B (en) * 2007-10-25 2012-01-11 中国科学院计算技术研究所 Method and system for detecting repeatable video and audio program fragment
KR100999655B1 (en) * 2009-05-18 2010-12-13 윤재민 Digital video recorder system and application method thereof
CN101566999B (en) * 2009-06-02 2010-11-17 哈尔滨工业大学 A quick audio retrieval method
CN102073635B (en) * 2009-10-30 2015-08-26 索尼株式会社 Program endpoint time detection apparatus and method and programme information searching system
CN102591892A (en) * 2011-01-13 2012-07-18 索尼公司 Data segmenting device and method
CN102841932A (en) * 2012-08-06 2012-12-26 河海大学 Content-based voice frequency semantic feature similarity comparative method
CN103247293B (en) * 2013-05-14 2015-04-08 中国科学院自动化研究所 Coding method and decoding method for voice data
CN103414948A (en) * 2013-08-01 2013-11-27 王强 Method and device for playing video
CN104021785A (en) * 2014-05-28 2014-09-03 华南理工大学 Method of extracting speech of most important guest in meeting
CN104183245A (en) * 2014-09-04 2014-12-03 福建星网视易信息***有限公司 Method and device for recommending music stars with tones similar to those of singers
CN104464754A (en) * 2014-12-11 2015-03-25 北京中细软移动互联科技有限公司 Sound brand search method
CN104464726B (en) * 2014-12-30 2017-10-27 北京奇艺世纪科技有限公司 A kind of determination method and device of similar audio
CN104866604B (en) * 2015-06-01 2018-10-30 腾讯科技(北京)有限公司 A kind of information processing method and server
CN106601243B (en) * 2015-10-20 2020-11-06 阿里巴巴集团控股有限公司 Video file identification method and device
CN105898556A (en) * 2015-12-30 2016-08-24 乐视致新电子科技(天津)有限公司 Plug-in subtitle automatic synchronization method and device
CN105719642A (en) * 2016-02-29 2016-06-29 黄博 Continuous and long voice recognition method and system and hardware equipment
CN109545190B (en) * 2018-12-29 2021-06-29 联动优势科技有限公司 Speech recognition method based on keywords
CN113205800B (en) * 2021-04-22 2024-03-01 京东科技控股股份有限公司 Audio identification method, device, computer equipment and storage medium
CN113506584B (en) * 2021-07-06 2024-05-14 腾讯音乐娱乐科技(深圳)有限公司 Data processing method and device
CN115359809B (en) * 2022-08-24 2024-04-19 济南大学 Self-adaptive second-order segmentation method and system for long-term emotion voice

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1270361A (en) * 1999-04-09 2000-10-18 国际商业机器公司 Method and device for audio information searching by content and loudspeaker information
JP2001344251A (en) * 2000-06-05 2001-12-14 Nippon Telegr & Teleph Corp <Ntt> Method and device for contents retrieval, and recording medium with contents retrieval program recorded thereon
US20030084037A1 (en) * 2001-10-31 2003-05-01 Kabushiki Kaisha Toshiba Search server and contents providing system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1270361A (en) * 1999-04-09 2000-10-18 国际商业机器公司 Method and device for audio information searching by content and loudspeaker information
JP2001344251A (en) * 2000-06-05 2001-12-14 Nippon Telegr & Teleph Corp <Ntt> Method and device for contents retrieval, and recording medium with contents retrieval program recorded thereon
US20030084037A1 (en) * 2001-10-31 2003-05-01 Kabushiki Kaisha Toshiba Search server and contents providing system

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
基于拼音图的两阶段关键词检索***. 罗骏,欧智坚,王作英.清华大学学报(自然科学版),第45卷第10期. 2005
基于拼音图的两阶段关键词检索***. 罗骏,欧智坚,王作英. 清华大学学报(自然科学版),第45卷第10期. 2005 *
基于段长分布的HMM语音识别模型. 王作英,肖熙.电子学报,第32卷第1期. 2004
基于段长分布的HMM语音识别模型. 王作英,肖熙. 电子学报,第32卷第1期. 2004 *
多功能语音/音频信息检索***的研究与实现. 欧智坚,罗骏,谢达东,赵贤宇,林晖,王作英.全国网络与信息安全技术研讨会’2004论文集. 2004
多功能语音/音频信息检索***的研究与实现. 欧智坚,罗骏,谢达东,赵贤宇,林晖,王作英. 全国网络与信息安全技术研讨会’2004论文集. 2004 *
多路并行的语音识别引擎的设计与实现. 吉鸿雁,刘鹏,吴及,王作英.计算机应用,第23卷第8期. 2003
多路并行的语音识别引擎的设计与实现. 吉鸿雁,刘鹏,吴及,王作英. 计算机应用,第23卷第8期. 2003 *
实时语音识别***语言层的改进. 鄢翔,王作英.计算机工程与应用,第19期. 2002
实时语音识别***语言层的改进. 鄢翔,王作英. 计算机工程与应用,第19期. 2002 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI416367B (en) * 2009-12-16 2013-11-21 Hon Hai Prec Ind Co Ltd Electronic device and method of audio data copyright protection thereof

Also Published As

Publication number Publication date
CN101021854A (en) 2007-08-22

Similar Documents

Publication Publication Date Title
CN100461179C (en) Audio analysis system based on content
CN107480152A (en) A kind of audio analysis and search method and system
EP1043665A2 (en) Methods and apparatus for retrieving audio information using content and speaker information
US20110077943A1 (en) System for generating language model, method of generating language model, and program for language model generation
EP1803072A1 (en) Computer implemented method for indexing and retrieving documents stored in a database and system for indexing and retrieving documents
JP2004133880A (en) Method for constructing dynamic vocabulary for speech recognizer used in database for indexed document
CN108831506B (en) GMM-BIC-based digital audio tamper point detection method and system
CN103164403A (en) Generation method of video indexing data and system
CN110689906A (en) Law enforcement detection method and system based on voice processing technology
Manchala et al. GMM based language identification system using robust features
Gandhe et al. Using web text to improve keyword spotting in speech
CN115148211A (en) Audio sensitive content detection method, computer device and computer program product
CA2596126A1 (en) Speech recognition by statistical language using square-root discounting
Sangeetha et al. A novel spoken keyword spotting system using support vector machine
Ramabhadran et al. Fast decoding for open vocabulary spoken term detection
Chandra et al. Keyword spotting: an audio mining technique in speech processing–a survey
Wang Mandarin spoken document retrieval based on syllable lattice matching
Tarján et al. A bilingual study on the prediction of morph-based improvement.
Mubarak et al. Novel features for effective speech and music discrimination
Nouza et al. Continual on-line monitoring of Czech spoken broadcast programs
Rebai et al. Spoken keyword search system using improved ASR engine and novel template-based keyword scoring
Huang et al. Sports audio segmentation and classification
Hsiao et al. The CMU-interACT 2008 Mandarin transcription system.
Jin et al. On continuous speech recognition of Indian English
Charhad et al. Speaker identity indexing in audio-visual documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: BEIJING NUFRONT NETWORK TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: BAO DONGSHAN

Effective date: 20071214

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20071214

Address after: A, building 16, building 1, building 8, Qinghua science park, No. 100084, Zhongguancun East Road, Beijing, Haidian District, China

Applicant after: Beijing Nufront Software Technology Co., Ltd.

Address before: A, building 16, building 1, building 8, Qinghua science park, No. 100084, Zhongguancun East Road, Beijing, Haidian District, China

Applicant before: Bao Dongshan

C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090211

Termination date: 20131011