CN103700370A

CN103700370A - Broadcast television voice recognition method and system

Info

Publication number: CN103700370A
Application number: CN201310648375.4A
Authority: CN
Inventors: 陈鑫玮; 徐波
Original assignee: BEIJING PATTEK Co Ltd
Current assignee: BEIJING PATTEK Co Ltd
Priority date: 2013-12-04
Filing date: 2013-12-04
Publication date: 2014-04-02
Anticipated expiration: 2033-12-04
Also published as: CN103700370B

Abstract

The invention discloses a broadcast television voice recognition method and system. The method comprises the following steps: extracting audio data according to broadcast television data; pre-processing the audio data to obtain feature text data; transmitting the feature text data to a cloud server for recognition to achieve male and female voice recognition, speaker recognition and voice recognition results; fusing the male and female voice recognition, speaker recognition and voice recognition results and marking a structured text to generate a structured voice recognition result. According to the method, the conventional voice recognition method is improved; various broadcast television data pre-processing technologies and broadcast television voice recognition methods are integrated; voice data is identified to meet data processing requirements in broadcast television industry; various recognition results are integrated to generate the structured voice recognition result; basic data can be provided for intelligent processing of other service of subsequent broadcast television programs; processing speed is increased; the accuracy is improved.

Description

A kind of radio and television speech recognition system method and system

Technical field

The present invention relates to audio frequency and video processing technology field, particularly a kind of radio and television audio recognition method and system.

Background technology

At present in field of broadcast televisions, radio and television speech recognition is mainly utilized to the conventional speech recognition methods that is applicable to every profession and trade, and traditional speech recognition mainly adopts pattern matching method, be divided into training and two stages of identification, wherein in the training stage, user reads each word in vocabulary successively or gives an account of, and as template, deposits its eigenvector in template base; At cognitive phase, by input voice eigenvector successively with template base in each template carry out similarity comparison, similarity soprano is exported as recognition result.

But there is following problem in the speech recognition of field of broadcast televisions in this speech recognition application:

1) broadcast television industry often has especially, is different from processing and the operation of other industry to speech recognition, but because above-mentioned traditional voice identification is applied to every profession and trade, for broadcast television industry, there is no specific aim, can not to the non-voice context in radio and television data, filter according to the feature of broadcast television industry.Because non-voice context is not within the process range for speech recognition in broadcast television industry, if so non-voice context is not filtered, with regard to also needing, it is transmitted and is processed, not only cause the waste of transfer resource and computational resource, but also can cause due to the existence of non-voice context occurring more mistake identifying operation, and affect processing speed.

2) because traditional voice recognition technology does not possess the speech identifying function for broadcast television industry, cause recognition result sufficiently complete, for example, for one section of radio and television data, cannot judge the speak scene of generation and speaker's the important informations such as identity, cannot according to different speakers, carry out segmentation to voice content, the timestamp of each voice word cannot be identified, to the intellectuality of follow-up other broadcast television services, robotization processing, any valuable reference information cannot be provided.

To sum up, traditional audio recognition method is applied in and in broadcast television industry, has consumes resources, processing speed is slow, accuracy is not high, the problems such as quantity of information deficiency are provided.

Summary of the invention

(1) technical matters that will solve

The technical problem to be solved in the present invention is how for broadcast television industry feature, to carry out speech recognition, the shortcoming of avoiding conventional speech recognition methods to exist in broadcast television industry application, processes sufficient available basic data is provided for intellectuality, the robotization of follow-up other broadcast television industry business.

(2) technical scheme

For solving the problems of the technologies described above, the invention provides a kind of radio and television audio recognition method, comprising:

S1, according to radio and television data, extract voice data;

S2, described voice data is carried out to pre-service, obtain feature text data;

S3, send to Cloud Server to carry out identifying processing described feature text data, obtain the identification of men and women's sound, Speaker Identification and voice identification result;

S4, to described data pre-service, the identification of men and women's sound, Speaker Identification and voice identification result merges and structured text sign, the voice identification result of generating structured.

Further, step S2 carries out pre-service to described voice data and specifically comprises:

S21, described voice data is carried out to cutting and fragmentation process and generate several sentence files;

S22, described sentence file is carried out to non-voice filtration, leave speech sentence file;

S23, each speech sentence file is carried out to wide and narrow strip differentiation, the speech sentence file of differentiating for broadband signal is added to broadband sign, the speech sentence file of differentiating for narrow band signal adds arrowband sign;

S24, to adding the speech sentence file of broadband sign and arrowband sign, carry out audio feature extraction, obtain feature text data, the audio-video document title of the beginning and ending time that comprises this speech sentence in wherein said feature text data, phonetic feature information, this sentence ownership and corresponding wide and narrow strip sign.

Further, step S3 sends to Cloud Server to carry out identifying processing described feature text data to comprise: the identification of men and women's sound, Speaker Identification, voice content identification and punctuation mark identification, generate the voice identification result that contains sign.

Further, step S4 merges described voice identification result and structured text sign specifically comprises:

S41, each voice identification result is gathered, alignd, and sort according to the beginning and ending time wherein comprising;

S42, to sequence after voice identification result according to structured format, carry out mark, comprise speaker's sex sign, speaker's sign, voice content, punctuation mark and timestamp.

Further, step S3 carries out the process of identifying processing and identifies according to language model storehouse, and described speech model storehouse is constantly upgraded by network text collection and network text study.

For solving the problems of the technologies described above, the present invention also provides a kind of radio and television speech recognition system, and this system comprises:

Extraction unit, extracts voice data according to radio and television data;

Pre-service terminal, carries out pre-service to described voice data, obtains feature text data, and sends to Cloud Server;

Cloud Server, carries out identifying processing to described feature text data, obtains voice identification result, and described voice identification result is merged and structured text sign to the voice identification result of generating structured.

Further, described pre-service terminal comprises:

Cutting module, carries out cutting and several sentence files of fragmentation processing generation to described voice data;

Non-voice filtering module, carries out non-voice filtration to described sentence file, leaves speech sentence file;

Wide and narrow strip discrimination module, carries out wide and narrow strip differentiation to each speech sentence file, and the speech sentence file of differentiating for broadband signal is added to broadband sign, and the speech sentence file of differentiating for narrow band signal adds arrowband sign;

Audio feature extraction module, to adding the speech sentence file of broadband sign and arrowband sign, carry out audio feature extraction, obtain feature text data, the beginning and ending time that comprises this speech sentence in wherein said feature text data, belong to audio-video document title and corresponding wide and narrow strip sign.

Further, described Cloud Server comprises:

Men and women's sound identification module, for carrying out the identification of men and women's sound to described feature text data;

Speaker Identification module, for carrying out Speaker Identification to described feature text;

Voice content and punctuation mark identification module, for described feature text being carried out to voice content identification and punctuation mark identification, generate the voice identification result that contains punctuation mark sign;

Recognition result processing module, merges and structured text sign the voice identification result of generating structured to described voice identification result.

Further, described recognition result processing module further comprises:

Gather order module, for each voice identification result is gathered, alignd, and sort according to the beginning and ending time wherein comprising;

Add identification module, for the voice identification result to after sequence, according to structured format, carry out mark, comprise speaker's sex sign, speaker's sign, voice content, punctuation mark and timestamp.

Further, in described Cloud Server, also comprise: language model intelligent learning module, for regularly collecting network text, by the study regular update language model storehouse to network text, the language model storehouse according to regular update in identification processing procedure is identified.

(3) beneficial effect

The embodiment of the present invention provides a kind of radio and television audio recognition method and system, and wherein method comprises: according to radio and television data, extract voice data; Described voice data is carried out to pre-service, obtain feature text data; Send to Cloud Server to carry out identifying processing described feature text data, obtain the identification of men and women's sound, Speaker Identification and voice identification result; To described data pre-service, the identification of men and women's sound, Speaker Identification and voice identification result merges and structured text sign, the voice identification result of generating structured.。Based on cloud computing, to existing voice, recognition methods improves the method, merge radio and television Data Preprocessing Technology, men and women's sound recognition technology, speaker Recognition Technology and radio and television audio recognition method, speech data is carried out specifically for the data processing requirement of broadcast television industry, carrying out identifying processing again after pre-service, to radio and television data pre-service result, men and women's sound recognition result, Speaker Identification result and voice identification result merge and structured text sign, the voice identification result of generating structured, it can be the speech retrieval of broadcast TV program, subtitle recognition, the later stage intelligent processing method functions such as host identifies provide basic data, can make radio and television voice recognition processing speed accelerate and improve accuracy.

For the intellectuality of follow-up other broadcast television services, robotization process provide basic data specifically comprise following some:

1) to the recognition result of voice and to the sign result of voice word timestamp, can provide basic data for the retrieval service of radio and television voice content;

2) the cutting time point sign result to speech sentence, and the differentiation result of wide and narrow strip, can provide for the fractionation of broadcast TV program the reference of boundary time point;

3) identification to the identification of voice content in radio and television and punctuation mark, can provide for the subtitle recognition in broadcast TV program content reference;

4) the differentiation result to the Speaker Identification of speech sentence and wide and narrow strip, can provide foundation for the host's identification in broadcast TV program, welcome guest's identification, the scene Recognition of speaking (indoor scene, outdoor scene) etc.

Accompanying drawing explanation

The flow chart of steps of a kind of radio and television audio recognition method that Fig. 1 provides for the embodiment of the present invention one;

The flow chart of steps of the pretreatment operation that Fig. 2 provides for the embodiment of the present invention one;

The speech/non-speech that Fig. 3 provides for the embodiment of the present invention one is differentiated the technological frame schematic diagram of process sound intermediate frequency sorting technique;

The particular flow sheet that radio and television data are carried out to speech recognition that Fig. 4 provides for the embodiment of the present invention one;

The composition schematic diagram of a kind of radio and television speech recognition system that Fig. 5 provides for the embodiment of the present invention two;

The composition schematic diagram of the pre-service terminal that Fig. 6 provides for the embodiment of the present invention two;

The composition schematic diagram of the Cloud Server that Fig. 7 provides for the embodiment of the present invention two;

The voice content that Fig. 8 provides for the embodiment of the present invention two and the workflow diagram of punctuation mark identification module;

The cloud service platform configuration diagram that Fig. 9 provides for the embodiment of the present invention two.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for illustrating the present invention, but are not used for limiting the scope of the invention.

Embodiment mono-

The embodiment of the present invention one provides a kind of radio and television audio recognition method, and steps flow chart as shown in Figure 1, specifically comprises the following steps:

Step S1, according to radio and television data, extract voice data.

Step S2, voice data is carried out to pre-service, obtain feature text data.

Step S3, send to Cloud Server to carry out identifying processing feature text data, obtain the identification of men and women's sound, Speaker Identification and voice identification result;

Step S4, logarithm Data preprocess, the identification of men and women's sound, Speaker Identification and voice identification result merges and structured text sign, the voice identification result of generating structured.

The radio and television data to be identified (being audio, video data) that first said method provides from user, extract voice data, and after pre-service, obtain feature text data, by Cloud Server, it is carried out to identifying processing again, finally to the data pre-service obtaining, the identification of men and women's sound, Speaker Identification and voice identification result merges and structured text sign, the voice identification result of final generating structured, and it is returned to user with expandable mark language XML.Voice identification result is added to the timestamp of voice word, signs such as the timestamp of sentence, men and women's sound, speakers, can provide foundation for retrieval, subtitle recognition and host's identification etc. of radio and television voice content, the intellectuality of convenient follow-up other broadcast television services, robotization are processed, for various operations and processing provide basic data.

Preferably, before the present embodiment step S1, also comprise: receive the radio and television data that user sends, wherein these radio and television data comprise audio, video data, can be understood as voice data and video data.After receiving radio and television data, first judge whether these radio and television data are the audio, video data type that speech recognition system is supported, if not the discernible audio, video data in other words of supporting, refusal is processed.

Audio/video decoding in the present embodiment adopts encoding and decoding standard G.711, utilize ffmpeg software decode instrument to realize the decoding of audio frequency and video, extract audio-frequency unit and save as pcm form, radio and television audio, video data form that can compatible current various main flows, for example wmv, wma, wav, mp3, asf, rm, the forms such as mp4, avi, flv.If judging is discernible audio, video data, this audio, video data is decoded, and further therefrom extract the data that belong to audio-frequency unit, and the pending data using the voice data obtaining as step S2.

Preferably, step S2 in the present embodiment carries out pre-service to voice data, pre-service content mainly comprises according to the standard that is applicable to speech recognition carries out cutting and fragmentation, sentence file after fragmentation is carried out to differentiation the sign of speech/non-speech, broadband/arrowband, finally extract the feature text data that includes phonetic feature, the steps flow chart of pretreatment operation as shown in Figure 2, specifically comprises the following steps:

Step S21, voice data is carried out to cutting and fragmentation process and generate several sentence files.

Because the voice data receiving is than more complete data block, need to process its cutting and fragmentation, generate several sentence files little, that be applicable to speech recognition system processing.Concrete cutting process is as follows:

First this voice data is resolved, analyze the energy signal value of each audio sample point, find quiet position, in the present embodiment with 50 frames, 200 of a frames sampled point, as quiet some threshold values, while surpassing this quiet some threshold values, illustrates that this point is for quiet position; After finding quiet position, according to quiet position, voice data is carried out to cutting, fragmentation generates discrete sentence file, and each sentence file is stamped to time marking, and the sentence file finally obtaining is preserved with pcm form.

Step S22, distich son file are carried out non-voice filtration, leave speech sentence file.

Because step S21 just carries out cutting according to quiet position to voice data, wherein also comprise a large amount of non-voice context, and these contents for follow-up audio identification without any help, do not have any positive effect yet, contrary, because the existence of non-voice context also can increase the weight of the processing load of speech recognition system to the transmission of voice data and calculating, also can cause the generation of mistake identification, therefore need to carry out non-voice filtration to the sentence file generating, the sentence file after fragmentation is carried out to speech/non-speech differentiation, remaining speech sentence file, this step is specific as follows:

First, resolve the sentence file after each fragmentation, according to speech/non-speech disaggregated model, by sorter, each sentence file is carried out to the differentiation of speech/non-speech;

Secondly, according to differentiating result, the sentence file of non-voice is deleted to the operation of sign, and the sub-time location of protocol sentence.

In the present embodiment, used a kind of based on support vector machine (Support Vector Machine, abbreviation SVM) audio frequency classification method, first based on energy threshold, short sentence is divided into quiet and non-quiet, then by selecting effectively and the audio frequency characteristics of robust, non-mute signal is divided into 4 classes: voice (pure voice, non-pure voice), non-voice (music, ambient sound), the method has very high classification accuracy and processing speed, and the technological frame of this audio frequency classification method as shown in Figure 3.

Step S23, each speech sentence file is carried out to wide and narrow strip differentiation, the speech sentence file of differentiating for broadband signal is added to broadband sign, the speech sentence file of differentiating for narrow band signal adds arrowband sign.

Each speech sentence is carried out to wide and narrow strip differentiation, to select which kind of speech recognition modeling that reference is provided when differentiating result for subsequent speech recognition, this step is specific as follows:

First, to filtering the speech sentence segment of rear remaining applicable speech recognition system processing, analyze one by one, differentiating its speech sentence is broadband (high sampling rate) or arrowband (low sampling rate), to select which kind of speech recognition modeling that reference is provided during subsequent speech recognition;

Secondly, every speech sentence is carried out to wide and narrow strip sign, the speech sentence file of broadband signal is added to broadband sign, the speech sentence file of narrow band signal is added to arrowband sign.

Concrete, in the present embodiment, wide and narrow strip differentiation is differentiated by the spectrum energy value in analyzing audio signal, and when the spectrum energy value more than 8K is greater than 0.1, this sound signal is broadband, when the spectrum energy value below 8K is less than or equal to 0.1, this sound signal is narrow band signal.

Step S24, to adding the speech sentence file of broadband sign and arrowband sign, carry out audio feature extraction, obtain feature text data, the audio-video document title of the beginning and ending time that wherein comprises this speech sentence in feature text data, phonetic feature information, this sentence ownership and corresponding wide and narrow strip sign.

For saving network bandwidth resources, after speech sentence file is added to wide and narrow strip sign, also to carry out the extraction of audio frequency characteristics, voice data is converted into text feature data, to reduce the data volume of Internet Transmission, specific as follows:

First, to analyzing one by one adding the speech sentence file of broadband sign and arrowband sign, extract MFCC(Mel Frequency Cepstrum Coefficient, Mel frequency cepstral coefficient) and PLP(Packet Level Protocol, packet level protocol) phonetic feature, this is at two kinds of conventional phonetic features of field of speech recognition;

Secondly, every phonetic feature after extracting is carried out to time marking, the beginning and ending time that comprises this speech sentence in the feature text data that makes finally to obtain, the file name that belongs to which audio-video document and corresponding wide and narrow strip sign.

It should be noted that, this step not only converts input speech signal to comparison robust and has the phonetic feature of separating capacity, for distinguishing different speakers, and also carried out certain normalization on feature extraction basis, normalization content wherein comprises:

1) average normalization CMN, mainly reduces channel effect;

2) variance normalization CVN, mainly reduces additive noise impact;

3) sound channel length normalization VTLN, mainly reduces the impact that sound channel difference causes;

4) Gaussian Gaussianization is the extended method of CMN+CVN;

5) anti-noise algorithm, reducing ground unrest affects system performance, uses AWF and VTS algorithm.

Preferably, the present embodiment step S3 sends to Cloud Server by feature text data, enters speech recognition flow process.The present embodiment medium cloud server calls module adopts Web Service interface protocol, and radio and television mission bit stream to be identified is sent to server end and carries out speech recognition in the mode of XML message.Wherein the XML message of identification mission comprises following content:

1) radio and television file name to be identified;

2) the sentence listed files of fragmentation;

3) speech/non-speech of each sentence file sign;

4) broadband of each sentence file/arrowband sign;

5) each is accredited as the phonetic feature text of the sentence file of voice;

6) beginning and ending time of each sentence file sign.

Cloud Server receives after identification mission, carries out identifying processing and comprises: the identification of men and women's sound, Speaker Identification, voice content identification and punctuation mark identification, generate the voice identification result that contains sign, and this step is specific as follows:

(1) by phonetic feature text corresponding to speech sentence file to be identified with XML(extensible language) mode of message send to one by one far-end for radio and television voice recognition processing and server, in XML message, except comprising phonetic feature text data, also should comprise following information: the radio and television audio-video document title of the beginning and ending time that speech sentence file is corresponding, this speech sentence file ownership, the wide and narrow strip sign of this speech sentence file;

(2) speech recognition system in Cloud Server builds based on cloud computing framework, when the feature text of speech sentence sends to radio and television speech recognition cloud, by controller, according to the situation that takies of computational resource in Cloud Server, be the identification reasonable distribution computational resource of this speech sentence file;

(3) speech recognition system is called the computational resource that is assigned to phonetic feature is carried out respectively to the identification of men and women's sound, Speaker Identification, voice content and punctuation mark identification, wherein the identification of men and women's sound, according to men and women's sound disaggregated model, is carried out discriminant classification the sign of men and women's sound to each sentence by sorter; Speaker Identification, according to speaker model storehouse, is carried out speaker's identification sign to each sentence; Voice content identification and punctuation mark identification are carried out the identification of voice content to each sentence, with tense marker punctuation mark, and each vocabulary identifying are carried out to time-labeling.

Preferably, the present embodiment step S4 merges voice identification result and specifically the comprising of structured text sign:

Step S41, each voice identification result is gathered, alignd, and sort according to the beginning and ending time wherein comprising, concrete: the recognition result for each speech sentence merges, radio and television audio-video document according to its ownership gathers arrangement, the different recognition results of each sentence (identification of men and women's sound, Speaker Identification, voice content and punctuation mark identification) are alignd according to time point, the line time of going forward side by side sequence.

Step S42, to sequence after voice identification result according to structured format, carry out mark, comprise speaker's sex sign, speaker's sign, voice content, punctuation mark and timestamp, concrete: for the recognition result having sorted, according to specific structurized form, carry out text results sign, sign content comprises in voice content in speaker's sex, speaker, sentence of each sentence file, sentence that the timestamp of each voice word is, the punctuation mark of the sentence point of interruption.

The voice identification result of last generating structured, feeds back to user by voice identification result with the form of XML message afterwards again, and wherein XML message comprises following content:

1) the radio and television file name of identification;

2) the sentence listed files of fragmentation;

3) speech/non-speech of each sentence file sign;

4) broadband of each sentence file/arrowband sign;

5) voice identification result of each sentence file;

6) speaker of each sentence file sign;

7) men and women's sound of each sentence file sign;

8) beginning and ending time of each sentence file sign.

Preferably, the present embodiment is for ensureing the accuracy rate of speech recognition, at step S3, carry out the process of identifying processing and identify according to acoustic model storehouse and language model storehouse, wherein language model storehouse is by constantly upgrading to the collection of network text with to the study of network text.Regularly by internet, carry out the collection of network text, by the study of network text is regularly optimized to language model storehouse, specific as follows:

1) from internet, regularly collect network text, pass through web crawlers, regularly to each large search engine (as Baidu, Google, search, search dog, search storehouse etc.) and the relevant portal website (as CCTV's net, each earth mat platform, Sina, Sohu etc.) of each large radio and television, capture web page interlinkage, collect popular vocabulary and network article.

2) by the network text of collecting, network article is carried out to participle, and add up word frequency, word number, by the language model storehouse in word segmentation result, network boom word collection result and this speech recognition system of statistics typing, for each sound identification module, carry out reference, the regular update of realization to language model storehouse, to ensure the accuracy rate of radio and television speech recognition.

Based on above-mentioned, the idiographic flow that the present embodiment carries out speech recognition to radio and television data as shown in Figure 4, specifically comprises:

First, receive radio and television data, send it to pre-service terminal and carry out audio/video decoding, therefrom extract voice data, carry out afterwards audio frequency cutting and fragmentation, the sentence file after fragmentation is carried out to speech/non-speech differentiation, if voice continue next step, otherwise be labeled as non-voice, do not do and continue to process.For speech sentence file, proceed that wide and narrow strip is differentiated, speech feature extraction, then " cloud " by speech recognition calls by the feature text data obtaining, it is usingd to XML message and as voice recognition tasks, send to Cloud Server to carry out voice recognition processing.The cloud service platform of Cloud Server end carries out respectively the identification of men and women's sound, Speaker Identification, voice content identification and punctuation mark identification to it, again to feeding back to and service platform after the processing such as recognition result merges, while to carrying out regular update in the language model storehouse of cloud service platform, guarantees the accuracy rate of speech recognition from the new network words of e-learning, popular vocabulary etc.Finally, Cloud Server is by recognition result, and namely structurized voice identification result feeds back to user by XML form, and for reference, retrieval etc. is intelligent processing method further.

The recognition methods providing by the present embodiment, based on cloud computing, to existing voice, recognition methods improves, merge radio and television Data Preprocessing Technology, men and women's sound recognition technology, speaker Recognition Technology and radio and television audio recognition method, speech data is carried out specifically for the data processing requirement of broadcast television industry, carrying out identifying processing again after pre-service, to radio and television data pre-service result, men and women's sound recognition result, Speaker Identification result and voice identification result merge and structured text sign, the voice identification result of generating structured, it can be the intellectuality of follow-up other broadcast television services, robotization is processed provides basic data, specifically comprise following some:

5) to the recognition result of voice and to the sign result of voice word timestamp, can provide basic data for the retrieval service of radio and television voice content;

6) the cutting time point sign result to speech sentence, and the differentiation result of wide and narrow strip, can provide for the fractionation of broadcast TV program the reference of boundary time point;

7) identification to the identification of voice content in radio and television and punctuation mark, can provide for the subtitle recognition in broadcast TV program content reference;

8) the differentiation result to the Speaker Identification of speech sentence and wide and narrow strip, can provide foundation for the host's identification in broadcast TV program, welcome guest's identification, the scene Recognition of speaking (indoor scene, outdoor scene) etc.

In addition, processing speed is accelerated, and the speech recognition problem that can tackle mass data also, owing to regularly language model storehouse being learnt and upgrading, can improve the accuracy of speech recognition.

Embodiment bis-

The embodiment of the present invention two also provides a kind of radio and television speech recognition system, forms schematic diagram as shown in Figure 5, and this system comprises:

Extraction unit 10, extracts voice data according to radio and television data;

Pre-service terminal 20, carries out pre-service to voice data, obtains feature text data, and sends to Cloud Server 30;

Cloud Server 30, carries out identifying processing to feature text data, obtains voice identification result, and voice identification result is merged and structured text sign to the voice identification result of generating structured.

Preferably, the composition schematic diagram of the pre-service terminal 20 in the present embodiment as shown in Figure 6, specifically comprises:

Cutting module 21, carries out cutting and several sentence files of fragmentation processing generation to voice data;

Non-voice filtering module 22, distich son file is carried out non-voice filtration, leaves speech sentence file;

Wide and narrow strip discrimination module 23, carries out wide and narrow strip differentiation to each speech sentence file, and the speech sentence file of differentiating for broadband signal is added to broadband sign, and the speech sentence file of differentiating for narrow band signal adds arrowband sign;

Audio feature extraction module 24, to adding the speech sentence file of broadband sign and arrowband sign, carry out audio feature extraction, obtain feature text data, the audio-video document title of the beginning and ending time that wherein comprises this speech sentence in feature text data, phonetic feature information, this sentence ownership and corresponding wide and narrow strip sign.

Preferably, the composition schematic diagram of the Cloud Server 30 in the present embodiment as shown in Figure 7, specifically comprises:

Men and women's sound identification module 31, for carrying out the identification of men and women's sound to feature text data.

Due to aspect physiology and psychology, the male sex, the women obvious difference of having spoken, the fundamental tone producing as vocal cords, the formant frequency that oral cavity structure (laryngopharynx, tongue, palate, lip, tooth etc.) produces, the size of exhaled air flow and strong and weak etc.Therefore the sex character that comprises speaker in voice signal.In the present embodiment, by GMM-SVM(Gaussian Mixture Models-Support Vector Machines) technological frame of mixture model, men and women's sound identification (being the identification of speaker's sex) of having set up all variation spatial modelings (Total Variability Modeling).All spatial modelings that changes, when training space matrix, are no longer distinguished speaker space and channel space, by overall space, represent, have simplified the mathematical notation in space, greatly reduce the degree of dependence to training data.By multisystem, merge, provide final sex result and judge.

Speaker Identification module 32, for carrying out Speaker Identification to feature text.

The two class difference of Speaker Identification based between speaker realize in the present embodiment: the one, in the pronunciation of different vocal tract spectrum characteristics, itself there are differences, and the phonetic feature that this species diversity is embodied in pronunciation distributes upper different; The 2nd, different speakers' high-level feature (high-level features) there are differences, and because living environment is different with background, form the day after tomorrow, as differences such as idiom, the rhythm, language constructions.The Speaker Recognition System of main flow is all based on these features substantially in the world at present, by the method for statistical modeling, solves speaker's identification problem.Concrete, Speaker Recognition System comprises following two modules:

A, speaker's modeling tool module: the method for training by differentiation, as support vector machines, or the method based on statistical modeling, as gauss hybrid models GMM, speaker is carried out to modeling, portray different speakers feature space distribution character separately, for distinguishing different speakers.

B, speaker differentiate algoritic module: the feature of input voice is mated with corresponding speaker model, differentiate the speaker ' s identity of input voice according to matching degree.

Voice content and punctuation mark identification module 33, for feature text being carried out to voice content identification and punctuation mark identification, generate the voice identification result that contains sign.

Module comprises 4 ingredients: acoustic model storehouse, language model storehouse, search and decoding, punctuation mark generate, workflow diagram as shown in Figure 8, after input phonetic feature, according to this phonetic feature, be broadband signal or narrow band signal, by searching for the intelligent learning of decoder module Selection and call, acoustic model storehouse and the language model storehouse come are identified voice content, the text (sentence) generating after identification is sent into the identification that punctuation mark generation module carries out punctuation mark, finally generates the voice identification result with punctuation mark sign.

The recognition technology that 4 ingredients adopt is respectively described below:

A, acoustic model storehouse: adopt in the present embodiment the hidden Markov model based on the context-sensitive degree of depth neural network of CD-DNN-HMM() acoustic model storehouse, than traditional hidden Markov model based on GMM-HMM(gauss hybrid models) acoustic model storehouse recognition accuracy is higher.

B, language model storehouse: adopt in the present embodiment N-Gram(N metagrammar) language model, this model is based on a kind of like this hypothesis, the appearance of n word is only to N-1 word is relevant above, and all uncorrelated with other any word, the probability of whole sentence is exactly the product of each word probability of occurrence.These probability can obtain by directly add up the number of times that N word occur from language material simultaneously.N-Gram language model is simply effective, by speech recognition industry, is widely used.

C, search and decoding: adopt in the present embodiment the dynamic programming methods such as Viterbi searching algorithm, the optimal result of search in setting models situation; Viterbi algorithm based on dynamic programming each state on each time point, calculate the posterior probability of decoded state sequence to observation sequence, the path that retains maximum probability, and under each nodes records corresponding status information to finally oppositely obtain word decoding sequence.Viterbi algorithm is not being lost under the condition of optimum solution, solved that HMM model state sequence in continuous speech recognition is aimed at the Nonlinear Time of acoustics observation sequence simultaneously, the identification of word Boundary Detection and word, be also the elementary tactics of conventional speech recognition search.

Punctuation mark generates: adopted in the present embodiment a kind of method of utilizing plain text information to add Chinese uttered sentence end of the sentence punctuate.The method is from the different grain size angle of sentence, the relation of modeling overall situation lexical information and punctuate, and with multilayer perceptron, merge the punctuate model obtaining under different grain size, thus realized punctuate (fullstop, question mark and exclamation) generation.

Recognition result processing module 34, merges and structured text sign the voice identification result of generating structured to voice identification result.Wherein, in the present embodiment, first recognition result processing module 34 gathers and merges the voice identification result of each speech sentence file in radio and television data (band punctuation mark, each voice word band timestamp).

Preferably, the recognition result processing module 34 in the present embodiment further comprises:

Preferably, in Cloud Server 30 in the present embodiment, also comprise: language model intelligent learning module 35, for regularly collecting network text, by the study regular update language model storehouse to network text, language model storehouse according to regular update in identification processing procedure is identified, to guarantee the accuracy rate of speech recognition.。

Cloud Server 30 in the present embodiment is realized based on speech recognition cloud service platform 36, the cloud service platform framework that concrete speech recognition cloud service platform combines with SOA based on ICE builds, by ICE framework, complete Distributed Calculation, by SOA framework, externally provide cloud service, complete communicating by letter of identification mission based on Web Service and recognition result.

In the present embodiment in service platform, by various identification modules, (be men and women's sound identification module 31, Speaker Identification module 32, voice content and punctuation mark identification module 33 and recognition result processing module 34) be encapsulated into plug-in unit, the cloud service of formation standard, be configured in framework, become a part for cloud service platform, various identification modules can add easily and unload in the situation that not affecting the normal operation of system in platform, when data volume to be identified increases, cloud service platform will add identification module adaptively, to complete the radio and television voice recognition tasks of magnanimity.

This cloud service platform framework as shown in Figure 9, radio and television data complete after pre-service, by calling data access interface, voice recognition tasks is passed to control module with XML task message, by control module according to the state of current computational resource (state of computational resource is collected by monitoring unit), mainly comprise CPU, internal memory, network state, execution status of task in conjunction with recognition node, task priority, and the priori of execution efficiency, the computational resource of dynamic decision optimal scheme completes the execution of identification mission.

In sum, the recognition system that the present embodiment provides merges radio and television Data Preprocessing Technology, men and women's sound recognition technology, speaker Recognition Technology and radio and television audio recognition method, speech data is carried out specifically for the data processing requirement of broadcast television industry, carrying out identifying processing again after pre-service, to radio and television data pre-service result, men and women's sound recognition result, Speaker Identification result and voice identification result merge and structured text sign, the voice identification result of generating structured, it can be the intellectuality of follow-up other broadcast television services, robotization is processed provides basic data.In addition, due to the mode adopting the speech data parallel processing of fragmentation, processing speed is accelerated, and the speech recognition problem that can tackle mass data, simultaneously owing to regularly intelligent learning and renewal being carried out in language model storehouse, can improve the accuracy of speech recognition.

Above embodiment is only for illustrating the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims

1. a radio and television audio recognition method, is characterized in that, comprising:

S1, according to radio and television data, extract voice data;

2. radio and television audio recognition method as claimed in claim 1, is characterized in that, step S2 carries out pre-service to described voice data and specifically comprises:

3. radio and television audio recognition method as claimed in claim 1, it is characterized in that, step S3 sends to Cloud Server to carry out identifying processing described feature text data to comprise: the identification of men and women's sound, Speaker Identification, voice content identification and punctuation mark identification, generate the voice identification result that contains sign.

4. radio and television audio recognition method as claimed in claim 1, is characterized in that, step S4 merges described voice identification result and structured text sign specifically comprises:

5. radio and television audio recognition method as claimed in claim 1, it is characterized in that, step S3 carries out the process of identifying processing and identifies according to language model storehouse, and described speech model storehouse is constantly upgraded by network text collection and network text study.

6. a radio and television speech recognition system, is characterized in that, this system comprises:

Extraction unit, extracts voice data according to radio and television data;

7. radio and television speech recognition system as claimed in claim 6, is characterized in that, described pre-service terminal comprises:

8. radio and television speech recognition system as claimed in claim 6, is characterized in that, described Cloud Server comprises:

9. radio and television speech recognition system as claimed in claim 8, is characterized in that, described recognition result processing module further comprises:

10. radio and television speech recognition system as claimed in claim 6, it is characterized in that, in described Cloud Server, also comprise: language model intelligent learning module, for regularly collecting network text, by the study regular update language model storehouse to network text, the language model storehouse according to regular update in identification processing procedure is identified.