CN103700370A - Broadcast television voice recognition method and system - Google Patents

Broadcast television voice recognition method and system Download PDF

Info

Publication number
CN103700370A
CN103700370A CN201310648375.4A CN201310648375A CN103700370A CN 103700370 A CN103700370 A CN 103700370A CN 201310648375 A CN201310648375 A CN 201310648375A CN 103700370 A CN103700370 A CN 103700370A
Authority
CN
China
Prior art keywords
voice
identification
sign
data
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310648375.4A
Other languages
Chinese (zh)
Other versions
CN103700370B (en
Inventor
陈鑫玮
徐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING PATTEK Co Ltd
Original Assignee
BEIJING PATTEK Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING PATTEK Co Ltd filed Critical BEIJING PATTEK Co Ltd
Priority to CN201310648375.4A priority Critical patent/CN103700370B/en
Publication of CN103700370A publication Critical patent/CN103700370A/en
Application granted granted Critical
Publication of CN103700370B publication Critical patent/CN103700370B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a broadcast television voice recognition method and system. The method comprises the following steps: extracting audio data according to broadcast television data; pre-processing the audio data to obtain feature text data; transmitting the feature text data to a cloud server for recognition to achieve male and female voice recognition, speaker recognition and voice recognition results; fusing the male and female voice recognition, speaker recognition and voice recognition results and marking a structured text to generate a structured voice recognition result. According to the method, the conventional voice recognition method is improved; various broadcast television data pre-processing technologies and broadcast television voice recognition methods are integrated; voice data is identified to meet data processing requirements in broadcast television industry; various recognition results are integrated to generate the structured voice recognition result; basic data can be provided for intelligent processing of other service of subsequent broadcast television programs; processing speed is increased; the accuracy is improved.

Description

A kind of radio and television speech recognition system method and system
Technical field
The present invention relates to audio frequency and video processing technology field, particularly a kind of radio and television audio recognition method and system.
Background technology
At present in field of broadcast televisions, radio and television speech recognition is mainly utilized to the conventional speech recognition methods that is applicable to every profession and trade, and traditional speech recognition mainly adopts pattern matching method, be divided into training and two stages of identification, wherein in the training stage, user reads each word in vocabulary successively or gives an account of, and as template, deposits its eigenvector in template base; At cognitive phase, by input voice eigenvector successively with template base in each template carry out similarity comparison, similarity soprano is exported as recognition result.
But there is following problem in the speech recognition of field of broadcast televisions in this speech recognition application:
1) broadcast television industry often has especially, is different from processing and the operation of other industry to speech recognition, but because above-mentioned traditional voice identification is applied to every profession and trade, for broadcast television industry, there is no specific aim, can not to the non-voice context in radio and television data, filter according to the feature of broadcast television industry.Because non-voice context is not within the process range for speech recognition in broadcast television industry, if so non-voice context is not filtered, with regard to also needing, it is transmitted and is processed, not only cause the waste of transfer resource and computational resource, but also can cause due to the existence of non-voice context occurring more mistake identifying operation, and affect processing speed.
2) because traditional voice recognition technology does not possess the speech identifying function for broadcast television industry, cause recognition result sufficiently complete, for example, for one section of radio and television data, cannot judge the speak scene of generation and speaker's the important informations such as identity, cannot according to different speakers, carry out segmentation to voice content, the timestamp of each voice word cannot be identified, to the intellectuality of follow-up other broadcast television services, robotization processing, any valuable reference information cannot be provided.
To sum up, traditional audio recognition method is applied in and in broadcast television industry, has consumes resources, processing speed is slow, accuracy is not high, the problems such as quantity of information deficiency are provided.
Summary of the invention
(1) technical matters that will solve
The technical problem to be solved in the present invention is how for broadcast television industry feature, to carry out speech recognition, the shortcoming of avoiding conventional speech recognition methods to exist in broadcast television industry application, processes sufficient available basic data is provided for intellectuality, the robotization of follow-up other broadcast television industry business.
(2) technical scheme
For solving the problems of the technologies described above, the invention provides a kind of radio and television audio recognition method, comprising:
S1, according to radio and television data, extract voice data;
S2, described voice data is carried out to pre-service, obtain feature text data;
S3, send to Cloud Server to carry out identifying processing described feature text data, obtain the identification of men and women's sound, Speaker Identification and voice identification result;
S4, to described data pre-service, the identification of men and women's sound, Speaker Identification and voice identification result merges and structured text sign, the voice identification result of generating structured.
Further, step S2 carries out pre-service to described voice data and specifically comprises:
S21, described voice data is carried out to cutting and fragmentation process and generate several sentence files;
S22, described sentence file is carried out to non-voice filtration, leave speech sentence file;
S23, each speech sentence file is carried out to wide and narrow strip differentiation, the speech sentence file of differentiating for broadband signal is added to broadband sign, the speech sentence file of differentiating for narrow band signal adds arrowband sign;
S24, to adding the speech sentence file of broadband sign and arrowband sign, carry out audio feature extraction, obtain feature text data, the audio-video document title of the beginning and ending time that comprises this speech sentence in wherein said feature text data, phonetic feature information, this sentence ownership and corresponding wide and narrow strip sign.
Further, step S3 sends to Cloud Server to carry out identifying processing described feature text data to comprise: the identification of men and women's sound, Speaker Identification, voice content identification and punctuation mark identification, generate the voice identification result that contains sign.
Further, step S4 merges described voice identification result and structured text sign specifically comprises:
S41, each voice identification result is gathered, alignd, and sort according to the beginning and ending time wherein comprising;
S42, to sequence after voice identification result according to structured format, carry out mark, comprise speaker's sex sign, speaker's sign, voice content, punctuation mark and timestamp.
Further, step S3 carries out the process of identifying processing and identifies according to language model storehouse, and described speech model storehouse is constantly upgraded by network text collection and network text study.
For solving the problems of the technologies described above, the present invention also provides a kind of radio and television speech recognition system, and this system comprises:
Extraction unit, extracts voice data according to radio and television data;
Pre-service terminal, carries out pre-service to described voice data, obtains feature text data, and sends to Cloud Server;
Cloud Server, carries out identifying processing to described feature text data, obtains voice identification result, and described voice identification result is merged and structured text sign to the voice identification result of generating structured.
Further, described pre-service terminal comprises:
Cutting module, carries out cutting and several sentence files of fragmentation processing generation to described voice data;
Non-voice filtering module, carries out non-voice filtration to described sentence file, leaves speech sentence file;
Wide and narrow strip discrimination module, carries out wide and narrow strip differentiation to each speech sentence file, and the speech sentence file of differentiating for broadband signal is added to broadband sign, and the speech sentence file of differentiating for narrow band signal adds arrowband sign;
Audio feature extraction module, to adding the speech sentence file of broadband sign and arrowband sign, carry out audio feature extraction, obtain feature text data, the beginning and ending time that comprises this speech sentence in wherein said feature text data, belong to audio-video document title and corresponding wide and narrow strip sign.
Further, described Cloud Server comprises:
Men and women's sound identification module, for carrying out the identification of men and women's sound to described feature text data;
Speaker Identification module, for carrying out Speaker Identification to described feature text;
Voice content and punctuation mark identification module, for described feature text being carried out to voice content identification and punctuation mark identification, generate the voice identification result that contains punctuation mark sign;
Recognition result processing module, merges and structured text sign the voice identification result of generating structured to described voice identification result.
Further, described recognition result processing module further comprises:
Gather order module, for each voice identification result is gathered, alignd, and sort according to the beginning and ending time wherein comprising;
Add identification module, for the voice identification result to after sequence, according to structured format, carry out mark, comprise speaker's sex sign, speaker's sign, voice content, punctuation mark and timestamp.
Further, in described Cloud Server, also comprise: language model intelligent learning module, for regularly collecting network text, by the study regular update language model storehouse to network text, the language model storehouse according to regular update in identification processing procedure is identified.
(3) beneficial effect
The embodiment of the present invention provides a kind of radio and television audio recognition method and system, and wherein method comprises: according to radio and television data, extract voice data; Described voice data is carried out to pre-service, obtain feature text data; Send to Cloud Server to carry out identifying processing described feature text data, obtain the identification of men and women's sound, Speaker Identification and voice identification result; To described data pre-service, the identification of men and women's sound, Speaker Identification and voice identification result merges and structured text sign, the voice identification result of generating structured.。Based on cloud computing, to existing voice, recognition methods improves the method, merge radio and television Data Preprocessing Technology, men and women's sound recognition technology, speaker Recognition Technology and radio and television audio recognition method, speech data is carried out specifically for the data processing requirement of broadcast television industry, carrying out identifying processing again after pre-service, to radio and television data pre-service result, men and women's sound recognition result, Speaker Identification result and voice identification result merge and structured text sign, the voice identification result of generating structured, it can be the speech retrieval of broadcast TV program, subtitle recognition, the later stage intelligent processing method functions such as host identifies provide basic data, can make radio and television voice recognition processing speed accelerate and improve accuracy.
For the intellectuality of follow-up other broadcast television services, robotization process provide basic data specifically comprise following some:
1) to the recognition result of voice and to the sign result of voice word timestamp, can provide basic data for the retrieval service of radio and television voice content;
2) the cutting time point sign result to speech sentence, and the differentiation result of wide and narrow strip, can provide for the fractionation of broadcast TV program the reference of boundary time point;
3) identification to the identification of voice content in radio and television and punctuation mark, can provide for the subtitle recognition in broadcast TV program content reference;
4) the differentiation result to the Speaker Identification of speech sentence and wide and narrow strip, can provide foundation for the host's identification in broadcast TV program, welcome guest's identification, the scene Recognition of speaking (indoor scene, outdoor scene) etc.
Accompanying drawing explanation
The flow chart of steps of a kind of radio and television audio recognition method that Fig. 1 provides for the embodiment of the present invention one;
The flow chart of steps of the pretreatment operation that Fig. 2 provides for the embodiment of the present invention one;
The speech/non-speech that Fig. 3 provides for the embodiment of the present invention one is differentiated the technological frame schematic diagram of process sound intermediate frequency sorting technique;
The particular flow sheet that radio and television data are carried out to speech recognition that Fig. 4 provides for the embodiment of the present invention one;
The composition schematic diagram of a kind of radio and television speech recognition system that Fig. 5 provides for the embodiment of the present invention two;
The composition schematic diagram of the pre-service terminal that Fig. 6 provides for the embodiment of the present invention two;
The composition schematic diagram of the Cloud Server that Fig. 7 provides for the embodiment of the present invention two;
The voice content that Fig. 8 provides for the embodiment of the present invention two and the workflow diagram of punctuation mark identification module;
The cloud service platform configuration diagram that Fig. 9 provides for the embodiment of the present invention two.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for illustrating the present invention, but are not used for limiting the scope of the invention.
Embodiment mono-
The embodiment of the present invention one provides a kind of radio and television audio recognition method, and steps flow chart as shown in Figure 1, specifically comprises the following steps:
Step S1, according to radio and television data, extract voice data.
Step S2, voice data is carried out to pre-service, obtain feature text data.
Step S3, send to Cloud Server to carry out identifying processing feature text data, obtain the identification of men and women's sound, Speaker Identification and voice identification result;
Step S4, logarithm Data preprocess, the identification of men and women's sound, Speaker Identification and voice identification result merges and structured text sign, the voice identification result of generating structured.
The radio and television data to be identified (being audio, video data) that first said method provides from user, extract voice data, and after pre-service, obtain feature text data, by Cloud Server, it is carried out to identifying processing again, finally to the data pre-service obtaining, the identification of men and women's sound, Speaker Identification and voice identification result merges and structured text sign, the voice identification result of final generating structured, and it is returned to user with expandable mark language XML.Voice identification result is added to the timestamp of voice word, signs such as the timestamp of sentence, men and women's sound, speakers, can provide foundation for retrieval, subtitle recognition and host's identification etc. of radio and television voice content, the intellectuality of convenient follow-up other broadcast television services, robotization are processed, for various operations and processing provide basic data.
Preferably, before the present embodiment step S1, also comprise: receive the radio and television data that user sends, wherein these radio and television data comprise audio, video data, can be understood as voice data and video data.After receiving radio and television data, first judge whether these radio and television data are the audio, video data type that speech recognition system is supported, if not the discernible audio, video data in other words of supporting, refusal is processed.
Audio/video decoding in the present embodiment adopts encoding and decoding standard G.711, utilize ffmpeg software decode instrument to realize the decoding of audio frequency and video, extract audio-frequency unit and save as pcm form, radio and television audio, video data form that can compatible current various main flows, for example wmv, wma, wav, mp3, asf, rm, the forms such as mp4, avi, flv.If judging is discernible audio, video data, this audio, video data is decoded, and further therefrom extract the data that belong to audio-frequency unit, and the pending data using the voice data obtaining as step S2.
Preferably, step S2 in the present embodiment carries out pre-service to voice data, pre-service content mainly comprises according to the standard that is applicable to speech recognition carries out cutting and fragmentation, sentence file after fragmentation is carried out to differentiation the sign of speech/non-speech, broadband/arrowband, finally extract the feature text data that includes phonetic feature, the steps flow chart of pretreatment operation as shown in Figure 2, specifically comprises the following steps:
Step S21, voice data is carried out to cutting and fragmentation process and generate several sentence files.
Because the voice data receiving is than more complete data block, need to process its cutting and fragmentation, generate several sentence files little, that be applicable to speech recognition system processing.Concrete cutting process is as follows:
First this voice data is resolved, analyze the energy signal value of each audio sample point, find quiet position, in the present embodiment with 50 frames, 200 of a frames sampled point, as quiet some threshold values, while surpassing this quiet some threshold values, illustrates that this point is for quiet position; After finding quiet position, according to quiet position, voice data is carried out to cutting, fragmentation generates discrete sentence file, and each sentence file is stamped to time marking, and the sentence file finally obtaining is preserved with pcm form.
Step S22, distich son file are carried out non-voice filtration, leave speech sentence file.
Because step S21 just carries out cutting according to quiet position to voice data, wherein also comprise a large amount of non-voice context, and these contents for follow-up audio identification without any help, do not have any positive effect yet, contrary, because the existence of non-voice context also can increase the weight of the processing load of speech recognition system to the transmission of voice data and calculating, also can cause the generation of mistake identification, therefore need to carry out non-voice filtration to the sentence file generating, the sentence file after fragmentation is carried out to speech/non-speech differentiation, remaining speech sentence file, this step is specific as follows:
First, resolve the sentence file after each fragmentation, according to speech/non-speech disaggregated model, by sorter, each sentence file is carried out to the differentiation of speech/non-speech;
Secondly, according to differentiating result, the sentence file of non-voice is deleted to the operation of sign, and the sub-time location of protocol sentence.
In the present embodiment, used a kind of based on support vector machine (Support Vector Machine, abbreviation SVM) audio frequency classification method, first based on energy threshold, short sentence is divided into quiet and non-quiet, then by selecting effectively and the audio frequency characteristics of robust, non-mute signal is divided into 4 classes: voice (pure voice, non-pure voice), non-voice (music, ambient sound), the method has very high classification accuracy and processing speed, and the technological frame of this audio frequency classification method as shown in Figure 3.
Step S23, each speech sentence file is carried out to wide and narrow strip differentiation, the speech sentence file of differentiating for broadband signal is added to broadband sign, the speech sentence file of differentiating for narrow band signal adds arrowband sign.
Each speech sentence is carried out to wide and narrow strip differentiation, to select which kind of speech recognition modeling that reference is provided when differentiating result for subsequent speech recognition, this step is specific as follows:
First, to filtering the speech sentence segment of rear remaining applicable speech recognition system processing, analyze one by one, differentiating its speech sentence is broadband (high sampling rate) or arrowband (low sampling rate), to select which kind of speech recognition modeling that reference is provided during subsequent speech recognition;
Secondly, every speech sentence is carried out to wide and narrow strip sign, the speech sentence file of broadband signal is added to broadband sign, the speech sentence file of narrow band signal is added to arrowband sign.
Concrete, in the present embodiment, wide and narrow strip differentiation is differentiated by the spectrum energy value in analyzing audio signal, and when the spectrum energy value more than 8K is greater than 0.1, this sound signal is broadband, when the spectrum energy value below 8K is less than or equal to 0.1, this sound signal is narrow band signal.
Step S24, to adding the speech sentence file of broadband sign and arrowband sign, carry out audio feature extraction, obtain feature text data, the audio-video document title of the beginning and ending time that wherein comprises this speech sentence in feature text data, phonetic feature information, this sentence ownership and corresponding wide and narrow strip sign.
For saving network bandwidth resources, after speech sentence file is added to wide and narrow strip sign, also to carry out the extraction of audio frequency characteristics, voice data is converted into text feature data, to reduce the data volume of Internet Transmission, specific as follows:
First, to analyzing one by one adding the speech sentence file of broadband sign and arrowband sign, extract MFCC(Mel Frequency Cepstrum Coefficient, Mel frequency cepstral coefficient) and PLP(Packet Level Protocol, packet level protocol) phonetic feature, this is at two kinds of conventional phonetic features of field of speech recognition;
Secondly, every phonetic feature after extracting is carried out to time marking, the beginning and ending time that comprises this speech sentence in the feature text data that makes finally to obtain, the file name that belongs to which audio-video document and corresponding wide and narrow strip sign.
It should be noted that, this step not only converts input speech signal to comparison robust and has the phonetic feature of separating capacity, for distinguishing different speakers, and also carried out certain normalization on feature extraction basis, normalization content wherein comprises:
1) average normalization CMN, mainly reduces channel effect;
2) variance normalization CVN, mainly reduces additive noise impact;
3) sound channel length normalization VTLN, mainly reduces the impact that sound channel difference causes;
4) Gaussian Gaussianization is the extended method of CMN+CVN;
5) anti-noise algorithm, reducing ground unrest affects system performance, uses AWF and VTS algorithm.
Preferably, the present embodiment step S3 sends to Cloud Server by feature text data, enters speech recognition flow process.The present embodiment medium cloud server calls module adopts Web Service interface protocol, and radio and television mission bit stream to be identified is sent to server end and carries out speech recognition in the mode of XML message.Wherein the XML message of identification mission comprises following content:
1) radio and television file name to be identified;
2) the sentence listed files of fragmentation;
3) speech/non-speech of each sentence file sign;
4) broadband of each sentence file/arrowband sign;
5) each is accredited as the phonetic feature text of the sentence file of voice;
6) beginning and ending time of each sentence file sign.
Cloud Server receives after identification mission, carries out identifying processing and comprises: the identification of men and women's sound, Speaker Identification, voice content identification and punctuation mark identification, generate the voice identification result that contains sign, and this step is specific as follows:
(1) by phonetic feature text corresponding to speech sentence file to be identified with XML(extensible language) mode of message send to one by one far-end for radio and television voice recognition processing and server, in XML message, except comprising phonetic feature text data, also should comprise following information: the radio and television audio-video document title of the beginning and ending time that speech sentence file is corresponding, this speech sentence file ownership, the wide and narrow strip sign of this speech sentence file;
(2) speech recognition system in Cloud Server builds based on cloud computing framework, when the feature text of speech sentence sends to radio and television speech recognition cloud, by controller, according to the situation that takies of computational resource in Cloud Server, be the identification reasonable distribution computational resource of this speech sentence file;
(3) speech recognition system is called the computational resource that is assigned to phonetic feature is carried out respectively to the identification of men and women's sound, Speaker Identification, voice content and punctuation mark identification, wherein the identification of men and women's sound, according to men and women's sound disaggregated model, is carried out discriminant classification the sign of men and women's sound to each sentence by sorter; Speaker Identification, according to speaker model storehouse, is carried out speaker's identification sign to each sentence; Voice content identification and punctuation mark identification are carried out the identification of voice content to each sentence, with tense marker punctuation mark, and each vocabulary identifying are carried out to time-labeling.
Preferably, the present embodiment step S4 merges voice identification result and specifically the comprising of structured text sign:
Step S41, each voice identification result is gathered, alignd, and sort according to the beginning and ending time wherein comprising, concrete: the recognition result for each speech sentence merges, radio and television audio-video document according to its ownership gathers arrangement, the different recognition results of each sentence (identification of men and women's sound, Speaker Identification, voice content and punctuation mark identification) are alignd according to time point, the line time of going forward side by side sequence.
Step S42, to sequence after voice identification result according to structured format, carry out mark, comprise speaker's sex sign, speaker's sign, voice content, punctuation mark and timestamp, concrete: for the recognition result having sorted, according to specific structurized form, carry out text results sign, sign content comprises in voice content in speaker's sex, speaker, sentence of each sentence file, sentence that the timestamp of each voice word is, the punctuation mark of the sentence point of interruption.
The voice identification result of last generating structured, feeds back to user by voice identification result with the form of XML message afterwards again, and wherein XML message comprises following content:
1) the radio and television file name of identification;
2) the sentence listed files of fragmentation;
3) speech/non-speech of each sentence file sign;
4) broadband of each sentence file/arrowband sign;
5) voice identification result of each sentence file;
6) speaker of each sentence file sign;
7) men and women's sound of each sentence file sign;
8) beginning and ending time of each sentence file sign.
Preferably, the present embodiment is for ensureing the accuracy rate of speech recognition, at step S3, carry out the process of identifying processing and identify according to acoustic model storehouse and language model storehouse, wherein language model storehouse is by constantly upgrading to the collection of network text with to the study of network text.Regularly by internet, carry out the collection of network text, by the study of network text is regularly optimized to language model storehouse, specific as follows:
1) from internet, regularly collect network text, pass through web crawlers, regularly to each large search engine (as Baidu, Google, search, search dog, search storehouse etc.) and the relevant portal website (as CCTV's net, each earth mat platform, Sina, Sohu etc.) of each large radio and television, capture web page interlinkage, collect popular vocabulary and network article.
2) by the network text of collecting, network article is carried out to participle, and add up word frequency, word number, by the language model storehouse in word segmentation result, network boom word collection result and this speech recognition system of statistics typing, for each sound identification module, carry out reference, the regular update of realization to language model storehouse, to ensure the accuracy rate of radio and television speech recognition.
Based on above-mentioned, the idiographic flow that the present embodiment carries out speech recognition to radio and television data as shown in Figure 4, specifically comprises:
First, receive radio and television data, send it to pre-service terminal and carry out audio/video decoding, therefrom extract voice data, carry out afterwards audio frequency cutting and fragmentation, the sentence file after fragmentation is carried out to speech/non-speech differentiation, if voice continue next step, otherwise be labeled as non-voice, do not do and continue to process.For speech sentence file, proceed that wide and narrow strip is differentiated, speech feature extraction, then " cloud " by speech recognition calls by the feature text data obtaining, it is usingd to XML message and as voice recognition tasks, send to Cloud Server to carry out voice recognition processing.The cloud service platform of Cloud Server end carries out respectively the identification of men and women's sound, Speaker Identification, voice content identification and punctuation mark identification to it, again to feeding back to and service platform after the processing such as recognition result merges, while to carrying out regular update in the language model storehouse of cloud service platform, guarantees the accuracy rate of speech recognition from the new network words of e-learning, popular vocabulary etc.Finally, Cloud Server is by recognition result, and namely structurized voice identification result feeds back to user by XML form, and for reference, retrieval etc. is intelligent processing method further.
The recognition methods providing by the present embodiment, based on cloud computing, to existing voice, recognition methods improves, merge radio and television Data Preprocessing Technology, men and women's sound recognition technology, speaker Recognition Technology and radio and television audio recognition method, speech data is carried out specifically for the data processing requirement of broadcast television industry, carrying out identifying processing again after pre-service, to radio and television data pre-service result, men and women's sound recognition result, Speaker Identification result and voice identification result merge and structured text sign, the voice identification result of generating structured, it can be the intellectuality of follow-up other broadcast television services, robotization is processed provides basic data, specifically comprise following some:
5) to the recognition result of voice and to the sign result of voice word timestamp, can provide basic data for the retrieval service of radio and television voice content;
6) the cutting time point sign result to speech sentence, and the differentiation result of wide and narrow strip, can provide for the fractionation of broadcast TV program the reference of boundary time point;
7) identification to the identification of voice content in radio and television and punctuation mark, can provide for the subtitle recognition in broadcast TV program content reference;
8) the differentiation result to the Speaker Identification of speech sentence and wide and narrow strip, can provide foundation for the host's identification in broadcast TV program, welcome guest's identification, the scene Recognition of speaking (indoor scene, outdoor scene) etc.
In addition, processing speed is accelerated, and the speech recognition problem that can tackle mass data also, owing to regularly language model storehouse being learnt and upgrading, can improve the accuracy of speech recognition.
Embodiment bis-
The embodiment of the present invention two also provides a kind of radio and television speech recognition system, forms schematic diagram as shown in Figure 5, and this system comprises:
Extraction unit 10, extracts voice data according to radio and television data;
Pre-service terminal 20, carries out pre-service to voice data, obtains feature text data, and sends to Cloud Server 30;
Cloud Server 30, carries out identifying processing to feature text data, obtains voice identification result, and voice identification result is merged and structured text sign to the voice identification result of generating structured.
Preferably, the composition schematic diagram of the pre-service terminal 20 in the present embodiment as shown in Figure 6, specifically comprises:
Cutting module 21, carries out cutting and several sentence files of fragmentation processing generation to voice data;
Non-voice filtering module 22, distich son file is carried out non-voice filtration, leaves speech sentence file;
Wide and narrow strip discrimination module 23, carries out wide and narrow strip differentiation to each speech sentence file, and the speech sentence file of differentiating for broadband signal is added to broadband sign, and the speech sentence file of differentiating for narrow band signal adds arrowband sign;
Audio feature extraction module 24, to adding the speech sentence file of broadband sign and arrowband sign, carry out audio feature extraction, obtain feature text data, the audio-video document title of the beginning and ending time that wherein comprises this speech sentence in feature text data, phonetic feature information, this sentence ownership and corresponding wide and narrow strip sign.
Preferably, the composition schematic diagram of the Cloud Server 30 in the present embodiment as shown in Figure 7, specifically comprises:
Men and women's sound identification module 31, for carrying out the identification of men and women's sound to feature text data.
Due to aspect physiology and psychology, the male sex, the women obvious difference of having spoken, the fundamental tone producing as vocal cords, the formant frequency that oral cavity structure (laryngopharynx, tongue, palate, lip, tooth etc.) produces, the size of exhaled air flow and strong and weak etc.Therefore the sex character that comprises speaker in voice signal.In the present embodiment, by GMM-SVM(Gaussian Mixture Models-Support Vector Machines) technological frame of mixture model, men and women's sound identification (being the identification of speaker's sex) of having set up all variation spatial modelings (Total Variability Modeling).All spatial modelings that changes, when training space matrix, are no longer distinguished speaker space and channel space, by overall space, represent, have simplified the mathematical notation in space, greatly reduce the degree of dependence to training data.By multisystem, merge, provide final sex result and judge.
Speaker Identification module 32, for carrying out Speaker Identification to feature text.
The two class difference of Speaker Identification based between speaker realize in the present embodiment: the one, in the pronunciation of different vocal tract spectrum characteristics, itself there are differences, and the phonetic feature that this species diversity is embodied in pronunciation distributes upper different; The 2nd, different speakers' high-level feature (high-level features) there are differences, and because living environment is different with background, form the day after tomorrow, as differences such as idiom, the rhythm, language constructions.The Speaker Recognition System of main flow is all based on these features substantially in the world at present, by the method for statistical modeling, solves speaker's identification problem.Concrete, Speaker Recognition System comprises following two modules:
A, speaker's modeling tool module: the method for training by differentiation, as support vector machines, or the method based on statistical modeling, as gauss hybrid models GMM, speaker is carried out to modeling, portray different speakers feature space distribution character separately, for distinguishing different speakers.
B, speaker differentiate algoritic module: the feature of input voice is mated with corresponding speaker model, differentiate the speaker ' s identity of input voice according to matching degree.
Voice content and punctuation mark identification module 33, for feature text being carried out to voice content identification and punctuation mark identification, generate the voice identification result that contains sign.
Module comprises 4 ingredients: acoustic model storehouse, language model storehouse, search and decoding, punctuation mark generate, workflow diagram as shown in Figure 8, after input phonetic feature, according to this phonetic feature, be broadband signal or narrow band signal, by searching for the intelligent learning of decoder module Selection and call, acoustic model storehouse and the language model storehouse come are identified voice content, the text (sentence) generating after identification is sent into the identification that punctuation mark generation module carries out punctuation mark, finally generates the voice identification result with punctuation mark sign.
The recognition technology that 4 ingredients adopt is respectively described below:
A, acoustic model storehouse: adopt in the present embodiment the hidden Markov model based on the context-sensitive degree of depth neural network of CD-DNN-HMM() acoustic model storehouse, than traditional hidden Markov model based on GMM-HMM(gauss hybrid models) acoustic model storehouse recognition accuracy is higher.
B, language model storehouse: adopt in the present embodiment N-Gram(N metagrammar) language model, this model is based on a kind of like this hypothesis, the appearance of n word is only to N-1 word is relevant above, and all uncorrelated with other any word, the probability of whole sentence is exactly the product of each word probability of occurrence.These probability can obtain by directly add up the number of times that N word occur from language material simultaneously.N-Gram language model is simply effective, by speech recognition industry, is widely used.
C, search and decoding: adopt in the present embodiment the dynamic programming methods such as Viterbi searching algorithm, the optimal result of search in setting models situation; Viterbi algorithm based on dynamic programming each state on each time point, calculate the posterior probability of decoded state sequence to observation sequence, the path that retains maximum probability, and under each nodes records corresponding status information to finally oppositely obtain word decoding sequence.Viterbi algorithm is not being lost under the condition of optimum solution, solved that HMM model state sequence in continuous speech recognition is aimed at the Nonlinear Time of acoustics observation sequence simultaneously, the identification of word Boundary Detection and word, be also the elementary tactics of conventional speech recognition search.
Punctuation mark generates: adopted in the present embodiment a kind of method of utilizing plain text information to add Chinese uttered sentence end of the sentence punctuate.The method is from the different grain size angle of sentence, the relation of modeling overall situation lexical information and punctuate, and with multilayer perceptron, merge the punctuate model obtaining under different grain size, thus realized punctuate (fullstop, question mark and exclamation) generation.
Recognition result processing module 34, merges and structured text sign the voice identification result of generating structured to voice identification result.Wherein, in the present embodiment, first recognition result processing module 34 gathers and merges the voice identification result of each speech sentence file in radio and television data (band punctuation mark, each voice word band timestamp).
Preferably, the recognition result processing module 34 in the present embodiment further comprises:
Gather order module, for each voice identification result is gathered, alignd, and sort according to the beginning and ending time wherein comprising;
Add identification module, for the voice identification result to after sequence, according to structured format, carry out mark, comprise speaker's sex sign, speaker's sign, voice content, punctuation mark and timestamp.
Preferably, in Cloud Server 30 in the present embodiment, also comprise: language model intelligent learning module 35, for regularly collecting network text, by the study regular update language model storehouse to network text, language model storehouse according to regular update in identification processing procedure is identified, to guarantee the accuracy rate of speech recognition.。
Cloud Server 30 in the present embodiment is realized based on speech recognition cloud service platform 36, the cloud service platform framework that concrete speech recognition cloud service platform combines with SOA based on ICE builds, by ICE framework, complete Distributed Calculation, by SOA framework, externally provide cloud service, complete communicating by letter of identification mission based on Web Service and recognition result.
In the present embodiment in service platform, by various identification modules, (be men and women's sound identification module 31, Speaker Identification module 32, voice content and punctuation mark identification module 33 and recognition result processing module 34) be encapsulated into plug-in unit, the cloud service of formation standard, be configured in framework, become a part for cloud service platform, various identification modules can add easily and unload in the situation that not affecting the normal operation of system in platform, when data volume to be identified increases, cloud service platform will add identification module adaptively, to complete the radio and television voice recognition tasks of magnanimity.
This cloud service platform framework as shown in Figure 9, radio and television data complete after pre-service, by calling data access interface, voice recognition tasks is passed to control module with XML task message, by control module according to the state of current computational resource (state of computational resource is collected by monitoring unit), mainly comprise CPU, internal memory, network state, execution status of task in conjunction with recognition node, task priority, and the priori of execution efficiency, the computational resource of dynamic decision optimal scheme completes the execution of identification mission.
In sum, the recognition system that the present embodiment provides merges radio and television Data Preprocessing Technology, men and women's sound recognition technology, speaker Recognition Technology and radio and television audio recognition method, speech data is carried out specifically for the data processing requirement of broadcast television industry, carrying out identifying processing again after pre-service, to radio and television data pre-service result, men and women's sound recognition result, Speaker Identification result and voice identification result merge and structured text sign, the voice identification result of generating structured, it can be the intellectuality of follow-up other broadcast television services, robotization is processed provides basic data.In addition, due to the mode adopting the speech data parallel processing of fragmentation, processing speed is accelerated, and the speech recognition problem that can tackle mass data, simultaneously owing to regularly intelligent learning and renewal being carried out in language model storehouse, can improve the accuracy of speech recognition.
Above embodiment is only for illustrating the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims (10)

1. a radio and television audio recognition method, is characterized in that, comprising:
S1, according to radio and television data, extract voice data;
S2, described voice data is carried out to pre-service, obtain feature text data;
S3, send to Cloud Server to carry out identifying processing described feature text data, obtain the identification of men and women's sound, Speaker Identification and voice identification result;
S4, to described data pre-service, the identification of men and women's sound, Speaker Identification and voice identification result merges and structured text sign, the voice identification result of generating structured.
2. radio and television audio recognition method as claimed in claim 1, is characterized in that, step S2 carries out pre-service to described voice data and specifically comprises:
S21, described voice data is carried out to cutting and fragmentation process and generate several sentence files;
S22, described sentence file is carried out to non-voice filtration, leave speech sentence file;
S23, each speech sentence file is carried out to wide and narrow strip differentiation, the speech sentence file of differentiating for broadband signal is added to broadband sign, the speech sentence file of differentiating for narrow band signal adds arrowband sign;
S24, to adding the speech sentence file of broadband sign and arrowband sign, carry out audio feature extraction, obtain feature text data, the audio-video document title of the beginning and ending time that comprises this speech sentence in wherein said feature text data, phonetic feature information, this sentence ownership and corresponding wide and narrow strip sign.
3. radio and television audio recognition method as claimed in claim 1, it is characterized in that, step S3 sends to Cloud Server to carry out identifying processing described feature text data to comprise: the identification of men and women's sound, Speaker Identification, voice content identification and punctuation mark identification, generate the voice identification result that contains sign.
4. radio and television audio recognition method as claimed in claim 1, is characterized in that, step S4 merges described voice identification result and structured text sign specifically comprises:
S41, each voice identification result is gathered, alignd, and sort according to the beginning and ending time wherein comprising;
S42, to sequence after voice identification result according to structured format, carry out mark, comprise speaker's sex sign, speaker's sign, voice content, punctuation mark and timestamp.
5. radio and television audio recognition method as claimed in claim 1, it is characterized in that, step S3 carries out the process of identifying processing and identifies according to language model storehouse, and described speech model storehouse is constantly upgraded by network text collection and network text study.
6. a radio and television speech recognition system, is characterized in that, this system comprises:
Extraction unit, extracts voice data according to radio and television data;
Pre-service terminal, carries out pre-service to described voice data, obtains feature text data, and sends to Cloud Server;
Cloud Server, carries out identifying processing to described feature text data, obtains voice identification result, and described voice identification result is merged and structured text sign to the voice identification result of generating structured.
7. radio and television speech recognition system as claimed in claim 6, is characterized in that, described pre-service terminal comprises:
Cutting module, carries out cutting and several sentence files of fragmentation processing generation to described voice data;
Non-voice filtering module, carries out non-voice filtration to described sentence file, leaves speech sentence file;
Wide and narrow strip discrimination module, carries out wide and narrow strip differentiation to each speech sentence file, and the speech sentence file of differentiating for broadband signal is added to broadband sign, and the speech sentence file of differentiating for narrow band signal adds arrowband sign;
Audio feature extraction module, to adding the speech sentence file of broadband sign and arrowband sign, carry out audio feature extraction, obtain feature text data, the beginning and ending time that comprises this speech sentence in wherein said feature text data, belong to audio-video document title and corresponding wide and narrow strip sign.
8. radio and television speech recognition system as claimed in claim 6, is characterized in that, described Cloud Server comprises:
Men and women's sound identification module, for carrying out the identification of men and women's sound to described feature text data;
Speaker Identification module, for carrying out Speaker Identification to described feature text;
Voice content and punctuation mark identification module, for described feature text being carried out to voice content identification and punctuation mark identification, generate the voice identification result that contains punctuation mark sign;
Recognition result processing module, merges and structured text sign the voice identification result of generating structured to described voice identification result.
9. radio and television speech recognition system as claimed in claim 8, is characterized in that, described recognition result processing module further comprises:
Gather order module, for each voice identification result is gathered, alignd, and sort according to the beginning and ending time wherein comprising;
Add identification module, for the voice identification result to after sequence, according to structured format, carry out mark, comprise speaker's sex sign, speaker's sign, voice content, punctuation mark and timestamp.
10. radio and television speech recognition system as claimed in claim 6, it is characterized in that, in described Cloud Server, also comprise: language model intelligent learning module, for regularly collecting network text, by the study regular update language model storehouse to network text, the language model storehouse according to regular update in identification processing procedure is identified.
CN201310648375.4A 2013-12-04 2013-12-04 A kind of radio and television speech recognition system method and system Active CN103700370B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310648375.4A CN103700370B (en) 2013-12-04 2013-12-04 A kind of radio and television speech recognition system method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310648375.4A CN103700370B (en) 2013-12-04 2013-12-04 A kind of radio and television speech recognition system method and system

Publications (2)

Publication Number Publication Date
CN103700370A true CN103700370A (en) 2014-04-02
CN103700370B CN103700370B (en) 2016-08-17

Family

ID=50361876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310648375.4A Active CN103700370B (en) 2013-12-04 2013-12-04 A kind of radio and television speech recognition system method and system

Country Status (1)

Country Link
CN (1) CN103700370B (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104469616A (en) * 2014-12-17 2015-03-25 天脉聚源(北京)教育科技有限公司 Method and device for transmitting sound signals of intelligent teaching system
CN104751847A (en) * 2015-03-31 2015-07-01 刘畅 Data acquisition method and system based on overprint recognition
CN104936020A (en) * 2015-06-25 2015-09-23 四川迪佳通电子有限公司 Far field pick-up remote-control method and system based on set top box
CN104994400A (en) * 2015-07-06 2015-10-21 无锡天脉聚源传媒科技有限公司 Method and device for indexing video by means of acquisition of host name
CN105679319A (en) * 2015-12-29 2016-06-15 百度在线网络技术(北京)有限公司 Speech recognition processing method and device
CN105895104A (en) * 2014-05-04 2016-08-24 讯飞智元信息科技有限公司 Adaptive speaker identification method and system
CN105897360A (en) * 2016-05-18 2016-08-24 国家新闻出版广电总局监管中心 Method and system for judging broadcast quality and effect
CN105895102A (en) * 2015-11-15 2016-08-24 乐视移动智能信息技术(北京)有限公司 Recording editing method and recording device
CN105957517A (en) * 2016-04-29 2016-09-21 中国南方电网有限责任公司电网技术研究中心 Voice data structural transformation method based on open source API and system thereof
CN106100777A (en) * 2016-05-27 2016-11-09 西华大学 Broadcast support method based on speech recognition technology
CN106162319A (en) * 2015-04-20 2016-11-23 中兴通讯股份有限公司 A kind of method and device of Voice command electronic programming
CN106649643A (en) * 2016-12-08 2017-05-10 腾讯音乐娱乐(深圳)有限公司 Audio data processing method and device
CN106683661A (en) * 2015-11-05 2017-05-17 阿里巴巴集团控股有限公司 Role separation method and device based on voice
CN106877955A (en) * 2017-03-29 2017-06-20 西华大学 Fm broadcast signal based on hidden Markov model gives the correct time characteristic recognition method
CN106953887A (en) * 2017-01-05 2017-07-14 北京中瑞鸿程科技开发有限公司 A kind of personalized Organisation recommendations method of fine granularity radio station audio content
CN106971721A (en) * 2017-03-29 2017-07-21 沃航(武汉)科技有限公司 A kind of accent speech recognition system based on embedded mobile device
CN107203616A (en) * 2017-05-24 2017-09-26 苏州百智通信息技术有限公司 The mask method and device of video file
CN107291676A (en) * 2017-06-20 2017-10-24 广东小天才科技有限公司 Block method, terminal device and the computer-readable storage medium of voice document
CN108062954A (en) * 2016-11-08 2018-05-22 科大讯飞股份有限公司 Audio recognition method and device
CN108648758A (en) * 2018-03-12 2018-10-12 北京云知声信息技术有限公司 The method and system of invalid voice are detached in medical scene
CN110110294A (en) * 2019-03-26 2019-08-09 北京捷通华声科技股份有限公司 A kind of method, apparatus and readable storage medium storing program for executing of dynamic inversely decoding
CN110580907A (en) * 2019-08-28 2019-12-17 云知声智能科技股份有限公司 Voice recognition method and system for multi-person speaking scene
CN110910863A (en) * 2019-11-29 2020-03-24 上海依图信息技术有限公司 Method, device and equipment for extracting audio segment from audio file and storage medium
US20200220869A1 (en) * 2019-01-08 2020-07-09 Fidelity Information Services, Llc Systems and methods for contactless authentication using voice recognition
CN112037792A (en) * 2020-08-20 2020-12-04 北京字节跳动网络技术有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112185357A (en) * 2020-12-02 2021-01-05 成都启英泰伦科技有限公司 Device and method for simultaneously recognizing human voice and non-human voice
CN112818906A (en) * 2021-02-22 2021-05-18 浙江传媒学院 Intelligent full-media news cataloging method based on multi-mode information fusion understanding
CN113016189A (en) * 2018-11-16 2021-06-22 三星电子株式会社 Electronic device and method for recognizing audio scene
CN113348504A (en) * 2018-10-31 2021-09-03 雷夫.康姆有限公司 System and method for quadratic segmentation clustering, automatic speech recognition and transcription generation
CN113470652A (en) * 2021-06-30 2021-10-01 山东恒远智能科技有限公司 Voice recognition and processing method based on industrial Internet
CN113593577A (en) * 2021-09-06 2021-11-02 四川易海天科技有限公司 Vehicle-mounted artificial intelligence voice interaction system based on big data
CN113825009A (en) * 2021-10-29 2021-12-21 平安国际智慧城市科技股份有限公司 Audio and video playing method and device, electronic equipment and storage medium
CN115456150A (en) * 2022-10-18 2022-12-09 北京鼎成智造科技有限公司 Reinforced learning model construction method and system
CN113825009B (en) * 2021-10-29 2024-06-04 平安国际智慧城市科技股份有限公司 Audio and video playing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0952737A2 (en) * 1998-04-21 1999-10-27 International Business Machines Corporation System and method for identifying and selecting portions of information streams for a television system
CN101539929A (en) * 2009-04-17 2009-09-23 无锡天脉聚源传媒科技有限公司 Method for indexing TV news by utilizing computer system
CN101924863A (en) * 2010-05-21 2010-12-22 中山大学 Digital television equipment
CN103200463A (en) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 Method and device for generating video summary
CN103413557A (en) * 2013-07-08 2013-11-27 深圳Tcl新技术有限公司 Voice signal bandwidth expansion method and device thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0952737A2 (en) * 1998-04-21 1999-10-27 International Business Machines Corporation System and method for identifying and selecting portions of information streams for a television system
CN101539929A (en) * 2009-04-17 2009-09-23 无锡天脉聚源传媒科技有限公司 Method for indexing TV news by utilizing computer system
CN101924863A (en) * 2010-05-21 2010-12-22 中山大学 Digital television equipment
CN103200463A (en) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 Method and device for generating video summary
CN103413557A (en) * 2013-07-08 2013-11-27 深圳Tcl新技术有限公司 Voice signal bandwidth expansion method and device thereof

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105895104B (en) * 2014-05-04 2019-09-03 讯飞智元信息科技有限公司 Speaker adaptation recognition methods and system
CN105895104A (en) * 2014-05-04 2016-08-24 讯飞智元信息科技有限公司 Adaptive speaker identification method and system
CN104469616A (en) * 2014-12-17 2015-03-25 天脉聚源(北京)教育科技有限公司 Method and device for transmitting sound signals of intelligent teaching system
CN104751847A (en) * 2015-03-31 2015-07-01 刘畅 Data acquisition method and system based on overprint recognition
CN106162319A (en) * 2015-04-20 2016-11-23 中兴通讯股份有限公司 A kind of method and device of Voice command electronic programming
CN104936020A (en) * 2015-06-25 2015-09-23 四川迪佳通电子有限公司 Far field pick-up remote-control method and system based on set top box
CN104994400A (en) * 2015-07-06 2015-10-21 无锡天脉聚源传媒科技有限公司 Method and device for indexing video by means of acquisition of host name
CN106683661A (en) * 2015-11-05 2017-05-17 阿里巴巴集团控股有限公司 Role separation method and device based on voice
CN105895102A (en) * 2015-11-15 2016-08-24 乐视移动智能信息技术(北京)有限公司 Recording editing method and recording device
WO2017080235A1 (en) * 2015-11-15 2017-05-18 乐视控股(北京)有限公司 Audio recording editing method and recording device
CN105679319B (en) * 2015-12-29 2019-09-03 百度在线网络技术(北京)有限公司 Voice recognition processing method and device
CN105679319A (en) * 2015-12-29 2016-06-15 百度在线网络技术(北京)有限公司 Speech recognition processing method and device
CN105957517A (en) * 2016-04-29 2016-09-21 中国南方电网有限责任公司电网技术研究中心 Voice data structural transformation method based on open source API and system thereof
CN105897360B (en) * 2016-05-18 2018-12-11 国家新闻出版广电总局监管中心 A kind of broadcasting-quality and effect method of discrimination and system
CN105897360A (en) * 2016-05-18 2016-08-24 国家新闻出版广电总局监管中心 Method and system for judging broadcast quality and effect
CN106100777A (en) * 2016-05-27 2016-11-09 西华大学 Broadcast support method based on speech recognition technology
CN106100777B (en) * 2016-05-27 2018-08-17 西华大学 Broadcast support method based on speech recognition technology
CN108062954B (en) * 2016-11-08 2020-12-08 科大讯飞股份有限公司 Speech recognition method and device
CN108062954A (en) * 2016-11-08 2018-05-22 科大讯飞股份有限公司 Audio recognition method and device
CN106649643A (en) * 2016-12-08 2017-05-10 腾讯音乐娱乐(深圳)有限公司 Audio data processing method and device
CN106649643B (en) * 2016-12-08 2019-10-22 腾讯音乐娱乐(深圳)有限公司 A kind of audio data processing method and its device
CN106953887A (en) * 2017-01-05 2017-07-14 北京中瑞鸿程科技开发有限公司 A kind of personalized Organisation recommendations method of fine granularity radio station audio content
CN106971721A (en) * 2017-03-29 2017-07-21 沃航(武汉)科技有限公司 A kind of accent speech recognition system based on embedded mobile device
CN106877955A (en) * 2017-03-29 2017-06-20 西华大学 Fm broadcast signal based on hidden Markov model gives the correct time characteristic recognition method
CN107203616A (en) * 2017-05-24 2017-09-26 苏州百智通信息技术有限公司 The mask method and device of video file
CN107291676A (en) * 2017-06-20 2017-10-24 广东小天才科技有限公司 Block method, terminal device and the computer-readable storage medium of voice document
CN108648758A (en) * 2018-03-12 2018-10-12 北京云知声信息技术有限公司 The method and system of invalid voice are detached in medical scene
CN108648758B (en) * 2018-03-12 2020-09-01 北京云知声信息技术有限公司 Method and system for separating invalid voice in medical scene
CN113348504A (en) * 2018-10-31 2021-09-03 雷夫.康姆有限公司 System and method for quadratic segmentation clustering, automatic speech recognition and transcription generation
CN113016189B (en) * 2018-11-16 2023-12-19 三星电子株式会社 Electronic device and method for recognizing audio scene
CN113016189A (en) * 2018-11-16 2021-06-22 三星电子株式会社 Electronic device and method for recognizing audio scene
US20200220869A1 (en) * 2019-01-08 2020-07-09 Fidelity Information Services, Llc Systems and methods for contactless authentication using voice recognition
CN110110294B (en) * 2019-03-26 2021-02-02 北京捷通华声科技股份有限公司 Dynamic reverse decoding method, device and readable storage medium
CN110110294A (en) * 2019-03-26 2019-08-09 北京捷通华声科技股份有限公司 A kind of method, apparatus and readable storage medium storing program for executing of dynamic inversely decoding
CN110580907A (en) * 2019-08-28 2019-12-17 云知声智能科技股份有限公司 Voice recognition method and system for multi-person speaking scene
CN110910863A (en) * 2019-11-29 2020-03-24 上海依图信息技术有限公司 Method, device and equipment for extracting audio segment from audio file and storage medium
CN110910863B (en) * 2019-11-29 2023-01-31 上海依图信息技术有限公司 Method, device and equipment for extracting audio segment from audio file and storage medium
CN112037792A (en) * 2020-08-20 2020-12-04 北京字节跳动网络技术有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112037792B (en) * 2020-08-20 2022-06-17 北京字节跳动网络技术有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112185357A (en) * 2020-12-02 2021-01-05 成都启英泰伦科技有限公司 Device and method for simultaneously recognizing human voice and non-human voice
CN112818906B (en) * 2021-02-22 2023-07-11 浙江传媒学院 Intelligent cataloging method of all-media news based on multi-mode information fusion understanding
CN112818906A (en) * 2021-02-22 2021-05-18 浙江传媒学院 Intelligent full-media news cataloging method based on multi-mode information fusion understanding
CN113470652A (en) * 2021-06-30 2021-10-01 山东恒远智能科技有限公司 Voice recognition and processing method based on industrial Internet
CN113593577A (en) * 2021-09-06 2021-11-02 四川易海天科技有限公司 Vehicle-mounted artificial intelligence voice interaction system based on big data
CN113825009A (en) * 2021-10-29 2021-12-21 平安国际智慧城市科技股份有限公司 Audio and video playing method and device, electronic equipment and storage medium
CN113825009B (en) * 2021-10-29 2024-06-04 平安国际智慧城市科技股份有限公司 Audio and video playing method and device, electronic equipment and storage medium
CN115456150B (en) * 2022-10-18 2023-05-16 北京鼎成智造科技有限公司 Reinforced learning model construction method and system
CN115456150A (en) * 2022-10-18 2022-12-09 北京鼎成智造科技有限公司 Reinforced learning model construction method and system

Also Published As

Publication number Publication date
CN103700370B (en) 2016-08-17

Similar Documents

Publication Publication Date Title
CN103700370A (en) Broadcast television voice recognition method and system
CN102122506B (en) Method for recognizing voice
CN105405439B (en) Speech playing method and device
CN101447185B (en) Audio frequency rapid classification method based on content
CN101477798B (en) Method for analyzing and extracting audio data of set scene
CN108735201B (en) Continuous speech recognition method, device, equipment and storage medium
CN107403619B (en) Voice control method and system applied to bicycle environment
CN103500579B (en) Audio recognition method, Apparatus and system
CN102723078A (en) Emotion speech recognition method based on natural language comprehension
CN113766314B (en) Video segmentation method, device, equipment, system and storage medium
CN103871424A (en) Online speaking people cluster analysis method based on bayesian information criterion
CN111462758A (en) Method, device and equipment for intelligent conference role classification and storage medium
CN100508587C (en) News video retrieval method based on speech classifying identification
CN111461173A (en) Attention mechanism-based multi-speaker clustering system and method
US11823685B2 (en) Speech recognition
Wang et al. Exploring audio semantic concepts for event-based video retrieval
CN110428853A (en) Voice activity detection method, Voice activity detection device and electronic equipment
CN111369981A (en) Dialect region identification method and device, electronic equipment and storage medium
CN101867742A (en) Television system based on sound control
CN111968628B (en) Signal accuracy adjusting system and method for voice instruction capture
CN113823303A (en) Audio noise reduction method and device and computer readable storage medium
CN112185357A (en) Device and method for simultaneously recognizing human voice and non-human voice
CN107180629B (en) Voice acquisition and recognition method and system
CN108520740B (en) Audio content consistency analysis method and analysis system based on multiple characteristics
CN113516963B (en) Audio data generation method and device, server and intelligent sound box

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C53 Correction of patent of invention or patent application
CB03 Change of inventor or designer information

Inventor after: Chen Xinwei

Inventor before: Chen Xinwei

Inventor before: Xu Bo

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: CHEN XINWEI XU BO TO: CHEN XINWEI

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant