CN110427459A - Visualized generation method, system and the platform of speech recognition network - Google Patents

Visualized generation method, system and the platform of speech recognition network Download PDF

Info

Publication number
CN110427459A
CN110427459A CN201910719492.2A CN201910719492A CN110427459A CN 110427459 A CN110427459 A CN 110427459A CN 201910719492 A CN201910719492 A CN 201910719492A CN 110427459 A CN110427459 A CN 110427459A
Authority
CN
China
Prior art keywords
language model
general
speech recognition
corpus
wfst
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910719492.2A
Other languages
Chinese (zh)
Other versions
CN110427459B (en
Inventor
王雪志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AI Speech Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Priority to CN201910719492.2A priority Critical patent/CN110427459B/en
Publication of CN110427459A publication Critical patent/CN110427459A/en
Application granted granted Critical
Publication of CN110427459B publication Critical patent/CN110427459B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The present invention discloses the visualized generation method of speech recognition network, this method comprises: receiving keyword by human-computer interaction interface.Current area field is chosen from preset multiple general field fields, each general field field corresponds to multiple default crawlers and corresponding multiple default Web crawl the page.Obtain general corpus.Obtain specific corpus.The general corpus of training obtains general language model and particular language model.After the WFST speech recognition network parallel connection of the WFST speech recognition network of general language model and particular language model, in conjunction with acoustic model and Pronounceable dictionary, passes through combination, determinization, minimizes operation synthesis WFST speech recognition network.By configuring system in identical platform, accelerates the training speed of language model, shorten production life cycle, shorten manpower consumption, save human cost.Simultaneously by the merging of general language model network and particular language model, the accuracy and efficiency of language identification are improved.

Description

Visualized generation method, system and the platform of speech recognition network
Technical field
The invention belongs to the visualized generation method of the technical field of speech recognition more particularly to speech recognition network, it is System and platform.
Background technique
Associated visualization language model manufacturing system is seldom currently on the market, and most language model production is all to order Row level is enabled to be customized.The speech recognition aspect that is produced on of language model is held the balance, and each voice company has oneself Team be responsible for model, but most of made under order line.In the prior art, model is carried out under order line Customization process is uncontrollable, and the bad management of version, risk is uncontrollable, and process not enough simplifies.The reason of leading to drawbacks described above is, Caused by order training pattern is manually entered with various scripts under order line.The manually training under order line lacks and continues Effective supervision and check, cause process uncontrollable, risk is uncontrollable.The inefficient operation of order line is not able to satisfy the language of multitask Say model training, process is complicated.Meanwhile in the prior art, the visualization in modelling is poor, is not easy to the production of model.
In view of the above-mentioned problems, the method solved these problems currently on the market have it is as follows: language model training standard process Formulation, script specification management, the unified management of data, develop more effective scripts, make each automate, arrange step by step More people carry out intersecting the modes such as discs.These solutions above-mentioned and unresolved the problem of being merged and not from entirety It is upper to go to solve the problems, such as with a complete system.
It follows that in the prior art speech recognition when used visual speech recognition network, in generating process In respectively to customize process uncontrollable, and version is not easy to manage, and is not able to satisfy the language model training of multitask.Meanwhile model system Visuality in work is poor, editor while multiple users is not easy to, to reduce speech recognition modeling formation efficiency and standard True property.
Summary of the invention
Embodiment of the present invention provide language model generation method and unit, at least solve above-mentioned technical problem it One.
In a first aspect, providing a kind of visualized generation method of speech recognition network, this method can operate in the end Web, Method includes:
Step S101 receives keyword by human-computer interaction interface.It chooses and works as from preset multiple general field fields Preceding field field, each general field field, which corresponds to, multiple default crawls word and corresponding multiple default Web crawl the page.
Step S102, according to current area field obtain it is corresponding it is default crawl word, crawl word according to default and working as front neck The corresponding multiple default Web of domain field crawl the page swash win the first place crawl as a result, according to first crawl result obtain common language Material collection.
Keyword is set as current crawler and crawls word by step S103, is crawled word according to current crawler and is searched at the end Web from setting It indexes to climb to win the second place in the back page held up and crawl as a result, crawling result according to second obtains specific corpus.
Step S104 is trained the general language model for generating arpa format based on general corpus, is based on specific language Material collection is trained the particular language model for generating arpa format.The file information of general language model and particular language model It include the version number with mark action in the file information.
Step S105 merges general language model and particular language model, in conjunction with acoustic model and Pronounceable dictionary number According to rear synthesis WFST speech recognition network.
It further include step S106 after step S105, according to multiple configurations in a kind of preferred embodiment of the present invention The setting test set of interface tests WFST speech recognition network respectively, obtains the test identification data of the interface of multiple configurations, shows Show the test identification data of the interface of multiple configurations, includes the identification information of the interface of corresponding configuration in test identification data.
In a kind of preferred embodiment of the present invention, in step S102 further include: step S1021 passes through the language that scores Model gives a mark to the entry that general corpus is concentrated, and obtains the corresponding scoring of entry, if the scoring of entry is greater than setting threshold values, protects Entry is stayed, deletes entry if it is not, then concentrating from general corpus.
It further include that step S1031 obtains specific corpus in step S103 in a kind of preferred embodiment of the present invention It concentrates each entry to sort in setting search engine, sets item from first backward sequence in interception setting search engine sequence Several entries updates specific corpus.
In a kind of preferred embodiment of the present invention, it is trained in step S104 based on general corpus and generates arpa The step of general language model of format includes that addition setting mandatory parameter button, sets if receiving on human-computer interaction interface Determine the selection information of mandatory parameter button, is then trained the general language model for generating arpa format based on general corpus.
Test the step of WFST speech recognition network in step S106 respectively according to the setting test set of the interface of multiple configurations It suddenly include the addition setting mandatory parameter button on human-computer interaction interface, if receiving the selection letter of setting mandatory parameter button Breath then tests WFST speech recognition network according to setting test set.
In a kind of preferred embodiment of the present invention, general language model and particular language model merge in step S105 The step of are as follows: general language model is converted into WFST form, particular language model is converted into WFST form, is being converted to The general language model of WFST form and be converted to WFST form particular language model first node before increase starting and save Point merges general language model and particular language model.
In a kind of preferred embodiment of the present invention, wherein step S102 further includes generating on human-computer interaction interface The operation key of step S102, if the end of run of step S101, the operation key of starting step S102.Step S103 is also wrapped It includes, the operation key of generation step S103 on human-computer interaction interface, if the end of run of step S102, starting step S103 Operation key.
Step S104 further includes the operation key of generation step S104 on human-computer interaction interface, if the fortune of step S103 Row terminates, then the operation key of starting step S104.Step S105 further includes the generation step S105 on human-computer interaction interface Run key, if the end of run of step S104, the operation key of starting step S105.
Second aspect provides a kind of Visual Production system of speech recognition network, including, it is user interaction unit, general Corpus acquiring unit, specific corpus acquiring unit, language model acquiring unit and WFST speech recognition network acquiring unit.
User interaction unit is configured to receive keyword by human-computer interaction interface.From preset multiple general field words Choose current area field in section, each general field field, which correspond to, multiple default crawls word and the multiple default Web of correspondence are crawled The page.
General corpus acquiring unit is configured to crawl word according to corresponding preset of current area field acquisition, according to default It crawls word and crawls the page in the corresponding multiple default Web of current area field and swash to win the first place and crawl as a result, being crawled according to first As a result general corpus is obtained.
Specific corpus acquiring unit, is configured to for keyword to be set as current crawler to crawl word, crawls word according to current crawler It climbs to win the second place from the back page of setting search engine at the end Web and crawl as a result, crawling result according to second obtains specific language Material collection.
Language model acquiring unit is configured to general corpus and is trained the all-purpose language mould for generating arpa format Type is trained the particular language model for generating arpa format based on specific corpus.The file information of general language model and It include the version number with mark action in the file information of particular language model.
WFST speech recognition network acquiring unit merges general language model and particular language model, in conjunction with acoustic mode WFST speech recognition network is synthesized after type and Pronounceable dictionary data.
In a kind of preferred embodiment of Visual Production system of the present invention, including, test cell.
Test cell is configured to test WFST speech recognition net respectively according to the setting test set of the interface of multiple configurations Network obtains the test identification data of the interface of multiple configurations, shows the test identification data of the interface of multiple configurations, test identification The identification information of interface in data including corresponding configuration.
The third aspect, the Visual Production platform for providing speech recognition network of the invention load the present invention on platform In speech recognition network Visual Production system, system can make multiple development groups while operate, multiple development groups it is every It include multiple developers in group, each developer is able to use a separate unit.Separate unit is that the voice in the present invention is known The single unit in Visual Production system in the Visual Production system of other network.
Visual Production platform is configured to store the general language model generated or used in multiple development groups and spy Determine language model, version number and spy of the Visual Production platform according to the general language model generated or used in multiple development groups The version number for determining language model establishes multiple version number's corresponding relationships.
Current development group can be selected from the general language model and particular language model that Visual Production platform is stored Take "current" model.If "current" model is deleted, replaces or edited to current development group, Visual Production platform is according to multiple version numbers Corresponding relationship notifies corresponding development group, current development group to operate "current" model according to the return information of corresponding development group.
Fourth aspect provides a kind of electronic equipment comprising: at least one processor, and with described at least one Manage the memory of device communication connection, wherein the memory is stored with the instruction that can be executed by least one described processor, institute It states instruction to be executed by least one described processor, so that at least one described processor is able to carry out any embodiment party of the present invention The step of method of formula.
5th aspect, embodiment of the present invention also provide a kind of computer program product, the computer program product packet The computer program being stored on non-volatile computer readable storage medium storing program for executing is included, the computer program includes program instruction, When described program instruction is computer-executed, the step of making the computer execute the method for any embodiment of the present invention.
The present invention shortens product by system configuration is carried out in identical platform, accelerating the training speed of language model Period, and the product between multi-user is helped to be isolated.Manpower consumption can be shortened, save human cost.Pass through common language simultaneously The merging for saying prototype network and particular language model, improves the accuracy and efficiency of language identification.
Detailed description of the invention
It, below will be to required in embodiment description in order to illustrate more clearly of the technical solution of embodiment of the present invention The attached drawing used is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present invention, right For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings His attached drawing.
Fig. 1 is the flow chart of the visualized generation method for the speech recognition network that an embodiment of the present invention provides.
Fig. 2 is the flow chart of the visualized generation method for the speech recognition network that another embodiment of the present invention provides.
Fig. 3 is the subdivision flow chart in the step S102 that an embodiment of the present invention provides.
Fig. 4 is the subdivision flow chart in the step S103 that an embodiment of the present invention provides.
Fig. 5 is the combination for the Visual Production system for additionally providing speech recognition network that an embodiment of the present invention provides Schematic diagram.
Fig. 6 is the group for the Visual Production system for additionally providing speech recognition network that another embodiment of the present invention provides Close schematic diagram.
Fig. 7 is the flow chart of the visualized generation method for the speech recognition network that another embodiment of the invention provides.
Fig. 8 is the structural schematic diagram for the electronic equipment that an embodiment of the present invention provides.
Specific embodiment
To keep the purposes, technical schemes and advantages of embodiment of the present invention clearer, implement below in conjunction with the present invention The technical solution in embodiment of the present invention is clearly and completely described in attached drawing in mode, it is clear that described reality The mode of applying is some embodiments of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ability Domain those of ordinary skill every other embodiment obtained without creative efforts, belongs to the present invention The range of protection.
One aspect of the present invention, improves the visualized generation method of speech recognition network, and this method can operate in Web End, as shown in Figure 1, the visualized generation method of the speech recognition network in the present invention includes:
Step S101 obtains keyword and general field field.
In this step, keyword is received by human-computer interaction interface;It chooses and works as from preset multiple general field fields Preceding field field, each general field field, which corresponds to, multiple default crawls word and corresponding multiple default Web crawl the page.
For example, multiple general field fields include " circuit ", " chemistry ", " machinery " 3 general field fields.By above-mentioned 3 A field is shown on the interactive interface of user terminal.Such as, user selects " electronics " on interactive interface, as current general neck Domain field.The equipment that its user terminal interacts interface display is intelligent terminal or touch-screen equipment.User terminal it is local or It locally can be realized the remote port remotely connecting with user terminal, prestore " electronics ", " chemistry ", multiple default corresponding to " machinery " Crawlers or information, and page info is crawled with Web corresponding to " electronics ", " chemistry ", " machinery ".As with " electronics " institute Corresponding Web crawl page info be electronics industry science popularization, using etc. the website Web that uses under occasions crawl webpage.
In addition, user inputs keyword in interactive interface by user.The keyword refer to user with general field word In the corresponding field of section, the corresponding field for especially needing to identify.Such as when the user field field that user selects is " circuit " When, input keyword can be special circuits terms such as " discrete device circuits ", " integrated device circuit " and " analog circuit ". To be conducive to improve the preparatory of corpus.
Step S102 obtains general corpus.
In this step, according to current area field obtain it is corresponding it is default crawl word, crawl word according to default and working as front neck The corresponding multiple default Web of domain field crawl the page swash win the first place crawl as a result, according to first crawl result obtain common language Material collection.
Step S103 obtains specific corpus.
In this step, keyword is set as current crawler and crawls word, word is crawled according to current crawler and is searched at the end Web from setting It indexes to climb to win the second place in the back page held up and crawl as a result, crawling result according to second obtains specific corpus.
Step S104 obtains general language model and particular language model.
In this step, it is trained the general language model for generating arpa format based on general corpus, is based on specific language Material collection is trained the particular language model for generating arpa format;The file information of general language model and particular language model It include the version number with mark action in the file information.
Step S105 synthesizes WFST speech recognition network.
General language model and particular language model are merged, in conjunction with being synthesized after acoustic model and Pronounceable dictionary data WFST speech recognition network.
To by by the WFST speech recognition of the WFST speech recognition network of general language model and particular language model The parallel connection of network can take into account all-purpose language identification and language-specific identification, and can be in same identification net in speech recognition It polymerize two kinds of identification methods in network, improves in a certain specific area, the accuracy of language identification.
In a preferred embodiment, as shown in Fig. 2, further including after step S105,
Step S106 tests WFST speech recognition network.
In this step, WFST speech recognition network is tested according to the setting test set of the interface of multiple configurations respectively, obtained The test of the interface of multiple configurations identifies data, shows the test identification data of the interface of multiple configurations, and test identifies in data The identification information of interface including corresponding configuration.
In a preferred embodiment, as shown in figure 3, in step S102 further include:
Step S1021 gives a mark to the entry that general corpus is concentrated.
In this step, is given a mark by scoring language model to the entry that general corpus is concentrated, obtains the corresponding scoring of entry, If the scoring of entry is greater than setting threshold values, is concentrated from general corpus and retain the entry, deleted if it is not, then being concentrated from general corpus The entry.To the screening for the entry concentrated to general corpus, reduces the deviation ratio of entry and entry memory space can be reduced, mention High entry arithmetic speed.
In a preferred embodiment, as shown in figure 4, in step S103 further include:
Step S1031, specific corpus entry optimization.
In this step, obtains each entry in specific corpus and sort in setting search engine, interception setting search engine The entry for setting item number in sequence from first backward sequence updates specific corpus.To the word in specific corpus Item optimizes, and selects the higher word of frequency of use, improves entry versatility, reduces entry memory space, and then improves entry Arithmetic speed.
In a preferred embodiment, it is trained in step S104 based on general corpus and generates arpa format The step of general language model includes,
The addition setting mandatory parameter button on human-computer interaction interface, if receiving the selection letter of setting mandatory parameter button Breath is then trained the general language model for generating arpa format based on general corpus.
Test the step of WFST speech recognition network in step S106 respectively according to the setting test set of the interface of multiple configurations Suddenly include,
The addition setting mandatory parameter button on human-computer interaction interface, if receiving the selection letter of setting mandatory parameter button Breath then tests WFST speech recognition network according to setting test set.
By setting " mandatory parameter ", to provide the error rate of developer in the process of development, and then exploitation matter is improved Amount.
In a preferred embodiment, in step S105 by by the WFST speech recognition network of general language model and The step of the WFST speech recognition network parallel connection of particular language model are as follows: general language model is converted into WFST form, it will be special Attribute says that model conversion is WFST form, is being converted to the general language model of WFST form and is being converted to the specific of WFST form Increase a start node before the first node of language model, merges general language model and particular language model.
In a preferred embodiment, step S102 further includes the generation step S102 on human-computer interaction interface Run key, if the end of run of step S101, the operation key of starting step S102.Step S103 further includes, man-machine The operation key of generation step S103 on interactive interface, if the end of run of step S102, the operation of starting step S103 is pressed Key.
Step S104 further includes the operation key of generation step S104 on human-computer interaction interface, if the fortune of step S103 Row terminates, then the operation key of starting step S104.
Step S105 further includes the operation key of generation step S105 on human-computer interaction interface, if the fortune of step S104 Row terminates, then the operation key of starting step S105.
On the one hand the visualization of operating process is improved, Limit exploitation person improves voice knowledge by execution or development scheduling The consistency and normalization that other network model generates, while development efficiency is improved because reducing maloperation.
In terms of another kind of the invention, as shown in figure 5, the present invention also provides the Visual Productions of speech recognition network System.The system includes user interaction unit 101, general corpus acquiring unit 201, specific corpus acquiring unit 301, language Model acquiring unit 401 and WFST speech recognition network acquiring unit 501.
User interaction unit 101 is configured to receive keyword by human-computer interaction interface;From preset multiple general fields Choose current area field in field, each general field field, which correspond to, multiple default crawls word and the multiple default Web of correspondence are climbed Take the page.
General corpus acquiring unit 201 is configured to crawl word according to corresponding preset of current area field acquisition, according to pre- It crawls if crawling word and crawling the page in the corresponding multiple default Web of current area field and swash to win the first place as a result, being climbed according to first Result is taken to obtain general corpus.
Specific corpus acquiring unit 301, is configured to for keyword to be set as current crawler to crawl word, is crawled according to current crawler Word the end Web from setting search engine back page in climb win the second place crawl as a result, according to second crawl result obtain it is specific Corpus.
Language model acquiring unit 401 is configured to general corpus and is trained the common language for generating arpa format It says model, the particular language model for generating arpa format is trained based on specific corpus;The file of general language model is believed It include the version number with mark action in the file information of breath and particular language model.
WFST speech recognition network acquiring unit 501 is configured to that general language model and particular language model will be merged, In conjunction with synthesis WFST speech recognition network after acoustic model and Pronounceable dictionary data.
In a kind of embodiment of the Visual Production system of speech recognition network of the invention, as shown in fig. 6, also wrapping It includes, test cell 601.Test cell 601 is configured to test WFST language respectively according to the setting test set of the interface of multiple configurations Sound identifies network, obtains the test identification data of the interface of multiple configurations, shows the test identification data of the interface of multiple configurations, It include the identification information of the interface of corresponding configuration in test identification data.
In another aspect of the invention, the Visual Production platform that body has used speech recognition network is gone back, is loaded on platform The Visual Production system of speech recognition network in the present invention.The system can make multiple development groups while operate, Duo Gekai It include multiple developers in every group of hair group, each developer is able to use a separate unit.Separate unit is speech recognition Single unit in the Visual Production system of network, for example, user interaction unit 101, general corpus acquiring unit 201, spy One in attribute material acquiring unit 301, language model acquiring unit 401 and WFST speech recognition network acquiring unit 501.
Visual Production platform is configured to store the general language model generated or used in multiple development groups and spy Determine language model, and according to the version number of the general language model generated or used in multiple development groups and particular language model Version number establishes multiple version number's corresponding relationships.
Current development group can be selected from the general language model and particular language model that Visual Production platform is stored Take "current" model.If "current" model is deleted, replaces or edited to current development group, Visual Production platform is according to multiple version numbers Corresponding relationship notifies corresponding development group, current development group to operate "current" model according to the return information of corresponding development group.To keep away When exempting from more developers' progress identical platform exploitations, because of the resource contention caused by resource-sharing.Improve the reliability of development platform And consistency.
It is worth noting that, the unit in embodiment disclosed by the invention is not limited to the scheme of the disclosure, separately Outside, it can also realize that related function module, such as separation module can also be realized with processor by hardware processor, herein It repeats no more.
In another embodiment of the invention, the visualized generation method of another speech recognition network is provided. This method comprises:
1) because this software systems is a complete platform can be serviced by being limited in a program to allow Device goes to carry out Row control, and artificial forgetting is avoided to execute some step;
Above-mentioned Row control includes three aspects:
1, mandatory parameter button is set during training pattern, test etc., and essential option is unselected, cannot continue to execute the behaviour Make.
2, to test be a whole set of process from training, there is sequencing, when preceding step does not operate, behind the step of it is aobvious It is shown as grey.
3, system can verify parameter, such as: check whether Pronounceable dictionary matches with word dictionary, if not With return error message.
2) to version, the software systems can provide special visual Version Control, and can pass through program checkout Related dependant avoids changing other versions due to deleting, modifying operation between version;Here related dependant refers to: holding When the operations such as row deletion, modification, system has checked whether that other model has used the version mould by way of tabling look-up Type, and prompt information to be confirmed is returned, it clicks and is just deleted, modified after confirming
3) because the software systems are to provide a web operation interface, this system ratio operates in order line Simply.Simplify flow operations;
4) Row control, Version Control and simple flow are carried out by way of system program, can thus reduces language Say the risk of model training.
With reference to Fig. 7, firstly, the production of language model needs data to support, so the first step carries out the management of corpus.This The collection of part corpus includes the crawling of network corpus, the generation of artificial corpus.The management of corpus includes the normalizing of corpus Change, delete the operation such as mobile.
The above-mentioned corpus that crawls is divided to two kinds: the first, the crawler of better fixed network has been set in system, when needing certain When a FIELD Data, the field that chooses on web page starts to crawl.Second: provide keyword crawls user in web terminal Keyword is filled in, then system crawler can be searched in major search engine, extract text in returning to entry.Screen text Method: general language model is given a mark, and when score is more than some threshold value, is retained entry, is otherwise deleted;Existed according to entry Putting in order in search engine, the entry of N before extracting.
Second, there is corpus to need to carry out the training of arpa language model later, contains general language model and generate, is fixed Language model processed generates, language model manages three parts.Above-mentioned language model management includes providing the button of deletion, movement, is deleted Except the language model referred in deletion file system;It is mobile that mobile storage position is mainly carried out in file system.
Above-mentioned training refers to that general language model is mainly user by choosing the language in the various big fields that system is set Material is trained to obtain.Custom language models be by user provide keyword crawl corpus and directly corpus be trained Language model.The two difference is mainly that the selection of corpus is different, can be operated in the different pages, can be embodied on model ID The two difference
Third, resource management module are mainly raw to the language model combination acoustic model and Pronounceable dictionary that have generated At WFST speech recognition network.It is to pass through that language model combination acoustic model and Pronounceable dictionary, which generate WFST speech recognition network, Combination, the operation of determinization, minimum obtain after merging.General, custom language models WFST network union are main It is to increase a start node by two network foremosts, two networks is made to be together in parallel.It can when identifying decoding Search for general and custom language models WFST speech recognition networks.
In order to which the purpose for realizing Project settings provides the operation to two WFST speech recognition network union.Resource management It is supplied to the input of decoder module, so being to be connected to language model module and decoder module.The module additionally provides WFST money The function of source control.
4th, decoding test and management can mainly provide various configurations with performance of the statistical test collection in new resource Interface.
Web terminal is built by front-end technology such as html, css, realizes visualization effect.
Server end builds the interface for calling data processing, model training and test with flask, is communicated with front end. The transmission of data is carried out by way of json.
The operations such as the specific data processing of bottom, model training combine the kit of open source, using python language as carrier Source code is carried out to write.
In other embodiments, embodiment of the present invention additionally provides a kind of nonvolatile computer storage media, Computer storage medium is stored with computer executable instructions, which, which can be performed above-mentioned any means, implements Speech processing and application method in mode;
As an implementation, nonvolatile computer storage media of the invention is stored with the executable finger of computer It enables, computer executable instructions setting are as follows:
Step S101 receives keyword by human-computer interaction interface;It chooses and works as from preset multiple general field fields Preceding field field, each general field field corresponds to multiple default crawlers and corresponding multiple default Web crawl the page;
Step S102, according to the corresponding default crawler of current area, the corresponding multiple default Web in field are crawled in this prior It is crawled on the page, obtains general corpus according to result is crawled;
Keyword is set as current crawler by step S103, and current crawler is at the end Web, from the return page of setting search engine It is crawled in face, obtains specific corpus according to result is crawled;
Step S104 obtains general language model by the general corpus of arpa language model training, passes through arpa language The specific corpus of model training obtains particular language model;The file information of general language model and the file of particular language model It include the version number with mark action in information;
Step S105, by the WFST speech recognition of the WFST speech recognition network of general language model and particular language model After network is in parallel, in conjunction with acoustic model and Pronounceable dictionary, by combination, determinization, operation synthesis WFST voice knowledge is minimized Other network.
As a kind of non-volatile computer readable storage medium storing program for executing, it can be used for storing non-volatile software program, non-volatile Property computer executable program and module, as the corresponding program of audio signal processing method in embodiment of the present invention refers to Order/module.One or more program instruction is stored in non-volatile computer readable storage medium storing program for executing, when being executed by processor When, execute the audio signal processing method in above-mentioned any means embodiment.
Non-volatile computer readable storage medium storing program for executing may include storing program area and storage data area, wherein storage journey It sequence area can application program required for storage program area, at least one function;Storage data area can be stored according to voice signal Processing unit uses created data etc..In addition, non-volatile computer readable storage medium storing program for executing may include that high speed is random Access memory, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other Non-volatile solid state memory part.In some embodiments, it includes opposite that non-volatile computer readable storage medium storing program for executing is optional In the remotely located memory of processor, these remote memories can pass through network connection to Speech processing unit.On The example for stating network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Embodiment of the present invention also provides a kind of computer program product, and computer program product is non-volatile including being stored in Computer program on property computer readable storage medium, computer program includes program instruction, when program instruction is by computer When execution, computer is made to execute any of the above-described audio signal processing method.
Fig. 8 is the structural schematic diagram for the electronic equipment that embodiment of the present invention provides, as shown in figure 8, the equipment includes: one A or multiple processors 710 and memory 720, in Fig. 7 by taking a processor 710 as an example.Audio signal processing method is set Standby can also include: input unit 730 and output unit 740.Processor 710, memory 720, input unit 730 and output are single Member 740 can be connected by bus or other modes, in Fig. 7 for being connected by bus.Memory 720 is above-mentioned non- Volatile computer readable storage medium storing program for executing.The non-volatile software journey that processor 710 is stored in memory 720 by operation Sequence, instruction and module, thereby executing the various function application and data processing of server, i.e. realization above method embodiment party Formula audio signal processing method.Input unit 730 can receive the number or character information of input, and generate single with information dispensing The related key signals input of the user setting and function control of member.Output unit 740 may include that display screen etc. shows equipment.
The said goods can be performed embodiment of the present invention provided by method, have the corresponding functional module of execution method and Beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to method provided by embodiment of the present invention.
As an implementation, above-mentioned electronic equipment can be applied to the Visual Production platform of speech recognition network In, comprising: at least one processor;And the memory being connect at least one processor communication;Wherein, memory stores There is the instruction that can be executed by least one processor, instruction is executed by least one processor, so that at least one processor energy It is enough:
Keyword is received by human-computer interaction interface;Current area word is chosen from preset multiple general field fields Section, each general field field corresponds to multiple default crawlers and corresponding multiple default Web crawl the page;
According to the corresponding default crawler of current area, the corresponding multiple default Web in field crawl the page and swash in this prior It takes, obtains general corpus according to result is crawled;
Keyword is set as current crawler, current crawler crawls from the back page of setting search engine at the end Web, Specific corpus is obtained according to result is crawled;
General language model is obtained by the general corpus of arpa language model training, it is special by the training of arpa language model Determine corpus and obtains particular language model;Include in the file information of general language model and the file information of particular language model Version number with mark action.
The WFST speech recognition network of the WFST speech recognition network of general language model and particular language model is in parallel Afterwards, in conjunction with acoustic model and Pronounceable dictionary, by combination, determinization, operation synthesis WFST speech recognition network is minimized.
The electronic equipment of embodiment of the present invention exists in a variety of forms, including but not limited to:
(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data Communication is main target.This Terminal Type includes: smart phone (such as iPhone), multimedia handset, functional mobile phone and low Hold mobile phone etc..
(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing function Can, generally also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as iPad.
(3) portable entertainment device: this kind of equipment can show and play multimedia content.Such equipment include: audio, Video player (such as iPod), handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.
(4) server: providing the equipment of the service of calculating, and the composition of server includes that processor, hard disk, memory, system are total Line etc., server is similar with general computer architecture, but due to needing to provide highly reliable service, in processing energy Power, stability, reliability, safety, scalability, manageability etc. are more demanding.
(5) other electronic units with data interaction function.
Unit embodiment described above is only schematical, wherein unit can be with as illustrated by the separation member It is or may not be and be physically separated, component shown as a unit may or may not be physical unit, Can be in one place, or may be distributed over multiple network units.It can select according to the actual needs wherein Some or all of the modules realize the purpose of present embodiment scheme.Those of ordinary skill in the art are not paying creativeness Labour in the case where, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation The method of certain parts of mode or embodiment.
Finally, it should be noted that embodiment of above is merely illustrative of the technical solution of the present invention, rather than its limitations;To the greatest extent Invention is explained in detail referring to aforementioned embodiments for pipe, those skilled in the art should understand that: its according to It can so modify to technical solution documented by aforementioned each embodiment, or part of technical characteristic is equal Replacement;And these are modified or replaceed, each embodiment technical solution of the present invention that it does not separate the essence of the corresponding technical solution Spirit and scope.

Claims (10)

1. the visualized generation method of speech recognition network, this method can operate in the end Web, which comprises
Step S101 receives keyword by human-computer interaction interface;It is chosen from preset multiple general field fields and works as front neck Domain field, each general field field, which corresponds to, multiple default crawls word and corresponding multiple default Web crawl the page;
Step S102, obtains corresponding preset according to the current area field and crawls word, crawls word in institute according to described preset It states the corresponding multiple default Web of current area field and crawls the page and swash to win the first place and crawl as a result, crawling knot according to described first Fruit obtains general corpus;
The keyword is set as current crawler and crawls word by step S103, crawls word at the end Web from setting according to the current crawler Determine to climb in the back page of search engine to win the second place and crawl as a result, crawling result according to described second obtains specific corpus;
Step S104 is trained the general language model for generating arpa format based on the general corpus, is based on the spy Determine corpus and is trained the particular language model for generating arpa format;The file information of the general language model and specific language Say that in the file information of model include the version number with mark action;
Step S105 merges the general language model and the particular language model, in conjunction with acoustic model and pronunciation word WFST speech recognition network is synthesized after allusion quotation data.
2. according to the method described in claim 1, further include after the step S105,
Step S106 tests the WFST speech recognition network according to the setting test set of the interface of multiple configurations respectively, obtains The test of the interface of the multiple configuration identifies data, shows the test identification data of the interface of the multiple configuration, the survey It include the identification information of the interface of corresponding configuration in examination identification data.
3. according to the method described in claim 1, in the step S102 further include:
Step S1021, the entry marking concentrated by scoring language model to general corpus obtain that the entry is corresponding to be commented Point, if the scoring of the entry is greater than setting threshold values, retain the entry, is deleted if it is not, then being concentrated from the general corpus The entry.
4. according to the method described in claim 1, further include in the step S103,
Step S1031 obtains each entry in the specific corpus and sorts in the setting search engine, intercepts the setting The entry for setting item number in search engine sequence from first backward sequence updates the specific corpus.
5. according to the method described in claim 2, being trained life based on the general corpus described in the step S104 At arpa format general language model the step of include,
The addition setting mandatory parameter button on human-computer interaction interface, if receiving the selection letter of the setting mandatory parameter button Breath is then trained the general language model for generating arpa format based on the general corpus;
The WFST speech recognition is tested according to the setting test set of the interface of multiple configurations respectively described in the step S106 The step of network includes,
The addition setting mandatory parameter button on human-computer interaction interface, if receiving the selection letter of the setting mandatory parameter button Breath then tests the WFST speech recognition network according to setting test set.
6. according to the method described in claim 1, general language model described in the step S105 and the language-specific mould The step of type merges are as follows:
The general language model is converted into WFST form, the particular language model is converted into WFST form, is being converted For WFST form general language model and be converted to WFST form particular language model first node before increase by one starting Node merges the general language model and the particular language model.
7. according to the method described in claim 1, wherein,
The step S102 further includes the operation key of generation step S102 on human-computer interaction interface, if the fortune of step S101 Row terminates, then starts the operation key of the step S102;
The step S103 further includes the operation key of generation step S103 on human-computer interaction interface, if the fortune of step S102 Row terminates, then starts the operation key of the step S103;
The step S104 further includes the operation key of generation step S104 on human-computer interaction interface, if the fortune of step S103 Row terminates, then starts the operation key of the step S104;
The step S105 further includes the operation key of generation step S105 on human-computer interaction interface, if the fortune of step S104 Row terminates, then starts the operation key of the step S105.
8. the Visual Production system of speech recognition network, including, user interaction unit, general corpus acquiring unit, specific language Expect acquiring unit, language model acquiring unit and WFST speech recognition network acquiring unit;
The user interaction unit is configured to receive keyword by human-computer interaction interface;From preset multiple general field words Choose current area field in section, each general field field, which correspond to, multiple default crawls word and the multiple default Web of correspondence are crawled The page;
The general corpus acquiring unit is configured to crawl word according to corresponding preset of current area field acquisition, according to It is described it is default crawl word the corresponding multiple default Web of the current area field crawl the page swash win the first place crawl as a result, Result, which is crawled, according to described first obtains general corpus;
The specific corpus acquiring unit, is configured to the keyword being set as current crawler to crawl word, is currently climbed according to described Worm crawls word and climbs to win the second place and crawl as a result, crawling knot according to described second from the back page of setting search engine at the end Web Fruit obtains specific corpus;
The language model acquiring unit is configured to the general corpus and is trained the common language for generating arpa format It says model, the particular language model for generating arpa format is trained based on the specific corpus;The general language model The file information and particular language model the file information in include the version number with mark action;
The WFST speech recognition network acquiring unit merges the general language model and the particular language model, knot WFST speech recognition network is synthesized after closing acoustic model and Pronounceable dictionary data.
9. system according to claim 8, including, test cell;
The test cell is configured to test the WFST speech recognition respectively according to the setting test set of the interface of multiple configurations Network obtains the test identification data of the interface of the multiple configuration, shows the test identification number of the interface of the multiple configuration According to the identification information of the interface including corresponding configuration in the test identification data.
10. the Visual Production platform of speech recognition network, the system in the claim 8 or 9, institute are loaded on the platform The system of stating can make multiple development groups while operate, and include multiple developers, each exploitation in every group of the multiple development group Person is able to use a separate unit;The separate unit is the list in the Visual Production system in the claim 8 or 9 Unit one;
The Visual Production platform is configured to store the general language model generated or used in the multiple development group And particular language model, the Visual Production platform is according to the all-purpose language generated or used in the multiple development group The version number of model and the version number of particular language model establish multiple version number's corresponding relationships;
Current development group can be selected from the general language model and particular language model that the Visual Production platform is stored Take "current" model;If the current development group deletes, replaces or edit the "current" model, the Visual Production platform root Corresponding development group is notified according to the multiple version number's corresponding relationship, and the current development group is according to the return information of corresponding development group Operate the "current" model.
CN201910719492.2A 2019-08-05 2019-08-05 Visual generation method, system and platform of voice recognition network Active CN110427459B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910719492.2A CN110427459B (en) 2019-08-05 2019-08-05 Visual generation method, system and platform of voice recognition network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910719492.2A CN110427459B (en) 2019-08-05 2019-08-05 Visual generation method, system and platform of voice recognition network

Publications (2)

Publication Number Publication Date
CN110427459A true CN110427459A (en) 2019-11-08
CN110427459B CN110427459B (en) 2021-09-17

Family

ID=68414250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910719492.2A Active CN110427459B (en) 2019-08-05 2019-08-05 Visual generation method, system and platform of voice recognition network

Country Status (1)

Country Link
CN (1) CN110427459B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145727A (en) * 2019-12-02 2020-05-12 云知声智能科技股份有限公司 Method and device for recognizing digital string by voice
CN111933146A (en) * 2020-10-13 2020-11-13 苏州思必驰信息科技有限公司 Speech recognition system and method
CN111951788A (en) * 2020-08-10 2020-11-17 百度在线网络技术(北京)有限公司 Language model optimization method and device, electronic equipment and storage medium
CN113111642A (en) * 2020-01-13 2021-07-13 京东方科技集团股份有限公司 Natural language identification model generation method, natural language processing method and equipment
CN113223522A (en) * 2021-04-26 2021-08-06 北京百度网讯科技有限公司 Speech recognition method, apparatus, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
EP2309487A1 (en) * 2009-09-11 2011-04-13 Honda Research Institute Europe GmbH Automatic speech recognition system integrating multiple sequence alignment for model bootstrapping
CN102760436A (en) * 2012-08-09 2012-10-31 河南省烟草公司开封市公司 Voice lexicon screening method
CN107705787A (en) * 2017-09-25 2018-02-16 北京捷通华声科技股份有限公司 A kind of audio recognition method and device
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model
CN109976702A (en) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 A kind of audio recognition method, device and terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2309487A1 (en) * 2009-09-11 2011-04-13 Honda Research Institute Europe GmbH Automatic speech recognition system integrating multiple sequence alignment for model bootstrapping
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
CN102760436A (en) * 2012-08-09 2012-10-31 河南省烟草公司开封市公司 Voice lexicon screening method
CN107705787A (en) * 2017-09-25 2018-02-16 北京捷通华声科技股份有限公司 A kind of audio recognition method and device
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model
CN109976702A (en) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 A kind of audio recognition method, device and terminal

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PALASKAR,SHRUTI等: "END-TO-END MULTIMODAL SPEECH RECOGNITION", 《2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 *
张志楠: "语音Corpus的自动构建和语音最小化标注的研究", 《中国优秀硕士学位论文全文数据库(电子期刊)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145727A (en) * 2019-12-02 2020-05-12 云知声智能科技股份有限公司 Method and device for recognizing digital string by voice
CN113111642A (en) * 2020-01-13 2021-07-13 京东方科技集团股份有限公司 Natural language identification model generation method, natural language processing method and equipment
CN111951788A (en) * 2020-08-10 2020-11-17 百度在线网络技术(北京)有限公司 Language model optimization method and device, electronic equipment and storage medium
CN111933146A (en) * 2020-10-13 2020-11-13 苏州思必驰信息科技有限公司 Speech recognition system and method
CN113223522A (en) * 2021-04-26 2021-08-06 北京百度网讯科技有限公司 Speech recognition method, apparatus, device and storage medium
CN113223522B (en) * 2021-04-26 2022-05-03 北京百度网讯科技有限公司 Speech recognition method, apparatus, device and storage medium

Also Published As

Publication number Publication date
CN110427459B (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN110427459A (en) Visualized generation method, system and the platform of speech recognition network
US11030412B2 (en) System and method for chatbot conversation construction and management
CN111177569B (en) Recommendation processing method, device and equipment based on artificial intelligence
CN106570106A (en) Method and device for converting voice information into expression in input process
CN107077841A (en) Superstructure Recognition with Recurrent Neural Network for Text To Speech
US20050080628A1 (en) System, method, and programming language for developing and running dialogs between a user and a virtual agent
CN110222827A (en) The training method of text based depression judgement network model
CN109710137A (en) Technical ability priority configuration method and system for voice dialogue platform
CA2365743A1 (en) Apparatus for design and simulation of dialogue
CN109948151A (en) The method for constructing voice assistant
CN108959436A (en) Dictionary edit methods and system for voice dialogue platform
CN109697979A (en) Voice assistant technical ability adding method, device, storage medium and server
CN109313668B (en) System and method for constructing session understanding system
CN110136689A (en) Song synthetic method, device and storage medium based on transfer learning
CN109119067A (en) Phoneme synthesizing method and device
CN111081280A (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
CN112000330B (en) Configuration method, device, equipment and computer storage medium of modeling parameters
CN110349569A (en) The training and recognition methods of customized product language model and device
CN111145745A (en) Conversation process customizing method and device
CN109032731A (en) A kind of voice interface method and system based on semantic understanding of oriented manipulation system
CN110032355A (en) Speech playing method, device, terminal device and computer storage medium
CN108170676A (en) Method, system and the terminal of story creation
CN108831444A (en) Semantic resources training method and system for voice dialogue platform
CN106844499A (en) Many wheel session interaction method and devices
CN109657125A (en) Data processing method, device, equipment and storage medium based on web crawlers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant