CN110427459B - Visual generation method, system and platform of voice recognition network - Google Patents

Visual generation method, system and platform of voice recognition network Download PDF

Info

Publication number
CN110427459B
CN110427459B CN201910719492.2A CN201910719492A CN110427459B CN 110427459 B CN110427459 B CN 110427459B CN 201910719492 A CN201910719492 A CN 201910719492A CN 110427459 B CN110427459 B CN 110427459B
Authority
CN
China
Prior art keywords
language model
crawling
general
wfst
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910719492.2A
Other languages
Chinese (zh)
Other versions
CN110427459A (en
Inventor
王雪志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN201910719492.2A priority Critical patent/CN110427459B/en
Publication of CN110427459A publication Critical patent/CN110427459A/en
Application granted granted Critical
Publication of CN110427459B publication Critical patent/CN110427459B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a visual generation method of a voice recognition network, which comprises the following steps: and receiving the keywords through the human-computer interaction interface. And selecting current field fields from a plurality of preset general field fields, wherein each general field corresponds to a plurality of preset crawlers and a plurality of preset Web crawling pages. And acquiring a universal corpus set. And acquiring a specific corpus. And training the universal corpus to obtain a universal language model and a specific language model. After the WFST voice recognition network of the general language model and the WFST voice recognition network of the specific language model are connected in parallel, the WFST voice recognition network is synthesized by combining, determining and minimizing operations with the acoustic model and the pronunciation dictionary. By configuring the system on the same platform, the training speed of the language model is increased, the product period is shortened, the labor consumption is reduced, and the labor cost is saved. Meanwhile, the accuracy and efficiency of language identification are improved by combining the general language model network and the specific language model.

Description

Visual generation method, system and platform of voice recognition network
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a visual generation method, a visual generation system and a visual generation platform of a voice recognition network.
Background
At present, related visual language model making systems in the market are few, and most language model making systems are customized at a command line level. The production of language models is of great importance in speech recognition, and each speech company has its own team to take charge of the model, but most of them are produced under the command line. In the prior art, the customization process of the model under the command line is uncontrollable, the version is not easy to manage, the risk is uncontrollable, and the process is not simplified enough. The above-mentioned drawbacks are caused by training the model with various script manual input commands under the command line. The training of manual work under the command line is lack of continuous and effective supervision and reexamination, so that the process is uncontrollable and the risk is uncontrollable. The inefficient operation of the command line cannot meet the multi-task language model training, and the process is complex. Meanwhile, in the prior art, the visualization in the model making is poor, so that the model making is inconvenient.
In view of the above problems, the current market methods for solving these problems are as follows: the method comprises the following steps of making a language model training standard flow, managing script standardization, managing data in a unified mode, developing more effective scripts, automating each step, arranging multiple persons to carry out cross duplication and the like. These solutions mentioned above do not solve the problem of fusion and do not solve the problem as a whole with a complete system.
Therefore, in the prior art, the visual voice recognition network used in voice recognition has uncontrollable customized flows in the generation process, is inconvenient to manage versions, and cannot meet the requirement of multi-task language model training. Meanwhile, the visibility in the model making is poor, and the simultaneous editing of a plurality of users is inconvenient, so that the generation efficiency and the accuracy of the voice recognition model are reduced.
Disclosure of Invention
The embodiment of the invention provides a method and a unit for generating a language model, which are used for solving at least one technical problem.
In a first aspect, a visual generation method for a speech recognition network is provided, where the method can be executed on a Web side, and the method includes:
and step S101, receiving keywords through a human-computer interaction interface. And selecting a current field from a plurality of preset general field fields, wherein each general field corresponds to a plurality of preset crawling words and a plurality of preset Web crawling pages.
Step S102, acquiring corresponding preset crawling words according to the current field, crawling a plurality of preset Web crawling pages corresponding to the current field according to the preset crawling words to acquire a first crawling result, and acquiring a universal corpus according to the first crawling result.
And step S103, setting the keywords as the current crawler crawling words, crawling a second crawling result from a return page of a set search engine at the Web end according to the current crawler crawling words, and acquiring a specific corpus according to the second crawling result.
And step S104, training based on the general corpus to generate a general language model in an arpa format, and training based on the specific corpus to generate a specific language model in the arpa format. The file information of the general language model and the file information of the specific language model include version numbers with identification functions.
Step S105, combining the general language model and the specific language model, combining the acoustic model and pronunciation dictionary data, and synthesizing the WFST voice recognition network.
In a preferred embodiment of the present invention, after step S105, step S106 is further included, which tests the WFST voice recognition network according to the set test sets of the plurality of configured interfaces, respectively, obtains test identification data of the plurality of configured interfaces, and displays the test identification data of the plurality of configured interfaces, where the test identification data includes identification information of the corresponding configured interfaces.
In a preferred embodiment of the present invention, step S102 further includes: and step S1021, scoring the entries in the general corpus set through the scoring language model, acquiring scores corresponding to the entries, if the scores of the entries are larger than a set threshold value, keeping the entries, and if not, deleting the entries from the general corpus set.
In a preferred embodiment of the present invention, step S103 further includes step S1031, obtaining the sequence of each entry in the specific corpus, and intercepting the entry with the set number from the first backward sequence in the sequence of the set search engine to update the specific corpus.
In a preferred embodiment of the present invention, the step of training based on the universal corpus to generate the universal language model in the arpa format in step S104 includes adding a button for setting the parameter that is necessary to select on the human-computer interaction interface, and training based on the universal corpus to generate the universal language model in the arpa format if receiving the selection information of the button for setting the parameter that is necessary to select.
The step of testing the WFST voice recognition network according to the set test set of the plurality of configured interfaces in step S106 includes adding a set optional parameter button on the human-computer interaction interface, and if receiving selection information of the set optional parameter button, testing the WFST voice recognition network according to the set test set.
In a preferred embodiment of the present invention, the step of combining the general language model and the specific language model in step S105 is: converting the general language model into WFST form, converting the specific language model into WFST form, adding an initial node before the initial node of the general language model converted into WFST form and the initial node of the specific language model converted into WFST form, and combining the general language model and the specific language model.
In a preferred embodiment of the present invention, step S102 further includes generating an operation key of step S102 on the human-computer interface, and if the operation of step S101 is finished, starting the operation key of step S102. Step S103 further includes generating an operation key of step S103 on the human-computer interaction interface, and if the operation of step S102 is finished, starting the operation key of step S103.
Step S104 further includes generating an operation key of step S104 on the human-computer interaction interface, and if the operation of step S103 is finished, starting the operation key of step S104. Step S105 further includes generating an operation key of step S105 on the human-computer interaction interface, and if the operation of step S104 is finished, starting the operation key of step S105.
In a second aspect, a visual generation system of a speech recognition network is provided, which includes a user interaction unit, a general corpus obtaining unit, a specific corpus obtaining unit, a language model obtaining unit, and a WFST speech recognition network obtaining unit.
And the user interaction unit is configured to receive the keywords through the human-computer interaction interface. And selecting a current field from a plurality of preset general field fields, wherein each general field corresponds to a plurality of preset crawling words and a plurality of preset Web crawling pages.
The general corpus acquiring unit is configured to acquire corresponding preset crawling words according to the current field, crawl a plurality of preset Web crawling pages corresponding to the current field according to the preset crawling words to acquire a first crawling result, and acquire a general corpus according to the first crawling result.
And the specific corpus acquiring unit is configured to set the keyword as a current crawler crawling word, crawl a second crawling result from a return page of the set search engine at the Web end according to the current crawler crawling word, and acquire a specific corpus according to the second crawling result.
And the language model acquisition unit is configured to perform training based on the general corpus to generate a general language model in an arpa format, and perform training based on the specific corpus to generate a specific language model in the arpa format. The file information of the general language model and the file information of the specific language model include version numbers with identification functions.
And the WFST voice recognition network acquisition unit combines the general language model and the specific language model, combines the acoustic model and pronunciation dictionary data and synthesizes the WFST voice recognition network.
In a preferred embodiment of the visualization generation system of the invention, a test unit is comprised.
The testing unit is configured to test the WFST voice recognition network respectively according to the set testing sets of the plurality of configured interfaces, acquire the testing identification data of the plurality of configured interfaces, and display the testing identification data of the plurality of configured interfaces, wherein the testing identification data comprises the identification information of the corresponding configured interfaces.
In a third aspect, the invention provides a visual generation platform of a voice recognition network, wherein the visual generation system of the voice recognition network is loaded on the platform, the system can enable a plurality of development groups to operate simultaneously, each of the plurality of development groups comprises a plurality of developers, and each developer can use an independent unit. The independent unit is a single unit in the visual generation system of the voice recognition network.
And the visualization generation platform is configured to be capable of storing the universal language models and the specific language models generated or used in the plurality of development groups, and establishes a corresponding relation of a plurality of version numbers according to the version numbers of the universal language models and the version numbers of the specific language models generated or used in the plurality of development groups.
The current development group can select a current model from the general language models and the specific language models stored by the visualization generation platform. And if the current development group deletes, replaces or edits the current model, the visual generation platform informs the corresponding development group according to the corresponding relation of the plurality of version numbers, and the current development group operates the current model according to the returned information of the corresponding development group.
In a fourth aspect, an electronic device is provided, comprising: the system comprises at least one processor and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any embodiment of the invention.
In a fifth aspect, the embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the method of any of the embodiments of the present invention.
The invention accelerates the training speed of the language model, shortens the product period and helps the product isolation among multiple users by carrying out system configuration on the same platform. The manpower consumption can be shortened, and the manpower cost is saved. Meanwhile, the accuracy and efficiency of language identification are improved by combining the general language model network and the specific language model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a visualization generation method for a speech recognition network according to an embodiment of the present invention.
Fig. 2 is a flowchart of a visualization generation method for a speech recognition network according to another embodiment of the present invention.
Fig. 3 is a flowchart of the subdivision process in step S102 according to an embodiment of the present invention.
Fig. 4 is a flowchart of the subdivision process in step S103 according to an embodiment of the present invention.
Fig. 5 is a combination schematic diagram of a visualization generation system further providing a speech recognition network according to an embodiment of the present invention.
Fig. 6 is a combination schematic diagram of a visualization generation system further providing a speech recognition network according to another embodiment of the present invention.
Fig. 7 is a flowchart of a visualization generation method for a speech recognition network according to another embodiment of the present invention.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
On one hand, the invention improves a visual generation method of a voice recognition network, which can be operated on a Web end, as shown in FIG. 1, and the visual generation method of the voice recognition network in the invention comprises the following steps:
and step S101, acquiring keywords and general field fields.
In the step, keywords are received through a human-computer interaction interface; and selecting a current field from a plurality of preset general field fields, wherein each general field corresponds to a plurality of preset crawling words and a plurality of preset Web crawling pages.
For example, the plurality of domain fields includes 3 domain fields of "circuit", "chemical", "mechanical". And displaying the 3 fields on an interactive interface of the user side. For example, the user selects "electronic" on the interactive interface as the current general domain field. The equipment for displaying the interactive interface at the user side is intelligent terminal equipment or touch screen equipment. The method comprises the steps that a plurality of preset crawler programs or information corresponding to 'electronics', 'chemistry' and 'machinery' and Web crawling page information corresponding to 'electronics', 'chemistry' and 'machinery' are prestored at a user side or a remote side which can realize remote connection with the user side locally. For example, the Web crawling page information corresponding to the 'electronic' is a Web crawling page of a website used in the electronic industry in the fields of science popularization, application and the like.
In addition, the user inputs keywords through the user at the interactive interface. The keyword is a field which is in a field corresponding to the general field and is particularly required to be identified by the user. For example, when the user domain field selected by the user is "circuit", the input keyword may be a specific circuit term such as "discrete device circuit", "integrated device circuit", and "analog circuit". Thereby being beneficial to improving the preparation of the linguistic data.
And S102, acquiring a general corpus set.
In the step, corresponding preset crawling words are obtained according to the current field, first crawling results are obtained by crawling on a plurality of preset Web crawling pages corresponding to the current field according to the preset crawling words, and the universal corpus is obtained according to the first crawling results.
Step S103, acquiring a specific corpus.
In the step, the keywords are set as the current crawler crawling words, a second crawling result is crawled from a return page of a set search engine at a Web end according to the current crawler crawling words, and a specific corpus is obtained according to the second crawling result.
And step S104, acquiring a general language model and a specific language model.
In the step, training is carried out based on a general corpus to generate a general language model in an arpa format, and training is carried out based on a specific corpus to generate a specific language model in the arpa format; the file information of the general language model and the file information of the specific language model include version numbers with identification functions.
In step S105, a WFST voice recognition network is synthesized.
And combining the general language model and the specific language model, combining the acoustic model and pronunciation dictionary data, and synthesizing the WFST voice recognition network.
Therefore, by connecting the WFST voice recognition network of the general language model and the WFST voice recognition network of the specific language model in parallel, the general language recognition and the specific language recognition can be considered in the voice recognition, two recognition modes can be aggregated in the same recognition network, and the accuracy of the voice recognition in a specific field is improved.
In a preferred embodiment, as shown in fig. 2, step S105 is followed by a step of,
step S106, testing the WFST voice recognition network.
In this step, the WFST voice recognition network is tested according to the set test set of the plurality of configured interfaces, test identification data of the plurality of configured interfaces is obtained, and the test identification data of the plurality of configured interfaces is displayed, wherein the test identification data includes identification information of the corresponding configured interfaces.
In a preferred embodiment, as shown in fig. 3, step S102 further includes:
and step S1021, scoring the entries in the universal corpus set.
In this step, the entries in the general corpus are scored through the scoring language model, scores corresponding to the entries are obtained, if the scores of the entries are larger than a set threshold value, the entries are reserved from the general corpus, and if not, the entries are deleted from the general corpus. Therefore, the entries in the general corpus set are screened, the deviation rate of the entries is reduced, the entry storage space is reduced, and the entry operation speed is increased.
In a preferred embodiment, as shown in fig. 4, step S103 further includes:
and step S1031, optimizing the entries of the specific corpus.
In this step, the entries in the specific corpus are obtained and sorted in the set search engine, and the entries with the number of entries set from the first backward sorting in the set search engine sorting are intercepted to update the specific corpus. Therefore, the entries in the specific corpus set are optimized, the words with high use frequency are selected, the universality of the entries is improved, the entry storage space is reduced, and the entry operation speed is further improved.
In a preferred embodiment, the step of training to generate a generic language model in an arpa format based on the generic corpus in step S104 includes,
and adding a parameter button which is set to be selected on a human-computer interaction interface, and training based on the universal corpus to generate a universal language model in an arpa format if selection information of the parameter button which is set to be selected is received.
The step of separately testing the WFST speech recognition network according to the set of test settings for the plurality of configured interfaces in step S106 includes,
and adding a parameter button which is set to be selected on the human-computer interaction interface, and testing the WFST voice recognition network according to the set test set if the selection information of the parameter button which is set to be selected is received.
The error rate of a developer in the development process is provided by setting the optional parameters, and the development quality is further improved.
In a preferred embodiment, the step of connecting the WFST speech recognition network of the generic language model and the WFST speech recognition network of the specific language model in parallel in step S105 comprises: converting the general language model into WFST form, converting the specific language model into WFST form, adding an initial node before the initial node of the general language model converted into WFST form and the initial node of the specific language model converted into WFST form, and combining the general language model and the specific language model.
In a preferred embodiment, step S102 further includes generating an operation key of step S102 on the human-computer interaction interface, and if the operation of step S101 is finished, starting the operation key of step S102. Step S103 further includes generating an operation key of step S103 on the human-computer interaction interface, and if the operation of step S102 is finished, starting the operation key of step S103.
Step S104 further includes generating an operation key of step S104 on the human-computer interaction interface, and if the operation of step S103 is finished, starting the operation key of step S104.
Step S105 further includes generating an operation key of step S105 on the human-computer interaction interface, and if the operation of step S104 is finished, starting the operation key of step S105.
On one hand, the visualization of the operation process is improved, developers are limited to execute or develop the sequence, the consistency and the normalization of the generation of the voice recognition network model are improved, and meanwhile, the development efficiency is improved due to the reduction of misoperation.
In another aspect of the invention, as shown in FIG. 5, the invention also provides a visualization generation system for a speech recognition network. The system comprises a user interaction unit 101, a general corpus acquisition unit 201, a specific corpus acquisition unit 301, a language model acquisition unit 401 and a WFST voice recognition network acquisition unit 501.
The user interaction unit 101 is configured to receive the keywords through a human-computer interaction interface; and selecting a current field from a plurality of preset general field fields, wherein each general field corresponds to a plurality of preset crawling words and a plurality of preset Web crawling pages.
The general corpus acquiring unit 201 is configured to acquire a corresponding preset crawling word according to the current field, crawl a plurality of preset Web crawling pages corresponding to the current field according to the preset crawling word to acquire a first crawling result, and acquire a general corpus according to the first crawling result.
The specific corpus acquiring unit 301 is configured to set the keyword as a current crawler crawling word, crawl a second crawling result from a return page of the set search engine at the Web end according to the current crawler crawling word, and acquire the specific corpus according to the second crawling result.
A language model obtaining unit 401 configured to perform training based on the general corpus to generate a general language model in an arpa format, and perform training based on the specific corpus to generate a specific language model in the arpa format; the file information of the general language model and the file information of the specific language model include version numbers with identification functions.
The WFST speech recognition network obtaining unit 501 is configured to combine the general language model and the specific language model, combine the acoustic model and the pronunciation dictionary data, and then synthesize the WFST speech recognition network.
In an embodiment of the visualization generating system of the speech recognition network of the present invention, as shown in fig. 6, the system further includes a testing unit 601. The test unit 601 is configured to test the WFST voice recognition network according to the set test set of the plurality of configured interfaces, respectively, obtain test identification data of the plurality of configured interfaces, and display the test identification data of the plurality of configured interfaces, where the test identification data includes identification information of the corresponding configured interfaces.
In still another aspect of the present invention, a visual generation platform of the speech recognition network is used, and the visual generation system of the speech recognition network in the present invention is loaded on the platform. The system enables multiple development groups to operate simultaneously, each of the multiple development groups including multiple developers, each developer capable of using an independent unit. The independent unit is a single unit in the visualization generating system of the speech recognition network, for example, one of the user interaction unit 101, the general corpus acquiring unit 201, the specific corpus acquiring unit 301, the language model acquiring unit 401, and the WFST speech recognition network acquiring unit 501.
And the visualization generation platform is configured to be capable of storing the general language models and the specific language models generated or used in the plurality of development groups, and establishing a corresponding relation of a plurality of version numbers according to the version numbers of the general language models and the version numbers of the specific language models generated or used in the plurality of development groups.
The current development group can select a current model from the general language models and the specific language models stored by the visualization generation platform. And if the current development group deletes, replaces or edits the current model, the visual generation platform informs the corresponding development group according to the corresponding relation of the plurality of version numbers, and the current development group operates the current model according to the returned information of the corresponding development group. Therefore, resource conflict caused by resource sharing when multiple developers develop the same platform is avoided. The reliability and consistency of the development platform are improved.
It should be noted that the units in the embodiments of the present disclosure are not used to limit the scheme of the present disclosure, and in addition, the related functional modules may also be implemented by a hardware processor, for example, the separation module may also be implemented by a processor, and will not be described herein again.
In another embodiment of the present invention, another method for generating a visualization of a speech recognition network is provided. The method comprises the following steps:
1) because the software system is a complete platform, the server can be limited in the program to control the process, and the artificial forgetting of executing a certain step is avoided;
the flow control includes three aspects:
1. when the necessary parameter buttons are set in the processes of training the model, testing and the like, the necessary options are not selected, and the operation cannot be continuously executed.
2. The training to testing is a complete set of flow with a sequence, and when the previous steps are not operated, the following steps are displayed in grey.
3. The system may verify parameters such as: checking whether the pronunciation dictionary is matched with the word dictionary, and returning error information if the pronunciation dictionary is not matched with the word dictionary.
2) The software system can provide special visual version control for versions, and can check related dependence through programs, so that other versions are prevented from being changed due to deletion and modification operations among the versions; the relevant dependencies here mean: when the operations of deletion, modification and the like are executed, the system checks whether other models use the version model or not in a table look-up mode, returns prompt information to be confirmed, and deletes and modifies the models after clicking confirmation
3) Because the software system provides a web-operated interface, the system is simpler to operate than on a command line. The flow operation is simplified;
4) the process control, version control and simplified process are carried out in a system program mode, so that the risk of language model training can be reduced.
Referring to fig. 7, first, since the creation of a language model requires data for support, the first step is to manage corpora. The gathering of the part of the linguistic data comprises the crawling of the network linguistic data and the generation of artificial linguistic data. The management of the corpora comprises the operations of normalizing the corpora, deleting the movement and the like.
The crawling corpus is divided into two categories: first, crawlers in some fixed domains are already set in the system, and when certain domain data is needed, domains are selected on a web page to start crawling. And the second method comprises the following steps: crawling users who provide keywords fill in the keywords on the web side, then the system crawler searches in each large search engine, and text is extracted from returned entries. The text screening method comprises the following steps: the general language model is scored, and entries are reserved when the scores exceed a certain threshold, or deleted; and extracting the first N entries according to the arrangement sequence of the entries in the search engine.
Secondly, after the linguistic data exists, training of an arpa language model is needed, and the training comprises three parts, namely general language model generation, custom language model generation and language model management. The language model management comprises providing a delete and move button, wherein the delete refers to deleting the language model in the file system; the movement is mainly to move the storage location in the file system.
The training means that the universal language model is mainly obtained by training the user through checking the corpora of various large fields set by the system. The customized language model is a language model trained by keyword crawling corpora and direct corpora provided by a user. The difference between the two is mainly that the selection of the linguistic data is different, the operation can be carried out on different pages, and the difference between the two can be reflected on the model ID
Thirdly, the resource management module mainly combines the acoustic model and the pronunciation dictionary to generate the WFST voice recognition network for the generated language model. The language model is combined with the acoustic model and the pronunciation dictionary to generate the WFST voice recognition network, and the WFST voice recognition network is obtained by combining, determining and minimizing operations. The WFST network unit of the universal and customized language model is mainly formed by adding a starting node at the forefront of two networks so that the two networks are connected in parallel. WFST speech recognition networks capable of searching for generic and custom language models when recognizing decodings.
The operation of two WFST voice recognition network units is provided for the purpose of achieving project customization. The resource management is provided to the input of the decoding module, so that the language model module and the decoding module are connected. This module also provides the functionality of WFST resource management.
Fourth, decoding test management may account for the performance of the test set on new resources, mainly providing interfaces for various configurations.
The web end is built through html, css and other front end technologies, and the visualization effect is achieved.
And the server builds an interface for calling data processing, model training and testing by using a flash, and communicates with the front end. The data transmission is performed in json.
And the specific operations of data processing, model training and the like at the bottom layer are combined with an open-source toolkit, and the source code is compiled by taking python language as a carrier.
In other embodiments, the present invention further provides a non-transitory computer storage medium storing computer-executable instructions for performing the speech signal processing and using method of any of the above method embodiments;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
step S101, receiving keywords through a human-computer interaction interface; selecting a current field from a plurality of preset general field fields, wherein each general field corresponds to a plurality of preset crawlers and a plurality of preset Web crawling pages;
step S102, crawling on a plurality of preset Web crawling pages corresponding to the current field according to preset crawlers corresponding to the current field, and acquiring a universal corpus according to a crawling result;
step S103, setting the keywords as current crawlers, crawling the current crawlers from a return page of a set search engine at a Web end, and acquiring a specific corpus according to a crawling result;
step S104, training a general corpus through an arpa language model to obtain a general language model, and training a specific corpus through the arpa language model to obtain a specific language model; the file information of the general language model and the file information of the specific language model comprise version numbers with identification functions;
step S105, after the WFST voice recognition network of the general language model and the WFST voice recognition network of the specific language model are connected in parallel, the WFST voice recognition network is synthesized by combining, determining and minimizing operations with the acoustic model and the pronunciation dictionary.
As a nonvolatile computer readable storage medium, it can be used to store nonvolatile software programs, nonvolatile computer executable programs, and modules, such as program instructions/modules corresponding to the voice signal processing method in the embodiment of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform the speech signal processing method of any of the method embodiments described above.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the voice signal processing unit, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the voice signal processing unit over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the speech signal processing methods described above.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 8, the electronic device includes: one or more processors 710 and a memory 720, one processor 710 being illustrated in fig. 7. The apparatus of the voice signal processing method may further include: an input unit 730 and an output unit 740. The processor 710, the memory 720, the input unit 730, and the output unit 740 may be connected by a bus or other means, and are exemplified by being connected by a bus in fig. 7. The memory 720 is a non-volatile computer-readable storage medium as described above. The processor 710 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 720, that is, implements the voice signal processing method of the above-described method embodiment. The input unit 730 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the information delivery unit. The output unit 740 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
As an embodiment, the electronic device may be applied to a visualization generation platform of a speech recognition network, and includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:
receiving keywords through a human-computer interaction interface; selecting a current field from a plurality of preset general field fields, wherein each general field corresponds to a plurality of preset crawlers and a plurality of preset Web crawling pages;
crawling on a plurality of preset Web crawling pages corresponding to the current field according to preset crawlers corresponding to the current field, and acquiring a universal corpus according to a crawling result;
setting keywords as a current crawler, crawling the current crawler from a return page of a set search engine at a Web end, and acquiring a specific corpus according to a crawling result;
training a general corpus through an arpa language model to obtain a general language model, and training a specific corpus through the arpa language model to obtain a specific language model; the file information of the general language model and the file information of the specific language model include version numbers with identification functions.
After the WFST voice recognition network of the general language model and the WFST voice recognition network of the specific language model are connected in parallel, the WFST voice recognition network is synthesized by combining, determining and minimizing operations with the acoustic model and the pronunciation dictionary.
The electronic device of embodiments of the present invention exists in a variety of forms, including but not limited to:
(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.
(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic units with data interaction functions.
The above-described cell embodiments are merely illustrative, and the cells described as separate parts may or may not be physically separate, and the parts displayed as cells may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding, the above technical solutions may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may be modified or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A visual generation method of a voice recognition network, which can be operated on a Web end, comprises the following steps:
step S101, receiving keywords through a human-computer interaction interface; selecting a current field from a plurality of preset general field fields, wherein each general field corresponds to a plurality of preset crawling words and a plurality of preset Web crawling pages;
step S102, acquiring corresponding preset crawling words according to the current field, crawling a plurality of preset Web crawling pages corresponding to the current field according to the preset crawling words to acquire a first crawling result, and acquiring a universal corpus according to the first crawling result;
step S103, setting the keyword as a current crawler crawling word, crawling a second crawling result from a return page of a set search engine at a Web end according to the current crawler crawling word, and acquiring a specific corpus according to the second crawling result;
step S104, training based on the general corpus to generate a general language model in an arpa format, and training based on the specific corpus to generate a specific language model in the arpa format; the file information of the general language model and the file information of the specific language model comprise version numbers with identification functions;
step S105, combining the general language model and the specific language model, combining an acoustic model and pronunciation dictionary data, and synthesizing a WFST voice recognition network;
the step of combining the generic language model and the specific language model comprises:
converting the general language model into WFST form, converting the specific language model into WFST form, adding an initial node before the initial node of the general language model converted into WFST form and the initial node of the specific language model converted into WFST form, and merging the general language model and the specific language model.
2. The method of claim 1, further comprising after the step S105,
step S106, testing the WFST voice recognition network according to the set test set of the plurality of configured interfaces, acquiring the test identification data of the plurality of configured interfaces, and displaying the test identification data of the plurality of configured interfaces, wherein the test identification data comprises the identification information of the corresponding configured interfaces.
3. The method according to claim 1, wherein the step S102 further comprises:
step S1021, scoring the entries in the general corpus set through a scoring language model, acquiring scores corresponding to the entries, if the scores of the entries are larger than a set threshold value, keeping the entries, and if not, deleting the entries from the general corpus set.
4. The method according to claim 1, further comprising in step S103,
and step S1031, obtaining the sequence of each entry in the specific corpus, intercepting the entry with the set number from the first backward sequence in the sequence of the set search engine, and updating the specific corpus.
5. The method according to claim 2, wherein the step of training to generate a generic language model in an arpa format based on the generic corpus in step S104 comprises,
adding a parameter button which is set to be selected on a human-computer interaction interface, and training based on the universal corpus to generate a universal language model in an arpa format if selection information of the parameter button which is set to be selected is received;
the step of testing the WFST speech recognition network according to the set of tests of the plurality of configured interfaces in step S106 comprises,
and adding a parameter button which is set to be selected on a human-computer interaction interface, and testing the WFST voice recognition network according to a set test set if the selection information of the parameter button which is set to be selected is received.
6. The method of claim 1, wherein,
the step S102 further comprises the step of generating an operation key of the step S102 on a human-computer interaction interface, and starting the operation key of the step S102 if the operation of the step S101 is finished;
the step S103 further includes generating an operation key of the step S103 on the human-computer interaction interface, and if the operation of the step S102 is finished, starting the operation key of the step S103;
the step S104 further comprises the step of generating an operation key of the step S104 on a human-computer interaction interface, and if the operation of the step S103 is finished, starting the operation key of the step S104;
the step S105 further includes generating an operation key of the step S105 on the human-computer interaction interface, and if the operation of the step S104 is finished, starting the operation key of the step S105.
7. The visual generation system of the speech recognition network comprises a user interaction unit, a general corpus acquisition unit, a specific corpus acquisition unit, a language model acquisition unit and a WFST speech recognition network acquisition unit;
the user interaction unit is configured to receive keywords through a human-computer interaction interface; selecting a current field from a plurality of preset general field fields, wherein each general field corresponds to a plurality of preset crawling words and a plurality of preset Web crawling pages;
the universal corpus acquiring unit is configured to acquire corresponding preset crawling words according to the current field, crawl a plurality of preset Web crawling pages corresponding to the current field according to the preset crawling words to acquire a first crawling result, and acquire a universal corpus according to the first crawling result;
the specific corpus acquiring unit is configured to set the keyword as a current crawler crawling word, crawl a second crawling result from a return page of a set search engine at a Web end according to the current crawler crawling word, and acquire a specific corpus according to the second crawling result;
the language model obtaining unit is configured to perform training based on the general corpus to generate a general language model in an arpa format, and perform training based on the specific corpus to generate a specific language model in the arpa format; the file information of the general language model and the file information of the specific language model comprise version numbers with identification functions;
the WFST voice recognition network acquisition unit is used for merging the general language model and the specific language model, and synthesizing the WFST voice recognition network after combining an acoustic model and pronunciation dictionary data;
the step of combining the generic language model and the specific language model comprises:
converting the general language model into WFST form, converting the specific language model into WFST form, adding an initial node before the initial node of the general language model converted into WFST form and the initial node of the specific language model converted into WFST form, and merging the general language model and the specific language model.
8. The system of claim 7, comprising, a test unit;
the testing unit is configured to respectively test the WFST voice recognition network according to a set test set of a plurality of configured interfaces, acquire test identification data of the plurality of configured interfaces, and display the test identification data of the plurality of configured interfaces, wherein the test identification data includes identification information of the corresponding configured interfaces.
9. A visualization generation platform of a speech recognition network, on which platform the system of claim 7 or 8 is loaded, said system enabling a plurality of development groups to operate simultaneously, each of said plurality of development groups comprising a plurality of developers, each developer being able to use an independent unit; the stand-alone unit is a single unit in the visualization generation system of claim 7 or 8;
the visual generation platform is configured to be capable of storing the universal language models and the specific language models generated or used in the plurality of development groups, and the visual generation platform establishes a corresponding relation of a plurality of version numbers according to the version numbers of the universal language models and the version numbers of the specific language models generated or used in the plurality of development groups;
the current development group can select a current model from the general language models and the specific language models stored in the visual generation platform; and if the current development group deletes, replaces or edits the current model, the visualization generation platform informs the corresponding development group according to the corresponding relation of the plurality of version numbers, and the current development group operates the current model according to the return information of the corresponding development group.
CN201910719492.2A 2019-08-05 2019-08-05 Visual generation method, system and platform of voice recognition network Active CN110427459B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910719492.2A CN110427459B (en) 2019-08-05 2019-08-05 Visual generation method, system and platform of voice recognition network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910719492.2A CN110427459B (en) 2019-08-05 2019-08-05 Visual generation method, system and platform of voice recognition network

Publications (2)

Publication Number Publication Date
CN110427459A CN110427459A (en) 2019-11-08
CN110427459B true CN110427459B (en) 2021-09-17

Family

ID=68414250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910719492.2A Active CN110427459B (en) 2019-08-05 2019-08-05 Visual generation method, system and platform of voice recognition network

Country Status (1)

Country Link
CN (1) CN110427459B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145727B (en) * 2019-12-02 2022-04-22 云知声智能科技股份有限公司 Method and device for recognizing digital string by voice
CN111951788A (en) * 2020-08-10 2020-11-17 百度在线网络技术(北京)有限公司 Language model optimization method and device, electronic equipment and storage medium
CN111933146B (en) * 2020-10-13 2021-02-02 苏州思必驰信息科技有限公司 Speech recognition system and method
CN113223522B (en) * 2021-04-26 2022-05-03 北京百度网讯科技有限公司 Speech recognition method, apparatus, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
EP2309487A1 (en) * 2009-09-11 2011-04-13 Honda Research Institute Europe GmbH Automatic speech recognition system integrating multiple sequence alignment for model bootstrapping
CN102760436A (en) * 2012-08-09 2012-10-31 河南省烟草公司开封市公司 Voice lexicon screening method
CN107705787A (en) * 2017-09-25 2018-02-16 北京捷通华声科技股份有限公司 A kind of audio recognition method and device
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model
CN109976702A (en) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 A kind of audio recognition method, device and terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2309487A1 (en) * 2009-09-11 2011-04-13 Honda Research Institute Europe GmbH Automatic speech recognition system integrating multiple sequence alignment for model bootstrapping
CN101923854A (en) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 Interactive speech recognition system and method
CN102760436A (en) * 2012-08-09 2012-10-31 河南省烟草公司开封市公司 Voice lexicon screening method
CN107705787A (en) * 2017-09-25 2018-02-16 北京捷通华声科技股份有限公司 A kind of audio recognition method and device
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model
CN109976702A (en) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 A kind of audio recognition method, device and terminal

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
END-TO-END MULTIMODAL SPEECH RECOGNITION;Palaskar,Shruti等;《2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》;20181231;第5774-5778页 *
语音Corpus的自动构建和语音最小化标注的研究;张志楠;《中国优秀硕士学位论文全文数据库(电子期刊)》;20140315;第I138-1157页 *

Also Published As

Publication number Publication date
CN110427459A (en) 2019-11-08

Similar Documents

Publication Publication Date Title
CN110427459B (en) Visual generation method, system and platform of voice recognition network
US11527233B2 (en) Method, apparatus, device and computer storage medium for generating speech packet
US20160225369A1 (en) Dynamic inference of voice command for software operation from user manipulation of electronic device
CN111737411A (en) Response method in man-machine conversation, conversation system and storage medium
CN112084315B (en) Question-answer interaction method, device, storage medium and equipment
EP3251001A1 (en) Dynamic inference of voice command for software operation from help information
WO2018085760A1 (en) Data collection for a new conversational dialogue system
CN109313668B (en) System and method for constructing session understanding system
CN111881316A (en) Search method, search device, server and computer-readable storage medium
JP7093397B2 (en) Question answering robot generation method and equipment
CN111881042B (en) Automatic test script generation method and device and electronic equipment
CN114168718A (en) Information processing apparatus, method and information recording medium
CN111145745A (en) Conversation process customizing method and device
CN112000330B (en) Configuration method, device, equipment and computer storage medium of modeling parameters
CN110660391A (en) Method and system for customizing voice control of large-screen terminal based on RPA (resilient packet Access) interface
CN111553138A (en) Auxiliary writing method and device for standardizing content structure document
CN111681658A (en) Voice control method and device for vehicle-mounted APP
CN112286486B (en) Operation method of application program on intelligent terminal, intelligent terminal and storage medium
CN109408815A (en) Dictionary management method and system for voice dialogue platform
CN112784024B (en) Man-machine conversation method, device, equipment and storage medium
US20210098012A1 (en) Voice Skill Recommendation Method, Apparatus, Device and Storage Medium
CN104090915B (en) Method and device for updating user data
CN109891410B (en) Data collection for new session dialog systems
CN116432573A (en) Performance simulation method, electronic device and storage medium
CN111158648A (en) Interactive help system development method based on live-action semantic understanding and platform thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Ltd.

GR01 Patent grant
GR01 Patent grant