CN116524905A

CN116524905A - Training method, device, equipment and storage medium of voice recognition model

Info

Publication number: CN116524905A
Application number: CN202310123430.1A
Authority: CN
Inventors: 张超; 王乐; 滕勇; 丁希剑; 李健
Original assignee: Xiaovo Technology Co ltd
Current assignee: Xiaovo Technology Co ltd
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-08-01

Abstract

The invention discloses a training method, device, equipment and storage medium of a speech recognition model. The method comprises the following steps: acquiring unlabeled voice data in the target field, and respectively inputting the unlabeled voice data into at least two open source voice recognition models to obtain recognition results matched with the open source voice recognition models; correcting each recognition result according to a preset text correction principle to obtain a correction result matched with each open source voice recognition model; determining labeling results of the label-free voice data according to the correction results; training the voice recognition model to be trained according to the label-free voice data and the labeling result to obtain a voice recognition model with the recognition result of the voice data in the target field meeting the preset evaluation condition. According to the technical scheme, the problem of high training cost of the voice recognition model is solved, the label-free voice data can be utilized for training, the recognition accuracy of the voice recognition model is ensured, and meanwhile, the training cost is greatly reduced.

Description

Training method, device, equipment and storage medium of voice recognition model

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for training a speech recognition model.

Background

At present, a voice recognition model is mainly obtained by using large-scale voice data and tag data matched with the voice data in a supervision training mode.

However, tagging large-scale voice data requires high labor costs and takes a lot of time. Therefore, the method for obtaining the voice recognition model through the label-free voice data and the training mode has very important significance.

Disclosure of Invention

The invention provides a training method, a device, equipment and a storage medium for a voice recognition model, which are used for solving the problem of high training cost of the voice recognition model, training can be performed by using unlabeled voice data, and the training cost is greatly reduced while the recognition accuracy of the voice recognition model is ensured.

According to an aspect of the present invention, there is provided a training method of a speech recognition model, the method comprising:

acquiring unlabeled voice data in the target field, and respectively inputting the unlabeled voice data into at least two open source voice recognition models to obtain recognition results matched with the open source voice recognition models;

correcting each recognition result according to a preset text correction principle to obtain a correction result matched with each open source voice recognition model; wherein the text correction principle is determined based on text data of the target field;

determining a labeling result of the label-free voice data according to the correction result of the matching of each open source voice recognition model;

and training the voice recognition model to be trained according to the label-free voice data and the labeling result to obtain a voice recognition model with the recognition result of the voice data in the target field meeting the preset evaluation condition.

According to another aspect of the present invention, there is provided a training apparatus of a speech recognition model, the apparatus comprising:

the recognition result determining module is used for acquiring the unlabeled voice data in the target field, and respectively inputting the unlabeled voice data into at least two open source voice recognition models to obtain recognition results matched with the open source voice recognition models;

the correction result determining module is used for respectively correcting each recognition result according to a preset text correction principle to obtain a correction result matched with each open source voice recognition model; wherein the text correction principle is determined based on text data of the target field;

the labeling result determining module is used for determining labeling results of the label-free voice data according to the correction results matched with the open source voice recognition models;

the recognition model training module is used for training the to-be-trained voice recognition model according to the label-free voice data and the labeling result so as to obtain a voice recognition model which meets the preset evaluation condition on the recognition result of the voice data in the target field.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of training a speech recognition model according to any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the training method of the speech recognition model according to any of the embodiments of the present invention when executed.

According to the technical scheme, the identification results matched with all the open source voice identification models are obtained by respectively inputting the unlabeled voice data in the target field into at least two open source voice identification models; then, correcting each recognition result according to a preset text correction principle to obtain a correction result matched with each open source voice recognition model; determining the labeling result of the label-free voice data according to the correction result of the matching of each open source voice recognition model; training the voice recognition model to be trained according to the label-free voice data and the labeling result to obtain a voice recognition model with the recognition result of the voice data in the target field meeting the preset evaluation condition. According to the technical scheme, the problem of high training cost of the voice recognition model is solved, the label-free voice data can be utilized for training, the recognition accuracy of the voice recognition model is ensured, and meanwhile, the training cost is greatly reduced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a training method of a speech recognition model according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a training method of a speech recognition model according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a process for determining a training data set according to a second embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a training device for a speech recognition model according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device implementing a training method of a speech recognition model according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The data acquisition, storage, use, processing and the like in the technical scheme meet the relevant regulations of national laws and regulations.

Example 1

Fig. 1 is a flowchart of a method for training a speech recognition model according to an embodiment of the present invention, where the method may be performed by a training device for a speech recognition model, and the device may be implemented in hardware and/or software, and the device may be configured in an electronic device. As shown in fig. 1, the method includes:

s110, acquiring unlabeled voice data in the target field, and respectively inputting the unlabeled voice data into at least two open source voice recognition models to obtain recognition results matched with the open source voice recognition models.

The scheme can be executed by electronic equipment such as a computer, a server and the like. For professional fields such as telephone service, medical treatment and finance, the voice data usually contains a large amount of professional vocabulary, and the correlation of speaking content is high, so that a targeted voice recognition model is required to realize the recognition of the voice data. During the speech data recognition process, the electronic device may read a pre-constructed speech data set. The voice dataset may include large-scale unlabeled voice data within the target area. The target field can be a vertical field such as medical treatment, finance and the like. The unlabeled voice data may be voice data for which there is no matching text label, i.e., the voice data is not manually labeled.

The electronic device can simultaneously input the unlabeled voice data into a plurality of open source voice recognition models for voice recognition. The open source voice recognition model is used for recognizing the voice data in the non-vertical field, and the recognition result of the open source voice recognition model meets a preset evaluation condition. In other words, the open source voice recognition model has certain recognition capability for the unlabeled voice data, and can achieve certain recognition accuracy for the voice data in the non-vertical field. For example, the recognition accuracy of the open source speech recognition model for speech data in non-vertical fields may be greater than 80%. It should be noted that, the plurality of open-source speech recognition models may be different, and each open-source speech recognition model may differ in model structure, may differ in training process, or may differ in training data.

After the voice recognition is carried out on the non-tag voice data, each open source voice recognition model can respectively obtain the recognition result of the non-tag voice data.

S120, correcting each recognition result according to a preset text correction principle to obtain a correction result of each open source voice recognition model match.

It can be appreciated that the electronic device can obtain text data such as books, papers, journals, and the like in the target field, and count specialized vocabulary in the text data. Meanwhile, the electronic equipment can also extract text features in the text data to train models such as text detection, text correction and the like. According to the text data, the electronic equipment can determine a text correction principle so as to perform operations such as vocabulary replacement, text correction and the like on the recognition results according to the text correction principle, thereby obtaining correction results corresponding to the recognition results.

S130, determining the labeling result of the label-free voice data according to the correction result of the matching of each open source voice recognition model.

The electronic equipment can compare the correction results corresponding to the open source voice recognition models, and the labeling result of the unlabeled voice data is determined according to the comparison result of the correction results. Specifically, if the correction results are the same, the correction result is used as the labeling result of the label-free voice data. If at least one correction result is different from other correction results, determining that the unlabeled voice data has no labeling result.

The electronic equipment can also calculate the similarity of the correction results corresponding to each two open source voice recognition models, and the labeling result of the label-free voice data is determined according to each similarity. For example, 3 open source speech recognition models are model a, model B and model C, respectively, and the corrected results of model a, model B and model C match are result a, result B and result C, respectively. The similarity between the result A and the result B is 100%, the similarity between the result A and the result C is 90%, the similarity between the result B and the result C is 90%, and the electronic equipment can use the result A or the result B as a labeling result of the label-free voice data.

And S140, training the voice recognition model to be trained according to the label-free voice data and the labeling result to obtain a voice recognition model with the recognition result of the voice data in the target field meeting the preset evaluation condition.

And taking the label-free voice data and the label result matched with the label-free voice data as a data set of the voice recognition model to be trained, and carrying out iterative training on the voice recognition model to be trained. According to the recognition result of the voice data in the target field, the electronic device can output a voice recognition model meeting the evaluation condition. The evaluation condition may be determined based on evaluation indexes such as recognition accuracy, loss, accuracy, recall, and F1 score. In particular, the evaluation condition may include one or more of an identification accuracy greater than a preset accuracy threshold, a loss less than a preset loss threshold, an accuracy greater than a preset accuracy threshold, and a recall less than a preset recall, etc. The electronic equipment can divide the data set of the speech recognition model to be trained into a training set and a testing set, the training set is utilized to carry out iterative training on the speech recognition model to be trained, and the testing set is utilized to test the speech recognition model after training. According to the test result, the electronic equipment can determine the evaluation indexes such as the recognition accuracy, loss, accuracy, recall rate, F1 score and the like of the voice recognition model, and can obtain the voice recognition model meeting the voice recognition requirement of the target field by judging whether the evaluation conditions meet the full-scale condition.

According to the technical scheme, unlabeled voice data in the target field are respectively input into at least two open source voice recognition models, so that recognition results matched with the open source voice recognition models are obtained; then, correcting each recognition result according to a preset text correction principle to obtain a correction result matched with each open source voice recognition model; determining the labeling result of the label-free voice data according to the correction result of the matching of each open source voice recognition model; training the voice recognition model to be trained according to the label-free voice data and the labeling result to obtain a voice recognition model with the recognition result of the voice data in the target field meeting the preset evaluation condition. According to the technical scheme, the problem of high training cost of the voice recognition model is solved, the label-free voice data can be utilized for training, the recognition accuracy of the voice recognition model is ensured, and meanwhile, the training cost is greatly reduced.

Example two

Fig. 2 is a flowchart of a training method of a speech recognition model according to a second embodiment of the present invention, which is refined based on the foregoing embodiment. As shown in fig. 2, the method includes:

s210, acquiring unlabeled voice data in the target field, and respectively inputting the unlabeled voice data into at least two open source voice recognition models to obtain recognition results matched with the open source voice recognition models.

S220, respectively carrying out target vocabulary replacement on each recognition result according to a preset target phrase set to obtain a replacement result matched with each open source voice recognition model;

on the basis of the scheme, the target phrase set is determined according to the vocabulary frequency statistical result in the text data of the target field.

It is easy to understand that the electronic device can use the professional vocabulary with the word frequency larger than the preset threshold value as the target vocabulary according to the word frequency statistical result of the professional vocabulary in the text data. The electronic device may form a target word group from a target word and a word associated with the target word, where the word associated with the target word may be a synonym, a homonym, etc. of the target word, for example, the target word is 5G, and the target word group may include 5G homonyms such as electrodeless, five-level, and roof, etc., and may further include synonyms such as fifth-generation communication, fifth-generation mobile communication technology, etc.

The electronic device may compare each recognition result with each target phrase in the target phrase set in sequence, replace all the vocabulary related to the target phrase in each recognition result with the target word, and output the replacement result of each recognition result.

According to the method and the device for identifying the target vocabulary, the same target vocabulary replacement is carried out on each identification result, the identification difference of different open source voice identification models on the target field professional vocabulary can be eliminated, and accurate judgment of the labeling result is facilitated.

And S230, respectively carrying out text error correction on each replacement result based on the error correction model which is trained in advance, and obtaining a correction result of each open source speech recognition model matching.

The electronic equipment can train to obtain an error correction model based on a deep learning algorithm in advance according to text data in the target field. The error correction model may be used to locate text errors in the replacement results, such as missing words, multiple words, and miscords. According to the text errors identified by the error correction model, the electronic equipment can correspondingly correct the text errors in the replacement result according to the error type, so that a correction result of matching of each open source voice identification model is obtained.

Optionally, the error correction model is trained by using text data of the target field and error correction labels of the text data.

According to the scheme, the text error correction can be carried out on the replacement result through training the error correction model, so that the accurate correction result can be obtained.

S240, judging whether the number of open source voice recognition models with the same correction result meets the preset number requirement, if so, executing S250, and if not, executing S260.

The electronic equipment can compare the correction results corresponding to the open source voice recognition models, and the labeling result of the label-free voice data is determined according to the ratio of the same correction result number in the total correction result number. Specifically, if 8 correction results are the same in the 10 correction results, the correction result is used as the labeling result of the label-free voice data. If 2 correction results are the same in the 10 correction results, determining that the label-free voice data has no labeling result.

S250, determining the label-free voice data label-free result.

And S260, taking the correction result as a labeling result of the label-free voice data.

S270, taking the unlabeled voice data with the labeling result as the input of the voice recognition model to be trained, and carrying out iterative training on the voice recognition model to be trained according to the output result of the voice recognition model to be trained and the labeling result matched with the unlabeled voice data.

The electronic equipment can screen out the unlabeled voice data with the labeling result from all the unlabeled voice data, and takes the unlabeled voice data with the labeling result and the labeling result thereof as a training data set of the voice recognition model to be trained. According to the output result of the to-be-trained voice recognition model and the labeling result of the label-free voice data matching, the monocotyledonous device can carry out iterative training on the to-be-trained voice recognition model. It should be noted that, the speech recognition model to be trained may be one of the open source speech recognition models, or may be an independently built speech recognition model without any training.

Fig. 3 is a schematic diagram of a process for determining a training data set according to a second embodiment of the present invention. As shown in fig. 3, in a specific scheme, the electronic device inputs the unlabeled voice data into the open source voice recognition model a and the open source voice recognition model B simultaneously, so as to obtain a recognition result a and a recognition result B respectively. And carrying out high-frequency vocabulary replacement on the target field by using the identification result A and the identification result B, and respectively obtaining a replacement result A and a replacement result B. The high-frequency vocabulary may be a vocabulary of which the occurrence number exceeds a preset number threshold in text data in the target field. And carrying out text correction on the replacement result A and the replacement result B through a text correction model, and obtaining a correction result A and a correction result B by the electronic equipment. By comparing whether the correction result a and the correction result B are the same, the electronic device can determine whether the unlabeled voice data can obtain a labeling result, and whether the unlabeled voice data needs to be added to a training data set of the voice recognition model to be trained. Specifically, if the correction result a and the correction result B are the same, the electronic device may use the correction result a or the correction result B as a labeling result of the unlabeled voice data, and add the unlabeled voice data and the labeling result to the data set. If the correction result a and the correction result B are different, the electronic device may discard the unlabeled voice data and no longer add to the data set.

Example III

Fig. 4 is a schematic structural diagram of a training device for a speech recognition model according to a third embodiment of the present invention. As shown in fig. 4, the apparatus includes:

the recognition result determining module 310 is configured to obtain unlabeled voice data in the target area, and input the unlabeled voice data into at least two open source voice recognition models respectively to obtain recognition results matched with the open source voice recognition models;

the correction result determining module 320 is configured to correct each recognition result according to a preset text correction principle, so as to obtain a correction result matched with each open source speech recognition model; wherein the text correction principle is determined based on text data of the target field;

the labeling result determining module 330 is configured to determine a labeling result of the label-free voice data according to the correction result matched by each open-source voice recognition model;

the recognition model training module 340 is configured to train the to-be-trained speech recognition model according to the unlabeled speech data and the labeling result, so as to obtain a speech recognition model that satisfies a preset evaluation condition for the recognition result of the speech data in the target field.

In this solution, optionally, the target area is a vertical area; the open source voice recognition model meets the preset evaluation condition on the recognition result of the voice data in the non-vertical field.

In one possible implementation, the text modification principle includes target vocabulary replacement and text correction;

the correction result determining module 320 includes:

the vocabulary replacement unit is used for respectively carrying out target vocabulary replacement on each recognition result according to a preset target phrase set to obtain a replacement result matched with each open source voice recognition model;

and the text error correction unit is used for respectively carrying out text error correction on each replacement result based on the error correction model which is trained in advance, so as to obtain a correction result matched with each open source voice recognition model.

On the basis of the scheme, optionally, the target phrase set is determined according to vocabulary frequency statistics in text data of the target field.

In this embodiment, optionally, the error correction model is trained by using text data in the target area and an error correction label of the text data.

In a preferred embodiment, the labeling result determining module 330 is specifically configured to:

if the number of open source voice recognition models with the same correction result meets the preset number requirement, the correction result is used as a labeling result of the label-free voice data;

if the number of open source voice recognition models with the same correction result does not meet the preset number requirement, determining that the label-free voice data has no labeling result.

Based on the above scheme, optionally, the recognition model training module 340 is specifically configured to:

and taking the unlabeled voice data with the labeling result as the input of the voice recognition model to be trained, and carrying out iterative training on the voice recognition model to be trained according to the output result of the voice recognition model to be trained and the labeling result matched with the unlabeled voice data.

The training device for the speech recognition model provided by the embodiment of the invention can execute the training method for the speech recognition model provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 5 shows a schematic diagram of an electronic device 410 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device 410 includes at least one processor 411, and a memory, such as a Read Only Memory (ROM) 412, a Random Access Memory (RAM) 413, etc., communicatively connected to the at least one processor 411, wherein the memory stores computer programs executable by the at least one processor, and the processor 411 may perform various suitable actions and processes according to the computer programs stored in the Read Only Memory (ROM) 412 or the computer programs loaded from the storage unit 418 into the Random Access Memory (RAM) 413. In the RAM 413, various programs and data required for the operation of the electronic device 410 may also be stored. The processor 411, the ROM 412, and the RAM 413 are connected to each other through a bus 414. An input/output (I/O) interface 415 is also connected to bus 414.

Various components in the electronic device 410 are connected to the I/O interface 415, including: an input unit 416 such as a keyboard, a mouse, etc.; an output unit 417 such as various types of displays, speakers, and the like; a storage unit 418, such as a magnetic disk, optical disk, or the like; and a communication unit 419 such as a network card, modem, wireless communication transceiver, etc. The communication unit 419 allows the electronic device 410 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The processor 411 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 411 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 411 performs the various methods and processes described above, such as the training method of the speech recognition model.

In some embodiments, the method of training a speech recognition model may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 418. In some embodiments, some or all of the computer program may be loaded and/or installed onto the electronic device 410 via the ROM 412 and/or the communication unit 419. When the computer program is loaded into RAM 413 and executed by processor 411, one or more steps of the above-described training method of the speech recognition model may be performed. Alternatively, in other embodiments, the processor 411 may be configured to perform the training method of the speech recognition model in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of training a speech recognition model, the method comprising:

2. The method of claim 1, wherein the target domain is a vertical domain; the open source voice recognition model meets the preset evaluation condition on the recognition result of the voice data in the non-vertical field.

3. The method of claim 1, wherein the text modification principles include target vocabulary replacement and text correction;

correcting each recognition result according to a preset text correction principle to obtain a correction result matched with each open source voice recognition model, wherein the correction result comprises:

according to a preset target phrase set, respectively carrying out target vocabulary replacement on each recognition result to obtain a replacement result matched with each open source voice recognition model;

and respectively carrying out text error correction on each replacement result based on the error correction model which is trained in advance, and obtaining a correction result matched with each open source voice recognition model.

4. The method of claim 2, wherein the set of target phrases is determined based on vocabulary frequency statistics in text data of the target domain.

5. The method of claim 2, wherein the correction model is trained using text data of a target area and a correction label of the text data.

6. The method of claim 1, wherein determining labeling results for unlabeled speech data based on the revised results for each open source speech recognition model match comprises:

7. The method of claim 1, wherein training the speech recognition model to be trained based on the unlabeled speech data and the labeling result comprises:

8. A training device for a speech recognition model, comprising:

9. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of training a speech recognition model according to any one of claims 1-7.

10. A computer readable storage medium storing computer instructions for causing a processor to perform the method of training a speech recognition model according to any one of claims 1-7.