CN115148196A

CN115148196A - Training method and device for end-to-end model, electronic equipment and storage medium

Info

Publication number: CN115148196A
Application number: CN202210760230.2A
Authority: CN
Inventors: 郑晓明; 李键; 陈明; 武卫东
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-04

Abstract

The embodiment of the invention provides a training method, a training device, electronic equipment and a storage medium for an end-to-end model, which are used for determining polyphonic words in a target training text; performing phonetic notation on the polyphonic words and phrases and generating a corpus; the corpus comprises the polyphonic words and phonetic notation information for the polyphonic words; acquiring audio information corresponding to the phonetic notation information; and training an end-to-end model by adopting the corpus and the audio information, and generating an acoustic model, thereby improving the accuracy of training the end-to-end model.

Description

Training method and device for end-to-end model, electronic equipment and storage medium

Technical Field

The present invention relates to the field of end-to-end model training technologies, and in particular, to a training method for an end-to-end model, a training apparatus for an end-to-end model, an electronic device, and a computer-readable storage medium.

Background

Speech Recognition (ASR) is a technology for studying how to convert voice Recognition of human Speech into text, and can be applied to services such as voice dialing, voice navigation, indoor device control, voice document retrieval, and simple dictation data entry.

The end-to-end model is an important model for realizing speech recognition, in general, chinese speech recognition is carried out based on the end-to-end model, chinese characters need to be selected as modeling units, however, the phenomenon of polyphonic characters exists in the Chinese characters, if the Chinese characters are directly adopted as the modeling units, in the process of training the model, two Chinese characters which are the same in Chinese characters but different in pronunciation can be corresponding to the same label for training, and therefore model training inaccuracy can be caused.

Disclosure of Invention

The embodiment of the invention provides a training method, a training device, electronic equipment and a computer readable storage medium for an end-to-end model, and aims to solve the problem that polyphones cannot be recognized by speech recognition.

The embodiment of the invention discloses a training method for an end-to-end model, which comprises the following steps:

determining polyphonic words in the target training text;

performing phonetic notation on the polyphonic words and phrases and generating a corpus; the corpus comprises the polyphonic words and phonetic notation information aiming at the polyphonic words;

acquiring audio information corresponding to the phonetic notation information;

and training an end-to-end model by adopting the corpus and the audio information, and generating an acoustic model.

Optionally, the acoustic model is configured to output a text sequence vector having a ZhuYin feature sequence vector for expressing the ZhuYin information, and the method may further include:

identifying the audio information using the acoustic model and generating an identification text; the identification text comprises identification result phonetic notation information;

and deleting the phonetic notation information of the identification result.

Optionally, the end-to-end model has a corresponding text-to-speech module, and the step of annotating the polyphonic words may further include:

and performing phonetic notation on the polyphonic words by adopting the text-to-speech module.

Optionally, the step of ZhuYin the polyphonic words may further include:

and performing phonetic notation on the polyphonic words based on a speech recognition alignment algorithm.

The embodiment of the invention also discloses a training device for the end-to-end model, which comprises the following steps:

the polyphonic word determining module is used for determining polyphonic words in the target training text;

the corpus generating module is used for performing phonetic notation on the polyphonic words and generating a corpus; the corpus comprises the polyphonic words and phonetic notation information aiming at the polyphonic words;

the audio information acquisition module is used for acquiring audio information corresponding to the phonetic notation information;

and the acoustic model generating module is used for training an end-to-end model by adopting the corpus and the audio information and generating an acoustic model.

Optionally, the acoustic model is configured to output a text sequence vector having a ZhuYin feature sequence vector for expressing the ZhuYin information, and the apparatus may further include:

the recognition text generation module is used for recognizing the audio information by using the acoustic model and generating recognition text; the identification text comprises identification result phonetic notation information;

and the recognition result phonetic notation information deleting module is used for deleting the recognition result phonetic notation information.

Optionally, the end-to-end model has a corresponding text-to-speech module, and the corpus generation module may further include:

and the first corpus generation submodule is used for performing phonetic notation on the polyphonic words by adopting the text-to-speech module.

Optionally, the corpus generating module may further include:

and the second corpus generation submodule is used for performing phonetic notation on the polyphonic words based on a speech recognition alignment algorithm.

The embodiment of the invention also discloses electronic equipment which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory finish mutual communication through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the method according to the embodiment of the present invention when executing the program stored in the memory.

Also disclosed is a computer-readable storage medium having instructions stored thereon, which, when executed by one or more processors, cause the processors to perform a method according to an embodiment of the invention.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, the polyphonic words in the target training text are determined; performing phonetic notation on the polyphonic words and phrases and generating a corpus; the corpus comprises the polyphonic words and phonetic notation information for the polyphonic words; acquiring audio information corresponding to the phonetic notation information; and training an end-to-end model by adopting the corpus and the audio information, and generating an acoustic model, thereby improving the accuracy of training the end-to-end model.

Drawings

FIG. 1 is a flow chart of the steps of a training method for an end-to-end model provided in an embodiment of the present invention;

FIG. 2 is a view of a system provided in an embodiment of the present invention a structural block diagram of a training device of an end-to-end model;

fig. 3 is a block diagram of a hardware structure of an electronic device provided in the embodiments of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Traditional speech recognition systems are composed of a number of modules, including acoustic models, pronunciation dictionaries, and language models. Wherein the acoustic model and the language model are trained. The training of these modules is generally performed independently, each with an objective function, such as an acoustic model, whose training objective is to maximize the probability of training speech and a language model whose training objective is to minimize confusion.

Because the modules cannot make up for each other's deficiencies during training, the trained objective function has a deviation from the overall performance index (generally, word Error Rate (WER)) of the system, and thus the trained network often fails to achieve optimal performance.

Along with this problem, end-to-end models (end-to-end models) were derived.

The end-to-end model refers to that the system does not have independent modules such as an acoustic model, a pronunciation dictionary, a language model and the like, but a neural network is directly connected from an input end (a speech waveform or a feature sequence vector) to an output end (a word or character sequence vector), and the neural network is used for bearing the functions of all the original modules, such as a connection time sequence classification (CTC) which is a time sequence classification based on the neural network and can also be understood as a recognition method for continuous sequence vectors. Specifically for speech recognition, an input sequence vector (audio) can be mapped as X = [ X1, X2, …, xT ], with a corresponding output sequence vector as Y = [ Y1, Y2, …, yU ]. The training goal for CTC is to maximize X and Y as closely as possible, i.e., to maximize the output probability P (X | Y). The operation of maximizing P (Y | X) is equivalent to establishing an accurate mapping between X and Y.

In practical application, chinese speech recognition is carried out based on an end-to-end model, chinese characters need to be selected as modeling units, however, the Chinese characters have polyphonic characters, if the Chinese characters are directly adopted as the modeling units, two Chinese characters with the same Chinese characters but different pronunciations correspond to the same label to be trained in the process of training the model, and therefore model training inaccuracy is caused.

Referring to fig. 1, a flowchart illustrating steps of a training method for an end-to-end model provided in the embodiment of the present invention is shown, which specifically includes the following steps:

step 101, determining polyphonic words in a target training text;

102, performing phonetic notation on the polyphonic words and generating a corpus; the corpus comprises the polyphonic words and phonetic notation information for the polyphonic words;

step 103, acquiring audio information corresponding to the phonetic notation information;

and step 104, training an end-to-end model by adopting the corpus and the audio information, and generating an acoustic model.

In a specific implementation, the training text of the embodiment of the present invention may be text information including words, and the embodiment of the present invention may determine polyphonic words in the target training text first.

Illustratively, since the standard pronunciation of a word may change over time, a polyphonic word table may be constructed based on the newly specified Chinese pronunciation criteria (e.g., the latest version of the Xinhua dictionary), and the polyphonic words in the target training text may then be determined by re-matching the words in the target training text with the polyphonic word table.

After determining the polyphonic words in the target training text, the embodiment of the present invention may perform phonetic notation on the polyphonic words, for example, "my", "true", the result after phonetic notation is "my (_ de)" "and (_ di) true", and generate the corpus after the phonetic notation is completed.

In practical applications, the corpus is language material, the corpus of the embodiment of the present invention may include polyphones in the target training text and phonetic notation information for the polyphones, and optionally, the phonetic notation information may include chinese pinyin and chinese phonetic notation, for example, the corpus may be "except (le)", or may be "except (_ㄌㄜ)".

The audio information corresponding to the ZhuYin information can be acquired in the embodiment of the present invention, and specifically, the audio information corresponding to the ZhuYin information can be an audio having the same pronunciation as the ZhuYin information, for example, "the ZhuYin information except for the word is" (_ le) ", so that the audio information with the pronunciation of" except for (_ le) "can be acquired from a local or a server, and of course, the audio information can also be acquired through manual entry, and the embodiment of the present invention is not limited thereto.

After the corpus is generated and the audio information corresponding to the phonetic notation information is obtained, the embodiment of the invention can train the end-to-end model by adopting the corpus and the audio information and generate the acoustic model.

For example, the corpus including the polyphone and the information about the main factors of the polyphone may be converted into a text sequence vector, the audio information corresponding to the phonetic notation information is converted into an audio sequence vector, the text training vector expressing the corpus is input into an end-to-end model to converge the end-to-end model, and finally the end-to-end model is trained into an acoustic model, wherein the text sequence vector expressing the corpus including the phonetic notation information has corresponding independent tags, that is, the text sequence vector expressing the corpus including the phonetic notation information does not correspond to the same tags as other text sequence vectors expressing different tones of the same character.

In the embodiment of the invention, polyphonic words in the target training text are determined; performing phonetic notation on the polyphonic words and phrases and generating a corpus; the corpus comprises the polyphonic words and phonetic notation information aiming at the polyphonic words; acquiring audio information corresponding to the phonetic notation information; and training an end-to-end model by adopting the corpus and the audio information, and generating an acoustic model, thereby improving the accuracy of training the end-to-end model.

On the basis of the above-described embodiment, a modified embodiment of the above-described embodiment is proposed, and it is to be noted herein that, in order to make the description brief, only the differences from the above-described embodiment are described in the modified embodiment.

In an optional embodiment of the invention, the method further comprises:

and deleting the phonetic notation information of the identification result.

In practical application, when the acoustic model performs speech recognition, the generated text may further include phonetic notation information, and if the phonetic notation information is not processed, data redundancy may be caused, so in order to avoid that the recognition result still includes phonetic notation information and further avoid redundancy of the recognition result information, the embodiment of the present invention may delete the recognition result phonetic notation information when the acoustic model is used to recognize audio information and generate the recognition text including the recognition result phonetic notation information.

For example, the target training text may include the word "ok", after the end-to-end model is trained using the corpus and the audio information and the acoustic model is generated, the acoustic model may be identified by using a (_ di) ok audio with the pronunciation "provided to the acoustic model, and when the acoustic model listens to the (_ di) ok audio with the pronunciation" provided, the generated identification text is "ok" _ di "), and at this time," __ di "may be deleted so that the identification text" ok "does not include the identification result phonetic notation information" __ di ".

In a specific implementation, when the acoustic model identifies the audio information, the text sequence vector may be output first, and the identification text is generated based on the text sequence vector, and the text sequence vector output by the acoustic model includes a phonetic notation feature sequence vector for expressing the phonetic notation information of the identification result.

According to the embodiment of the invention, the audio information is identified by using the acoustic model, and an identification text is generated; the identification text comprises identification result phonetic notation information; and deleting the phonetic notation information of the identification result, thereby avoiding the identification result from still containing phonetic notation information and further avoiding the redundancy of the identification result information.

In an optional embodiment of the present invention, the end-to-end model has a corresponding text-to-speech module, and the step of ZhuYin the polyphonic words further comprises:

In practical application, the embodiment of the present invention may configure a text-to-speech module TTS (text-to-speech), which is also referred to as a front-end module, for the end-to-end model, and the embodiment of the present invention may annotate the polyphonic words through the text-to-speech module TTS, specifically, the text-to-speech module TTS may analyze a text structure first, for example, determine what language the polyphonic words in the target training text are, and after determining the language of the polyphonic words, the text-to-speech module TTS may divide the sentence patterns in the target training text. After the text structure is analyzed by the text-to-speech module TTS, the multi-tone words in the target training text may be normalized, in practical applications, the text normalization of the text-to-speech module TTS may convert punctuations or numbers that are not chinese characters into chinese characters, for example, in the sentence "666 o" the text-to-speech module TTS may convert "666" into "six". After the text-to-speech module TTS completes regularization of the polyphonic words in the target training text, the polyphonic words in the target training text can be converted into phonemes, that is, the polyphonic words in the target training text are converted into pinyin.

Of course, since the accuracy of phonetic notation for polyphonic words cannot be guaranteed due to the existence of polyphonic words in chinese, in an optional embodiment of the present invention, when the phonetic notation for polyphonic words is performed by using the text-to-speech module and the result of the pronunciation test is different from the phonetic notation information, the text-to-speech module may be used again to annotate polyphonic words and generate new corpora, for example, a target training text containing a word "true" and the text-to-speech module TTS determining that "true" is "phonetic notation _de", after the end-to-end model is trained by the corpora and an acoustic model is generated, the text containing the word "true" may be provided to the acoustic model for reading again, and when the acoustic model is tested to read the word "true", if the pronunciation of the test sound is "true _ de", the text to end model fails to be determined, so that the acoustic model may be used to re-annotate "to-phonetic word" until the acoustic model is determined to be "di", and the phonetic pronunciation of the word _ "true _" di "is determined".

According to the embodiment of the invention, the polyphonic words are annotated by adopting the text-to-speech module, so that the polyphonic words are annotated automatically, and the training efficiency of an end-to-end model is improved.

In a specific implementation, the embodiment of the invention can annotate polyphonic words based on a speech recognition alignment algorithm.

For example, a phone set for a chinese character may be first established, the phone set including all chinese characters, initials and finals (a, aa, ai, an, ang, ao, b, c, ch, d, e, ee, ei, en, eng, er, f, g, h, i, ia, ian, iang, iao, ie, ii, in, ing, iong, iu, ix, iy, iz, j, k, l, m, n, o, ong, oo, ou, p, q, r, s, sh, t, u, ua, uai, uan, uang, ueng, ui, un, uo, uu, v, van, ve, vn, vv, x, z, zh), and the mapping relationships between all chinese characters and initials, mapping relationships between all chinese characters and finals, and when determining the final path of the target text, the phonetic multiple phonetic transcription and final paths may be determined based on the shortest path between the final elements and the word and final path.

According to the embodiment of the invention, the polyphonic words are annotated by a speech recognition alignment algorithm. Therefore, automatic phonetic notation of polyphonic words is realized, and the training efficiency of the end-to-end model is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 2, a structural block diagram of a training apparatus for an end-to-end model provided in the embodiment of the present invention is shown, and specifically, the structural block diagram may include the following modules:

a polyphonic word determination module 201, configured to determine polyphonic words in the target training text;

a corpus generating module 202, configured to annotate the polyphonic words and generate a corpus; the corpus comprises the polyphonic words and phonetic notation information aiming at the polyphonic words;

an audio information obtaining module 203, configured to obtain audio information corresponding to the ZhuYin information;

an acoustic model generating module 204, configured to train an end-to-end model by using the corpus and the audio information, and generate an acoustic model.

Optionally, the corpus generating module may further include:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

In addition, an embodiment of the present invention further provides an electronic device, including: the processor, the memory, and the computer program stored in the memory and capable of running on the processor, when executed by the processor, implement the processes of the above-mentioned training method embodiment for the end-to-end model, and can achieve the same technical effects, and in order to avoid repetition, details are not repeated here.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements each process of the above-mentioned training method for an end-to-end model, and can achieve the same technical effect, and is not described here again to avoid repetition. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

Fig. 3 is a schematic diagram of a hardware structure of an electronic device implementing various embodiments of the present invention.

The electronic device 300 includes, but is not limited to: radio frequency unit 301, network module 302, audio output unit 303, input unit 304, sensor 305, display unit 306, user input unit 307, interface unit 308, memory 309, processor 310, and power supply 311. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 3 does not constitute a limitation of electronic devices, which may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 301 may be used for receiving and sending signals during a message sending and receiving process or a call process, and specifically, receives downlink data from a base station and then processes the received downlink data to the processor 310; in addition, the uplink data is transmitted to the base station. In general, radio frequency unit 301 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 301 can also communicate with a network and other devices through a wireless communication system.

The electronic device provides the user with wireless broadband internet access via the network module 302, such as assisting the user in sending and receiving e-mails, browsing web pages, and accessing streaming media.

The audio output unit 303 may convert audio data received by the radio frequency unit 301 or the network module 302 or stored in the memory 309 into an audio signal and output as sound. Also, the audio output unit 303 may also provide audio output related to a specific function performed by the electronic apparatus 300 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 303 includes a speaker, a buzzer, a receiver, and the like.

The input unit 304 is used to receive audio or video signals. The input Unit 304 may include a Graphics Processing Unit (GPU) 3041 and a microphone 3042, and the Graphics processor 3041 processes image data of a still picture or video obtained by an image capturing apparatus (such as a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 306. The image frames processed by the graphic processor 3041 may be stored in the memory 309 (or other storage medium) or transmitted via the radio frequency unit 301 or the network module 302. The microphone 3042 may receive sounds and may be capable of processing such sounds into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 301 in case of the phone call mode.

The electronic device 300 also includes at least one sensor 305, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that adjusts the brightness of the display panel 3061 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 3061 and/or the backlight when the electronic device 300 is moved to the ear. As one type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of an electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 305 may also include fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which are not described in detail herein.

The display unit 306 is used to display information input by the user or information provided to the user. The display unit 306 may include a display panel 3061, and the display panel 3061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting diode (Organic Light-Emitting _ dio _ de, O _ leD), or the like.

The user input unit 307 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 307 includes a touch panel 3071 and other input devices 3072. The touch panel 3071, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 3071 (e.g., operations by a user on or near the touch panel 3071 using a finger, a stylus, or any suitable object or attachment). The touch panel 3071 may include two parts of a touch detection device and a touch controller. Wherein the touch detection device detects a touch orientation of a user, detecting a signal brought by touch operation, and transmitting the signal to a touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 310, receives a command from the processor 310, and executes the command. In addition, the touch panel 3071 may be implemented using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The user input unit 307 may include other input devices 3072 in addition to the touch panel 3071. Specifically, the other input devices 3072 may include, but are not limited to, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described herein.

Further, the touch panel 3071 may be overlaid on the display panel 3061, and when the touch panel 3071 detects a touch operation on or near the touch panel, the touch operation is transmitted to the processor 310 to determine the type of the touch event, and then the processor 310 provides a corresponding visual output on the display panel 3061 according to the type of the touch event. Although the touch panel 3071 and the display panel 3061 are shown in fig. 3 as two separate components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 3071 and the display panel 3061 may be integrated to implement the input and output functions of the electronic device, which is not limited herein.

The interface unit 308 is an interface for connecting an external device to the electronic apparatus 300. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 308 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus 300 or may be used to transmit data between the electronic apparatus 300 and an external device.

The memory 309 may be used to store software programs as well as various data. The memory 309 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 309 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 310 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 309 and calling data stored in the memory 309, thereby integrally monitoring the electronic device. Processor 310 may include one or more processing units; preferably, the processor 310 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 310.

The electronic device 300 may further include a power supply 311 (such as a battery) for supplying power to various components, and preferably, the power supply 311 may be logically connected to the processor 310 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system.

In addition, the electronic device 300 includes some functional modules that are not shown, and are not described in detail herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of training an end-to-end model, comprising:

determining polyphonic words in the target training text;

acquiring audio information corresponding to the phonetic notation information;

2. The method of claim 1, further comprising:

and deleting the phonetic notation information of the identification result.

3. The method of claim 1, wherein the end-to-end model has a corresponding text-to-speech module, and wherein the step of annotating the polyphonic words further comprises:

4. The method of claim 1, wherein the step of ZhuYin the polyphonic words further comprises:

5. A training apparatus for an end-to-end model, comprising:

the corpus generating module is used for performing phonetic notation on the polyphonic words and generating a corpus; the corpus comprises the polyphonic words and phonetic notation information for the polyphonic words;

6. The apparatus of claim 5, further comprising:

7. The apparatus of claim 5, wherein the end-to-end model has a corresponding text-to-speech module, and wherein the corpus generation module further comprises:

8. The apparatus of claim 5, wherein the corpus generation module further comprises:

9. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;

the memory is used for storing a computer program;

the processor, when executing a program stored on the memory, implementing the method of any one of claims 1-4.

10. A computer-readable storage medium having stored thereon instructions, which when executed by one or more processors, cause the processors to perform the method of any one of claims 1-4.