CN114078470A

CN114078470A - Model processing method and device, and voice recognition method and device

Info

Publication number: CN114078470A
Application number: CN202010825574.8A
Authority: CN
Inventors: 李威; 朱海; 魏娟; 郑昊
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2022-02-22

Abstract

The embodiment of the specification provides a model processing method and device and a voice recognition method and device. One specific implementation of the model processing method comprises the following steps: acquiring at least one training sample, wherein the training sample comprises voice information including words of a target category and a marked text, the marked text is used for representing the semantics of the voice information and is added with a slot position mark corresponding to the target category, and the slot position mark is added at the original appearance position of the words of the target category in the marked text; and training the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model which is used for voice recognition and has a slot position prediction function.

Description

Model processing method and device, and voice recognition method and device

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a model processing method and device, a voice recognition method and device and interaction equipment.

Background

The existing speech recognition technical solution is generally based on Deep Learning (Deep Learning) theory, and adopts an end-to-end (end2end) modeling solution. The end-to-end voice recognition is effective in recognizing general scenes, but is not effective in similar scenes such as names of people and places.

Therefore, a reasonable and reliable scheme is needed to improve the recognition effect when speaking similar scenes such as the names of people and places.

Disclosure of Invention

The embodiment of the specification provides a model processing method and device, a voice recognition method and device and interaction equipment.

In a first aspect, an embodiment of the present specification provides a model processing method, including obtaining at least one training sample, where the training sample includes speech information including a word of a target category and a tagged text, where the tagged text is used to represent semantics of the speech information and is added with a slot position mark corresponding to the target category, and the slot position mark is added to an original appearance position of the word of the target category in the tagged text; and training the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model which is used for voice recognition and has a slot position prediction function.

In some embodiments, the target categories include at least one of the following categories: name of person, place name, organization name.

In some embodiments, when the target category includes a name, the name slot mark added in the annotation text is used for indicating that the name should appear at the position occupied by the name slot mark; when the target category comprises a place name, a place name slot mark added in the label text is used for representing that the place name should appear at the position occupied by the place name slot mark; and when the target category comprises the mechanism name, the mechanism name slot mark added in the label text is used for representing that the mechanism name is supposed to appear at the position occupied by the mechanism name slot mark.

In some embodiments, a dictionary adopted by the end-to-end model to be trained is added with a slot label corresponding to the target class.

In some embodiments, the end-to-end model to be trained comprises a natural language processing model based on a self-attention mechanism and employing an encoder-decoder architecture.

In some embodiments, the tagged text is text in the form of a sequence of words; and training the end-to-end model to be trained according to the at least one training sample, including: and taking the voice information respectively included by the at least one training sample as input, taking the labeled text corresponding to the voice information as a label, and training the end-to-end model to be trained.

In some embodiments, the annotation text is text that is not participled; and training the end-to-end model to be trained according to the at least one training sample, including: for each training sample in the at least one training sample, performing word segmentation on the labeled text included in the training sample, and forming words obtained through word segmentation into word sequences; and taking the voice information respectively included by the at least one training sample as input, and taking a word sequence corresponding to the voice information as a label, and training an end-to-end model to be trained.

In a second aspect, the present specification provides a speech recognition method applied to an optimization processor in a speech recognition system, where the speech recognition system further includes a target end-to-end model with a slot prediction function for speech recognition, and the method includes: obtaining a prediction result output by the target end-to-end model, wherein the prediction result comprises a plurality of pieces of text information; in response to reading the slot position marks corresponding to the target categories at the same positions of the plurality of pieces of text information, respectively extracting words which are adjacent to the slot position marks and appear behind the slot position marks from the plurality of pieces of text information; determining a first score corresponding to the extracted word according to the slot position mark; and determining a target word from the extracted words according to the determined first score, wherein the target word is used as a recognition result corresponding to the position occupied by the slot position mark.

In some embodiments, the target end-to-end model includes a target end-to-end model with a slot prediction function for speech recognition, which is trained by the implementation manner in the first aspect.

In some embodiments, the determining, according to the slot position marker, a first score corresponding to the extracted word includes: and determining a first score corresponding to the extracted word by using the scoring model corresponding to the slot position mark.

In some embodiments, the scoring model includes a pre-established data mapping table for characterizing a correspondence between words in the target category and the first score; and determining a first score corresponding to the extracted word by using the scoring model corresponding to the slot position mark, wherein the determining comprises the following steps: and searching records comprising the extracted words in the scoring model, and determining the first score in the searched records as the first score corresponding to the extracted words.

In some embodiments, the scoring model includes a pre-trained prediction model for predicting a first score corresponding to a word in a target category; and determining a first score corresponding to the extracted word by using the scoring model corresponding to the slot position mark, wherein the determining comprises the following steps: and inputting the extracted words into the scoring model to obtain a first score output by the scoring model.

In some embodiments, the prediction result further includes second scores corresponding to the text messages respectively; and determining a target word from the extracted words according to the determined first score, including: for each extracted word, determining a screening score corresponding to the word according to a first score corresponding to the word and a second score corresponding to the text information where the word is located; and determining the target word from the extracted words according to the determined screening score.

In some embodiments, the scoring model corresponds to a preset score factor; and determining the screening score corresponding to the word according to the first score corresponding to the word and the second score corresponding to the text information where the word is located, wherein the determining of the screening score corresponding to the word comprises the following steps: determining the product of the first fraction corresponding to the word and the dispatching fraction; and determining the sum of the product and a second score corresponding to the text information where the word is positioned as a screening score corresponding to the word.

In a third aspect, an embodiment of the present specification provides a model processing method, including obtaining at least one training sample, where the training sample includes voice information including a name of a person, and a tagged text, where the tagged text is used to represent semantics of the voice information and is added with a name slot tag, and the name slot tag is added to an original appearance position of a word belonging to the name of the person in the tagged text; and training the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model which is used for voice recognition and has a name slot position prediction function.

In a fourth aspect, an embodiment of the present specification provides a speech recognition method applied to an optimization processor in a speech recognition system, where the speech recognition system further includes a target end-to-end model with a name slot prediction function for speech recognition, and the method includes: obtaining a prediction result output by the target end-to-end model, wherein the prediction result comprises a plurality of pieces of text information; in response to reading the name slot position marks at the same positions of the text messages, respectively extracting words which are adjacent to the name slot position marks and appear behind the name slot position marks from the text messages; determining a first score corresponding to the extracted word according to the name slot position mark; and determining a target word from the extracted words according to the determined first score, wherein the target word is used as a recognition result corresponding to the position occupied by the name slot position mark.

In some embodiments, the target end-to-end model includes a target end-to-end model trained by the implementation manner in the third aspect and used for voice recognition and having a name slot prediction function.

In some embodiments, the determining, according to the name slot position mark, a first score corresponding to the extracted word includes: and determining a first score corresponding to the extracted word by using the scoring model corresponding to the name slot position mark.

In a fifth aspect, an embodiment of the present specification provides a model processing method, including: acquiring at least one training sample, wherein the training sample comprises text information including words of a target category and slot position marking information, and the slot position marking information shows the appearance position of the words of the target category and slot position marks corresponding to the target category; and training the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model for slot position prediction.

In a sixth aspect, an embodiment of the present specification provides a speech recognition method applied to an optimization processor in a speech recognition system, where the speech recognition system further includes a speech recognition model and a target end-to-end model for slot prediction, and the method includes: acquiring a first prediction result output by the voice recognition model, wherein the first prediction result comprises a plurality of pieces of text information; obtaining a second prediction result output by the target end-to-end model, wherein the second prediction result is obtained by performing slot position prediction on the first prediction result; responding to the second prediction result showing the appearance position of the word in the target category and the slot position mark corresponding to the target category, reading the word in the appearance position from the plurality of pieces of text information, and determining a first score corresponding to the read word in the appearance position according to the slot position mark; and according to the determined first score, determining a target word from the read words located at the appearance position, wherein the target word is used as a recognition result corresponding to the appearance position.

In some embodiments, the target end-to-end model comprises a target end-to-end model trained using the implementation in the fifth aspect for slot prediction.

In some embodiments, the determining, according to the slot position marker, a first score corresponding to the read word located at the occurrence position includes: and determining a first score corresponding to the read word positioned at the appearance position by utilizing a scoring model corresponding to the slot position mark.

A seventh aspect, an embodiment of the present specification provides a model processing apparatus, including an obtaining unit, configured to obtain at least one training sample, where the training sample includes speech information including a word of a target category, and a tagged text, where the tagged text is used to represent a semantic meaning of the speech information and is added with a slot tag corresponding to the target category, and the slot tag is added to an original appearance position of the word of the target category in the tagged text; and the model training unit is configured to train the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model which is used for voice recognition and has a slot position prediction function.

In an eighth aspect, an embodiment of the present specification provides a speech recognition apparatus applied to an optimization processor in a speech recognition system, where the speech recognition system further includes a target end-to-end model with a slot prediction function for speech recognition, and the apparatus includes: an obtaining unit configured to obtain a prediction result output by the target end-to-end model, the prediction result including a plurality of pieces of text information; an extracting unit configured to extract, from the plurality of pieces of text information, words that are adjacent to and appear after slot marks, respectively, in response to reading the slot marks corresponding to target categories at the same positions of the plurality of pieces of text information; the score determining unit is configured to determine a first score corresponding to the extracted word according to the slot position mark; and the identification result determining unit is configured to determine a target word from the extracted words according to the determined first score, wherein the target word is used as an identification result corresponding to the position occupied by the slot position mark.

In some embodiments, the score determination unit is further configured to: and determining a first score corresponding to the extracted word by using the scoring model corresponding to the slot position mark.

In a ninth aspect, an embodiment of the present specification provides a model processing apparatus, including an obtaining unit, configured to obtain at least one training sample, where the training sample includes voice information including a name of a person, and a tagged text, where the tagged text is used to represent semantics of the voice information and is added with a name slot mark, and the name slot mark is added to an original appearance position of a word belonging to the name of the person in the tagged text; and the model training unit is configured to train the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model which is used for voice recognition and has a name slot position prediction function.

In a tenth aspect, an embodiment of the present specification provides a speech recognition apparatus, applied to an optimization processor in a speech recognition system, where the speech recognition system further includes a target end-to-end model with a name slot prediction function for speech recognition, and the apparatus includes: an obtaining unit configured to obtain a prediction result output by the target end-to-end model, the prediction result including a plurality of pieces of text information; an extracting unit configured to extract, from the plurality of pieces of text information, words that are adjacent to and appear after the name slot mark positions, respectively, in response to reading the name slot mark at the same position of the plurality of pieces of text information; the score determining unit is configured to determine a first score corresponding to the extracted word according to the name slot position mark; and the identification result determining unit is configured to determine a target word from the extracted words according to the determined first score, wherein the target word is used as an identification result corresponding to the position occupied by the name slot mark.

In some embodiments, the score determination unit is further configured to: and determining a first score corresponding to the extracted word by using the scoring model corresponding to the name slot position mark.

In an eleventh aspect, an embodiment of the present specification provides a model processing apparatus, including: an obtaining unit configured to obtain at least one training sample, where the training sample includes text information including a word of a target category and slot position marking information, and the slot position marking information shows an appearance position of the word of the target category and a slot position mark corresponding to the target category; and the model training unit is configured to train the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model for slot position prediction.

In a twelfth aspect, an embodiment of the present specification provides a speech recognition apparatus applied to an optimization processor in a speech recognition system, where the speech recognition system further includes a speech recognition model and a target end-to-end model for slot prediction, and the apparatus includes: a first acquisition unit configured to acquire a first prediction result output by the speech recognition model, the first prediction result including a plurality of pieces of text information; a second obtaining unit configured to obtain a second prediction result output by the target end-to-end model, the second prediction result being obtained by performing slot prediction on the first prediction result; a score determining unit configured to determine, in response to the second prediction result showing an appearance position of a word of a target category and a slot mark corresponding to the target category, and reading a word located at the appearance position among the plurality of pieces of text information, a first score corresponding to the read word located at the appearance position according to the slot mark; and the recognition result determining unit is configured to determine a target word from the read words located at the appearance positions according to the determined first score, wherein the target word is used as a recognition result corresponding to the appearance positions.

In some embodiments, the score determination unit is further configured to: and determining a first score corresponding to the read word positioned at the appearance position by utilizing a scoring model corresponding to the slot position mark.

In a thirteenth aspect, an embodiment of the present specification provides an interaction device, including an optimization processor; the optimization processor is configured to: obtaining a prediction result output by a target end-to-end model, wherein the target end-to-end model is used for voice recognition and has a slot position prediction function, and the prediction result comprises a plurality of pieces of text information; in response to reading the slot position marks corresponding to the target categories at the same positions of the plurality of pieces of text information, respectively extracting words which are adjacent to the slot position marks and appear behind the slot position marks from the plurality of pieces of text information; determining a first score corresponding to the extracted word according to the slot position mark; and determining a target word from the extracted words according to the determined first score, wherein the target word is used as a recognition result corresponding to the position occupied by the slot position mark.

In a fourteenth aspect, an embodiment of the present specification provides an interaction device, including an optimization processor; the optimization processor is configured to: acquiring a first prediction result output by a voice recognition model, wherein the first prediction result comprises a plurality of pieces of text information; obtaining a second prediction result output by a target end-to-end model, wherein the target end-to-end model is used for slot position prediction, and the second prediction result is obtained by performing slot position prediction on the first prediction result; responding to the second prediction result showing the appearance position of the word in the target category and the slot position mark corresponding to the target category, reading the word in the appearance position from the plurality of pieces of text information, and determining a first score corresponding to the read word in the appearance position according to the slot position mark; and according to the determined first score, determining a target word from the read words located at the appearance position, wherein the target word is used as a recognition result corresponding to the appearance position.

In a fifteenth aspect, the present specification provides a computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed in a computer, the computer is caused to execute the method described in any implementation manner of the first to sixth aspects.

In a sixteenth aspect, the present specification provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method as described in any one of the implementation manners of the first aspect to the sixth aspect.

The model processing method and apparatus provided by the above embodiments of the present specification, by obtaining at least one training sample including: the method comprises the steps of obtaining speech information of words of a target category, obtaining a labeled text which is used for representing the semantics of the speech information and is added with a slot position mark corresponding to the target category, adding the slot position mark to the original appearance position of the words of the target category in the labeled text, and then training an end-to-end model to be trained according to at least one training sample so as to train and obtain the target end-to-end model which is used for speech recognition and has a slot position prediction function. In a speech recognition scene, when a slot position mark corresponding to a target category is added to a prediction result output by a target end-to-end model, the optimization processor can acquire a word of the target category to be optimized according to the slot position mark, and therefore the optimization processor can accurately recognize the word of the target category. Therefore, the recognition accuracy of the words in the target category can be improved, the false alarm rate can be reduced, and the voice recognition effect of the voice information including the words in the target category can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings needed to be used in the description of the embodiments will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments disclosed in the present specification, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is an exemplary system architecture diagram to which some embodiments of the present description may be applied;

FIG. 2 is a flow diagram for one embodiment of a model processing method in accordance with the present description;

FIG. 3a is a schematic diagram of adding a slot marker to text information to be marked;

FIG. 3b is another schematic diagram of adding a slot marker to the text information to be marked;

FIG. 4 is a flow diagram of yet another embodiment of a model processing method according to the present description;

FIG. 5 is a flow diagram for one embodiment of a speech recognition method according to the present description;

FIG. 6 is a flow diagram of one embodiment of a method of determining a recognition result corresponding to a position occupied by a slot marker corresponding to a target category;

FIG. 7 is a flow diagram of yet another embodiment of a speech recognition method according to the present description;

FIG. 8 is a diagram of yet another exemplary system architecture to which some embodiments of the present description may be applied;

FIG. 9 is a flow diagram of yet another embodiment of a model processing method according to the present description;

FIG. 10 is a flow diagram of yet another embodiment of a speech recognition method according to the present description;

FIG. 11 is a schematic view of a model processing apparatus according to the present description;

fig. 12 is a schematic view of a structure of a speech recognition apparatus according to the present specification;

FIG. 13 is a schematic view of yet another configuration of a model processing apparatus according to the present description;

FIG. 14 is a schematic view of yet another arrangement of a speech recognition apparatus according to the present description;

FIG. 15 is a schematic view of yet another configuration of a model processing apparatus according to the present description;

fig. 16 is a schematic view of still another structure of the speech recognition apparatus according to the present specification;

FIG. 17 is a schematic diagram of a scenario of an interaction device according to the present description;

FIG. 18 is a schematic diagram of another scenario of an interaction device according to the present description.

Detailed Description

The present specification will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. The described embodiments are only a subset of the embodiments described herein and not all embodiments described herein. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step are within the scope of the present application.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present description may be combined with each other without conflict.

As described above, the end-to-end speech recognition is effective in recognizing a general scene, but is not effective in a scene similar to a name of a person or a place.

The inventor finds that when the recognition effect of a certain word is required to be optimized, the optimization of the recognition effect of the word can be integrated into the training of the end-to-end model, so that the recognition effect can be optimized by means of the strong modeling capability of the end-to-end model. Because the model end can directly output information to tell the optimization processor when the word should be optimized, the recognition efficiency can be effectively improved.

Specifically, when the recognition effect of a certain type of word is desired to be improved, a corresponding slot position mark may be set for the certain type of word in advance. For example, when it is desired to enhance the recognition effect of the name, a name slot mark may be set in advance. When the recognition effect of the place name is to be improved, a place name slot mark can be preset. Here, a certain kind of words as described above may be referred to as a target class. In the training data preparation phase, the slot position markers corresponding to the target categories may be associated with the appearance positions of the words of the target categories in the text information serving as the training data. Therefore, in the training stage of the end-to-end model, the trained end-to-end model has a slot position prediction function by combining the slot position marks corresponding to the target category, namely the occurrence position of the word of the target category can be predicted. Subsequently, the optimization processor can accurately identify the words of the target category according to the slot position mark shown by the prediction result output by the end-to-end model and the position corresponding to the slot position mark.

In addition, the inventor finds that a target end-to-end model with a slot position prediction function can be obtained through training through the following two model processing schemes, and the optimization processor can accurately recognize words of a target category through a prediction result output by the target end-to-end model.

The first model processing scheme is as follows: and fusing the slot position mark prediction to a training stage of an end-to-end model for voice recognition, so that the obtained target end-to-end model is trained, and when voice information including words of a target category is received, text information which is used for representing the semantics of the voice information and added with a slot position mark corresponding to the target category is output. Wherein the slot mark is added at the occurrence position of the word of the predicted target category. Furthermore, in a speech recognition scenario, the prediction result of the target end-to-end model output is used as input to the optimization processor.

Model processing scheme two: and independently training an end-to-end model specially used for slot position prediction, enabling the trained target end-to-end model to receive a prediction result output by the voice recognition model, performing slot position prediction according to the prediction result, and outputting a corresponding slot position prediction result. In the speech recognition scenario, the prediction results respectively output by the speech recognition model and the target end-to-end model are both used as input of the optimization processor.

In the following, the details related to the first model processing scheme will be described.

Some embodiments of the present specification disclose a model processing method and a speech recognition method respectively associated with a first model processing scheme. In particular, FIG. 1 illustrates an exemplary system architecture diagram suitable for use with these embodiments.

As shown in FIG. 1, the system architecture includes a model training system and a speech recognition system. The model training system is used for training an end-to-end model to be trained according to a training sample corresponding to the voice information of the words including the target category so as to train and obtain the target end-to-end model which is used for voice recognition and has a slot position prediction function. It should be noted that, on the basis of using the training sample corresponding to the speech information including the word in the target category, a training sample corresponding to the speech information not including the word in the target category may also be used.

The target categories may include categories related to names. Further, the target categories may include at least one of the following categories: a person name, a place name, an organization name, an animal name, an audio name, a video name, and so forth. The audio names may include, for example, but are not limited to song names. The video names may include, but are not limited to, names of at least one of the following videos: live video, television shows, movies, art programs, and the like.

In practice, the target end-to-end model is part of a speech recognition system. In addition, the speech recognition system further comprises an optimization processor. In a speech recognition scenario, speech information to be recognized may be input into the target end-to-end model. The target end-to-end model can perform speech recognition on the speech information and output a prediction result obtained through the speech recognition to the optimization processor. The optimization processor can optimize the prediction result and output the voice recognition result obtained after the optimization processing.

It should be noted that, when the speech information to be recognized is not speech information including a word of a target category, the prediction result output by the target end-to-end model is similar to the prediction result output by the existing end-to-end model for speech recognition, and the optimization processor may perform optimization processing on the prediction result by using a conventional processing method.

When the speech information to be recognized is speech information including a word of a target category, a prediction result output by the target end-to-end model is different from a prediction result output by an existing end-to-end model for speech recognition. The text information in the prediction result output by the target end-to-end model is added, so that the optimization processor can be helped to accurately identify the slot position mark of the word of the target category.

The following describes specific implementation steps of the above method with reference to specific examples.

Referring to FIG. 2, a flow 200 of one embodiment of a model processing method is shown. The implementation subject of the method may be the model training system shown in fig. 1. The method comprises the following steps:

step 201, obtaining at least one training sample, where the training sample includes voice information including words of a target category and a tagged text, the tagged text is used for representing semantics of the voice information and is added with a slot position mark corresponding to the target category, and the slot position mark is added at an original appearance position of the words of the target category in the tagged text;

step 202, training an end-to-end model to be trained according to at least one training sample to obtain a target end-to-end model which is used for voice recognition and has a slot position prediction function.

The following describes the steps 201-202 in detail.

In step 201, the speech information included in each of the at least one training sample may be spectral information, and the spectral information may be extracted by using a feature extraction method such as Mel-Frequency Cepstral Coefficients (MFCCs) or Linear Predictive Cepstral Coefficients (LPCCs).

The at least one training sample may be pre-generated, and the at least one training sample may be stored in a specific storage location, for example, a location local to the execution agent or on another server communicatively connected to the execution agent. The execution subject may obtain the at least one training sample from the storage location.

In practice, the slot marks in the labeling text included in each of the at least one training sample may be manually labeled, or may be automatically labeled by using a specific tool (e.g., a tool for named entity recognition). For example, when it is desired to identify a named entity (e.g., a person name, a place name, or an organization name), a category of the named entity and a slot tag corresponding to the category may be set in the tool. And then the tool can identify a word belonging to the category from the text to be labeled according to the category, and adds a slot mark corresponding to the category in front of the word, so that the slot mark occupies the original appearance position of the word in the text.

Take the target category as the name of a PERSON, and the name slot is labeled as "CLASS _ PERSON". As shown in fig. 3a, it shows a schematic diagram of adding a slot mark to the text information to be marked. In fig. 3a, the text information "my name called xiaoming" to be annotated is shown. Wherein, Xiaoming belongs to the name of a person. The name slot mark "CLASS _ PERSON" can be added in front of (for example, on the left side of) the "xiao ming" in a manual labeling mode or an automatic labeling mode. As shown in fig. 3a, the annotation text after adding the name slot mark may be "my name called CLASS _ PERSON Xiaoming". Wherein, the "CLASS _ PERSON" occupies the original appearance position of the name "Xiaoming". "CLASS _ PERSON" may be used to indicate that the name of a PERSON should appear at the location it occupies.

Then, take the target category as the PLACE name, and mark the PLACE name slot as "CLASS _ PLACE" as an example. As shown in fig. 3b, another schematic diagram of adding a slot mark to the text information to be annotated is shown. In fig. 3b, the text information "i live in beijing" to be annotated is shown. Wherein, the 'Beijing' belongs to the place name. The PLACE name slot mark 'CLASS _ PLACE' can be added in front of (such as on the left side of) Beijing in a manual labeling mode or an automatic labeling mode. As shown in fig. 3b, the annotation text after adding the PLACE name slot mark may be "i live in CLASS _ PLACE beijing". Wherein, the CLASS _ PLACE occupies the original appearance position of the PLACE name Beijing. "CLASS _ plain" may be used to indicate that a PLACE name should appear at the location it occupies.

It should be understood that when the target category includes a person name, the person name slot mark added in the annotation text may be used to indicate that the person name should appear at the position occupied by it. When the target category comprises a place name, a place name slot mark added in the annotation text can be used for indicating that the place name should appear at the position occupied by the place name. When the target category includes a mechanism name, the mechanism name slot mark added in the annotation text can be used for indicating that the mechanism name should appear at the position occupied by the mechanism name slot mark. When the target category includes other categories, the content indicated by the slot mark corresponding to the other categories added in the annotation text can be obtained by analogy with the content described above, and will not be described in detail herein.

Alternatively, the at least one training sample may be a part of a pre-collected set of training samples. The training sample set may also include training samples corresponding to speech information that does not include words of the target category. The training sample includes speech information that does not include words of the target class, and text information that characterizes semantics of the speech information. It should be appreciated that the training sample set may be used for training of the end-to-end model to be trained. On the basis of training the end-to-end model to be trained according to the training sample corresponding to the voice information of the word including the target category, the end-to-end model to be trained can also be trained according to the training sample corresponding to the voice information of the word not including the target category.

In step 202, the executing entity may train an end-to-end model to be trained according to the at least one training sample, so as to obtain a target end-to-end model for speech recognition and having a slot prediction function.

The end-to-end model to be trained may be an untrained or an untrained completed model. In addition, the end-to-end model to be trained may be any kind of model that is suitable for speech recognition and that employs an end-to-end architecture. Further, the end-to-end model to be trained may include, but is not limited to, a transform-based model, and the like. The Transformer-based model may be referred to as a Transformer model. The Transformer model is a natural language processing model based on the self-attention mechanism and employing an encoder (encoder) -decoder (decoder) architecture. The Transformer model typically processes all words or symbols in a sequence in parallel, while using a self-attention mechanism to combine context with more distant words. By processing all words in parallel and letting each word notice other words in the sentence in a number of processing steps. In addition, the training speed of the Transformer model is generally high, and the processing effect is good.

It should be noted that a dictionary adopted by the end-to-end model to be trained is added with a slot position mark corresponding to the target class. Thus, the end-to-end model to be trained can learn slot prediction.

In practice, the labeled text included in each of the training samples may be a text in the form of a word sequence, or a text without word segmentation.

When the labeled text included in each of the at least one training sample is a text in a word sequence form, the speech information included in each of the at least one training sample may be used as an input, and the labeled text corresponding to the speech information may be used as a label to train the end-to-end model to be trained.

When the labeled texts included in the at least one training sample are non-participled texts, for each training sample in the at least one training sample, the labeled texts included in the training sample may be participled, and the participled words are combined into a word sequence. Then, the speech information included in the at least one training sample is used as input, and the word sequence corresponding to the speech information is used as a label to train the end-to-end model to be trained. In practice, the existing text word segmentation technology can be adopted to perform word segmentation processing on the labeled text, and a specific word segmentation method is not described in detail here.

In the model processing method provided by this embodiment, the target end-to-end model used for speech recognition and having a slot position prediction function is obtained by obtaining at least one training sample as described above and training the end-to-end model to be trained according to the at least one training sample. Therefore, the prediction result output by the target end-to-end model can help the optimization processor to accurately identify the words of the target category, improve the identification accuracy of the words of the target category, and reduce the false alarm rate, so that the voice identification effect of the voice information of the words comprising the target category can be improved.

Next, the application of the scheme provided by the embodiment corresponding to fig. 2 in the scenario of saying the name of a person is described.

In a scenario where a person name is spoken, the object class may be referred to as a person name. Before model training, a name slot marker may be set. In addition, at least one piece of voice information (for example, spectrum information obtained by preprocessing) of the spoken person name may be collected in advance, and corresponding text information may be generated for the collected at least one piece of voice information. The text information is used for representing the semantics of the corresponding voice information. Then, a manual marking mode or an automatic marking mode can be adopted, and name slot position marks are added in front of words belonging to names in the generated text information. It should be understood that the name slot marker is located at the original occurrence of the word in the text message. Here, the text information to which the name slot mark is added may be referred to as a label text.

Then, for each piece of speech information in the collected at least one piece of speech information, the speech information and the labeled text may be combined into a training sample. In addition, the composed training samples may also be stored to a specific storage location for model training.

As shown in FIG. 4, a flow 400 of yet another embodiment of a model processing method is shown. The subject of the training method may be a model training system as shown in fig. 1. The training method comprises the following steps:

step 401, obtaining at least one training sample, where the training sample includes voice information including a name and a tagged text, the tagged text is used for representing semantics of the voice information and is added with a name slot mark, and the name slot mark is added at an original appearance position of a word belonging to the name in the tagged text;

step 402, training an end-to-end model to be trained according to at least one training sample to obtain a target end-to-end model which is used for voice recognition and has a name slot position prediction function.

In the scene of speaking the name, the target end-to-end model obtained by training in the

steps

401 and 402 can be used for voice recognition and has the name slot prediction function. The prediction result output by the target end-to-end model can help the optimization processor to accurately identify the name of the person, improve the accuracy rate of identification aiming at the name of the person, reduce the false alarm rate, and therefore, improve the voice identification effect aiming at the voice information comprising the name of the person.

It should be noted that, according to the application of the model processing method in the scenario of saying the name of the person, those skilled in the art can analogize to obtain the application scheme of the model processing method in the scenario of saying the words of other categories, which is not illustrated here.

After the target end-to-end model for speech recognition and with the slot prediction function is obtained by training using the scheme provided by the embodiment corresponding to fig. 2, the model may be applied to a speech recognition system, so that a prediction result output by the model is used as an input of an optimization processor in the speech recognition system.

In practice, the prediction result output by the target end-to-end model typically comprises a plurality of pieces of textual information, which have the same length. The optimization processor needs to read the words located at the same position from the plurality of pieces of text information in a specified reading order (e.g., left-to-right, right-to-left, etc.), and determine the words that serve as the recognition result from the read words. It should be noted that, for the words in the prediction result that are not related to the slot position mark, a conventional processing method may be used to perform the optimization processing. For words related to slot markers, the optimization process may be performed using the flow shown in fig. 5.

FIG. 5 illustrates a flow 500 of one embodiment of a speech recognition method. The execution subject of the speech recognition method may be an optimization processor in the speech recognition system as shown in fig. 1. The voice recognition method specifically shows an optimization processing process of words related to slot position marks, and comprises the following steps:

step 501, obtaining a prediction result output by a target end-to-end model, wherein the prediction result comprises a plurality of pieces of text information;

step 502, in response to reading the slot position marks corresponding to the target categories at the same positions of the plurality of pieces of text information, respectively extracting words which are adjacent to the positions of the slot position marks and appear behind the slot position marks from the plurality of pieces of text information;

step 503, determining a first score corresponding to the extracted word according to the slot position mark;

step 504, determining a target word from the extracted words according to the determined first score, wherein the target word is used as a recognition result corresponding to the position occupied by the slot position mark.

In this embodiment, in order to facilitate distinguishing between a score corresponding to a word and a score corresponding to text information mentioned below, the score corresponding to the word is referred to as a first score, and the score corresponding to the text information is referred to as a second score. It should be noted that the first score and the second score may both be values within the interval of [0, 1 ].

In practice, each target category may be provided with a corresponding scoring model. The slot indicia corresponding to the target category may include an identification of the scoring model. The scoring model may be used to characterize the correspondence between the words under their corresponding target categories and the first score. Specifically, the scoring model may include a pre-established data mapping table for characterizing the correspondence between the words in the target category and the first score. Alternatively, the scoring model may include a pre-trained predictive model for predicting a first score corresponding to a word under the target category.

Next, the detailed description of step 501-504 is provided.

In step 501, the target end-to-end model may include a target end-to-end model with a slot prediction function for speech recognition, which is trained by the method described in the corresponding embodiment of fig. 2. If the prediction result is obtained by performing speech recognition on the speech information including the words of the target category, the slot marks corresponding to the target category may be added to the plurality of pieces of text information in the prediction result, respectively, and the positions occupied by the slot marks are the predicted positions where the words of the target category appear. If the prediction result is obtained by performing voice recognition on the voice information not including the word of the target category, the slot position mark corresponding to the target category is not added to the plurality of pieces of text information in the prediction result.

In step 502, in the process of optimizing the prediction result, the execution main body may extract, from the plurality of pieces of text information, words that appear adjacent to and after the slot mark position, in response to reading the slot mark corresponding to the target category at the same position of the plurality of pieces of text information in the prediction result. When a reading sequence from left to right is adopted, the word appearing after the slot mark means the word located on the right side of the slot mark. When a reading order from right to left is employed, the word appearing after the slot index refers to the word located to the left of the slot index.

As an example, assume that the prediction result shows the following two pieces of text information: "My name is CLASS _ PERSON Mingmbh" and "My name is CLASS _ PERSON Mingmbh", the execution bodies adopt the reading sequence from left to right. After "my name" is recognized from the two pieces of text information, the PERSON name slot flag "CLASS _ PERSON" is read. After the name slot mark is read, it can be known that the position occupied by the name slot mark should have words belonging to the name, and words adjacent to the position marked by the name slot mark and positioned on the right side of the name slot mark, namely, the word "Xiaoming" and the word "Minkou", can be determined and are candidate names corresponding to the position occupied by the name slot mark. Therefore, "xiaoming" and "small name" can be extracted from the two pieces of information, respectively.

In step 503, a first score corresponding to each extracted word may be determined by using a scoring model corresponding to the slot position mark, so as to perform score excitation on each extracted word. Thus, each extracted word can be optimized according to the score.

As an implementation manner, when the scoring model is the data mapping table as described above, for each extracted word, a record including the word may be searched in the scoring model, and a first score in the searched record is determined as a first score corresponding to the word. Alternatively, if no record including the word is found, the number 0 may be determined as the first score corresponding to the word.

As another implementation, when the scoring model is the prediction model as described above, each extracted word may be input into the scoring model to obtain the first score output by the scoring model.

In step 504, the executing entity may determine, according to the determined first score, a target word serving as a recognition result corresponding to a position occupied by the slot position mark from the extracted words.

As an implementation manner, a word corresponding to the highest first score among the extracted words may be determined as the target word. Continuing with the above two text messages as an example, assuming that the first score corresponding to "xiaoming" is 0.9, and the first score corresponding to "maxas" is 0.2, then "xiaoming" may be determined as the recognition result corresponding to the position occupied by the human name slot position mark "CLASS _ PERSON". At this time, the final speech recognition result corresponding to the two pieces of text information may include "my name is called xiaoming".

As another implementation manner, the prediction result in step 501 further includes second scores corresponding to the plurality of pieces of text information respectively. The second score corresponding to each piece of text information may be calculated by the target end-to-end model according to the probability distribution of each character in the text information. For each extracted word, the screening score corresponding to the word can be determined according to the first score corresponding to the word and the second score corresponding to the text information where the word is located. For example, the sum of the first score corresponding to the word and the second score corresponding to the text information where the word is located may be determined as the score for filtering corresponding to the word. Then, a target word may be determined from the extracted words according to the determined score for screening, for example, a word corresponding to the highest screening score may be determined as the target word.

Alternatively, the scoring model may be preset with a score adjustment coefficient. The score adjusting number is used for adjusting a first score corresponding to the extracted word when the score excitation is carried out on the extracted word. The division number may be a number within the interval [0, 1 ]. In practice, a set of words corresponding to a target category is collected in advance, and the target category is associated with speech recognition statistical information corresponding to the set of words, which may include, but is not limited to, word correctness rate, recognition accuracy rate of words for the target category, and the like. The number of divisions may be determined based on the statistical information.

In the case that the scoring model is preset with a score adjustment coefficient, step 504 can be implemented by the flow shown in fig. 6. Fig. 6 shows a flow of an embodiment of a determination method of a recognition result corresponding to a position occupied by a slot mark corresponding to a target category. The determination method comprises the following steps:

step 5041, for each extracted word, determining a product between a first score corresponding to the word and a tuning score number of a scoring model corresponding to the slot position mark, and determining a sum of the product and a second score corresponding to the text information where the word is located as a screening score corresponding to the word;

step 5042, determining a target word from the extracted words according to the determined screening score, the target word serving as a recognition result corresponding to a position occupied by the slot position mark corresponding to the target category.

Therein, in step 5042, the word corresponding to the highest screening score may be determined as the target word.

In the speech recognition method provided in the embodiment corresponding to fig. 5, in the process of optimizing the prediction result, in response to reading the slot position marks corresponding to the target categories at the same positions of the pieces of text information in the prediction result, the pieces of text information are respectively extracted from the pieces of text information, words that are adjacent to the slot position marks and appear after the slot position marks are extracted, and then a first score corresponding to the extracted words is determined according to the slot position marks (for example, by using scoring models corresponding to the slot position marks), so that a target word serving as a recognition result corresponding to a position occupied by the slot position mark is determined from the extracted words according to the determined first score. Therefore, words corresponding to the positions occupied by the slot marks can be accurately identified according to the slot marks corresponding to the target categories and the scoring models corresponding to the slot marks. Therefore, the recognition accuracy of the words in the target category can be improved, the false alarm rate can be reduced, and the voice recognition effect of the voice information including the words in the target category can be improved.

Next, the application of the scheme provided by the embodiment corresponding to fig. 5 in the scenario of saying the name of a person is described.

In the scenario of speaking the name of a person, after a target end-to-end model for speech recognition and having a name slot prediction function is obtained by training using the scheme provided by the embodiment corresponding to fig. 4, the model may be applied to a speech recognition system, so that a prediction result output by the model is used as an input of an optimization processor in the speech recognition system.

In a speech recognition scene, after a piece of speech information to be recognized including a name is input into a target end-to-end model, the target end-to-end model analyzes and processes the speech information, and then a corresponding prediction result can be output, wherein the prediction result can include a plurality of pieces of text information which are used for representing the semantics of the speech information and are added with name slot position marks. After receiving the prediction result, the optimization processor may perform optimization processing on the prediction result and output a speech recognition result obtained after the optimization processing. For words in the prediction result which are irrelevant to the name slot position mark, a conventional processing method can be adopted for optimization processing. For words related to the name slot mark, the optimization process may be performed by using the flow shown in fig. 7.

FIG. 7 illustrates a flow 700 of one embodiment of a speech recognition method. The execution subject of the speech recognition method may be an optimization processor in the speech recognition system as shown in fig. 1. The voice recognition method specifically shows an optimization processing process of words related to the name slot position mark, and comprises the following steps of:

step 701, obtaining a prediction result output by a target end-to-end model, wherein the prediction result comprises a plurality of pieces of text information;

step 702, responding to reading of the name slot marks at the same positions of the plurality of text messages, respectively extracting words which are adjacent to the name slot mark positions and appear behind the name slot marks from the plurality of text messages;

703, determining a first score corresponding to the extracted word according to the name slot position mark;

step 704, according to the determined first score, determining a target word from the extracted words, wherein the target word is used as a recognition result corresponding to the position occupied by the name slot position mark.

In step 701, the target end-to-end model may include a target end-to-end model which is trained by the method described in the embodiment corresponding to fig. 4 and used for speech recognition and has a name slot mark prediction function.

In step 703, a first score corresponding to the extracted word may be determined by using the scoring model corresponding to the name slot mark.

In the case of saying names, the optimization processor can accurately identify the

names using steps

701 and 704. Therefore, the recognition accuracy rate for the name of the person can be improved, the false alarm rate can be reduced, and the voice recognition effect for the voice information including the name of the person can be improved.

It should be noted that, according to the application of the speech recognition method in the scenario of speaking the name of the person, those skilled in the art can analogize to obtain the application scheme of the speech recognition method in the scenario of speaking the words of other categories, which is not illustrated here.

What has been described above relates to the model processing scheme one described above. Next, the contents related to the model processing scheme two as described above are described. It should be noted that, in order to facilitate distinguishing the target end-to-end model from the prediction result output by the speech recognition model, hereinafter, the prediction result output by the speech recognition model is referred to as a first prediction result, and the prediction result output by the target end-to-end model is referred to as a second prediction result.

Some embodiments of the present specification disclose a model processing method and a speech recognition method respectively associated with the second model processing scheme. In particular, FIG. 8 illustrates an exemplary system architecture diagram suitable for use with these embodiments.

As shown in FIG. 8, the system architecture includes a model training system and a speech recognition system. The model training system is used for training an end-to-end model to be trained according to a training sample comprising text information and slot position marking information so as to train and obtain a target end-to-end model for slot position prediction. Wherein the text information comprises words of the target category. The slot mark information shows the position of occurrence of the word and a slot mark corresponding to the target category. Here, the object class is similar to the object class mentioned in the foregoing, and is not explained in detail here.

In practice, the target end-to-end model is part of a speech recognition system. In addition, the speech recognition system includes a speech recognition model and an optimization processor. The speech recognition model may be a model employing an end-to-end architecture. In a speech recognition scene, when speech recognition is performed on speech information to be recognized, the speech information can be input into a speech recognition model, so that the speech recognition model is respectively output to a target end-to-end model and an optimization processor, and a first prediction result of the speech information is aimed at. The target end-to-end model can perform slot position prediction on the first prediction result and output a second prediction result obtained through slot position prediction to the optimization processor. The optimization processor can perform optimization processing on the first prediction result according to the second prediction result and output a voice recognition result obtained after the optimization processing.

Referring to FIG. 9, a flow 900 of one embodiment of a model processing method is shown. The implementation subject of the method may be a model training system as shown in fig. 8. The method comprises the following steps:

step 901, obtaining at least one training sample, where the training sample includes text information including words of a target category and slot position marking information, and the slot position marking information shows an appearance position of a word of the target category and a slot position mark corresponding to the target category;

and 902, training the end-to-end model to be trained according to at least one training sample to obtain a target end-to-end model for slot position prediction.

The following describes steps 901-902 in detail.

In step 901, the at least one training sample may be generated in advance, and the at least one training sample may be stored in a specific storage location, for example, a location on another server local to the execution principal or communicatively connected to the execution principal. The execution subject may obtain the at least one training sample from the storage location.

In practice, the slot marking information in the at least one training sample may be manually marked or automatically marked, and is not specifically limited herein.

In step 902, the executing agent may train the end-to-end model to be trained according to the at least one training sample, so as to obtain a target end-to-end model for slot prediction.

Wherein the end-to-end model to be trained may be an untrained or an untrained completed model. In addition, the end-to-end model to be trained may be any kind of model that is suitable for text processing and that employs an end-to-end architecture. Further, the end-to-end model to be trained may include, but is not limited to, a transform-based model, and the like.

In practice, the text information included in each of the at least one training sample may be a text in the form of a word sequence, or may be a text without word segmentation.

When the text information included in the at least one training sample is a text in a word sequence form, the text information included in the at least one training sample may be used as an input, and the slot position marking information corresponding to the text information may be used as a tag to train the end-to-end model to be trained.

When the text information included in each of the at least one training sample is a text without word segmentation, for each training sample in the at least one training sample, word segmentation may be performed on the text information included in the training sample, and words obtained through word segmentation may be formed into a word sequence. Then, the word sequence corresponding to the text information included in the at least one training sample is used as an input, and the slot position marking information corresponding to the text information is used as a label to train the end-to-end model to be trained.

In the model processing method provided in this embodiment, the target end-to-end model for slot position prediction is obtained by obtaining the at least one training sample, and then training the end-to-end model to be trained according to the at least one training sample. In this way, the second prediction result output by the target end-to-end model can help the optimization processor accurately recognize the words of the target category from the first prediction result output by the speech recognition model, so that the recognition accuracy of the words of the target category is improved, the false alarm rate is reduced, and therefore the speech recognition effect of the speech information including the words of the target category can be improved.

After the target end-to-end model for slot prediction is obtained by training using the scheme provided by the embodiment corresponding to fig. 9, the model may be applied to a speech recognition system, so that a second prediction result output by the model is used as an input of an optimization processor in the speech recognition system.

In practice, the first prediction output by the speech recognition model typically comprises a plurality of pieces of text information, which have the same length. The optimization processor needs to read words located at the same position from the plurality of text messages according to a specified reading sequence (for example, from left to right, or from right to left, etc.), and determine a word serving as a recognition result corresponding to the position from the read words. It should be noted that, for the words in the first prediction result that are not related to the slot position mark, a conventional processing method may be used to perform the optimization processing. For words related to slot markers, the optimization process may be performed using the flow shown in fig. 10.

FIG. 10 illustrates a flow 1000 of one embodiment of a speech recognition method. The execution subject of the speech recognition method may be an optimization processor in the speech recognition system as shown in fig. 8. The voice recognition method shows an optimization processing process of words related to slot position marks under the condition that a second prediction result shows the appearance position of the words of the target category and the slot position marks corresponding to the target category, and the voice recognition method comprises the following steps:

1001, acquiring a first prediction result output by a voice recognition model, wherein the first prediction result comprises a plurality of pieces of text information;

step 1002, obtaining a second prediction result output by the target end-to-end model, wherein the second prediction result is obtained by performing slot position prediction on the first prediction result;

step 1003, responding to the second prediction result, wherein the second prediction result shows the appearance position of the word in the target category and the slot position mark corresponding to the target category, reading the word at the appearance position from the plurality of pieces of text information, and determining a first score corresponding to the read word at the appearance position according to the slot position mark shown by the second prediction result;

step 1004, according to the determined first score, determining a target word from the read words located at the appearance position, wherein the target word is used as a recognition result corresponding to the appearance position.

In step 1002, the target end-to-end model may include a target end-to-end model for slot prediction, which is obtained by training using the method described in the embodiment corresponding to fig. 9.

In step 1003, in response to the second prediction result showing the occurrence position of the word in the target category and the slot mark corresponding to the target category and reading the word at the occurrence position from the plurality of pieces of text information, a first score corresponding to the read word at the occurrence position may be determined by using a scoring model corresponding to the slot mark shown by the second prediction result.

In this embodiment, a method for determining a first score corresponding to the read word located at the occurrence position by using a scoring model corresponding to the slot position mark, and determining a target word from the read word located at the occurrence position according to the determined first score may refer to the related description in the embodiment corresponding to fig. 5.

In the speech recognition method provided in this embodiment, when the second prediction result shows the occurrence position of the word in the target category and the slot mark corresponding to the target category, in a process of performing optimization processing on the first prediction result, in response to reading a word located at the occurrence position from the plurality of pieces of text information in the first prediction result, a first score corresponding to the read word located at the occurrence position is determined according to the slot mark (for example, by using a scoring model corresponding to the slot mark), so that a target word is determined from the read word located at the occurrence position according to the determined first score, and the target word serves as a recognition result corresponding to the occurrence position. Therefore, words corresponding to the appearance positions shown by the second prediction result can be accurately identified according to the second prediction result and the scoring model corresponding to the slot position marks shown by the second prediction result. Therefore, the recognition accuracy of the words in the target category can be improved, the false alarm rate can be reduced, and the voice recognition effect of the voice information including the words in the target category can be improved.

With further reference to FIG. 11, as an implementation of the methods shown in some of the above figures, the present specification provides an embodiment of a model processing apparatus, which corresponds to the embodiment of the method shown in FIG. 2, and which may be applied to a model training system as shown in FIG. 1.

As shown in fig. 11, the model processing apparatus 1100 of the present embodiment includes: an acquisition unit 1101 and a model training unit 1102. The obtaining unit 1101 is configured to obtain at least one training sample, where the training sample includes voice information including words of a target category, and a tagged text, the tagged text is used for representing semantics of the voice information and is added with a slot position mark corresponding to the target category, and the slot position mark is added at an original appearance position of the words of the target category in the tagged text; the model training unit 1102 is configured to train the end-to-end model to be trained according to the at least one training sample, so as to obtain a target end-to-end model for speech recognition and having a slot prediction function.

In this embodiment, specific processing of the obtaining unit 1101 and the model training unit 1102 and technical effects brought by the specific processing can refer to related descriptions of step 201 and step 202 in the corresponding embodiment of fig. 2, which are not repeated herein.

Optionally, the target category may include at least one of the following categories: a person name, a place name, an organization name, an animal name, an audio name, a video name, etc.

Optionally, a dictionary adopted by the end-to-end model to be trained is added with a slot position mark corresponding to the target class. The end-to-end model to be trained may comprise a natural language processing model based on a self-attention mechanism and employing an encoder-decoder architecture.

Optionally, when the target category includes a name, a name slot mark added in the annotation text is used for indicating that the name of the person should appear at the position occupied by the name slot mark; when the target category comprises a place name, a place name slot mark added in the label text is used for representing, and the place name is supposed to appear at the position occupied by the place name slot mark; when the target category comprises the organization name, the organization name slot mark added in the annotation text is used for representing that the organization name should appear at the position occupied by the organization name slot mark.

Optionally, when the annotation text is a text in the form of a word sequence, the model training unit 1102 may be further configured to: and taking the voice information respectively included by the at least one training sample as input, taking the labeled text corresponding to the voice information as a label, and training the end-to-end model to be trained.

Optionally, when the annotation text is an un-participled text, the model training unit 1102 may be further configured to: for each training sample in the at least one training sample, performing word segmentation on the labeled text included in the training sample, and forming words obtained through word segmentation into word sequences; and taking the voice information respectively included by the at least one training sample as input, and taking a word sequence corresponding to the voice information as a label, and training an end-to-end model to be trained.

The model processing apparatus provided in this embodiment obtains the at least one training sample by the obtaining unit, and then trains the end-to-end model to be trained by the model training unit according to the at least one training sample to obtain the target end-to-end model for speech recognition and having a slot position prediction function, so that a prediction result output by the target end-to-end model can help the optimization processor to accurately recognize words of a target category, improve recognition accuracy of the words of the target category, and reduce a false alarm rate, and therefore, a speech recognition effect of speech information for the words including the target category can be improved.

With further reference to fig. 12, as an implementation of the methods shown in some of the above figures, the present specification provides an embodiment of a speech recognition apparatus, which corresponds to the embodiment of the method shown in fig. 5, and which is applied to an optimization processor in the speech recognition system shown in fig. 1. The speech recognition system also includes a target end-to-end model for speech recognition with slot prediction. The optimization processor is used for optimizing the prediction result output by the target end-to-end model.

As shown in fig. 12, the speech recognition apparatus 1200 of the present embodiment includes: an acquisition unit 1201, an extraction unit 1202, a score determination unit 1203, and a recognition result determination unit 1204. The obtaining unit 1201 is configured to obtain a prediction result output by the target end-to-end model, where the prediction result includes a plurality of pieces of text information; the extracting unit 1202 is configured to extract, in response to reading of a slot mark corresponding to a target category at the same position of the pieces of text information, a word that is adjacent to the slot mark position and appears after the slot mark, from the pieces of text information, respectively; the score determining unit 1203 is configured to determine a first score corresponding to the extracted word according to the slot position mark; the recognition result determining unit 1204 is configured to determine a target word from the extracted words according to the determined first score, the target word serving as a recognition result corresponding to a position occupied by the slot position mark.

In this embodiment, specific processes of the obtaining unit 1201, the extracting unit 1202, the score determining unit 1203, and the recognition result determining unit 1204 and technical effects brought by the specific processes may refer to related descriptions of step 501, step 502, step 503, and step 504 in the corresponding embodiment of fig. 5, and are not described herein again.

Optionally, the target end-to-end model may include a target end-to-end model with a slot prediction function for speech recognition, which is trained by the method described in the corresponding embodiment of fig. 2.

Optionally, the score determining unit 1203 may be further configured to: and determining a first score corresponding to the extracted word by using the scoring model corresponding to the slot position mark.

Optionally, the scoring model may include a pre-established data mapping table for characterizing a correspondence between words in the target category and the first score; and the score determination unit 1203 may be further configured to: and searching the record comprising the extracted word in the scoring model, and determining the first score in the searched record as the first score corresponding to the extracted word.

Optionally, the scoring model may include a pre-trained prediction model for predicting a first score corresponding to a word in the target category; and the score determination unit 1203 may be further configured to: and inputting the extracted words into a scoring model to obtain a first score output by the scoring model.

Optionally, the prediction result may further include second scores corresponding to the text messages respectively; and the recognition result determining unit 1204 may be further configured to: for each extracted word, determining a screening score corresponding to the word according to a first score corresponding to the word and a second score corresponding to the text information where the word is located; and determining the target word from the extracted words according to the determined screening score.

Optionally, the scoring model may correspond to a preset score adjustment coefficient; and the recognition result determining unit 1204 may be further configured to: for each extracted word, determining the product of a first score and a tuning score corresponding to the word; and determining the sum of the product and a second score corresponding to the text information where the word is positioned as a screening score corresponding to the word.

In the speech recognition apparatus provided in this embodiment, the obtaining unit obtains a prediction result output by the target end-to-end model, where the prediction result includes a plurality of pieces of text information, the extracting unit then responds to reading of a slot position mark corresponding to a target category at the same position of the plurality of pieces of text information, extracts words from the plurality of pieces of text information, which are adjacent to the slot position mark position and appear after the slot position mark, and then the score determining unit determines a first score corresponding to the extracted words according to the slot position mark (for example, by using a scoring model corresponding to the slot position mark), so that the recognition result determining unit determines a target word from the extracted words according to the determined first score, where the target word serves as a recognition result corresponding to a position occupied by the slot position mark. Therefore, words corresponding to the positions occupied by the slot position marks can be accurately identified according to the slot position marks and the scoring models corresponding to the slot position marks. Therefore, the recognition accuracy of the words in the target category can be improved, the false alarm rate can be reduced, and the voice recognition effect of the voice information including the words in the target category can be improved.

With further reference to FIG. 13, as an implementation of the methods shown in some of the above figures, the present specification provides yet another embodiment of a model processing apparatus, which corresponds to the method embodiment shown in FIG. 4, and which may be applied to a model training system as shown in FIG. 1.

As shown in fig. 13, the model processing apparatus 1300 of the present embodiment includes: an acquisition unit 1301 and a model training unit 1302. The obtaining unit 1301 is configured to obtain at least one training sample, where the training sample includes voice information including a name of a person, and a tagged text, the tagged text is used for representing semantics of the voice information and is added with a name slot mark, and the name slot mark is added at an original appearance position of a word belonging to the name of the person in the tagged text; the model training unit 1302 is configured to train the end-to-end model to be trained according to the at least one training sample, so as to obtain a target end-to-end model for speech recognition and having a name slot prediction function.

In this embodiment, specific processing of the obtaining unit 1301 and the model training unit 1302 and technical effects brought by the specific processing may refer to related descriptions of step 401 and step 402 in the embodiment corresponding to fig. 4, which are not described herein again.

In the model processing apparatus provided in this embodiment, the obtaining unit obtains the at least one training sample, and then the model training unit trains the end-to-end model to be trained according to the at least one training sample, so as to obtain a target end-to-end model which is used for speech recognition and has a name slot prediction function. The prediction result output by the target end-to-end model can help the optimization processor to accurately identify the name of the person, improve the accuracy rate of identification aiming at the name of the person, reduce the false alarm rate, and therefore, improve the voice identification effect aiming at the voice information comprising the name of the person.

With further reference to fig. 14, as an implementation of the methods shown in some of the above figures, the present specification provides an embodiment of a speech recognition apparatus, which corresponds to the embodiment of the method shown in fig. 7, and which is applied to an optimization processor in the speech recognition system shown in fig. 1. The voice recognition system also includes a target end-to-end model for voice recognition with a name slot prediction function. The optimization processor is used for optimizing the prediction result output by the target end-to-end model.

As shown in fig. 14, the speech recognition apparatus 1400 of the present embodiment includes: an acquisition unit 1401, an extraction unit 1402, a score determination unit 1403, and a recognition result determination unit 1404. Wherein, the obtaining unit 1401 is configured to obtain a prediction result output by the target end-to-end model, and the prediction result includes a plurality of pieces of text information; the extracting unit 1402 is configured to extract, in response to reading of the person name slot mark at the same position of the pieces of text information, words that are adjacent to and appear after the person name slot mark from the pieces of text information, respectively; the score determining unit 1403 is configured to determine a first score corresponding to the extracted word according to the name slot position mark; the recognition result determining unit 1404 is configured to determine a target word from the extracted words according to the determined first score, the target word serving as a recognition result corresponding to a position occupied by the name slot mark.

In this embodiment, specific processes of the obtaining unit 1401, the extracting unit 1402, the score determining unit 1403, and the recognition result determining unit 1404 and technical effects brought by the specific processes may refer to related descriptions of step 701, step 702, step 703, and step 704 in the corresponding embodiment of fig. 7, which are not described herein again.

Optionally, the target end-to-end model may include a target end-to-end model which is trained by the method described in the embodiment corresponding to fig. 4 and used for voice recognition and has a name slot prediction function.

Optionally, the score determining unit 1403 may be further configured to: and determining a first score corresponding to the extracted word by using the scoring model corresponding to the name slot position mark.

In the speech recognition device provided by this embodiment, the obtaining unit obtains a prediction result output by the target end-to-end model, where the prediction result includes a plurality of pieces of text information, the extracting unit then responds to reading of a name slot mark at the same position of the plurality of pieces of text information, the plurality of pieces of text information are respectively extracted from the plurality of pieces of text information, words that are adjacent to the position of the name slot mark and appear after the name slot mark are extracted, and then the score determining unit determines a first score corresponding to the extracted words according to the name slot mark (for example, a scoring model corresponding to the name slot mark), so that the recognition result determining unit determines a target word from the extracted words according to the determined first score, and the target word serves as a recognition result corresponding to a position occupied by the name slot mark. Therefore, the name can be accurately identified according to the name slot position mark and the scoring model corresponding to the name slot position mark. Therefore, the recognition accuracy rate for the name of the person can be improved, the false alarm rate can be reduced, and the voice recognition effect for the voice information including the name of the person can be improved.

With further reference to FIG. 15, as an implementation of the methods shown in some of the above figures, the present specification provides yet another embodiment of a model processing apparatus, corresponding to the method embodiment shown in FIG. 9, which may be applied to a model training system as shown in FIG. 8.

As shown in fig. 15, the model processing apparatus 1500 of the present embodiment includes: an acquisition unit 1501 and a model training unit 1502. The acquisition unit 1501 is configured to acquire at least one training sample, where the training sample includes text information including a word of a target category and slot position marking information, and the slot position marking information shows an appearance position of the word of the target category and a slot position mark corresponding to the target category; the model training unit 1502 is configured to train the end-to-end model to be trained according to the at least one training sample, resulting in a target end-to-end model for slot prediction.

In this embodiment, specific processing of the obtaining unit 1501 and the model training unit 1502 and technical effects brought by the specific processing can refer to related descriptions of step 901 and step 902 in the corresponding embodiment of fig. 9, which are not described herein again.

In the model processing apparatus provided in this embodiment, the obtaining unit obtains the at least one training sample, and then the model training unit trains the end-to-end model to be trained according to the at least one training sample, so as to train and obtain the target end-to-end model for slot position prediction. In this way, the second prediction result output by the target end-to-end model can help the optimization processor accurately recognize the words of the target category from the first prediction result output by the speech recognition model, so that the recognition accuracy of the words of the target category is improved, the false alarm rate is reduced, and therefore the speech recognition effect of the speech information including the words of the target category can be improved.

With further reference to fig. 16, as an implementation of the methods shown in some of the above figures, the present specification provides an embodiment of a speech recognition apparatus, which corresponds to the embodiment of the method shown in fig. 10, and which is applied to an optimization processor in the speech recognition system shown in fig. 8. The speech recognition system also includes a speech recognition model and a target end-to-end model for slot prediction. The target end-to-end model is used for carrying out slot position prediction on a first prediction result output by the voice recognition model and outputting a second prediction result obtained through slot position prediction to the optimization processor. And the optimization processor is used for optimizing the first prediction result according to the second prediction result.

As shown in fig. 16, the speech recognition apparatus 1600 of the present embodiment includes: a first acquisition unit 1601, a second acquisition unit 1602, a score determination unit 1603, and a recognition result determination unit 1604. Wherein the first obtaining unit 1601 is configured to obtain a first prediction result output by the speech recognition model, the first prediction result including a plurality of pieces of text information; the second obtaining unit 1602 is configured to obtain a second prediction result output by the target end-to-end model, where the second prediction result is obtained by performing slot position prediction on the first prediction result; the score determining unit 1603 is configured to, in response to the second prediction result showing the appearance position of the word in the target category and the slot mark corresponding to the target category, and to read the word at the appearance position from the plurality of pieces of text information, determine a first score corresponding to the read word at the appearance position according to the slot mark; the recognition result determination unit 1604 is configured to determine a target word from the read words located at the appearance position according to the determined first score, the target word serving as a recognition result corresponding to the appearance position.

In this embodiment, specific processes of the first obtaining unit 1601, the second obtaining unit 1602, the score determining unit 1603, and the recognition result determining unit 1604, and technical effects brought by the specific processes, may refer to related descriptions of step 1001, step 1002, step 1003, and step 1004 in the corresponding embodiment of fig. 10, and are not described herein again.

Optionally, the target end-to-end model may include a target end-to-end model for slot prediction, which is trained by the method described in the corresponding embodiment of fig. 9.

Optionally, the score determining unit 1603 may be further configured to: and in response to the second prediction result showing the appearance position of the word in the target category and the slot marks corresponding to the target category, reading the word at the appearance position from the plurality of pieces of text information, and determining a first score corresponding to the read word at the appearance position by using a scoring model corresponding to the slot marks.

The speech recognition apparatus provided in this embodiment obtains, by a first obtaining unit, a first prediction result output by a speech recognition model, where the first prediction result includes a plurality of pieces of text information, obtains, by a second obtaining unit, a second prediction result output by a target end-to-end model, where the second prediction result is obtained by performing slot prediction on the first prediction result, and then, by a score determining unit, in response to the second prediction result, shows an appearance position of a word in a target category and a slot mark corresponding to the target category, and reads a word located at the appearance position from the plurality of pieces of text information, determines, according to the slot mark (for example, a scoring model corresponding to the slot mark), a first score corresponding to the read word located at the appearance position, so that the recognition result determining unit determines, according to the determined first score, a target word from the read word located at the appearance position, the target word is used as a recognition result corresponding to the appearance position. Therefore, words corresponding to the appearance positions shown by the second prediction result can be accurately identified according to the second prediction result and the scoring model corresponding to the slot position marks shown by the second prediction result. Therefore, the recognition accuracy of the words in the target category can be improved, the false alarm rate can be reduced, and the overall voice recognition effect of the voice information including the words in the target category can be improved.

With further reference to FIG. 17, a schematic view of a scenario of an interaction device according to the present description is shown.

As shown in fig. 17, the interaction device may include an optimization processor. Wherein the optimization processor may interface the target end-to-end model. The target end-to-end model is used for voice recognition and is provided with a slot position prediction function.

It should be noted that the target end-to-end model may be included in the interactive device, or may be included in other devices, and is not specifically limited herein. Here, the target end-to-end model is included in the interactive device as an example for explanation.

Specifically, the interactive device may obtain voice information of the user and input the voice information into the target end-to-end model. The target end-to-end model can perform voice recognition and slot prediction on voice information and output prediction results to the optimization processor. The prediction result may include a plurality of pieces of text information, and the plurality of pieces of text information are respectively used for representing semantics of the speech information. It should be understood that when the voice information of the user includes the word of the target category, the slot mark corresponding to the target category is added to the plurality of pieces of text information, and the word of the target category should appear at the position occupied by the slot mark. Thereafter, the optimization processor may extract from the plurality of pieces of text information, respectively, words that appear adjacent to and after the slot mark position, in response to reading the slot mark corresponding to the target category at the same position of the plurality of pieces of text information. Then, the optimization processor determines a target word from the extracted words, wherein the target word is used as a recognition result corresponding to the position occupied by the slot position mark. Specifically, the optimization processor may determine a first score corresponding to the extracted word according to the slot position marker. The optimization processor may then determine a target word from the extracted words based on the determined first score.

Optionally, the optimization processor may be further configured to: and determining a first score corresponding to the extracted word according to the scoring model corresponding to the slot position mark.

It should be noted that, for a detailed explanation of the operations performed by the optimization processor and the technical effects brought by the optimization processor, reference may be made to the relevant description in the corresponding embodiment of fig. 5, which is not described in detail herein.

The interaction device provided by the embodiment of the invention has the advantages that the optimization processor included in the interaction device can enable the interaction device to have a higher recognition accuracy for the target category of words in the voice information, and to have a better voice recognition effect for the voice information including the target category of words.

With further reference to FIG. 18, another scene schematic of an interaction device according to the present description is shown.

As shown in fig. 18, the interaction device may include an optimization processor. Wherein the optimization processor may connect the speech recognition model and the target end-to-end model. The target end-to-end model is used for slot prediction.

It should be noted that the speech recognition model and/or the target end-to-end model may be included in the interactive device, or may be included in other devices, and are not limited in this respect. Here, the speech recognition model and the target end-to-end model are included in the interactive device for example.

Specifically, the interactive device may obtain voice information of the user and input the voice information into the voice recognition model. The voice recognition model can perform voice recognition on the voice information to obtain a first prediction result, and the first prediction result is respectively output to the target end-to-end model and the optimization processor. Wherein the first prediction result comprises a plurality of pieces of text information. The text messages are respectively used for representing the semantics of the voice message.

The target end-to-end model can perform slot position prediction on the received first prediction result to obtain a second prediction result, and the second prediction result is output to the optimization processor. It should be understood that when the voice information of the user includes the word of the target category, the second prediction result may show the occurrence position of the word of the target category and the slot mark corresponding to the target category.

After receiving the first prediction result and the second prediction result, the optimization processor may perform processing such as analysis on the first prediction result and the second prediction result. Specifically, the optimization processor may determine, in response to the second prediction result showing the occurrence position of the word in the target category and the slot mark corresponding to the target category, and reading the word located at the occurrence position from the plurality of pieces of text information, a target word that serves as the recognition result corresponding to the occurrence position from the read word located at the occurrence position. Specifically, the optimization processor may determine, according to the slot position mark, a first score corresponding to the read word located at the appearance position, and determine, according to the determined first score, a target word from the read word located at the appearance position.

Optionally, the optimization processor may be further configured to: and determining a first score corresponding to the read word positioned at the appearance position according to the scoring model corresponding to the slot position mark.

It should be noted that, for the detailed explanation of the operation performed by the optimization processor and the technical effect brought by the operation, reference may be made to the relevant description in the corresponding embodiment of fig. 10, and details are not described here.

It should be particularly noted that the interactive devices in the embodiments respectively corresponding to fig. 17 and 18 may be any kind of devices having voice recognition and interaction functions, and may include, but are not limited to, a smart speaker, a smart robot, and the like.

The present specification also provides a computer readable storage medium, on which a computer program is stored, wherein when the computer program is executed in a computer, the computer is caused to execute the methods respectively shown in the above method embodiments.

The present specification further provides a computing device, including a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the methods respectively shown in the above method embodiments.

The present specification also provides a computer program product, which when executed on a data processing apparatus, causes the data processing apparatus to implement the methods respectively shown in the above method embodiments.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The above-mentioned embodiments, objects, technical solutions and advantages of the embodiments disclosed in the present specification are further described in detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the embodiments disclosed in the present specification, and are not intended to limit the scope of the embodiments disclosed in the present specification, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the embodiments disclosed in the present specification should be included in the scope of the embodiments disclosed in the present specification.

Claims

1. A model processing method, comprising:

acquiring at least one training sample, wherein the training sample comprises voice information including words of a target category and a labeling text, the labeling text is used for representing the semantics of the voice information and is added with a slot position mark corresponding to the target category, and the slot position mark is added at the original appearance position of the words of the target category in the labeling text;

and training the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model which is used for voice recognition and has a slot position prediction function.

2. The method of claim 1, wherein the target categories include at least one of: name of person, place name, organization name.

3. The method of claim 2, wherein,

when the target category comprises a name, the name slot mark added in the label text is used for representing that the name of the person should appear at the position occupied by the mark text;

when the target category comprises a place name, a place name slot mark added in the label text is used for representing that the place name should appear at the position occupied by the place name slot mark;

and when the target category comprises the mechanism name, the mechanism name slot mark added in the label text is used for representing that the mechanism name is supposed to appear at the position occupied by the mechanism name slot mark.

4. The method of claim 1, wherein a dictionary adopted by the end-to-end model to be trained is added with a slot label corresponding to the target class.

5. The method of claim 1, wherein the end-to-end model to be trained comprises a natural language processing model based on a self-attention mechanism and employing an encoder-decoder architecture.

6. The method according to one of claims 1 to 5, wherein the annotated text is text in the form of a sequence of words; and

the training of the end-to-end model to be trained according to the at least one training sample comprises:

and taking the voice information respectively included by the at least one training sample as input, taking the labeled text corresponding to the voice information as a label, and training the end-to-end model to be trained.

7. The method according to one of claims 1 to 5, wherein the annotation text is an un-participled text; and

for each training sample in the at least one training sample, performing word segmentation on the labeled text included in the training sample, and forming words obtained through word segmentation into word sequences;

and taking the voice information respectively included by the at least one training sample as input, and taking a word sequence corresponding to the voice information as a label, and training an end-to-end model to be trained.

8. A speech recognition method applied to an optimization processor in a speech recognition system, the speech recognition system further comprising a target end-to-end model with a slot prediction function for speech recognition, the method comprising:

obtaining a prediction result output by the target end-to-end model, wherein the prediction result comprises a plurality of pieces of text information;

in response to reading the slot position marks corresponding to the target categories at the same positions of the plurality of pieces of text information, respectively extracting words which are adjacent to the slot position marks and appear behind the slot position marks from the plurality of pieces of text information;

determining a first score corresponding to the extracted word according to the slot position mark;

and determining a target word from the extracted words according to the determined first score, wherein the target word is used as a recognition result corresponding to the position occupied by the slot position mark.

9. The method of claim 8, wherein the target end-to-end model comprises a target end-to-end model with slot prediction functionality for speech recognition trained using the method of claim 1.

10. The method of claim 8, wherein the determining a first score corresponding to the extracted word according to the slot marker comprises:

and determining a first score corresponding to the extracted word by using the scoring model corresponding to the slot position mark.

11. The method according to claim 10, wherein the scoring model comprises a pre-established data mapping table for characterizing the correspondence between the words in the target category and the first score; and

the determining the first score corresponding to the extracted word by using the scoring model corresponding to the slot position mark comprises the following steps:

and searching records comprising the extracted words in the scoring model, and determining the first score in the searched records as the first score corresponding to the extracted words.

12. The method of claim 10, wherein the scoring model comprises a pre-trained predictive model for predicting a first score corresponding to a word under a target category; and

and inputting the extracted words into the scoring model to obtain a first score output by the scoring model.

13. The method according to one of claims 8 to 12, wherein the prediction result further comprises second scores corresponding to the plurality of pieces of text information respectively; and

determining a target word from the extracted words according to the determined first score, including:

for each extracted word, determining a screening score corresponding to the word according to a first score corresponding to the word and a second score corresponding to the text information where the word is located;

and determining the target word from the extracted words according to the determined screening score.

14. The method of claim 13, wherein the scoring model corresponds to a preset score factor; and

determining the screening score corresponding to the word according to the first score corresponding to the word and the second score corresponding to the text message where the word is located, wherein the determining the screening score corresponding to the word comprises the following steps:

determining the product of the first fraction corresponding to the word and the dispatching fraction;

and determining the sum of the product and a second score corresponding to the text information where the word is positioned as a screening score corresponding to the word.

15. A model processing method, comprising:

acquiring at least one training sample, wherein the training sample comprises voice information including a name and a marked text, the marked text is used for representing the semantic meaning of the voice information and is added with a name slot position mark, and the name slot position mark is added at the original appearance position of a word belonging to the name in the marked text;

and training the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model which is used for voice recognition and has a name slot position prediction function.

16. A speech recognition method applied to an optimization processor in a speech recognition system, the speech recognition system further comprising a target end-to-end model with a name slot prediction function for speech recognition, the method comprising:

in response to reading the name slot position marks at the same positions of the text messages, respectively extracting words which are adjacent to the name slot position marks and appear behind the name slot position marks from the text messages;

determining a first score corresponding to the extracted word according to the name slot position mark;

and determining a target word from the extracted words according to the determined first score, wherein the target word is used as a recognition result corresponding to the position occupied by the name slot position mark.

17. The method of claim 16, wherein the target end-to-end model comprises a target end-to-end model trained using the method of claim 13 and having a name slot prediction function for speech recognition.

18. A model processing method, comprising:

acquiring at least one training sample, wherein the training sample comprises text information including words of a target category and slot position marking information, and the slot position marking information shows the appearance position of the words of the target category and slot position marks corresponding to the target category;

and training the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model for slot position prediction.

19. A speech recognition method applied to an optimization processor in a speech recognition system, the speech recognition system further comprising a speech recognition model and a target end-to-end model for slot prediction, the method comprising:

acquiring a first prediction result output by the voice recognition model, wherein the first prediction result comprises a plurality of pieces of text information;

obtaining a second prediction result output by the target end-to-end model, wherein the second prediction result is obtained by performing slot position prediction on the first prediction result;

responding to the second prediction result showing the appearance position of the word in the target category and the slot position mark corresponding to the target category, reading the word in the appearance position from the plurality of pieces of text information, and determining a first score corresponding to the read word in the appearance position according to the slot position mark;

and according to the determined first score, determining a target word from the read words located at the appearance position, wherein the target word is used as a recognition result corresponding to the appearance position.

20. The method of claim 19, wherein the target end-to-end model comprises a target end-to-end model for slot prediction trained using the method of claim 18.

21. A model processing apparatus comprising:

the training sample comprises voice information including words of a target category and a marked text, wherein the marked text is used for representing the semantics of the voice information and is added with a slot position mark corresponding to the target category, and the slot position mark is added at the original appearance position of the words of the target category in the marked text;

and the model training unit is configured to train the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model which is used for voice recognition and has a slot position prediction function.

22. A speech recognition apparatus for use in an optimization processor in a speech recognition system, the speech recognition system further comprising a target end-to-end model with slot prediction for speech recognition, the apparatus comprising:

an obtaining unit configured to obtain a prediction result output by the target end-to-end model, the prediction result including a plurality of pieces of text information;

an extracting unit configured to extract, from the plurality of pieces of text information, words that are adjacent to and appear after slot marks, respectively, in response to reading the slot marks corresponding to target categories at the same positions of the plurality of pieces of text information;

the score determining unit is configured to determine a first score corresponding to the extracted word according to the slot position mark;

and the identification result determining unit is configured to determine a target word from the extracted words according to the determined first score, wherein the target word is used as an identification result corresponding to the position occupied by the slot position mark.

23. A model processing apparatus comprising:

an acquisition unit configured to acquire at least one training sample, the training sample including voice information including a name of a person, and a tagged text, the tagged text being used to represent semantics of the voice information and being added with a name slot mark, the name slot mark being added at an original appearance position of a word belonging to the name of the person in the tagged text;

and the model training unit is configured to train the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model which is used for voice recognition and has a name slot position prediction function.

24. A speech recognition apparatus for use in an optimization processor in a speech recognition system, the speech recognition system further comprising a target end-to-end model with name slot prediction for speech recognition, the apparatus comprising:

an extracting unit configured to extract, from the plurality of pieces of text information, words that are adjacent to and appear after the name slot mark positions, respectively, in response to reading the name slot mark at the same position of the plurality of pieces of text information;

the score determining unit is configured to determine a first score corresponding to the extracted word according to the name slot position mark;

and the identification result determining unit is configured to determine a target word from the extracted words according to the determined first score, wherein the target word is used as an identification result corresponding to the position occupied by the name slot mark.

25. A model processing apparatus comprising:

an obtaining unit configured to obtain at least one training sample, where the training sample includes text information including a word of a target category and slot position marking information, and the slot position marking information shows an appearance position of the word of the target category and a slot position mark corresponding to the target category;

and the model training unit is configured to train the end-to-end model to be trained according to the at least one training sample to obtain a target end-to-end model for slot position prediction.

26. A speech recognition apparatus for use in an optimization processor in a speech recognition system, the speech recognition system further comprising a speech recognition model and a target end-to-end model for slot prediction, the apparatus comprising:

a first acquisition unit configured to acquire a first prediction result output by the speech recognition model, the first prediction result including a plurality of pieces of text information;

a second obtaining unit configured to obtain a second prediction result output by the target end-to-end model, the second prediction result being obtained by performing slot prediction on the first prediction result;

a score determining unit configured to determine, in response to the second prediction result showing an appearance position of a word of a target category and a slot mark corresponding to the target category, and reading a word located at the appearance position among the plurality of pieces of text information, a first score corresponding to the read word located at the appearance position according to the slot mark;

and the recognition result determining unit is configured to determine a target word from the read words located at the appearance positions according to the determined first score, wherein the target word is used as a recognition result corresponding to the appearance positions.

27. An interaction device comprising an optimization processor;

the optimization processor is configured to:

obtaining a prediction result output by a target end-to-end model, wherein the target end-to-end model is used for voice recognition and has a slot position prediction function, and the prediction result comprises a plurality of pieces of text information;

28. An interaction device comprising an optimization processor;

the optimization processor is configured to:

acquiring a first prediction result output by a voice recognition model, wherein the first prediction result comprises a plurality of pieces of text information;

obtaining a second prediction result output by a target end-to-end model, wherein the target end-to-end model is used for slot position prediction, and the second prediction result is obtained by performing slot position prediction on the first prediction result;

29. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-20.

30. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that when executed by the processor implements the method of any of claims 1-20.