WO2022121251A1 - 文本处理模型训练方法、装置、计算机设备和存储介质 - Google Patents

文本处理模型训练方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2022121251A1
WO2022121251A1 PCT/CN2021/096582 CN2021096582W WO2022121251A1 WO 2022121251 A1 WO2022121251 A1 WO 2022121251A1 CN 2021096582 W CN2021096582 W CN 2021096582W WO 2022121251 A1 WO2022121251 A1 WO 2022121251A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
text
wubi
data
pinyin
Prior art date
Application number
PCT/CN2021/096582
Other languages
English (en)
French (fr)
Inventor
吴天博
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022121251A1 publication Critical patent/WO2022121251A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods
    • G06F3/0237Character input methods using prediction or retrieval techniques

Definitions

  • the present application relates to a text processing model training method, apparatus, computer equipment and storage medium.
  • Chinese error correction is a basic task in natural language processing, which often affects the accuracy of upstream tasks.
  • various Chinese errors are often included, but for the extensive and profound Chinese , changing a few words may change the semantics dramatically, so Chinese error correction is often used as the underlying module to provide higher-quality texts for upstream tasks.
  • Bert in the traditional technology is the current mainstream pre-training language model, and its MLM pre-training task will introduce 15%-10% noise due to its mask mechanism, so Bert has a certain error detection ability, but due to Only 15%-10% of the noise is introduced, so Bert is often weak in text error detection, making it difficult to obtain high-quality text data.
  • a text processing model training method is provided.
  • a text processing model training method comprising:
  • a text data acquisition device comprising:
  • a first training sample set obtaining module used for obtaining the first text sample set to be trained
  • a word vector training module for performing model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods
  • the second training sample set acquisition module is used to obtain the second to-be-trained text sample set and the pre-trained language model
  • an encoding data extracting module for extracting encoding data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model and the pinyin word vector model;
  • the model training module is used to perform model training according to the encoded data to obtain a text processing model.
  • a method for acquiring text data comprising:
  • the language encoding data is obtained by training as input data, and the word vector encoding data is obtained based on the pre-trained word vector model, and the language encoding data is obtained based on the pre-trained language model.
  • a text data acquisition device includes:
  • an acquisition module for acquiring the text data to be processed
  • the processing module is used to input the text data to be processed into the pre-trained text processing model, so as to perform data processing on the text data to be processed according to the model parameters in the text processing model to obtain the target text data; the text processing model is based on the correspondence of different input methods.
  • the word vector encoding data and language encoding data are obtained by training as input data, and the word vector encoding data is obtained based on the pre-trained word vector model, and the language encoding data is obtained based on the pre-training language model.
  • a computer device comprising a memory and one or more processors, the memory having computer-readable instructions stored therein, the computer-readable instructions, when executed by the processor, cause the one or more processors to execute The following steps:
  • Computer-readable instructions One or more computer-readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
  • the above-mentioned method for acquiring text data is to acquire text data to be processed; input the text data to be processed into a pre-trained text processing model, and perform data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model It is obtained by training based on the word vector coding data and language coding data corresponding to different input methods as input data, and the word vector coding data is obtained based on the pre-trained word vector model, and the language coding data is obtained based on the pre-trained language model.
  • FIG. 1 is an application environment diagram of a text processing model training method according to one or more embodiments.
  • FIG. 2 is a schematic flowchart of a text processing model training method according to one or more embodiments.
  • FIG. 3 is a structural diagram of a text processing model provided in accordance with one or more embodiments.
  • FIG. 4 is a structural block diagram of an apparatus for training a text processing model according to one or more embodiments.
  • FIG. 5 is a schematic flowchart of a method for acquiring text data according to one or more embodiments.
  • FIG. 6 is a structural block diagram of an apparatus for acquiring text data according to one or more embodiments.
  • FIG. 7 is a block diagram of a computer device in accordance with one or more embodiments.
  • the text data acquisition method provided in this application can be applied to the application environment shown in FIG. 1 .
  • the terminal 102 communicates with the server 104 through the network.
  • the server 104 obtains the first text sample set to be trained uploaded by the terminal 102; performs model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods; obtains the second text sample to be trained based on the language model, the Wubi word vector model and the pinyin word vector model, respectively, extract the encoded data corresponding to the second text sample set to be trained; perform model training according to the encoded data to obtain a text processing model.
  • the terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server 104 can be implemented by an independent server or a server cluster composed of multiple servers.
  • a text processing model training method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • Step 202 Obtain a first text sample set to be trained.
  • the first training text sample set includes multiple text data, and specifically may include multiple text sentences.
  • the text data in the first training text sample set may include text data requiring error correction processing, that is, the first training text sample set may include erroneous text information.
  • the sources of the first training text sample set include Chinese Wikipedia, historical telemarketing records, news crawled on the Internet, Baidu Q&A and other data, which are not limited here.
  • Step 204 respectively performing model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods.
  • the input method specifically includes the Pinyin input method and the Wubi input method, which correspond to the use of different coding algorithms to identify the text.
  • the algorithm identifies the text.
  • pinyin coding algorithm and Wubi coding algorithm can have different coding content, such as the pinyin of "word” corresponds to "zi” Wubi corresponds to "PBF”. Therefore, based on different encoding methods, word vector models corresponding to different input methods can be trained separately.
  • the word vector models corresponding to different input methods include the pinyin word vector model and the Wubi word vector model, and the pinyin word vector model is obtained by training based on the pinyin coding data, and the Wubi word vector model is obtained by training based on the Wubi coding data. Therefore, since the word vector models corresponding to different input methods are obtained by training based on different encoded data, the word vector models corresponding to different input methods represent text data with different dimensions, and represent text data through different dimensions so that the text data can be The characterization is more accurate and reliable.
  • the word vector model includes a pinyin word vector model and a Wubi word vector model. And for the same text, the pinyin word vector model can be used to obtain the corresponding pinyin encoded data, and the Wubi word vector model can be used to obtain the corresponding Wubi encoded data.
  • Step 206 Obtain a second text sample set to be trained and a pre-trained language model.
  • the server obtains the second text sample set to be trained, where the second text sample set to be trained and the first text sample set to be trained may be the same or different sample sets, which are not limited herein.
  • the language model is a model with language prediction ability, and specifically, it may be a Bert (Bidirectional Encoder Representation from Transformers) language model.
  • Bert Bidirectional Encoder Representation from Transformers
  • MLM mask language model
  • NSP next sentence prediction
  • the MLM task is to predict the text content at the corresponding position
  • the NSP task is to judge whether the two sentences before and after are not continuous. of.
  • Step 208 based on the language model, the Wubi word vector model and the pinyin word vector model, respectively extract the encoded data corresponding to the second text sample set to be trained.
  • the language model, the Wubi word vector model and the pinyin word vector model express information on the same text data in different dimensions, at least three different ways of expressing the same text data can be obtained based on different models.
  • the information content of the encoded data obtained by expressing the second training sample set in different dimensions is more abundant, so the text processing model obtained when the model training is performed based on the encoded data has higher text processing accuracy.
  • Step 210 Perform model training according to the encoded data to obtain a text processing model.
  • the text processing model is a model for performing error correction processing on the text data to be processed, and is used for processing the text data to be processed into text data with higher precision.
  • model training can be performed based on text data with higher accuracy as training data, thereby improving the accuracy of model training.
  • the above-mentioned text processing model training method, device, computer equipment and storage medium obtain a first text sample set to be trained; respectively perform model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vectors corresponding to different input methods model; obtain the second text sample set to be trained and the pre-trained language model; extract the encoded data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model and the pinyin word vector model respectively; perform model training according to the encoded data Get the text processing model.
  • the text processing model Based on the training sample set, first train to obtain word vector models corresponding to different input methods, and then perform model training again based on the trained word vector model and language model to obtain a text processing model, ensuring that more dimensions can be integrated in the process of training the text processing model.
  • the text information obtained by the text processing model has higher accuracy and higher prediction accuracy.
  • the text processing model obtained by training can be used to process the input text data to be processed, so that more text information is taken into account in the process of text processing, thereby improving the processing ability of text data, thereby making it possible to obtain high-quality data. quality text data becomes possible.
  • performing model training based on the first set of text samples to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods includes: converting the first set of text samples to be trained into corresponding pinyin Encoding vector, traverse the pinyin encoding vector in turn according to the pre-configured sliding window, take the traversed pinyin encoding vector as the current pinyin vector to be processed, and predict the prediction in the current pinyin vector based on the current word vector model corresponding to the current pinyin model parameters.
  • the Wubi encoding vector at the preset position is predicted in the middle, and the target Wubi model parameters are determined according to the predicted Wubi encoding vector and the real Wubi encoding vector, and the Wubi word vector model is obtained according to the determined target Wubi model parameters.
  • the server obtains the first text sample set to be trained, converts the first text sample set to be trained into a corresponding pinyin coding vector, performs word vector model training according to the pinyin coding vector to obtain a pinyin word vector model, and converts the first text to be trained
  • the sample set is converted into the corresponding Wubi encoding vector, and the word vector model is trained according to the Wubi encoding vector to obtain the Wubi word vector model.
  • the server obtains the first training text sample set, converts each text in the first training text sample set into corresponding pinyin data to obtain a pinyin coding vector, and uses the obtained pinyin coding vector as the training pinyin
  • the input data of the word vector model, and then the trained pinyin word vector model is obtained.
  • the server obtains the first training text sample set, converts each text in the first training text sample set into corresponding Wubi data to obtain Wubi encoding vector, and uses the obtained Wubi encoding vector as the input data for training the Wubi word vector model, Then, the trained Wubi word vector model is obtained.
  • the training method of the word vector encoding model may be a training method based on a Bert language model, or a training method based on a word vector such as word2vec, which is not limited here.
  • performing the training method of the word vector model based on the training method of word vectors such as word2vec includes: converting the text corresponding to the first training text sample set into a Wubi encoding vector, and setting a predefined sliding window, such as a sliding window can be set The size of the window is 5, and then the server traverses the Wubi encoding vector corresponding to the text data in turn based on the size corresponding to the sliding window as the unit step, and uses the currently traversed Wubi encoding vector as the currently pending Wubi encoding vector, and in the current pending Wubi encoding vector
  • the data prediction step is performed in processing Wubi encoded vectors.
  • the Wubi encoding vector of the two characters before and after is used to predict the Wubi encoding vector of the text at the middle position, and the prediction is obtained.
  • Compare the Wubi encoding vector with the actual Wubi encoding vector to adjust the current Wubi word vector model parameters according to the comparison result to obtain the target Wubi word vector model parameters, and finally obtain the target Wubi word vector according to the target Wubi word vector model parameters.
  • model in the same way, the pinyin word vector model can be obtained.
  • the same text data can be expressed in multiple dimensions, so that the model can obtain multi-dimensional information of the same text data, which is then used for training the model.
  • the training method of word vectors is cost-effective and efficient, which further improves model training. s efficiency.
  • a method for acquiring text data is provided, which is described by taking the method applied to the server in FIG. 1 as an example, including the following steps:
  • Step 502 acquiring text data to be processed.
  • the text data to be processed is text data that needs to be processed for error correction, that is, the text data to be processed may include erroneous text information.
  • the to-be-processed text data can be used as a training sample set for model training. Therefore, when the to-be-processed text data includes erroneous text information, when the model training is performed using the to-be-processed text data including the error information, the The accuracy of training has a great impact, therefore, it is necessary to perform data processing on the text data to be processed, so as to remove or correct the included erroneous data.
  • the sources of the text data to be processed include Chinese Wikipedia, historical telephone sales records, news crawled on the Internet, Baidu Q&A and other data, which are not limited here.
  • Step 504 Input the text data to be processed into a pre-trained text processing model, and perform data processing on the text data to be processed according to the model parameters in the text processing model to obtain target text data; the text processing model is based on word vectors corresponding to different input methods.
  • the encoded data and the language encoded data are obtained by training as input data, the word vector encoded data is obtained based on the pre-trained word vector model, and the language encoded data is obtained based on the pre-trained language model.
  • the text processing model is a model for performing error correction processing on the text data to be processed, and is used for processing the text data to be processed into text data with higher precision.
  • model training can be performed based on text data with higher accuracy as training data, thereby improving the accuracy of model training.
  • the target text data is the data obtained after performing data error correction on the text data to be processed, that is to say, the target text data has high data accuracy and can be used as a training sample set during model training.
  • the input method specifically includes the Pinyin input method and the Wubi input method, which correspond to the use of different coding algorithms to identify the text.
  • the algorithm identifies the text.
  • pinyin coding algorithm and Wubi coding algorithm can have different coding content, such as the pinyin of "word” corresponds to "zi” Wubi corresponds to "PBF”. Therefore, based on different encoding methods, word vector models corresponding to different input methods can be trained separately.
  • the word vector model includes a pinyin word vector model and a Wubi word vector model.
  • the pinyin word vector model can be used to obtain the corresponding pinyin encoded data
  • the Wubi word vector model can be used to obtain the corresponding Wubi encoded data.
  • the language model is a model with language prediction ability, and specifically, it may be a Bert (Bidirectional Encoder Representation from Transformers) language model.
  • Bert Bidirectional Encoder Representation from Transformers
  • MLM mask language model
  • NSP next sentence prediction
  • the MLM task is to predict the text content at the corresponding position
  • the NSP task is to judge whether the two sentences before and after are not continuous. of.
  • the above-mentioned method for acquiring text data is to acquire text data to be processed; input the text data to be processed into a pre-trained text processing model, and perform data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model It is obtained by training based on the word vector coding data and language coding data corresponding to different input methods as input data, and the word vector coding data is obtained based on the pre-trained word vector model, and the language coding data is obtained based on the pre-trained language model.
  • Chinese error correction is often used as a low-level module to provide higher-quality texts for upstream tasks. Therefore, one of the purposes of this application is to perform error correction processing on the erroneous data in the text data to be processed, so as to ensure the acquisition of target text data with a high accuracy rate, and to use the target text data as a training sample set for model training.
  • the present application creatively introduces word vector models corresponding to different input methods, which brings more reference information to the model, thereby improving the processing capability of the text data to be processed, and making it possible to obtain target data with higher precision.
  • extracting the encoded data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model, and the pinyin word vector model, respectively includes: based on the pre-trained Wubi word vector model from the second text to be trained Extract Wubi coding data from the sample set; extract pinyin coding data from the second text sample set to be trained based on the pre-trained pinyin word vector model; obtain the pre-trained language model, and extract multi-dimensional language coding data from the second training sample set based on the language model
  • the text processing model is obtained by performing model training according to the coded data, including: taking Wubi coded data, pinyin coded data and multi-dimensional coded data as input data, and performing model training according to the input data to obtain a text processing model.
  • the text processing model is obtained by jointly training the trained word vector model and the language model. That is to say, in the specific training process, the pinyin word vector model and the Wubi word vector model are first trained based on the training sample set, and the pre-trained language model is obtained, and then based on the trained pinyin word vector model, Wubi word vector model and The language model is trained again to obtain the final text processing model.
  • this application includes at least two layers of model training process, the first layer is the training of the word vector model based on the input method, and the other layer is the word vector model and language model based on the input method obtained based on the training of the first layer. The text processing model obtained by model training again.
  • the server obtains the second text sample set to be trained, where the second text sample set to be trained and the first text sample set to be trained may be the same or different sample sets, which are not limited herein. Then input the second set of text samples to be trained into the trained word vector model to obtain pinyin coding data and Wubi coding data respectively, and input the second set of text samples to be trained into the trained language model to obtain multi-dimensional language encoded data. Then, the obtained pinyin coded data, Wubi coded data and multi-dimensional language coded data are used as input data to train the model again, and then a text processing model is obtained.
  • the sources of input data include multiple models, specifically including word vector models corresponding to different input methods and pre-trained language models with high precision, thereby enabling the training of the text processing model.
  • the source of the data is more accurate and the information is more abundant in the process, which makes the training accuracy of the model higher.
  • the multi-dimensional language encoding data includes one or more of word vector encoding data (token embedding), classification encoding data (type embedding) and position encoding data (position embedding).
  • the Bert embedding layer in the language model has three inputs recorded as multi-dimensional language encoding data, and multi-dimensional language encoding data can express text information from various aspects.
  • the multi-dimensional language encoding data corresponds to token-embedding, segment-embedding and position-embedding, respectively.
  • Token-embedding is used to convert words into a fixed-dimensional vector representation, and each word is represented as a 768-dimensional vector in Bert-base.
  • Segment-embedding is used for Bert to directly splicing the two texts into the model when solving the double sentence classification task (such as judging whether the two texts are semantically similar), so how does the model distinguish the two texts? , the answer is through segment-embedding.
  • the segment-embedding part of the first sentence is all 0, and the segment-embedding part of the second sentence is all 1.
  • BERT uses the transformer encoder to learn the representation of the sentence through the self-attention mechanism. The self-attention does not pay attention to the position information of the token, so in order for the transformer to learn the position information of the token, position-embedding is added to the input.
  • word vector models such as Wubi word vector model and pinyin word vector model Wubi embedding and pinyin embedding use word2vec. By using word2vec, the amount of data can be reduced, thereby improving the efficiency of model training.
  • the Wubi encoded data, the pinyin encoded data and the multi-dimensional language encoded data are used as input data, and model training is performed according to the input data to obtain a text processing model, including: the Wubi encoded data, the pinyin encoded data and the multi-dimensional language encoded data
  • the data is spliced to obtain spliced coded data; the prediction processing is performed on the spliced coded data based on the language model to obtain the corresponding prediction probability at each position; the initial predicted text at the corresponding position is determined according to the size of the predicted probability; based on the initial predicted text and real labels
  • the difference between texts adjusts the initial model parameters of the initial text processing model to obtain target model parameters, and determines the text processing model according to the target model parameters.
  • the server performs splicing processing on the acquired Wubi coded data, pinyin coded data and multi-dimensional language data to obtain spliced coded data, and inputs the spliced coded data into the prediction module to obtain the corresponding prediction probability at each position.
  • the data with the predicted probability greater than the preset threshold is extracted as the initial predicted data, for example, the data with the predicted probability ranked in the top 5 can be used as the initial predicted data.
  • determining the initial predicted text at the corresponding position according to the size of the predicted probability includes: obtaining the predicted text whose predicted probability value is greater than a preset value; and extracting the initial predicted text from the predicted text based on the homophone principle and the pinyin principle, The initial predicted text is stored in the blockchain node.
  • FIG. 3 is a structural diagram of a text processing model provided in an embodiment.
  • this application does not add token embedding to the last layer of the error correction module for classification output, but directly outputs through the error correction module and uses pinyin features to constrain the output.
  • this application makes full use of the characteristics of language model training to detect errors in texts.
  • the current language model basically predicts the current position given the words on the left and the right, and there is also a given central word to predict the words on the left and right sides. , through this training, the model can learn which words a word is adjacent to and the probability of being adjacent, and the same is true using pinyin training.
  • the correct pinyin of Chinese panda is "zhong guo xiong mao"
  • the wrong pinyin is "zhong guo xun mao”
  • the "mao" pinyin is preceded by "xiong”
  • the probability of "xun” is higher than that of "xun”
  • the probability of "guo” followed by "xiong” is higher than that of "xun”
  • the reason for freezing the pinyin word vector model during the model training process can also be used to prevent the correct pinyin word vector from being affected by the lower quality data through the freezing process.
  • the Bert model used in the error correction part it will perform softmax output on each word. If the output result is different from the input, it means that the word needs to be corrected. For example, for the typo of Xun, the 5 highest scores of the softmax output of the bert model are bear, search, big, good, and xun. At this time, I hope to further filter according to pinyin. According to the output result of the previous pinyin embedding and dense connection, the pinyin of this position is predicted to be xiong. Based on this, the bert result is filtered, other pinyin is removed, and finally only "bear" is retained, and other positions are the same reason.
  • the results of pinyin prediction are also added for screening.
  • Chinese Xunmao for the typo in Xun, assuming that the top 5 of the predicted results are Xiong, Xun, Da, Hao, Xun, if only homophones are filtered, then after the "xun" filter, the bear with the highest probability will be filtered out.
  • the above-mentioned initial predicted text can also be stored in a node of a blockchain.
  • dynamic screening can be achieved by using Pinyin Embedding and two-way GRU+Dense to perform word list screening, not just fixed homophone screening.
  • the results of the Pinyin model are used for the screening of the Bert error correction results, instead of the original input Pinyin, which improves the accuracy of error correction and obtains text data with higher accuracy.
  • the model parameters of the text processing model include pinyin model parameters and Wubi model parameters;
  • the target model parameters are obtained by adjusting the initial model parameters of the initial text processing model based on the difference between the initial predicted text and the real label text, and determining the text processing model according to the target model parameters, including: adjusting the initial Wubi parameters of the initial text processing model based on the difference between the initial predicted text and the real label text to obtain the target Wubi model parameters; according to the pinyin model parameters and the target Wubi model parameters Determine the text processing model.
  • the Pinyin embedding is fixed and immutable, and the Wubi embedding is fixed and variable.
  • variable means that the parameters are variable, that is, Wubi embedding participates in the parameter update of backpropagation during the training process, and the Pinyin embedding is fixed. That is, it will not be updated during training.
  • the word vector model is obtained by training based on word2vector, and the language model is obtained by training based on Bert model.
  • the Bert language model is very strong, the cost of making Pinyin Bert is very high, and because the quality of the pre-training text cannot be guaranteed, even if the Bert is made of Pinyin, it can only do information enhancement, which is not suitable for pinyin error detection, so the model I chose to work hard on the quality of the training data of Pinyin Embedding, and chose a lighter word2vector language model instead of Bert. At the same time, I also think that for word2vec, because the pre-training process is related to the downstream error detection, its error detection ability will not be comparable. Bert is much worse.
  • the vector of Wubi is the same as the pinyin vector, which is obtained by the method of Word2Vector.
  • the training method of word2vector includes: converting all characters into Wubi codes, setting the sliding window to 5, that is, using the codes of the two characters before and after each time to predict the code of the middle character.
  • Wubi Embedding and Pinyin Embedding of high-quality text are introduced into the error detection module for information enhancement, which can significantly improve the capability of the original Soft-mask error detection network.
  • the homophonic screening of Top5 in the error correction module can effectively control the output of the text, and the correct pinyin predicted by the Pinyin Embedding + two-way GRU + Dense layer can be used to dynamically screen the results, which can also reduce the homophony screening to filter out the correct words. probability.
  • a text processing model training apparatus including:
  • the first training sample set obtaining module 402 is configured to obtain a first text sample set to be trained.
  • the word vector training module 404 is configured to perform model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods.
  • the second training sample set obtaining module 406 is configured to obtain a second to-be-trained text sample set and a pre-trained language model.
  • the coded data extraction module 408 is configured to extract coded data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model and the pinyin word vector model, respectively.
  • the model training module 410 is configured to perform model training according to the encoded data to obtain a text processing model.
  • the coded data extraction module 408 is further configured to convert the first set of text samples to be trained into corresponding pinyin coding vectors, traverse the pinyin coding vectors sequentially according to the preconfigured sliding window, and traverse the traversed pinyin coding vectors.
  • the vector is used as the current pinyin vector to be processed.
  • the pinyin coding vector at the preset position is predicted in the current pinyin vector to be processed, and determined according to the predicted pinyin coding vector and the actual pinyin coding vector.
  • the target pinyin model parameters are obtained according to the determined target pinyin model parameters to obtain the pinyin word vector model; the first text sample set to be trained is converted into the corresponding Wubi coding vector, and the Wubi coding vector is traversed in turn according to the preconfigured sliding window, and the traversal to The Wubi encoding vector is used as the current Wubi vector to be processed, based on the current word vector model corresponding to the current Wubi model parameters, the Wubi encoding vector at the preset position is predicted in the current Wubi vector to be processed, and according to the predicted Wubi encoding vector and the real Wubi
  • the encoding vector determines the target Wubi model parameters, and the Wubi word vector model is obtained according to the determined target Wubi model parameters.
  • the encoded data extraction module 408 is further configured to extract Wubi encoded data from the second text sample set to be trained based on the pre-trained Wubi word vector model; based on the pre-trained pinyin word vector model from the second text to be trained
  • the phonetic coding data is extracted from the sample set; the pre-trained language model is obtained, and the multi-dimensional language coding data is extracted from the second training sample set based on the language model; the model training module 410 is also used for Wubi coding data, Pinyin coding data and multi-dimensional language coding data
  • a text processing model is obtained by model training according to the input data.
  • the model training module 410 is further configured to perform splicing processing on Wubi coded data, pinyin coded data and multi-dimensional coded data to obtain spliced coded data; perform prediction processing on the spliced coded data based on the language model to obtain the corresponding correspondence at each position According to the size of the predicted probability, the initial predicted text at the corresponding position is determined; based on the difference between the initial predicted text and the real label text, the initial model parameters of the initial text processing model are adjusted to obtain the target model parameters, and according to the target model The parameters determine the text processing model.
  • the model training module 410 is further configured to obtain the predicted text whose predicted probability value is greater than the preset value; the initial predicted text is extracted from the predicted text based on the homophone principle and the pinyin principle, and the initial predicted text is stored in the blockchain node middle.
  • the model training module 410 is further configured to adjust the initial Wubi parameters of the initial text processing model based on the difference between the initial predicted text and the actual label text to obtain target Wubi model parameters; according to the pinyin model parameters and the target Wubi The model parameters determine the text processing model.
  • a text data acquisition device including:
  • the obtaining module 602 is used for obtaining the text data to be processed.
  • the processing module 604 is used to input the text data to be processed into a pre-trained text processing model, so as to perform data processing on the text data to be processed according to the model parameters in the text processing model to obtain target text data; the text processing model is based on different input methods.
  • Corresponding word vector encoding data and language encoding data are obtained by training as input data, and the word vector encoding data is obtained based on the pre-trained word vector model, and the language encoding data is obtained based on the pre-trained language model.
  • Each module in the above-mentioned text data acquisition device and text processing model training device can be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device in one of the embodiments, the computer device may be a server, and its internal structure diagram may be as shown in FIG. 7 .
  • the computer device includes a processor, memory, and a network interface connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the computer device's database is used to store textual data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions when executed by the processor, implement a text data acquisition method and a text processing model training method.
  • FIG. 7 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device comprising a memory and one or more processors, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, causes the one or more processors to execute the method in any one of the foregoing embodiments the steps involved.
  • One or more computer-readable storage media storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, cause the one or more processors to perform the method involved in any one of the above embodiments. step.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the blockchain referred to in the present invention is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种文本处理模型训练方法,涉及人工智能技术领域,包括:获取第一待训练文本样本集(202);基于第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型(204);获取第二待训练文本样本集以及预训练的语言模型(206);基于语言模型、五笔词向量模型以及拼音词向量模型分别提取第二待训练文本样本集对应的编码数据(208);根据编码数据执行模型训练得到文本处理模型(210)。

Description

文本处理模型训练方法、装置、计算机设备和存储介质
相关申请的交叉引用
本申请要求于2020年12月11日提交中国专利局,申请号为2020114479642,申请名称为“文本处理模型训练方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及一种文本处理模型训练方法、装置、计算机设备和存储介质。
背景技术
中文纠错是自然语言处理中的一个基础任务,它常常影响着上游任务的准确性,在可获得的廉价文本数据中,常常包含着各种各样的中文错误,但是对于博大精深的中文而言,改变几个字可能语义也会发生天翻地覆的变化,因此中文纠错常常作为底层模块为上游任务提供较高质量的文本。
然而,发明人意识到传统技术中的Bert作为目前主流的预训练语言模型,其MLM预训练任务由于其mask机制会引入15%-10%的噪声,所以Bert具有一定的检错能力,但是由于只有15%-10%的噪声引入,因此Bert在做文本检错常常表现得无力,使得获取高质量的文本数据较为困难。
发明内容
根据本申请公开的各种实施例,提供一种文本处理模型训练方法、装置、计算机设备和存储介质。
一种文本处理模型训练方法,包括:
获取第一待训练文本样本集;
基于第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型;
获取第二待训练文本样本集以及预训练的语言模型;
基于语言模型、五笔词向量模型以及拼音词向量模型分别提取第二待训练文本样本集对应的编码数据;及
根据编码数据执行模型训练得到文本处理模型。
一种文本数据获取装置,包括:
第一训练样本集获取模块,用于获取第一待训练文本样本集;
词向量训练模块,用于基于第一待训练文本样本集分别执行模型训练得到不同输入法 对应的五笔词向量模型以及拼音词向量模型;
第二训练样本集获取模块,用于获取第二待训练文本样本集以及预训练的语言模型;
编码数据提取模块,用于基于语言模型、五笔词向量模型以及拼音词向量模型分别提取第二待训练文本样本集对应的编码数据;及
模型训练模块,用于根据编码数据执行模型训练得到文本处理模型。
一种文本数据获取方法,方法包括:
获取待处理文本数据;及
将待处理文本数据输入至预先训练的文本处理模型中,以根据文本处理模型中的模型参数对待处理文本数据进行数据处理得到目标文本数据;文本处理模型是基于不同输入法对应的词向量编码数据以及语言编码数据作为输入数据进行训练得到,且词向量编码数据是基于预训练的词向量模型得到,语言编码数据是基于预训练的语言模型得到。
一种文本数据获取装置,装置包括:
获取模块,用于获取待处理文本数据;及
处理模块,用于将待处理文本数据输入至预先训练的文本处理模型中,以根据文本处理模型中的模型参数对待处理文本数据进行数据处理得到目标文本数据;文本处理模型是基于不同输入法对应的词向量编码数据以及语言编码数据作为输入数据进行训练得到,且词向量编码数据是基于预训练的词向量模型得到,语言编码数据是基于预训练的语言模型得到。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:
获取第一待训练文本样本集;
基于第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型;
获取第二待训练文本样本集以及预训练的语言模型;
基于语言模型、五笔词向量模型以及拼音词向量模型分别提取第二待训练文本样本集对应的编码数据;及
根据编码数据执行模型训练得到文本处理模型。
计算机可读指令计算机可读指令一个或多个存储有计算机可读指令的计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
获取第一待训练文本样本集;
基于第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型;
获取第二待训练文本样本集以及预训练的语言模型;
基于语言模型、五笔词向量模型以及拼音词向量模型分别提取第二待训练文本样本集对应的编码数据;及
根据编码数据执行模型训练得到文本处理模型。
计算机可读指令计算机可读指令
上述文本数据获取方法,获取待处理文本数据;将待处理文本数据输入至预先训练的文本处理模型中,根据文本处理模型中的模型参数对待处理文本数据进行数据处理得到目标文本数据;文本处理模型是基于不同输入法对应的词向量编码数据以及语言编码数据作为输入数据进行训练得到,且词向量编码数据是基于预训练的词向量模型得到,语言编码数据是基于预训练的语言模型得到。通过训练训练不同输入法对应的词向量模型,并基于不同输入法对应的词向量模型以及语言模型综合进行文本数据的处理,使得在文本处理的过程中考虑到了更多的文本信息,进而提高了对文本数据的处理能力,进而使得获取高质量的文本数据成为了可能。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为根据一个或多个实施例中文本处理模型训练方法的应用环境图。
图2为根据一个或多个实施例中文本处理模型训练方法的流程示意图。
图3为根据一个或多个实施例中提供的一种文本处理模型的结构图。
图4为根据一个或多个实施例中文本处理模型训练装置的结构框图。
图5为根据一个或多个实施例中文本数据获取方法的流程示意图。
图6为根据一个或多个实施例中文本数据获取装置的结构框图。
图7为根据一个或多个实施例中计算机设备的框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的文本数据获取方法,可以应用于如图1所示的应用环境中。终端102通过网络与服务器104进行通信。服务器104获取终端102上传的第一待训练文本样本集;基于第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型;获取第二待训练文本样本集以及预训练的语言模型;基于语言模型、 五笔词向量模型以及拼音词向量模型分别提取第二待训练文本样本集对应的编码数据;根据编码数据执行模型训练得到文本处理模型。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在其中一个实施例中,如图2所示,提供了一种文本处理模型训练方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:
步骤202,获取第一待训练文本样本集。
其中,第一训练文本样本集中包括多个文本数据,具体可以是包括多个文本句子。需要说明的是,第一训练文本样本集中的文本数据可能包括需要进行纠错处理的文本数据,也就是说,第一训练文本样本集中可能包括错误的文本信息。具体地,第一训练文本样本集的来源包括中文***、历史电话销售记录、网上爬取的新闻、百度问答等数据,在此不做限制。
步骤204,基于第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型。
输入法具体包括拼音输入法以及五笔输入法,分别对应为利用不同的编码算法对文本进行标识,如拼音输入法是一种利用拼音编码算法对文本进行标识,五笔输入法是一种利用五笔编码算法对文本进行标识。并且,需要说明的是,对于同一个文本内容而言,基于不同的编码算法(拼音编码算法以及五笔编码算法)可以具有不同的编码内容,如“字”的拼音对应为“zi”五笔对应为“PBF”。故而,基于不同的编码方法,可以分别训练不同输入法对应的词向量模型。
具体地,不同输入法对应的词向量模型包括拼音词向量模型以及五笔词向量模型,并且拼音词向量模型是基于拼音编码数据进行训练得到,五笔词向量模型是基于五笔编码数据进行训练得到。故而,由于不同输入法对应的词向量模型是基于不同的编码数据进行训练得到,故而不同输入法对应的词向量模型是同不同维度表征文本数据的,并且通过不同维度表征文本数据使得文本数据的表征更加精准、可靠。具体地,基于输入法的不同,词向量模型包括拼音词向量模型以及五笔词向量模型。并且对于同一个文本而言,可以同时利用拼音词向量模型获取对应的拼音编码数据,利用五笔词向量模型获取对应的五笔编码数据。
步骤206,获取第二待训练文本样本集以及预训练的语言模型。
具体地,服务器获取第二待训练文本样本集,其中第二待训练文本样本集与第一待训练文本样本集可以是相同或者不同的样本集,在此不作限制。
其中,语言模型是具有语言预测能力的模型,具体可以是Bert(Bidirectional Encoder Representation from Transformers)语言模型。具体地,Bert模型的训练任务分别有2个,MLM(masked language model)和NSP(next sentence prediction),MLM任务是预测对应位置处的文本内容,NSP的任务是要判断前后两句话不是连续的。
步骤208,基于语言模型、五笔词向量模型以及拼音词向量模型分别提取第二待训练文本样本集对应的编码数据。
由于语言模型、五笔词向量模型以及拼音词向量模型是分别在不同的维度对同一个文本数据进行信息表达,故而基于不同的模型可以得到同一个文本数据的至少三种不同的表达方式。通过对第二训练样本集进行不同的维度表达得到的编码数据的信息量更加丰富,故而基于编码数据执行模型训练时得到的文本处理模型的文本处理准确性更高。
步骤210,根据编码数据执行模型训练得到文本处理模型。
其中,文本处理模型是用于对待处理文本数据进行纠错处理的模型,用于将待处理文本数据处理为精度较高的文本数据。在具体业务中,可以根据精度较高的文本数据作为训练数据,进行模型训练,进而提高模型训练的精度。
上述文本处理模型训练方法、装置、计算机设备和存储介质,获取第一待训练文本样本集;基于第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型;获取第二待训练文本样本集以及预训练的语言模型;基于语言模型、五笔词向量模型以及拼音词向量模型分别提取第二待训练文本样本集对应的编码数据;根据编码数据执行模型训练得到文本处理模型。基于训练样本集首先训练得到不同输入法对应的词向量模型,然后基于训练好的词向量模型以及语言模型再次执行模型训练得到文本处理模型,保证在训练文本处理模型的过程中能够综合更多维度的文本信息,得到的文本处理模型的精度更高,预测准确率更高。并且利用训练得到的文本处理模型可以用于对输入的待处理文本数据进行处理,使得在文本处理的过程中考虑到了更多的文本信息,进而提高了对文本数据的处理能力,进而使得获取高质量的文本数据成为了可能。
在其中一个实施例中,基于第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型,包括:将第一待训练文本样本集转换为对应的拼音编码向量,根据预配置的滑动窗口依次遍历拼音编码向量,并将遍历到的拼音编码向量作为当前待处理拼音向量,基于当前拼音模型参数对应的当前词向量模型在当前待处理拼音向量中预测预设位置处的拼音编码向量,并根据预测的拼音编码向量以及真实的拼音编码向量确定目标拼音模型参数,根据确定的目标拼音模型参数得到拼音词向量模型;将第一待训练文本样本集转换为对应的五笔编码向量,根据预配置的滑动窗口依次遍历五笔编码向量,并将遍历到的五笔编码向量作为当前待处理五笔向量,基于当前五笔模型参数对应的当前词向量模型在当前待处理五笔向量中预测预设位置处的五笔编码向量,并根据预测的五笔编码向量以及真实的五笔编码向量确定目标五笔模型参数,根据确定的目标五笔模型参数得到五笔词向量模型。
具体地,服务器获取第一待训练文本样本集,将第一待训练文本样本集转换为对应的拼音编码向量,根据拼音编码向量进行词向量模型训练得到拼音词向量模型,将第一待训练文本样本集转换为对应的五笔编码向量,根据五笔编码向量进行词向量模型训练得到五笔词向量模型。
在其中一个具体的实施例中,服务器获取第一训练文本样本集,并将第一训练文本样本集中的每一个文本转换为对应的拼音数据得到拼音编码向量,将得到的拼音编码向量作为训练拼音词向量模型的输入数据,进而得到训练好的拼音词向量模型。以及,服务器获取第一训练文本样本集,并将第一训练文本样本集中的每一个文本转换为对应的五笔数据得到五笔编码向量,将得到的五笔编码向量作为训练五笔词向量模型的输入数据,进而得到训练好的五笔词向量模型。具体地,词向量编码模型的训练方式可以是基于Bert语言模型的训练方式,也可以是基于词向量如word2vec的训练方式,在此不作限制。
具体地,基于词向量如word2vec的训练方式执行词向量模型的训练方式,包括:将第一训练文本样本集对应的文字转换为五笔编码向量,并设置预定义的滑动窗口,如可以设定滑动窗口的大小为5,然后服务器基于滑动窗口对应的大小为单位步长,依次遍历文本数据对应的五笔编码向量,并将当前遍历到的五笔编码向量作为当前待处理五笔编码向量,并在当前待处理五笔编码向量中执行数据预测步骤。具体可以是在每次循环过程中,在当前五笔模型参数对应的当前五笔词向量模型中,使用前后各两个文字的五笔编码向量来预测中间位置处的文字的五笔编码向量,并将预测得到的五笔编码向量与实际的五笔编码向量进行比对,以根据比对结果对当前五笔词向量模型参数进行调整得到目标五笔词向量模型参数,最终根据目标五笔词向量模型参数得到目标的五笔词向量模型,同理,可以得到拼音词向量模型。
上述实施例中,通过利用不同的输入表示法将训练样本集进行分别表示,进而能够实现在多个维度表达同一个文本数据,使得模型能够获取同一个文本数据的多维度信息,进而为训练模型提供了更多的数据信息,以提高模型训练的精度,并且,不同输入法对应的模型可以基于词向量的方式进行训练,词向量的训练方式成本交底,效率较高,进一步都提高了模型训练的效率。
在一个实施例中,如图5所示,提供了一种文本数据获取方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:
步骤502,获取待处理文本数据。
其中,待处理文本数据是需要进行纠错处理的文本数据,也就是说,待处理文本数据中可能包括错误的文本信息。在一个具体的实施例中,待处理文本数据可以作为模型训练的训练样本集,故而当待处理文本数据中包括错误文本信息时,利用包括错误信息的待处理文本数据进行模型训练时会对模型训练的精度造成很大的影响,故而,需要对待处理文本数据进行数据处理,以实现对包括的错误数据进行去除或者进行纠错处理。
具体地,待处理文本数据的来源包括中文***、历史电话销售记录、网上爬取的新闻、百度问答等数据,在此不做限制。
步骤504,将待处理文本数据输入至预先训练的文本处理模型中,根据文本处理模型中的模型参数对待处理文本数据进行数据处理得到目标文本数据;文本处理模型是基于不同输入法对应的词向量编码数据以及语言编码数据作为输入数据进行训练得到,且词向量 编码数据是基于预训练的词向量模型得到,语言编码数据是基于预训练的语言模型得到。
其中,文本处理模型是用于对待处理文本数据进行纠错处理的模型,用于将待处理文本数据处理为精度较高的文本数据。在具体业务中,可以根据精度较高的文本数据作为训练数据,进行模型训练,进而提高模型训练的精度。
目标文本数据是对待处理文本数据进行数据纠错处理后得到的数据,也就是说目标文本数据的数据精度较高,可以作为模型训练时的训练样本集。输入法具体包括拼音输入法以及五笔输入法,分别对应为利用不同的编码算法对文本进行标识,如拼音输入法是一种利用拼音编码算法对文本进行标识,五笔输入法是一种利用五笔编码算法对文本进行标识。并且,需要说明的是,对于同一个文本内容而言,基于不同的编码算法(拼音编码算法以及五笔编码算法)可以具有不同的编码内容,如“字”的拼音对应为“zi”五笔对应为“PBF”。故而,基于不同的编码方法,可以分别训练不同输入法对应的词向量模型。
具体地,基于输入法的不同,词向量模型包括拼音词向量模型以及五笔词向量模型。并且对于同一个文本而言,可以同时利用拼音词向量模型获取对应的拼音编码数据,利用五笔词向量模型获取对应的五笔编码数据。
其中,语言模型是具有语言预测能力的模型,具体可以是Bert(Bidirectional Encoder Representation from Transformers)语言模型。具体地,Bert模型的训练任务分别有2个,MLM(masked language model)和NSP(next sentence prediction),MLM任务是预测对应位置处的文本内容,NSP的任务是要判断前后两句话不是连续的。
上述文本数据获取方法,获取待处理文本数据;将待处理文本数据输入至预先训练的文本处理模型中,根据文本处理模型中的模型参数对待处理文本数据进行数据处理得到目标文本数据;文本处理模型是基于不同输入法对应的词向量编码数据以及语言编码数据作为输入数据进行训练得到,且词向量编码数据是基于预训练的词向量模型得到,语言编码数据是基于预训练的语言模型得到。通过训练训练不同输入法对应的词向量模型,并基于不同输入法对应的词向量模型以及语言模型综合进行文本数据的处理,使得在文本处理的过程中考虑到了更多的文本信息,进而提高了对文本数据的处理能力,进而使得获取高质量的文本数据成为了可能。
基于单独利用语言模型进行文本处理,如在语言模型中的检错模块中只使用了基于文本的token embedding特征,难以很好地解决在具体的落地应用中遇到的同音不同字和字形相近偏旁不同的问题,尤其是对于自动语音识别技术(ASR)对应的语音识别场景,发音是个非常重要的纠错线索。在本申请中通过增加不同输入法对应的词向量模型(如拼音和五笔)与语言模型共同配合进行对待处理文本数据的处理,实现了给与模型更多的文本参考信息,进而实现了提高了文本处理模型对待处理文本数据的处理能力。
在具体的业务场景中,中文纠错是自然语言处理中的一个基础任务,它常常影响着上游任务的准确性,在可获得的廉价文本数据中,常常包含着各种各样的中文错误,简单的有由于用户输入法导致的拼音错误以及五笔错误。而在ASR识别中,会出现一些同音词 的替换,比如逆境被ASR识别为泥金,这样虽然发声一致,但是文本意思却发生了天翻地覆的变化,也可能或多或少引入一些噪声,比如你好变成了你你好。这样的噪声文本送到深度学习模型中,其实会大大影响模型的准确率,毕竟对于博大精深的中文而言,改变几个字可能语义也会发生天翻地覆的变化。因此中文纠错常常作为底层模块为上游任务提供较高质量的文本。故而,本申请的目的之一在于实现对待处理文本数据中的错误数据进行纠错处理,以保证获取准确率较高的目标文本数据,并将目标文本数据作为训练样本集进行模型的训练。
本申请创造性引入了不同输入法对应的词向量模型,为模型带来更多可参考的信息,进而提高了对待处理文本数据的处理能力,使得获取精度更高的目标数据成为了可能。
在其中一个实施例中,基于语言模型、五笔词向量模型以及拼音词向量模型分别提取第二待训练文本样本集对应的编码数据,包括:基于预训练的五笔词向量模型从第二待训练文本样本集中提取五笔编码数据;基于预训练的拼音词向量模型从第二待训练文本样本集中提取拼音编码数据;获取预训练的语言模型,并基于语言模型从第二训练样本集中提取多维语言编码数据;根据编码数据执行模型训练得到文本处理模型,包括:将五笔编码数据、拼音编码数据以及多维编码数据作为输入数据,并根据输入数据进行模型训练得到文本处理模型。
具体地,文本处理模型是基于训练好的词向量模型以及语言模型共同进行训练得到的。也就是说,在具体训练过程中,首先基于训练样本集训练得到拼音词向量模型以及五笔词向量模型,以及获取预训练的语言模型,然后基于训练好的拼音词向量模型、五笔词向量模型以及语言模型再次进行模型训练得到最终的文本处理模型。换言之,在本申请中至少包括两层的模型训练过程,第一层是基于输入法的词向量模型的训练,另一层是基于第一层训练得到的基于输入法的词向量模型以及语言模型再次进行模型训练得到的文本处理模型。
具体地,服务器获取第二待训练文本样本集,其中第二待训练文本样本集与第一待训练文本样本集可以是相同或者不同的样本集,在此不作限制。然后将第二待训练文本样本集输入至训练好的词向量模型中,分别得到拼音编码数据以及五笔编码数据,以及将第二待训练文本样本集输入至训练好的语言模型中,得到多维语言编码数据。然后将得到的拼音编码数据、五笔编码数据以及多维语言编码数据再次作为输入数据进行模型的训练,进而得到文本处理模型。在此过程中,在训练文本处理模型的过程中,输入数据的来源包括多个模型,具体包括不同输入法对应的词向量模型以及预训练的精度较高的语言模型,进而使得训练文本处理模型的过程中数据的来源更加精准,信息更加丰富,进而使得模型的训练精度更高。
在其中一个具体地实施例中,多维语言编码数据中包括词向量编码数据(token embedding)、分类编码数据(type embedding)以及位置编码数据(position embedding)中的一种或者多种。具体地,语言模型中的Bert embedding layer有三个输入记为多维语言 编码数据,多维语言编码数据可以从多方面表达文本信息。具体地,多维语言编码数据分别对应为token-embedding、segment-embedding和position-embedding。具体地,Token-embedding是用于将单词转换为固定维的向量表示形式,在Bert-base中每个单词都表示为一个768维的向量。Segment-embedding是用于Bert在解决双句分类任务(如判断两段文本在语义上是否相似)时是直接把这两段文本拼接起来输入到模型中,那么模型是如何区分这两段文本呢,答案就是通过segment-embedding。对于两个句子,第一个句子的segment-embedding部分全是0,第二个句子的segment-embedding部分全是1。BERT使用transformer编码器,通过self-attention机制学习句子的表征,self-attention不关注token的位置信息,所以为了能让transformer学习到token的位置信息,在输入时增加了position-embedding。
并且,在训练文本处理模型的过程中,通过引入五笔和拼音信息来进行信息增强。并且为了进一步提高模型训练的效率,词向量模型如五笔词向量模型以及拼音词向量模型中的五笔embedding以及拼音embedding使用word2vec。通过使用word2vec可以减少数据量,进而实现了提高模型训练的效率。
在其中一个实施例中,将五笔编码数据、拼音编码数据以及多维语言编码数据作为输入数据,并根据输入数据进行模型训练得到文本处理模型,包括:将五笔编码数据、拼音编码数据以及多维语言编码数据进行拼接处理得到拼接编码数据;基于语言模型对拼接编码数据进行预测处理得到每一个位置处对应的预测概率;根据预测概率的大小确定对应位置处的初始预测文本;基于初始预测文本与真实标签文本之间的差异对初始文本处理模型的初始模型参数进行调整得到目标模型参数,并根据目标模型参数确定文本处理模型。
具体地,服务器将获取到的五笔编码数据、拼音编码数据以及多维语言数据进行拼接处理得到拼接编码数据,并将拼接编码数据输入至预测模块中,得到每一个位置处对应的预测概率。并且提取预测概率大于预设阈值的数据作为初始预测数据,如可以将预测概率排名在前5的数据作为初始预测数据。
在其中一个实施例中,根据预测概率的大小确定对应位置处的初始预测文本,包括:获取预测概率值大于预设值的预测文本;基于同音原则以及拼音原则从预测文本中提取初始预测文本,初始预测文本存储至区块链节点中。
参考图3,图3为一个实施例中提供的一种文本处理模型的结构图。具体地,在文本处理模型的纠错模块中,本申请不是将token embedding加入到纠错模块的最后一层进行分类输出,而是通过纠错模块直接输出并利用拼音特征对输出加以约束。具体地,本申请充分利用语言模型训练的特点进行文本的检错,目前的语言模型基本都是给定左边的词以及右边的词然后预测当前位置,也存在给定中心词预测左右两边的词,通过这种训练,模型可以学到一个词与哪些词邻近以及邻近的概率,使用拼音训练也是如此。举例来说,中国熊猫的正确拼音是“zhong guo xiong mao”,错误拼音是“zhong guo xun mao”,那么如果训练拼音词向量模型的数据质量较高,那么“mao”拼音前面是“xiong”的概率要高于 “xun”,同理“guo”后面跟“xiong”的概率要高于“xun”,这样可以看到只要拼音词向量模型的训练数据质量较高,那么在检错模块中在预测“mao”前面的“xun”时,“xiong”的概率要远高“xun”,这样起到了检错的作用。故而,在一些实施例中,还可以通过在模型训练过程中冷冻拼音词向量模型的原因,以通过冷冻处理使得正确的拼音词向量不会受较低质量数据的影响。
继续参考图3,具体地,纠错部分使用的Bert模型,它会对每个字都进行softmax输出,如果输出结果跟输入的不一样,说明该字需要修正。举例来说,针对熏这个错别字,bert模型的softmax输出结果最高分的5个为熊、寻、大、好、勋。此时希望进一步根据拼音进行筛选,根据前面拼音embedding接dense的输出结果,该位置的拼音预测为xiong,基于此对bert结果进行过滤,去掉其他拼音,最终只保留下“熊”,其他位置同理。
也就是说,利用拼音预测的结果在纠错模块做预测结果的筛选,除了对预测结果的Top5做同音词的筛选外,还添加拼音预测的结果做筛选。比如中国熏猫,对于熏该错别字而言,假设预测结果的Top5为熊、寻、大、好、勋,如果只做同音词的筛选,那么经过“xun”筛选,将会过滤掉概率最高的熊,导致纠错失败,但是如果熏经拼音预测的结果为“xiong”,这样经过“xiong”和“xun”的过滤就会得到熊、寻、勋,然后取概率最大的熊,进而纠错成功。
需要强调的是,为进一步保证上述初始预测文本的私密和安全性,上述初始预测文本还可以存储于一区块链的节点中。
上述实施例中,通过拼音Embedding以及双向GRU+Dense做词表筛选可以实现动态筛选,而不仅仅是固定的同音筛选。具体来讲就是对后面Bert纠错结果的筛选使用了拼音模型的结果,而不是原始的输入拼音,进而提高了纠错的准确性,以得到准确率较高的文本数据。
在其中一个实施例中,文本处理模型的模型参数包括拼音模型参数以及五笔模型参数;基于初始预测文本与真实标签文本之间的差异对初始文本处理模型的初始模型参数进行调整得到目标模型参数,并根据目标模型参数确定文本处理模型,包括:基于初始预测文本与真实标签文本之间的差异对初始文本处理模型的初始五笔参数进行调整得到目标五笔模型参数;根据拼音模型参数以及目标五笔模型参数确定文本处理模型。
具体地,在训练文本处理模型的过程中,拼音embedding是固定不可变的,五笔embedding是固定可变的。其中,可变是指参数可变,即五笔embedding参与到训练过程中反向传播的参数更新,拼音embedding固定不变。也就是训练过程中不会更新。
在其中一个实施例中,词向量模型是基于word2vector训练得到,语言模型是基于Bert模型训练得到。
虽然Bert语言模型很强,但是做拼音Bert的成本很高,而且由于不能保证预训练文本的质量,即便做了拼音Bert也只能做下信息增强,不适合用于拼音检错上,所以模型选择在拼音Embedding的训练数据质量上下功夫,选用较为轻小的word2vector语言模型而 放弃Bert,同时也认为对于word2vec,由于预训练过程和下游的检错相关,所以其检错能力也不会相比Bert差很多。五笔的向量跟拼音向量一样,都是通过Word2Vector的方法训练得到的。
具体地,word2vector的训练方法包括:将所有文字转换为五笔编码,设置滑动窗口为5,即每次使用前后各个2字的编码来预测中间的那个字的编码。
上述实施例中,在检错模块中引入高质量文本的五笔Embedding和拼音Embedding进行信息增强,可以显著提高原始Soft-mask检错网络的能力。而通过冷冻拼音Embedding,可以保证该Embedding不引入当前错误文本的拼音信息,从而实现纠错的能力。其次在纠错模块中对Top5做同音筛选可以有效控制文本的输出,而利用拼音Embedding+双向GRU+Dense层预测的正确拼音可以对结果实现动态的筛选,这样也能降低同音筛选过滤掉正确词的概率。
整体来讲本申请中的方案突出了语音特征的重要性,弥补了只是用语言模型对于ASR语音识别纠错场景的不足。
应该理解的是,虽然图2的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在其中一个实施例中,如图4所示,提供了一种文本处理模型训练装置,包括:
第一训练样本集获取模块402,用于获取第一待训练文本样本集。
词向量训练模块404,用于基于第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型。
第二训练样本集获取模块406,用于获取第二待训练文本样本集以及预训练的语言模型。
编码数据提取模块408,用于基于语言模型、五笔词向量模型以及拼音词向量模型分别提取第二待训练文本样本集对应的编码数据。
模型训练模块410,用于根据编码数据执行模型训练得到文本处理模型。
在其中一个实施例中,编码数据提取模块408还用于将第一待训练文本样本集转换为对应的拼音编码向量,根据预配置的滑动窗口依次遍历拼音编码向量,并将遍历到的拼音编码向量作为当前待处理拼音向量,基于当前拼音模型参数对应的当前词向量模型在当前待处理拼音向量中预测预设位置处的拼音编码向量,并根据预测的拼音编码向量以及真实的拼音编码向量确定目标拼音模型参数,根据确定的目标拼音模型参数得到拼音词向量模型;将第一待训练文本样本集转换为对应的五笔编码向量,根据预配置的滑动窗口依次遍历五笔编码向量,并将遍历到的五笔编码向量作为当前待处理五笔向量,基于当前五笔模 型参数对应的当前词向量模型在当前待处理五笔向量中预测预设位置处的五笔编码向量,并根据预测的五笔编码向量以及真实的五笔编码向量确定目标五笔模型参数,根据确定的目标五笔模型参数得到五笔词向量模型。
在其中一个实施例中,编码数据提取模块408还用于基于预训练的五笔词向量模型从第二待训练文本样本集中提取五笔编码数据;基于预训练的拼音词向量模型从第二待训练文本样本集中提取拼音编码数据;获取预训练的语言模型,并基于语言模型从第二训练样本集中提取多维语言编码数据;模型训练模块410还用于将五笔编码数据、拼音编码数据以及多维语言编码数据作为输入数据,并根据输入数据进行模型训练得到文本处理模型。
在其中一个实施例中,模型训练模块410还用于将五笔编码数据、拼音编码数据以及多维编码数据进行拼接处理得到拼接编码数据;基于语言模型对拼接编码数据进行预测处理得到每一个位置处对应的预测概率;根据预测概率的大小确定对应位置处的初始预测文本;基于初始预测文本与真实标签文本之间的差异对初始文本处理模型的初始模型参数进行调整得到目标模型参数,并根据目标模型参数确定文本处理模型。
在其中一个实施例中,模型训练模块410还用于获取预测概率值大于预设值的预测文本;基于同音原则以及拼音原则从预测文本中提取初始预测文本,初始预测文本存储至区块链节点中。
在其中一个实施例中,模型训练模块410还用于基于初始预测文本与真实标签文本之间的差异对初始文本处理模型的初始五笔参数进行调整得到目标五笔模型参数;根据拼音模型参数以及目标五笔模型参数确定文本处理模型。
在其中一个实施例中,如图6所示,提供了一种文本数据获取装置,包括:
获取模块602,用于获取待处理文本数据。
处理模块604,用于将待处理文本数据输入至预先训练的文本处理模型中,以根据文本处理模型中的模型参数对待处理文本数据进行数据处理得到目标文本数据;文本处理模型是基于不同输入法对应的词向量编码数据以及语言编码数据作为输入数据进行训练得到,且词向量编码数据是基于预训练的词向量模型得到,语言编码数据是基于预训练的语言模型得到。
关于文本数据获取装置以及文本处理模型训练装置的具体限定可以参见上文中对于文本数据获取方法以及文本处理模型训练方法的限定,在此不再赘述。上述文本数据获取装置以及文本处理模型训练装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图7所示。该计算机设备包括通过***总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作***、计算机可读指令和 数据库。该内存储器为非易失性存储介质中的操作***和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储文本数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种文本数据获取方法以及文本处理模型训练方法。
本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器执行上述任意一个实施例中的方法所涉及的步骤。
一个或多个存储有计算机可读指令的计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任意一个实施例中的方法所涉及的步骤。
其中,该计算机可读存储介质可以是非易失性,也可以是易失性的。
本发明所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读取存储介质中,该存储介质可以为易失性的或非易失性计算机可读存储介质,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。 因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种文本处理模型训练方法,其中,所述文本处理模型训练方法包括:
    获取第一待训练文本样本集;
    基于所述第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型;
    获取第二待训练文本样本集以及预训练的语言模型;
    基于所述语言模型、所述五笔词向量模型以及所述拼音词向量模型分别提取所述第二待训练文本样本集对应的编码数据;及
    根据所述编码数据执行模型训练得到文本处理模型。
  2. 根据权利要求1所述的文本处理模型训练方法,其中,所述基于所述第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型,包括:
    将所述第一待训练文本样本集转换为对应的拼音编码向量,根据预配置的滑动窗口依次遍历所述拼音编码向量,并将遍历到的所述拼音编码向量作为当前待处理拼音向量,基于当前拼音模型参数对应的当前词向量模型在所述当前待处理拼音向量中预测预设位置处的拼音编码向量,并根据预测的所述拼音编码向量以及真实的所述拼音编码向量确定目标拼音模型参数,根据确定的所述目标拼音模型参数得到拼音词向量模型;及
    将所述第一待训练文本样本集转换为对应的五笔编码向量,根据预配置的滑动窗口依次遍历所述五笔编码向量,并将遍历到的所述五笔编码向量作为当前待处理五笔向量,基于当前五笔模型参数对应的当前词向量模型在所述当前待处理五笔向量中预测预设位置处的五笔编码向量,并根据预测的所述五笔编码向量以及真实的五笔编码向量确定目标五笔模型参数,根据确定的所述目标五笔模型参数得到五笔词向量模型。
  3. 根据权利要求1所述的文本处理模型训练方法,其中,所述基于所述语言模型、所述五笔词向量模型以及所述拼音词向量模型分别提取所述第二待训练文本样本集对应的编码数据,包括:
    基于预训练的所述五笔词向量模型从所述第二待训练文本样本集中提取五笔编码数据;
    基于预训练的所述拼音词向量模型从所述第二待训练文本样本集中提取拼音编码数据;
    获取预训练的语言模型,并基于所述语言模型从所述第二训练样本集中提取多维语言编码数据;及
    所述根据所述编码数据执行模型训练得到文本处理模型,包括:
    将所述五笔编码数据、所述拼音编码数据以及所述多维语言编码数据作为输入数据,并根据所述输入数据进行模型训练得到文本处理模型。
  4. 根据权利要求3所述的文本处理模型训练方法,其中,所述将所述五笔编码数据、 所述拼音编码数据以及所述多维语言编码数据作为输入数据,并根据所述输入数据进行模型训练得到文本处理模型,包括:
    将所述五笔编码数据、所述拼音编码数据以及所述多维编码数据进行拼接处理得到拼接编码数据;
    基于所述语言模型对所述拼接编码数据进行预测处理得到每一个位置处对应的预测概率;
    根据所述预测概率的大小确定对应位置处的初始预测文本;及
    基于所述初始预测文本与真实标签文本之间的差异对初始文本处理模型的初始模型参数进行调整得到目标模型参数,并根据所述目标模型参数确定文本处理模型。
  5. 根据权利要求4所述的文本处理模型训练方法,其中,所述根据所述预测概率的大小确定对应位置处的初始预测文本,包括:
    获取预测概率值大于预设值的预测文本;及
    基于同音原则以及拼音原则从所述预测文本中提取初始预测文本,所述初始预测文本存储至区块链节点中。
  6. 根据权利要求4所述的文本处理模型训练方法,其中,所述文本处理模型的模型参数包括拼音模型参数以及五笔模型参数;所述基于所述初始预测文本与真实标签文本之间的差异对初始文本处理模型的初始模型参数进行调整得到目标模型参数,并根据所述目标模型参数确定文本处理模型,包括:
    基于所述初始预测文本与真实标签文本之间的差异对初始文本处理模型的初始五笔参数进行调整得到目标五笔模型参数;及
    根据所述拼音模型参数以及所述目标五笔模型参数确定文本处理模型。
  7. 一种文本数据获取方法,其中,所述文本数据获取方法包括:
    获取待处理文本数据;及
    将所述待处理文本数据输入至预先训练的文本处理模型中,以根据所述文本处理模型中的模型参数对所述待处理文本数据进行数据处理得到目标文本数据;所述文本处理模型是基于不同输入法对应的词向量编码数据以及语言编码数据作为输入数据进行训练得到,且所述词向量编码数据是基于预训练的词向量模型得到,所述语言编码数据是基于预训练的语言模型得到。
  8. 一种文本处理模型训练装置,其中,所述文本处理模型训练装置包括:
    第一训练样本集获取模块,用于获取第一待训练文本样本集;
    词向量训练模块,用于基于所述第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型;
    第二训练样本集获取模块,用于获取第二待训练文本样本集以及预训练的语言模型;
    编码数据提取模块,用于基于所述语言模型、所述五笔词向量模型以及所述拼音词向量模型分别提取所述第二待训练文本样本集对应的编码数据;及
    模型训练模块,用于根据所述编码数据执行模型训练得到文本处理模型。
  9. 一种文本数据获取装置,其中,所述文本数据获取装置包括:
    获取模块,用于获取待处理文本数据;及
    处理模块,用于将所述待处理文本数据输入至预先训练的文本处理模型中,以根据所述文本处理模型中的模型参数对所述待处理文本数据进行数据处理得到目标文本数据;所述文本处理模型是基于不同输入法对应的词向量编码数据以及语言编码数据作为输入数据进行训练得到,且所述词向量编码数据是基于预训练的词向量模型得到,所述语言编码数据是基于预训练的语言模型得到。
  10. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取第一待训练文本样本集;
    基于所述第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型;
    获取第二待训练文本样本集以及预训练的语言模型;
    基于所述语言模型、所述五笔词向量模型以及所述拼音词向量模型分别提取所述第二待训练文本样本集对应的编码数据;及
    根据所述编码数据执行模型训练得到文本处理模型。
  11. 根据权利要求10所述的计算机设备,其中,所述处理器执行所述计算机可读指令时所实现的所述基于所述第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型,包括:
    将所述第一待训练文本样本集转换为对应的拼音编码向量,根据预配置的滑动窗口依次遍历所述拼音编码向量,并将遍历到的所述拼音编码向量作为当前待处理拼音向量,基于当前拼音模型参数对应的当前词向量模型在所述当前待处理拼音向量中预测预设位置处的拼音编码向量,并根据预测的所述拼音编码向量以及真实的所述拼音编码向量确定目标拼音模型参数,根据确定的所述目标拼音模型参数得到拼音词向量模型;及
    将所述第一待训练文本样本集转换为对应的五笔编码向量,根据预配置的滑动窗口依次遍历所述五笔编码向量,并将遍历到的所述五笔编码向量作为当前待处理五笔向量,基于当前五笔模型参数对应的当前词向量模型在所述当前待处理五笔向量中预测预设位置处的五笔编码向量,并根据预测的所述五笔编码向量以及真实的五笔编码向量确定目标五笔模型参数,根据确定的所述目标五笔模型参数得到五笔词向量模型。
  12. 根据权利要求10所述的计算机设备,其中,所述处理器执行所述计算机可读指令时所实现的所述基于所述语言模型、所述五笔词向量模型以及所述拼音词向量模型分别提取所述第二待训练文本样本集对应的编码数据,包括:
    基于预训练的所述五笔词向量模型从所述第二待训练文本样本集中提取五笔编码数 据;
    基于预训练的所述拼音词向量模型从所述第二待训练文本样本集中提取拼音编码数据;
    获取预训练的语言模型,并基于所述语言模型从所述第二训练样本集中提取多维语言编码数据;及
    所述根据所述编码数据执行模型训练得到文本处理模型,包括:
    将所述五笔编码数据、所述拼音编码数据以及所述多维语言编码数据作为输入数据,并根据所述输入数据进行模型训练得到文本处理模型。
  13. 根据权利要求12所述的计算机设备,其中,所述处理器执行所述计算机可读指令时所实现的所述将所述五笔编码数据、所述拼音编码数据以及所述多维语言编码数据作为输入数据,并根据所述输入数据进行模型训练得到文本处理模型,包括:
    将所述五笔编码数据、所述拼音编码数据以及所述多维编码数据进行拼接处理得到拼接编码数据;
    基于所述语言模型对所述拼接编码数据进行预测处理得到每一个位置处对应的预测概率;
    根据所述预测概率的大小确定对应位置处的初始预测文本;及
    基于所述初始预测文本与真实标签文本之间的差异对初始文本处理模型的初始模型参数进行调整得到目标模型参数,并根据所述目标模型参数确定文本处理模型。
  14. 根据权利要求13所述的计算机设备,其中,所述处理器执行所述计算机可读指令时所实现的所述根据所述预测概率的大小确定对应位置处的初始预测文本,包括:
    获取预测概率值大于预设值的预测文本;及
    基于同音原则以及拼音原则从所述预测文本中提取初始预测文本,所述初始预测文本存储至区块链节点中。
  15. 根据权利要求13所述的计算机设备,其中,所述处理器执行所述计算机可读指令时所涉及的所述文本处理模型的模型参数包括拼音模型参数以及五笔模型参数;所述处理器执行所述计算机可读指令时所实现的所述基于所述初始预测文本与真实标签文本之间的差异对初始文本处理模型的初始模型参数进行调整得到目标模型参数,并根据所述目标模型参数确定文本处理模型,包括:
    基于所述初始预测文本与真实标签文本之间的差异对初始文本处理模型的初始五笔参数进行调整得到目标五笔模型参数;及
    根据所述拼音模型参数以及所述目标五笔模型参数确定文本处理模型。
  16. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取待处理文本数据;及
    将所述待处理文本数据输入至预先训练的文本处理模型中,以根据所述文本处理模型中的模型参数对所述待处理文本数据进行数据处理得到目标文本数据;所述文本处理模型是基于不同输入法对应的词向量编码数据以及语言编码数据作为输入数据进行训练得到,且所述词向量编码数据是基于预训练的词向量模型得到,所述语言编码数据是基于预训练的语言模型得到。
  17. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取第一待训练文本样本集;
    基于所述第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型;
    获取第二待训练文本样本集以及预训练的语言模型;
    基于所述语言模型、所述五笔词向量模型以及所述拼音词向量模型分别提取所述第二待训练文本样本集对应的编码数据;及
    根据所述编码数据执行模型训练得到文本处理模型。
  18. 根据权利要求17所述的存储介质,其中,所述计算机可读指令被所述处理器执行时所实现的所述基于所述第一待训练文本样本集分别执行模型训练得到不同输入法对应的五笔词向量模型以及拼音词向量模型,包括:
    将所述第一待训练文本样本集转换为对应的拼音编码向量,根据预配置的滑动窗口依次遍历所述拼音编码向量,并将遍历到的所述拼音编码向量作为当前待处理拼音向量,基于当前拼音模型参数对应的当前词向量模型在所述当前待处理拼音向量中预测预设位置处的拼音编码向量,并根据预测的所述拼音编码向量以及真实的所述拼音编码向量确定目标拼音模型参数,根据确定的所述目标拼音模型参数得到拼音词向量模型;及
    将所述第一待训练文本样本集转换为对应的五笔编码向量,根据预配置的滑动窗口依次遍历所述五笔编码向量,并将遍历到的所述五笔编码向量作为当前待处理五笔向量,基于当前五笔模型参数对应的当前词向量模型在所述当前待处理五笔向量中预测预设位置处的五笔编码向量,并根据预测的所述五笔编码向量以及真实的五笔编码向量确定目标五笔模型参数,根据确定的所述目标五笔模型参数得到五笔词向量模型。
  19. 根据权利要求17所述的存储介质,其中,所述计算机可读指令被所述处理器执行时所实现的所述基于所述语言模型、所述五笔词向量模型以及所述拼音词向量模型分别提取所述第二待训练文本样本集对应的编码数据,包括:
    基于预训练的所述五笔词向量模型从所述第二待训练文本样本集中提取五笔编码数据;
    基于预训练的所述拼音词向量模型从所述第二待训练文本样本集中提取拼音编码数据;
    获取预训练的语言模型,并基于所述语言模型从所述第二训练样本集中提取多维语言 编码数据;及
    所述根据所述编码数据执行模型训练得到文本处理模型,包括:
    将所述五笔编码数据、所述拼音编码数据以及所述多维语言编码数据作为输入数据,并根据所述输入数据进行模型训练得到文本处理模型。
  20. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取待处理文本数据;及
    将所述待处理文本数据输入至预先训练的文本处理模型中,以根据所述文本处理模型中的模型参数对所述待处理文本数据进行数据处理得到目标文本数据;所述文本处理模型是基于不同输入法对应的词向量编码数据以及语言编码数据作为输入数据进行训练得到,且所述词向量编码数据是基于预训练的词向量模型得到,所述语言编码数据是基于预训练的语言模型得到。
PCT/CN2021/096582 2020-12-11 2021-05-28 文本处理模型训练方法、装置、计算机设备和存储介质 WO2022121251A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011447964.2 2020-12-11
CN202011447964.2A CN112528637B (zh) 2020-12-11 2020-12-11 文本处理模型训练方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022121251A1 true WO2022121251A1 (zh) 2022-06-16

Family

ID=74998573

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096582 WO2022121251A1 (zh) 2020-12-11 2021-05-28 文本处理模型训练方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN112528637B (zh)
WO (1) WO2022121251A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116667326A (zh) * 2023-05-30 2023-08-29 淮阴工学院 一种电动汽车充电负荷预测方法
CN117609781A (zh) * 2023-11-20 2024-02-27 北京中关村科金技术有限公司 文本评估模型的训练方法、文本评估方法及装置
CN117831573A (zh) * 2024-03-06 2024-04-05 青岛理工大学 基于多模态的语言障碍人群言语录音分析方法及***

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528637B (zh) * 2020-12-11 2024-03-29 平安科技(深圳)有限公司 文本处理模型训练方法、装置、计算机设备和存储介质
CN113434699B (zh) * 2021-06-30 2023-07-18 平安科技(深圳)有限公司 用于文本匹配的bert模型的预训练方法、计算机装置和存储介质
CN113609157B (zh) * 2021-08-09 2023-06-30 平安科技(深圳)有限公司 语言转换模型训练、语言转换方法、装置、设备及介质
CN113988055A (zh) * 2021-10-18 2022-01-28 浙江香侬慧语科技有限责任公司 一种预训练模型的中文训练方法、装置及存储介质
CN114139524B (zh) * 2021-11-29 2022-09-13 浙江大学 故事文本的预测方法、装置以及电子设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
US20180349327A1 (en) * 2017-06-05 2018-12-06 Baidu Online Network Technology (Beijing)Co., Ltd. Text error correction method and apparatus based on recurrent neural network of artificial intelligence
CN110750959A (zh) * 2019-10-28 2020-02-04 腾讯科技(深圳)有限公司 文本信息处理的方法、模型训练的方法以及相关装置
CN111310443A (zh) * 2020-02-12 2020-06-19 新华智云科技有限公司 一种文本纠错方法和***
CN111476036A (zh) * 2020-04-10 2020-07-31 电子科技大学 一种基于中文单词特征子串的词嵌入学习方法
CN111523306A (zh) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 文本的纠错方法、装置和***
CN111597815A (zh) * 2020-05-22 2020-08-28 北京慧闻科技(集团)有限公司 一种多嵌入命名实体识别方法、装置、设备及存储介质
CN112528637A (zh) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 文本处理模型训练方法、装置、计算机设备和存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107170453B (zh) * 2017-05-18 2020-11-03 百度在线网络技术(北京)有限公司 基于人工智能的跨语种语音转录方法、设备及可读介质
CN110472251B (zh) * 2018-05-10 2023-05-30 腾讯科技(深圳)有限公司 翻译模型训练的方法、语句翻译的方法、设备及存储介质
CN110110041B (zh) * 2019-03-15 2022-02-15 平安科技(深圳)有限公司 错词纠正方法、装置、计算机装置及存储介质
CN110288980A (zh) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 语音识别方法、模型的训练方法、装置、设备及存储介质
CN110795935A (zh) * 2020-01-06 2020-02-14 广东博智林机器人有限公司 文字词向量模型的训练方法、装置、终端及存储介质
CN111488466B (zh) * 2020-04-16 2023-06-06 清华大学 中文带标记错误语料生成方法、计算装置和存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
US20180349327A1 (en) * 2017-06-05 2018-12-06 Baidu Online Network Technology (Beijing)Co., Ltd. Text error correction method and apparatus based on recurrent neural network of artificial intelligence
CN111523306A (zh) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 文本的纠错方法、装置和***
CN110750959A (zh) * 2019-10-28 2020-02-04 腾讯科技(深圳)有限公司 文本信息处理的方法、模型训练的方法以及相关装置
CN111310443A (zh) * 2020-02-12 2020-06-19 新华智云科技有限公司 一种文本纠错方法和***
CN111476036A (zh) * 2020-04-10 2020-07-31 电子科技大学 一种基于中文单词特征子串的词嵌入学习方法
CN111597815A (zh) * 2020-05-22 2020-08-28 北京慧闻科技(集团)有限公司 一种多嵌入命名实体识别方法、装置、设备及存储介质
CN112528637A (zh) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 文本处理模型训练方法、装置、计算机设备和存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116667326A (zh) * 2023-05-30 2023-08-29 淮阴工学院 一种电动汽车充电负荷预测方法
CN116667326B (zh) * 2023-05-30 2024-02-23 淮阴工学院 一种电动汽车充电负荷预测方法
CN117609781A (zh) * 2023-11-20 2024-02-27 北京中关村科金技术有限公司 文本评估模型的训练方法、文本评估方法及装置
CN117609781B (zh) * 2023-11-20 2024-05-28 北京中关村科金技术有限公司 文本评估模型的训练方法、文本评估方法及装置
CN117831573A (zh) * 2024-03-06 2024-04-05 青岛理工大学 基于多模态的语言障碍人群言语录音分析方法及***
CN117831573B (zh) * 2024-03-06 2024-05-14 青岛理工大学 基于多模态的语言障碍人群言语录音分析方法及***

Also Published As

Publication number Publication date
CN112528637A (zh) 2021-03-19
CN112528637B (zh) 2024-03-29

Similar Documents

Publication Publication Date Title
WO2022121251A1 (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
JP5901001B1 (ja) 音響言語モデルトレーニングのための方法およびデバイス
CN114580382A (zh) 文本纠错方法以及装置
CN111310441A (zh) 基于bert的语音识别后文本修正方法、装置、终端及介质
US11636272B2 (en) Hybrid natural language understanding
CN113053367B (zh) 语音识别方法、语音识别的模型训练方法以及装置
CN114218932B (zh) 基于故障因果图谱的航空故障文本摘要生成方法及其装置
WO2021143206A1 (zh) 单语句自然语言处理方法、装置、计算机设备及可读存储介质
WO2017052817A1 (en) Dynamic adaptation of language models and semantic tracking for automatic speech recognition
US20230096805A1 (en) Contrastive Siamese Network for Semi-supervised Speech Recognition
US20230104228A1 (en) Joint Unsupervised and Supervised Training for Multilingual ASR
CN116956835B (zh) 一种基于预训练语言模型的文书生成方法
US20230237993A1 (en) Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models
CN116050425A (zh) 建立预训练语言模型的方法、文本预测方法及装置
EP4214643A1 (en) Dynamic language models for continuously evolving content
US20230410794A1 (en) Audio recognition method, method of training audio recognition model, and electronic device
CN113160820A (zh) 语音识别的方法、语音识别模型的训练方法、装置及设备
US11687723B2 (en) Natural language processing with missing tokens in a corpus
TWI818427B (zh) 使用基於文本的說話者變更檢測的說話者劃分糾正方法及系統
CN115858776A (zh) 一种变体文本分类识别方法、***、存储介质和电子设备
CN115525749A (zh) 语音问答方法、装置、电子设备和存储介质
CN111090720B (zh) 一种热词的添加方法和装置
US20230116268A1 (en) System and a method for phonetic-based transliteration
US20230252225A1 (en) Automatic Text Summarisation Post-processing for Removal of Erroneous Sentences
CN118261248A (zh) 文本检测方法、训练方法、装置、设备、介质和程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901964

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901964

Country of ref document: EP

Kind code of ref document: A1