CN117973402B - Text conversion preprocessing method and device, storage medium and electronic equipment - Google Patents

Text conversion preprocessing method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN117973402B
CN117973402B CN202410387525.9A CN202410387525A CN117973402B CN 117973402 B CN117973402 B CN 117973402B CN 202410387525 A CN202410387525 A CN 202410387525A CN 117973402 B CN117973402 B CN 117973402B
Authority
CN
China
Prior art keywords
sample
text
texts
target
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410387525.9A
Other languages
Chinese (zh)
Other versions
CN117973402A (en
Inventor
王思嘉
吴建伟
郑仲富
卿佳
梁有宁
刘海龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202410387525.9A priority Critical patent/CN117973402B/en
Publication of CN117973402A publication Critical patent/CN117973402A/en
Application granted granted Critical
Publication of CN117973402B publication Critical patent/CN117973402B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a text conversion preprocessing method and device, a storage medium and electronic equipment. Wherein the method comprises the following steps: acquiring an initial sample text set from application data of a target application; determining a first group of sample texts which do not meet the symbol configuration conditions from the initial sample text set; determining a second group of sample texts with the text similarity larger than a preset threshold value from the initial sample text set; removing a first sub-group of sample texts in the first group of sample texts and the second group of sample texts from the initial sample text set to obtain a third group of sample texts; determining the third set of sample text as a positive sample set and determining the first set of sample text and the first subset of sample text as a negative sample set; training by using the positive sample set and the negative sample set to obtain a text conversion model. The application solves the technical problem of lower accuracy rate in the preprocessing process of text conversion in the related technology.

Description

Text conversion preprocessing method and device, storage medium and electronic equipment
Technical Field
The present application relates to the field of computers, and in particular, to a text conversion preprocessing method and apparatus, a storage medium, and an electronic device.
Background
In converting text content of application data of a target application, some machine model for completing training is usually utilized to assist in completing conversion quickly. For example, using translation as an example, the machine translation model may directly convert data in a source language into data in other languages, such as configuration data or interaction data in a target application published in a different country, from which the data is translated into an official use language in the corresponding country.
Still taking translation as an example, in order to improve the performance of the machine translation model, historical translation data generated in the application running process is generally extracted from the translation configuration table, the machine translation model is trained by using the historical translation data, and the accuracy of the machine translation model is improved by continuously adjusting the structural parameters of the machine translation model in the training process.
However, due to the limitation of the performance of the machine-machine translation model, there may be situations such as inconsistent meaning of the translation and the original text, abnormal sign, etc. in the historical translation data, that is, a great amount of dirty data is doped in the training sample data input into the machine-machine translation model. In the case that the training sample data of the input model is inaccurate, the accuracy of the translation result output by the iterative training is reduced. In other words, the preprocessing method for text conversion provided by the related art has a technical problem of low processing accuracy.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the application provides a preprocessing method and device for text conversion, a storage medium and electronic equipment, which are used for at least solving the technical problem of lower accuracy in the preprocessing process of text conversion.
According to an aspect of the embodiment of the present application, there is provided a preprocessing method for text conversion, including: acquiring an initial sample text set from application data of a target application, wherein each pair of sample texts in the initial sample text set comprises a source sample text of a source language and a target sample text of a target language, and the target sample text is obtained after content conversion based on text content of the source sample text; determining a first group of sample texts which do not meet symbol configuration conditions from an initial sample text set, wherein the symbol configuration conditions are used for indicating format requirements of symbols contained in text contents of the sample texts; determining a second group of sample texts with the text similarity larger than a preset threshold value from the initial sample text set, wherein the second group of sample texts comprises a first subgroup of sample texts and a second subgroup of sample texts, the text similarity between an ith sample text in the first subgroup of sample texts and a jth sample text in the second subgroup of sample texts is larger than the preset threshold value, and the text similarity is used for indicating the content similarity between the ith sample text and the jth sample text, and i and j are positive integers larger than or equal to 1; removing a first sub-group of sample texts in the first group of sample texts and the second group of sample texts from the initial sample text set to obtain a third group of sample texts; determining the third set of sample text as a positive sample set and determining the first set of sample text and the first subset of sample text as a negative sample set; training is performed by using the positive sample set and the negative sample set, and a text conversion model for converting the source sample text of the source language into the target sample text of the target language is obtained.
Optionally, the determining, from the initial sample text set, the second set of sample texts with text similarity greater than the preset threshold includes: dividing a source sample text in an initial sample text set into F pairs of source sample texts, wherein F is a positive integer greater than or equal to 2; determining the text similarity between two source sample texts in each pair of source sample texts to obtain F text similarities, wherein the text similarity comprises the text similarity between each pair of source sample texts in the F pairs of source sample texts, and F is a positive integer greater than or equal to 1; m pairs of source sample texts with the text similarity larger than a preset threshold value are determined from F text similarity, wherein M is a positive integer larger than or equal to 1 and smaller than or equal to F; determining a first subgroup of sample texts according to the M pairs of source sample texts; the remaining sample text of the M pairs of source sample text excluding the first subset of sample text is determined to be the second subset of sample text.
Optionally, determining the text similarity between the two source sample texts in each pair of source sample texts to obtain F text similarities includes: determining a kth text similarity between a first source sample text and a second source sample text in a kth pair of source sample texts by: determining a first word sequence according to the first source sample text, wherein the s-th word in the first word sequence is the same as the last character in the previous adjacent word, the first word sequence comprises Q words with the number of characters being N, N is a positive integer greater than or equal to a preset value, and s is a positive integer greater than or equal to 2 and less than or equal to Q; determining a second word sequence according to the second source sample text, wherein the t-th word in the second word sequence is the same as the last character in the previous adjacent word, the second word sequence comprises R words with the number of characters being N, Q, R is a positive integer greater than or equal to 2, and t is a positive integer greater than or equal to 2 and less than or equal to R; and determining the kth text similarity according to the first word sequence and the second word sequence, wherein the F text similarities comprise the kth text similarity.
Optionally, the determining the kth text similarity according to the first word sequence and the second word sequence includes: determining a target word sequence according to the first word sequence and the second word sequence, wherein the target word sequence is a word sequence obtained by performing de-duplication on words in the first word sequence and the second word sequence and then splicing, the target word sequence comprises W words, W is a positive integer greater than or equal to 2 and less than or equal to the sum of the number of target words, and the sum of the number of target words is the sum of the number of words in the first word sequence and the number of words in the second word sequence; determining a first word frequency vector according to the first word sequence and the target word sequence, wherein elements in the first word frequency vector are used for indicating whether words in the first word sequence appear in the target word sequence; determining a second word frequency vector according to the second word sequence and the target word sequence, wherein elements in the second word frequency vector are used for indicating whether words in the second word sequence appear in the target word sequence; and determining the kth text similarity between the first source sample text and the second source sample text in the kth pair of source sample texts according to the first word frequency vector and the second word frequency vector.
Optionally, the determining the first word frequency vector according to the first word sequence and the target word sequence includes: and under the condition that the W words comprise a first part of words in the first word sequence, setting 1 at the position corresponding to the first part of words in the target word sequence, and obtaining a first word frequency vector with the dimension of 1 XW.
Optionally, the determining the second word frequency vector according to the second word sequence and the target word sequence includes: and under the condition that the W words comprise second partial words in the second word sequence, setting 1 at the position corresponding to the second partial words in the target word sequence, and obtaining a second word frequency vector with the dimension of 1 XW.
Optionally, determining the kth text similarity between the first source sample text and the second source sample text in the kth pair of source sample texts according to the first word frequency vector and the second word frequency vector includes: and determining the cosine similarity between the first word frequency vector and the second word frequency vector, and determining the cosine similarity as the kth text similarity.
Optionally, the determining the first group of sample texts that do not meet the symbol configuration condition from the initial sample text set includes: searching sample texts containing abnormal characters from an initial sample text set to obtain first-type sample texts, wherein the initial sample text set comprises first-type sample texts, and each pair of texts in the first-type sample texts comprises a source sample text of a source language and a target sample text of a target language; searching a second type sample text of which the languages of the sample text are other than the source language and the target language from an initial sample text set, wherein the initial sample text set comprises the second type sample text, and each pair of texts in the second type sample text comprises the source sample text of the source language and the target sample text of the target language; searching a third type of sample text with the text content being a null value from the initial sample text set, wherein the first group of sample text comprises a first type of sample text, a second type of sample text and the third type of sample text; wherein the symbol configuration condition includes at least one of: the abnormal symbol does not appear in the sample text, the languages of the sample text comprise source language and target language, and the text content of the sample text does not contain null values.
Optionally, before training using the positive sample set and the negative sample set to obtain a text conversion model for converting the source sample text of the source language into the target sample text of the target language, the method further includes: the target pre-processing model is obtained by inputting the positive sample set and the negative sample set into the pre-processing model, wherein the target pre-processing model is a model for identifying the categories of the sample texts in the positive sample set and the negative sample set.
Optionally, the obtaining the target pretreatment model by inputting the positive sample set and the negative sample set into the pretreatment model includes: converting positive samples in the positive sample set into a first multidimensional vector, wherein the first multidimensional vector comprises first type information and semantic information of the positive samples, and the first type information represents that the types of sample texts are positive samples; converting the negative samples in the negative sample set into a second multidimensional vector, wherein the second multidimensional vector comprises second class information and semantic information of the negative samples, and the second class information represents that the class of the sample text is the negative samples; inputting the first multidimensional vector and the second multidimensional vector into a preprocessing model to obtain a group of prediction probabilities, wherein the group of prediction probabilities comprises the prediction probabilities that the categories of the positive samples of the positive sample set belong to the first category information and the second category information and the prediction probabilities that the categories of the negative samples of the negative sample set belong to the first category information and the second category information; determining a target loss function according to a group of prediction probabilities, first class information of the positive sample and second class information of the negative sample; and under the condition that the value of the target loss function meets the convergence condition, stopping training to obtain the target preprocessing model.
Optionally, the acquiring the initial sample text set from the application data of the target application includes: and determining a historical text set generated by the target application in a historical operation period, a preconfigured original text set and an associated text set as an initial sample text set, wherein the original text set comprises texts predefined in a configuration table of the target application, and the associated text set is a text set generated by other applications with the same functions and categories as the target application in the operation process.
According to still another aspect of the embodiment of the present application, there is also provided a preprocessing apparatus for text conversion, including: the first acquisition unit is used for acquiring an initial sample text set from application data of a target application, wherein each pair of sample texts in the initial sample text set comprises a source sample text of a source language and a target sample text of a target language, and the target sample text is obtained after content conversion based on text content of the source sample text; a first processing unit, configured to determine a first set of sample texts that do not satisfy a symbol configuration condition from an initial sample text set, where the symbol configuration condition is used to indicate a format requirement of a symbol included in text content of the sample text; the second processing unit is used for determining a second group of sample texts with the text similarity larger than a preset threshold value from the initial sample text set, wherein the second group of sample texts comprises a first subgroup of sample texts and a second subgroup of sample texts, the text similarity between an ith sample text in the first subgroup of sample texts and a jth sample text in the second subgroup of sample texts is larger than the preset threshold value, and the text similarity is used for indicating the content similarity between the ith sample text and the jth sample text, and i and j are positive integers larger than or equal to 1; the third processing unit is used for eliminating the first sub-group sample texts in the first group of sample texts and the second group of sample texts from the initial sample text set to obtain a third group of sample texts; a fourth processing unit for determining the third set of sample texts as a positive sample set and determining the first set of sample texts and the first subset of sample texts as a negative sample set; and the training unit is used for training by utilizing the positive sample set and the negative sample set to obtain a text conversion model for converting the source sample text of the source language into the target sample text of the target language.
According to still another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described text conversion preprocessing method when executed by an electronic device.
According to a further aspect of embodiments of the present application, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the above method.
According to still another aspect of the embodiments of the present application, there is also provided an electronic device including a memory in which a computer program is stored, and a processor configured to execute the preprocessing method of text conversion by the computer program.
By adopting the mode, the sample text containing the abnormal symbol is removed from the initial sample text set by judging whether the sample text in the initial sample text set meets the symbol configuration condition; and removing repeated sample texts in the initial sample text set by determining a second group of sample texts with the text similarity larger than a preset threshold value, and training the text conversion model by using the sample set with the abnormal sample texts removed, so that the accuracy of the result of the text conversion model is improved. In other words, through preset judging conditions, abnormal sample texts in the initial sample text set are cleaned, so that the accuracy of the sample texts input into the text conversion model is ensured, and the technical effect of improving the accuracy of the preprocessing result of text conversion is realized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application.
Fig. 1 is a schematic diagram of an application scenario of an alternative text conversion preprocessing method according to an embodiment of the present application.
Fig. 2 is a flow chart of an alternative text conversion preprocessing method according to an embodiment of the present application.
FIG. 3 is a comparative schematic of the translation quality of machine translation models obtained before and after training using cleaned sample text.
Fig. 4 is an overall schematic diagram of an alternative text conversion preprocessing method in accordance with an embodiment of the present application.
Fig. 5 is an example of an alternative similar sample text in accordance with an embodiment of the present application.
Fig. 6 is an embodiment 1 of determining a second set of sample text based on text similarity.
Fig. 7 is embodiment 2 of determining a second set of sample text based on text similarity.
Fig. 8 is a schematic diagram of calculating text similarity between two source sample texts.
FIG. 9 is a schematic diagram of an alternative deduplication process for similar sample text, in accordance with an embodiment of the present application.
Fig. 10 is an example of an alternative sample text containing anomaly symbols in accordance with an embodiment of the present application.
FIG. 11 is an example of another alternative sample text containing anomaly symbols in accordance with an embodiment of the present application.
FIG. 12 is a flow chart of determining a first set of sample text that does not satisfy a symbol configuration condition.
Fig. 13 is a flowchart of training the BERT discrimination model using the preprocessed positive and negative sample sets.
Fig. 14 is a schematic diagram of format conversion of a positive sample set and a negative sample set.
Fig. 15 is an implementation code of a linear classification layer of inputting the format-converted data into the BERT discrimination model.
FIG. 16 is a flow chart for training the BERT discriminant model using cleaned sample text.
Fig. 17 is a schematic structural view of an alternative text conversion preprocessing apparatus according to an embodiment of the present application.
Fig. 18 is a schematic structural view of an alternative electronic device according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The technical scheme in the embodiment of the application can follow legal rules in the implementation process, and when the operation is executed according to the technical scheme in the embodiment, the used data can not relate to user privacy, and the safety of the data is ensured while the operation process is ensured to be a compliance method.
In addition, when the above embodiments of the present application are applied to a specific product or technology, user approval or consent needs to be obtained, and the collection, use and processing of relevant data needs to comply with relevant regulations and standards of the relevant country or region.
Noun interpretation:
corpus: items of game proper nouns internally define translation content such as ranks, credits, etc.
According to one aspect of the embodiment of the application, a preprocessing method for text conversion is provided. As an alternative embodiment, the above-mentioned preprocessing method for text conversion may be applied to, but not limited to, an application scenario as shown in fig. 1. In an application scenario as shown in fig. 1, the target terminal 102 may be, but is not limited to being, in communication with the server 106 via the network 104, and the server 106 may be, but is not limited to being, performing operations on the database 108, such as, for example, write data operations or read data operations. The target terminal 102 may include, but is not limited to, a man-machine interaction screen, a processor, and a memory. The man-machine interaction screen described above may be, but is not limited to, a page for a target application displayed on the target terminal 102, source sample text, and the like. The processor may be, but is not limited to being, configured to perform a corresponding operation in response to the man-machine interaction operation, or generate a corresponding instruction and send the generated instruction to the server 106. The memory is used for storing related processing data such as a first group of sample texts, an initial sample text set, a third group of sample texts and the like.
Alternatively, in this embodiment, the target terminal may be a terminal configured with a target client, and may include, but is not limited to, at least one of the following: a Mobile phone (such as an Android Mobile phone, an iOS Mobile phone, etc.), a notebook computer, a tablet computer, a palm computer, an MID (Mobile INTERNET DEVICES, mobile internet device), a PAD, a desktop computer, a smart television, etc. The target client may be a video client, an instant messaging client, a browser client, an educational client, and the like. The network may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: local area networks, metropolitan area networks, and wide area networks, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communications. The server may be a single server, a server cluster composed of a plurality of servers, or a cloud server.
The technical solution in the embodiment of the present application may be, but is not limited to, preprocessing (e.g., data cleaning) of input data applied to a text machine translation model, cleaning of input data of a conversion model converting text in a general language into a code language, cleaning of input data of a machine model converting a white language into a term of art in a specific vertical field, and the like.
In order to solve the problem of low processing accuracy in the text conversion preprocessing process, a text conversion preprocessing method is provided in the embodiment of the present application, and fig. 2 is a flowchart of a text conversion preprocessing method according to an embodiment of the present application, where the flowchart includes the following steps S202 to S212.
It should be noted that, the preprocessing method of the text conversion shown in step S202 to step S212 may be performed by, but not limited to, an electronic device, which may be, but not limited to, a target terminal or a server as shown in fig. 1.
Step S202, an initial sample text set is obtained from application data of a target application, wherein each pair of sample texts in the initial sample text set comprises a source sample text of a source language and a target sample text of a target language, and the target sample text is obtained after content conversion based on text content of the source sample text.
Step S204, determining a first group of sample texts which do not meet symbol configuration conditions from the initial sample text set, wherein the symbol configuration conditions are used for indicating format requirements of symbols contained in text contents of the sample texts.
Step S206, determining a second group of sample texts with the text similarity larger than a preset threshold value from the initial sample text set, wherein the second group of sample texts comprises a first subgroup of sample texts and a second subgroup of sample texts, the text similarity between an ith sample text in the first subgroup of sample texts and a jth sample text in the second subgroup of sample texts is larger than the preset threshold value, and the text similarity is used for indicating the content similarity degree between the ith sample text and the jth sample text, and i and j are positive integers larger than or equal to 1.
Step S208, the first group of sample texts and the first subgroup of sample texts in the second group of sample texts are removed from the initial sample text set, so as to obtain a third group of sample texts.
In step S210, the third set of sample texts is determined as a positive sample set, and the first set of sample texts and the first subset of sample texts are determined as a negative sample set.
Step S212, training is performed by using the positive sample set and the negative sample set, and a text conversion model for converting the source sample text of the source language into the target sample text of the target language is obtained.
In order to facilitate understanding of the above text conversion preprocessing method, in the embodiment of the present application, data cleaning of a machine translation model in a game application is taken as an example, and the above text conversion preprocessing method is explained.
The machine translation model in the game application may be, but is not limited to, a target text data built in the game application for converting text data of a source language in the game application into a target language, for example, translating text in a configuration table of the game application from a middle to english, for example, as shown in fig. 3.
As shown in fig. 3 (a), the translated english text cannot completely and accurately represent semantic information represented by text contents in chinese text due to limitations of the machine translation model itself. Therefore, it is necessary to ensure the accuracy of the translation result output by the machine translation model.
In general, the performance of the machine translation model is improved mainly by optimizing structural parameters in the model structure in the model training process, and the translation quality of the machine translation model after adjustment is obviously improved from the translation result shown in (b) of fig. 3.
In the related art, historical translation data generated in the running process of game application is generally extracted from a translation configuration table, a machine translation model is trained by utilizing the historical translation data, and the accuracy of the machine translation model is improved by continuously adjusting structural parameters of the machine translation model in the training process.
However, since there may be situations such as inconsistent meaning of the original text and the translated text, abnormal sign, etc. in the historical translation data, this means that the training data input into the machine translation model may be doped with dirty data (e.g. abnormal sign, repeated data, etc.) to different extents, and if these dirty data are not cleaned out from the historical translation data, the problems of overlong training time and poor training effect of the model will be necessarily caused.
In order to solve the above-mentioned text, the embodiment of the present application proposes a preprocessing method for text conversion, and the processing procedure of the method is shown in fig. 4.
As shown in fig. 4, the system for performing the above-mentioned text conversion preprocessing method includes a data source extraction module and a data source cleansing module, wherein the data source extraction module is used for extracting historical translation data (data generated by running a target application in a historical period) from an in-game configuration translation table, and extracting corresponding data information from a corpus in a game.
The historical translation data extracted from the in-game configuration translation table is typically long text, for example, text for describing skills; the data extracted from the corpus within the game is typically short text, e.g., phrases of predefined numbers, ranks, points, etc., by the project developer.
To further enrich the data, context data crawling logic is also added to the data source extraction module for retrieving data of other game applications related or similar to the target application.
The data source cleaning module internally comprises the following 3 algorithms, and the following 3 algorithms are respectively and briefly introduced.
(1) The first algorithm is mainly used for processing various obvious abnormal data, such as null value, special symbol type translation, non-target language translation and text translation for project internal test.
(2) The second algorithm is an algorithm for performing cosine similarity after word frequency statistics through the N-gram, and filtering out similar sample texts, for example, filtering out similar sentences.
(3) And the third algorithm is that the cleaned data and the difference data classified by Bleurt algorithm are sent into a classification model for training, the obtained multi-classification training model is then repeatedly verified and optimized on a verification set, and finally a multi-classification model with normal scoring logic is formed. The above 3 algorithms are described separately below in connection with specific embodiments.
After the initial sample text set is obtained through the data source extraction module, a first group of sample texts which do not meet the symbol configuration condition, for example, sample texts containing special characters, sample texts which do not meet the naming rule, and sample texts with non-standard languages, are searched from the initial sample text set by utilizing the first algorithm.
And searching a second group of sample texts with the similarity larger than a preset threshold value from the initial sample text set by utilizing the third algorithm, determining the sample texts in the searched second group of sample texts as similar texts, and performing de-duplication processing on the similar texts.
Classifying the data subjected to the de-duplication processing, for example, determining the sample text searched by using the second algorithm and the third algorithm as a negative sample set, taking the rest sample text in the cleaned initial sample text set as a positive sample set, and training a text conversion model (which can be understood as a machine translation model) by using the positive sample set and the negative sample set to obtain a target text conversion model.
The machine translation model is only an example, and is not limited thereto, and may be, for example, a text format conversion model that converts a text in a normal language into a code language, a language conversion model that converts a white language into a term of art in a specific vertical field, or the like.
The implementation process of the above-described text conversion preprocessing method will be described below taking a text conversion model as an example of a text format conversion model.
And acquiring an initial sample text set, wherein each pair of sample texts in the initial sample text set comprises a source sample text in a source format and a target sample text in a target format, and the target sample text is obtained after text content conversion based on the source sample text.
A first set of sample text is determined from the initial set of sample text that does not satisfy character format requirements indicating an arrangement rule and a character type of characters contained in the sample text.
And determining a second group of sample texts with non-text similarity larger than a preset threshold value from the initial sample text set, wherein the second group of sample texts comprises a first sub-group of sample texts and a second sub-group of sample texts, the text similarity between an ith sample text in the first sub-group of sample texts and a jth sample text in the second sub-group of sample texts is larger than the preset threshold value, and i and j are positive integers larger than or equal to 1.
And eliminating the first group of sample texts and the first sub-group of sample texts from the initial sample text set to obtain a third group of sample texts.
The third set of sample text is determined as a positive sample set and the first set of sample text and the first subset of sample text are determined as negative sample sets.
Training is performed by using the positive sample set and the negative sample set, and a text format conversion model for converting the source sample text in the source format into the target sample text in the target format is obtained.
By adopting the method, the sample text containing the abnormal symbol is removed from the initial sample text set by judging whether the sample text in the initial sample text set meets the symbol configuration condition; and removing repeated sample texts in the initial sample text set by determining a second group of sample texts with the text similarity larger than a preset threshold value, and training the text conversion model by using the sample set with the abnormal sample texts removed, so that the accuracy of the result of the text conversion model is improved. In other words, through preset judging conditions, abnormal sample texts in the initial sample text set are cleaned, so that the accuracy of the sample texts input into the text conversion model is ensured, and the technical effect of improving the accuracy of the preprocessing result of text conversion is realized.
It will be readily appreciated that the above-described similar sample text is deduplicated because when there are 2 sets of data that are nearly identical for the source sample text (e.g., chinese text) and there are large differences in the translation results output by the machine translation model, such data is very fatal to model training, can confuse the knowledge learned by the model itself, and the machine translation model will not know which word is what the model should correctly predict, so that, when training with such a data source, for example, as shown in fig. 5, it is assumed that the machine translation model is utilized to translate "fighting mountain and river", and finally an english text that is not related to the original semantics of "fighting mountain and river" may be output.
The translation results are poor in difference and possibly different in the result of different historic translators, so that two similar Chinese sentences are caused, and the two English sentences output have large difference.
In order to implement the de-duplication processing of the sample text, in the embodiment of the present application, an N-gram algorithm and cosine similarity are adopted to perform repeated data processing, where the N-gram is an algorithm based on a statistical language model, and is also called a first-order markov chain, and the basic idea is to perform a sliding window operation with N sliding window sizes on the content in the text according to bytes to form a byte segment sequence with a length of N, where each byte segment is called a gram, count the occurrence frequencies of all the grams, and filter according to a threshold value set in advance to form a key gram list, that is, a vector feature space of the text, and each gram in the list is a feature vector dimension. The implementation of the N-gram algorithm is described below in connection with specific embodiments.
As an optional example, the determining, from the initial sample text set, the second set of sample texts with text similarity greater than the preset threshold includes: dividing a source sample text in an initial sample text set into F pairs of source sample texts, wherein F is a positive integer greater than or equal to 2; determining the text similarity between two source sample texts in each pair of source sample texts to obtain F text similarities, wherein the text similarity comprises the text similarity between each source sample text in the F source sample texts and the rest F-1 source sample texts, and F is a positive integer greater than or equal to 1; m pairs of source sample texts with the text similarity larger than a preset threshold value are determined from F text similarity, wherein M is a positive integer larger than or equal to 1 and smaller than or equal to F; determining a first subgroup of sample texts according to the M pairs of source sample texts; the remaining sample text of the M pairs of source sample text excluding the first subset of sample text is determined to be the second subset of sample text.
In the embodiment of the present application, the texts in the initial sample text set are presented in pairs, for example, in the case of a source language of chinese and a target language of english, each text includes a chinese text and an english text.
Assuming that a traversal method is adopted, dividing source sample texts (such as all Chinese sentences) in an initial sample text set into F pairs of texts, and then calculating text similarity between two source sample texts contained in each pair of source sample texts in the F pairs of source sample texts to obtain F text similarity.
And searching M pairs of source sample texts with the text similarity larger than a preset threshold value from the F pairs of source sample texts, determining a first sub-group of sample texts according to the M pairs of source sample texts, and determining the rest sample texts except the first sub-group of sample texts in the M pairs of source sample texts as a second sub-group of sample texts.
For example, as shown in fig. 6, assume that there are 10 pairs of sample text in the initial sample text set, and each pair of sample text includes one source sample text (e.g., chinese sentence 1) and one target sample text (e.g., english sentence 1).
The Chinese sentences in the 10 Chinese sentences are paired in pairs by a traversing method to obtain 5 pairs Wen Yugou, wherein the contents of the two Chinese sentences in each Chinese sentence are possibly partially identical or semantically similar, and then the text similarity between the 3 Chinese sentences is determined to be greater than a preset threshold value by calculating the text similarity between each Chinese sentence in the 5 Chinese sentences, for example, the Chinese sentence 1 and the Chinese sentence 2, the Chinese sentence 3 and the Chinese sentence 4, the Chinese sentence 5 and the Chinese sentence 6 shown in fig. 6, and the like.
Then chinese sentence 1 and chinese sentence 2 are determined as repeated sentences, and similarly chinese sentence 3 and chinese sentence 4, chinese sentence 5 and chinese sentence 6 are also determined as repeated sentences, respectively. In order to avoid interference caused by repeated data on model training, repeated data are subjected to de-duplication processing.
Specifically, deleting a second source sample text (e.g., chinese sentence 1 is a first source sample text and chinese sentence 2 is a second source sample text) from a pair of source sample texts having text similarity greater than a preset threshold, and simultaneously deleting a second target sample text corresponding to the second source sample text, e.g., deleting chinese sentence 2 and english sentence 2, chinese sentence 4 and english sentence 4, etc.
As another alternative example, in addition to determining the repeated sample text according to the text similarity between the source sample texts, the repeated sample text may be determined together according to the first text similarity between the source sample texts and the second text similarity between the target sample texts, which specifically includes: dividing a source sample text in an initial sample text set into F pairs of source sample texts, and dividing a target sample text in the initial sample text set into F pairs of target sample texts, wherein F is a positive integer greater than or equal to 2; determining first text similarity between two source sample texts in each pair of source sample texts to obtain F first text similarity, wherein the first text similarity comprises the text similarity between each pair of source sample texts in the F pairs of source sample texts, and F is a positive integer greater than or equal to 1; determining second text similarity between two target sample texts in each pair of target sample texts to obtain F second text similarity, wherein the second text similarity comprises the text similarity between each pair of target sample texts in the F pairs of target sample texts; and determining the ith pair of source sample texts as repeated sample texts under the condition that the first text similarity between the ith pair of source sample texts is larger than a first preset threshold value and the second text similarity between the jth pair of target sample texts is smaller than or equal to a second preset threshold value, wherein the ith pair of target sample texts is a pair of target sample texts which convert the ith pair of source sample texts in source languages into target languages by using a text conversion model, and i and j are positive integers which are larger than or equal to 1 and smaller than or equal to F.
After the repeated sample text is determined according to the method, one of the repeated sample text is removed from the initial sample text set.
For example, as shown in fig. 7, assume that there are 10 pairs of sample text in the initial sample text set, and each pair of sample text includes one source sample text (e.g., chinese sentence 1) and one target sample text (e.g., english sentence 1).
By means of traversing, chinese sentences in the 10 Chinese sentences are paired in pairs to obtain 5 pairs Wen Yugou, english sentences in the 10 English sentences are paired in pairs to obtain 5 pairs of English sentences, and the first text similarity between the Chinese sentences 1 and the Chinese sentences 2 and the second text similarity between the English sentences 1 and the English sentences 2 are judged.
And under the condition that the first text similarity is larger than a first preset threshold value and the second text similarity value is smaller than or equal to a second preset threshold value, determining the Chinese sentence 1 and the Chinese sentence 2 as repeated sentences, and deleting the Chinese sentence 2 and the English sentence 2.
According to the same method, determining the text similarity between each pair of sentences such as the Chinese sentence 3, the Chinese sentence 4, the Chinese sentence 5, the Chinese sentence 6 and the like, thereby determining repeated sample texts and deleting the corresponding sample texts.
By adopting the mode to carry out the de-duplication processing on the sample data of the input text conversion model, the sample number is reduced, the complexity of model training is simplified, and the model performance is improved.
As an optional implementation manner, determining the text similarity between the two source sample texts in each pair of source sample texts to obtain F text similarities includes: determining a kth text similarity between a first source sample text and a second source sample text in a kth pair of source sample texts, wherein the F pair of source sample texts includes the kth pair of source sample texts, k being a positive integer greater than or equal to 1 and less than or equal to F: determining a first word sequence according to the first source sample text, wherein the s-th word in the first word sequence is the same as the last character in the previous adjacent word, the first word sequence comprises Q words with the number of characters being N, N is a positive integer greater than or equal to a preset value, and s is a positive integer greater than or equal to 2 and less than or equal to Q; determining a second word sequence according to the second source sample text, wherein the t-th word in the second word sequence is the same as the last character in the previous adjacent word, the second word sequence comprises R words with the number of characters being N, Q, R is a positive integer greater than or equal to 2, and t is a positive integer greater than or equal to 2 and less than or equal to R; and determining the kth text similarity according to the first word sequence and the second word sequence, wherein the F text similarities comprise the kth text similarity.
In the embodiment of the application, the repeated sample text is determined mainly by adopting an N-gram algorithm and cosine similarity, and particularly referring to an example shown in fig. 8, assuming that a first source sample text is "i like eating apples" and a second source sample text is "i like eating bananas", then under the condition that N in the N-gram algorithm is set to 2, 2-gram segmentation is performed on "i like eating apples" to obtain 5 gram fragments as shown in fig. 8, namely "i like", "happy eat", "eat apples" and "apples", so as to obtain a first word sequence; similarly, the 2-gram cut was performed on "I like eat bananas" to obtain a second word sequence as shown in FIG. 8.
And determining the text similarity between the first source sample text 'i like eating apples' and the second source sample text 'i like eating bananas' according to the first word sequence and the second word sequence.
As an optional example, determining the kth text similarity according to the first word sequence and the second word sequence includes: determining a target word sequence according to the first word sequence and the second word sequence, wherein the target word sequence is a word sequence obtained by performing de-duplication on words in the first word sequence and the second word sequence and then splicing, the target word sequence comprises W words, W is a positive integer greater than or equal to 2 and less than or equal to the sum of the number of target words, and the sum of the number of target words is the sum of the number of words in the first word sequence and the number of words in the second word sequence; determining a first word frequency vector according to the first word sequence and the target word sequence, wherein elements in the first word frequency vector are used for indicating whether words in the first word sequence appear in the target word sequence; determining a second word frequency vector according to the second word sequence and the target word sequence, wherein elements in the second word frequency vector are used for indicating whether words in the second word sequence appear in the target word sequence; and determining the kth text similarity between the first source sample text and the second source sample text in the kth pair of source sample texts according to the first word frequency vector and the second word frequency vector.
As shown in fig. 8, if it is determined that the same words "i like", "happy" exist in the first word sequence and the second word sequence, one of the words is reserved, and then the first word sequence and the second word sequence are spliced to obtain the target word sequence.
As an optional example, determining the first word frequency vector according to the first word sequence and the target word sequence includes: and under the condition that the W words comprise a first part of words in the first word sequence, setting 1 at the position corresponding to the first part of words in the target word sequence, and obtaining a first word frequency vector with the dimension of 1 XW.
For the first word sequence, taking each gram fragment as a dimension, sequentially judging whether the words at each position appear in the target word sequence according to the position sequence, and setting the value of the element at the corresponding position in the target word sequence to be 1 if the words appear in the target word sequence; otherwise, the value of the element in the position is set to 0.
For example, for the first word "i like" in the target word sequence, the same word "i like" is also included in the first word sequence, and then the element value at the position of the first word in the target word sequence is set to 1; for the second word like in the target word sequence, the first word sequence also comprises the same word like, and then the element value of the position of the second word in the target word sequence is set to be 1; and analogically, setting 1 for the elements on the positions of the 3 rd to 5 th words in the target word sequence.
However, for the word "fragrance" at the position of the 6 th word in the target word sequence, which does not appear in the first word sequence, then the element at the position of the 6 th word in the target word sequence is set to 0; likewise, the element at the position of the 7 th word in the target word sequence will then be set to 0.
Then a first term vector is derived based on the values of the elements at each position in the target term sequence [1,1,1,1,1,0,0].
As an optional example, determining the second word frequency vector according to the second word sequence and the target word sequence includes: and under the condition that the W words comprise second partial words in the second word sequence, setting 1 at the position corresponding to the second partial words in the target word sequence, and obtaining a second word frequency vector with the dimension of 1 XW.
As shown in fig. 8, according to the same method as the above-mentioned method for determining the first word frequency vector, the values of the elements at the positions of "i like", "happy", "fragrant", "banana" and the like in the target word sequence are set to 1, and the values of the elements at the remaining positions are set to 0, so as to obtain the second word frequency vector [1,1,1,0,0,1,1].
As an optional implementation manner, determining the kth text similarity between the first source sample text and the second source sample text in the kth pair of source sample texts according to the first word frequency vector and the second word frequency vector includes: and determining the cosine similarity between the first word frequency vector and the second word frequency vector, and determining the cosine similarity as the kth text similarity.
After determining the word frequency vectors of the two sentences, calculating the cosine similarity between the first word frequency vector and the second word frequency vector, specifically by the following formula (1):
wherein A and B represent a first word frequency vector and a second word frequency vector, respectively, Representing the dot product of two word frequency vectors,/>And/>Representing the norms (i.e., modulo lengths) of vectors a and B, respectively, cosine similarity is typically used to express how similar the two vectors are in vector space.
Specifically, the calculation procedure and calculation result of the cosine similarity between the first source sample text (sentence 1) and the second source sample text (sentence 2) are shown in fig. 8.
Wherein, the closer the cosine similarity value is to 1, the more similar sentence 1 and sentence 2 are represented; the closer the value is to 0, the less similar statement 1 and statement 2 are represented. Experience data shows that when the cosine similarity is larger than 0.9, the two sentences are almost identical, and duplicate removal can be performed, namely only one of the two sentences is reserved.
For the two data mentioned in the above embodiment, the cosine similarity calculated by using the word frequency vectorization of the N-gram is 1, which indicates that the two data are two sentences with identical expressions, and then one data is automatically filtered.
From the principle of cosine similarity, it can be seen that the above method for calculating text similarity is more in terms of text (text in source language) which tends to find whole sentence words, including words and sequenced documents which are almost consistent in use. In the application scenario of data source cleaning of the machine translation model, the data processed by the above-mentioned text similarity calculation method accords with our use expectation, and specific reference may be made to the specific embodiment of filtering the repeated data shown in fig. 9.
In connection with the above description of the embodiment, in the process of preprocessing the input data of the input machine translation model by using the data cleaning method, not only the data containing the abnormal symbols but also the repeated data are removed, and the description of how to remove the data with the abnormal symbols in the initial sample text is described below in connection with the specific embodiment.
As an optional example, the determining, from the initial sample text set, the first set of sample texts that do not satisfy the symbol configuration condition includes: searching sample texts containing abnormal characters from an initial sample text set to obtain first-type sample texts, wherein the initial sample text set comprises first-type sample texts, and each pair of texts in the first-type sample texts comprises a source sample text of a source language and a target sample text of a target language; searching a second type sample text of which the languages of the sample text are other than the source language and the target language from an initial sample text set, wherein the initial sample text set comprises the second type sample text, and each pair of texts in the second type sample text comprises the source sample text of the source language and the target sample text of the target language; searching a third type of sample text with the text content being a null value from the initial sample text set, wherein the first group of sample text comprises a first type of sample text, a second type of sample text and the third type of sample text; wherein the symbol configuration condition includes at least one of: the abnormal symbol does not appear in the sample text, the languages of the sample text comprise source language and target language, and the text content of the sample text does not contain null values.
Before preprocessing sample text which does not meet symbol configuration conditions, a simple description is given of a preprocessing process of dirty data and common dirty data.
In embodiments of the present application, the input data may be determined to be dirty data by one of the following ways, but not limited to.
(1) The input data is redundant data.
For example, repeatedly occurring sentences, sentences having a similarity greater than a preset threshold value, or the like.
(2) The input data is error data.
For example, sentences that are not smooth, sentences that do not conform to grammar rules, or sentences that express incompleteness, etc.
(3) The input data is an abnormal symbol or an abnormal character, and the input data contains an abnormal symbol or an abnormal character.
For example, the input data contains symbols or identifications that do not meet the system requirements, data that is not specified by the default data fields of the system, or brackets that do not appear in pairs, etc.
(4) The data is null.
In an alternative embodiment, the dirty data contained in the input data may be preprocessed, but not limited to, in the manner shown in fig. 12, and the following steps S1202 to S1212 may be referred to.
S1202, judging whether the original text contains test characters.
For example, if a special identifier of the non-outgoing document specified by the items [ ex ], "DNT", "test", etc., is included, if so, the partial data is saved directly to the negative sample pool, because such data is data that will not be applied by subsequent training.
The original text may be understood as a source sample text of a source language in the initial sample text set.
S1204, judging whether the original language and the translation are the same as the target language.
For example, assuming that the source language of the original text is chinese, the target language of the translation is english, if the original text does not contain chinese, or the translation contains content in other languages than english, such data is determined to be dirty data and not available for model training, while saving this data to the negative sample pool.
Where translation may be understood, but is not limited to, as target sample text in a target language that is translated from source sample text using a text conversion model (machine translation model).
S1206, judging whether the translation is empty.
If the translation is empty, step S1210 is performed; otherwise, step S1208 is performed.
S1208, judging whether the sentence is a pure symbol.
If yes, directly storing the data into a negative sample pool; otherwise, step S1212 is performed.
Obviously, if the sentence is text of a pure symbol type as a result of the judgment, the text is unavailable data, and training is not performed.
S1210, calling an AI translation model built in the target application to carry out supplementary translation, and storing the supplementary translation and the original text into a positive sample pool for subsequent training.
S1212, finally, simple paired symbol matching is carried out, if the paired brackets are judged to exist in the original text, but the paired brackets are not involved in the translation text, the group of data is determined to be dirty data at the symbol level, and the dirty data enters a negative sample pool.
It should be noted that the data still correct after the above washing is entered into the positive sample cell.
In a specific embodiment, common dirty data consists essentially of the following.
(1) This type of data has no effect on tuning of the machine translation model when part of special characters are mixed in the data set, and reference is specifically made to fig. 10 (a).
(2) When a part of special identification is mixed in the data set, this type of data is also defined as dirty data, and specifically, as shown in fig. 10 (b), when the identification contains item specific rule definition such as [ ex ], [ DNT ], test word, etc., the data is determined as dirty data.
Wherein [ ex ] may be understood, but is not limited to, that the data is not within version control of the target application, in other words, the data with [ ex ] identification is data that will not be used for a period of time; DNT may be understood, but is not limited to, that the text of the data is not translated for a while, i.e., the target sample in the data sample pair is empty, such data is meaningless to the model training process.
(3) When data of a language other than the source language and the target language is included in the data, it is determined that the language of the data does not meet the specification, and the data is determined as dirty data as in (c) and (d) of fig. 10.
(4) When the data is doped with an abnormal symbol, the data is determined to be dirty data, as shown in (a) of fig. 11.
(5) In the case where a null value is included in the data, it is determined as dirty data, as shown in (b) of fig. 11.
By cleaning the abnormal data which do not meet the symbol configuration conditions, the data in the initial sample text set is effectively reduced, for example, from the initial 11 ten thousand original data to the 6 ten thousand pure data, nearly 50% of dirty data is removed, the data volume of model training is effectively reduced, and the resource occupation is reduced.
At the same time, the training time of the model is reduced when the model is trained by using the cleaned data, for example, a base model with the magnitude of 14B parameters can be compressed to 2 days from one week of training time.
As an alternative example, before training with the positive sample set and the negative sample set to obtain a text conversion model for converting the source sample text of the source language into the target sample text of the target language, the method further includes: the target pre-processing model is obtained by inputting the positive sample set and the negative sample set into the pre-processing model, wherein the target pre-processing model is a model for identifying the categories of the sample texts in the positive sample set and the negative sample set.
After the method is used for cleaning the abnormal samples in the initial sample text set, in order to verify the accuracy of the cleaned data, a preprocessing model is also introduced in the embodiment of the application, and the preprocessing model can assist in classifying whether the text is correct or not, for example, a BERT model suitable for being used as a classification task.
As an alternative example, the above-mentioned obtaining the target pretreatment model by inputting the positive sample set and the negative sample set into the pretreatment model includes: converting positive samples in the positive sample set into a first multidimensional vector, wherein the first multidimensional vector comprises first type information and semantic information of the positive samples, and the first type information represents that the types of sample texts are positive samples; converting the negative samples in the negative sample set into a second multidimensional vector, wherein the second multidimensional vector comprises second class information and semantic information of the negative samples, and the second class information represents that the class of the sample text is the negative samples; inputting the first multidimensional vector and the second multidimensional vector into a preprocessing model to obtain a group of prediction probabilities, wherein the group of prediction probabilities comprises the prediction probabilities that the categories of the positive samples of the positive sample set belong to the first category information and the second category information and the prediction probabilities that the categories of the negative samples of the negative sample set belong to the first category information and the second category information; determining a target loss function according to a group of prediction probabilities, first class information of the positive sample and second class information of the negative sample; and under the condition that the value of the target loss function meets the convergence condition, stopping training to obtain the target preprocessing model.
The process of training the pre-processing model using the positive and negative sample sets obtained by the above-described classification after cleaning is described below with reference to fig. 13.
S1302, data preparation.
After the above-mentioned washing of the abnormal data and filtering of the repeated data, a positive sample set and a negative sample set are obtained, and the filtered data is sorted to obtain a table as shown in fig. 14, where the sorting purpose is mainly to make the data source possess category information, for example, a problematic sentence whose classification information is defined as bad case and a non-problematic sentence whose classification information is defined as good case.
S1304, the words of each data are segmented through tokenlizer word segmenters of the BERT model.
S1306, converting the segmented data into a data format required by the BERT model.
For example, the semantic vector information storage for the discriminative task of BERT, such as characters [ CLS ] and blank characters [ SEP ] is added before and after the word sequence after word segmentation.
In a specific embodiment, for the sentence I WILL WATCH Memento tonight, after passing through the word segmentation device, it becomes I, will, watch, memento, tonight, and [ CLS ] and the empty character [ SEP ] are added to become '[ CLS ] I WILL WATCH Memento tonight [ SEP ]', where [ CLS ] is used to describe the sentence semantic meaning the category information of the sentence.
S1308, inputting the data with the converted format into a network layer of the BERT model, and learning to obtain model output.
S1310, filling [ CLS ] with a multidimensional vector describing sentence semantics.
The multidimensional vector comprises two parts, wherein the first part is used for describing the semantic information of the sentence, and the second part is the category information representing the sentence.
S1310, inputting the multidimensional vector into a linear classification layer of the BERT model to obtain the prediction probability of each classification of the sentence.
Specifically, through the code segment of fig. 15, when the preset number of categories is 2, the model outputs a result that is a predicted probability that the sample of the current input model is a positive sample or a negative sample, for example, I LIKE EAT APPLES is classified as bad, where the probability is 0.9, and when the empirical value is greater than 0.9, the original data is considered to belong to a certain classification, and otherwise, the original data does not belong to the classification.
After the above BERT discriminant task fine-tuning, a model with the ability to make text correct or not is generated. In a subsequent operation, the discrimination model is applied to assist the target application in determining whether the translation belongs to dirty data. The specific flow is shown in fig. 16.
S1602, data source collection.
As an alternative example, the manner in which the data sources are collected includes: and determining a historical text set generated by the target application in a historical operation period, a preconfigured original text set and an associated text set as an initial sample text set, wherein the original text set comprises texts predefined in a configuration table of the target application, and the associated text set is a text set generated by other applications with the same functions and categories as the target application in the operation process.
As shown in fig. 4, the data source extraction module extracts the history translation data (data generated by running the target application in the history period) from the intra-game configuration translation table, and extracts the corresponding data information from the intra-game corpus.
The set of historical text includes, but is not limited to, historical translation data extracted from an in-game configuration translation table, typically long text, e.g., text for describing skills; the original text collection includes, but is not limited to, data extracted from a corpus within the game, typically short text, e.g., phrases of predefined numbers, ranks, credits, etc., by the project developer.
The set of associated text includes, but is not limited to, data obtained from other game applications related or similar to the target application using context data crawling logic, e.g., the target application is a shooter-type game application, and data generated during the running of the same type of shooter-type game application is the set of associated text.
S1604, data cleansing.
Generally divided into two parts: (1) Cleaning dirty data, such as text containing abnormal symbols, null values or non-languages; (2) performing deduplication processing on the duplicate data.
The processing procedure of the above two data may refer to the description in the above embodiment, and will not be repeated here.
S1606, classifying after cleaning to obtain a positive sample set and a negative sample set, and training the BERT distinguishing model.
The training process may refer to the description in the above embodiments, and will not be repeated here.
S1608, determining the accuracy of the category of the text after the data cleaning according to the training result of the BERT model.
After the data source collecting and cleaning processes are performed by the method of the embodiment, dividing all sample texts into positive samples and negative samples, and then inputting the two samples into the BERT classification task discrimination model after fine tuning to assist in judging whether error sentences occur in the positive samples, whether the negative samples are all problem sentences, and the like.
In other words, in order to improve the performance of the text conversion model, after the abnormal text in the initial sample text set is cleaned and the repeated text is filtered by using the method in the embodiment, a positive sample set and a negative sample set are obtained, and then, by inputting the positive sample set and the negative sample set into the preprocessing model, whether the samples in the positive sample set and the negative sample set are accurate or not is assisted to be judged, and the sample set processed by using the BERT discriminant model is used as training data of the text conversion model, so that the number of model training data is reduced, the cost of fine tuning of the model is reduced, and the accuracy of the training data is ensured. In other words, the performance of the trained text conversion model is improved by improving the purity of the data input into the text conversion model, so that the accuracy of the text conversion model is improved.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
According to still another aspect of the embodiment of the present application, there is also provided a preprocessing apparatus for text conversion as shown in fig. 17, the apparatus including: a first obtaining unit 1702 configured to obtain an initial sample text set from application data of a target application, where each pair of sample texts in the initial sample text set includes a source sample text of a source language and a target sample text of a target language, and the target sample text is obtained after content conversion based on text content of the source sample text; a first processing unit 1704, configured to determine a first set of sample texts that do not meet a symbol configuration condition from the initial sample text set, where the symbol configuration condition is used to indicate a format requirement of a symbol included in text content of the sample text; a second processing unit 1706, configured to determine, from the initial sample text set, a second set of sample texts with a text similarity greater than a preset threshold, where the second set of sample texts includes a first subset of sample texts and a second subset of sample texts, a text similarity between an ith sample text in the first subset of sample texts and a jth sample text in the second subset of sample texts is greater than the preset threshold, the text similarity is used to indicate a content similarity between the ith sample text and the jth sample text, and i and j are positive integers greater than or equal to 1; a third processing unit 1708, configured to reject, from the initial sample text set, a first subset of sample texts in the first set of sample texts and the second set of sample texts, to obtain a third set of sample texts; a fourth processing unit 1710, configured to determine the third set of sample texts as a positive sample set, and determine the first set of sample texts and the first subset of sample texts as a negative sample set; the training unit 1712 is configured to perform training by using the positive sample set and the negative sample set, so as to obtain a text conversion model for converting the source sample text of the source language into the target sample text of the target language.
Optionally, the second processing unit 1706 includes: the first processing module is used for dividing the source sample text in the initial sample text set into F pairs of source sample texts, wherein F is a positive integer greater than or equal to 2; the second processing module is used for determining the text similarity between two source sample texts in each pair of source sample texts to obtain F text similarities, wherein the text similarity comprises the text similarity between each pair of source sample texts in the F pairs of source sample texts, and F is a positive integer greater than or equal to 1; the third processing module is used for determining M pairs of source sample texts with text similarity larger than a preset threshold value from F text similarity, wherein M is a positive integer larger than or equal to 1 and smaller than or equal to F; a fourth processing module, configured to determine a first subset of sample texts according to M pairs of source sample texts; and a fifth processing module, configured to determine remaining sample texts in the M pairs of source sample texts except for the first sub-set of sample texts as a second sub-set of sample texts.
Optionally, the second processing module includes: a first processing sub-module configured to determine a kth text similarity between a first source sample text and a second source sample text in a kth pair of source sample texts, where the F pair of source sample texts includes the kth pair of source sample texts, and k is a positive integer greater than or equal to 1 and less than or equal to F: determining a first word sequence according to the first source sample text, wherein the s-th word in the first word sequence is the same as the last character in the previous adjacent word, the first word sequence comprises Q words with the number of characters being N, N is a positive integer greater than or equal to a preset value, and s is a positive integer greater than or equal to 2 and less than or equal to Q; determining a second word sequence according to the second source sample text, wherein the t-th word in the second word sequence is the same as the last character in the previous adjacent word, the second word sequence comprises R words with the number of characters being N, Q, R is a positive integer greater than or equal to 2, and t is a positive integer greater than or equal to 2 and less than or equal to R; and determining the kth text similarity according to the first word sequence and the second word sequence, wherein the F text similarities comprise the kth text similarity.
Optionally, the second processing module includes: the second processing sub-module is used for determining a target word sequence according to the first word sequence and the second word sequence, wherein the target word sequence is a word sequence obtained by performing de-duplication on words in the first word sequence and the second word sequence and then splicing, the target word sequence comprises W words, W is a positive integer greater than or equal to 2 and less than or equal to the sum of the number of the target words, and the sum of the number of the target words is the sum of the number of the words in the first word sequence and the number of the words in the second word sequence; a third processing sub-module, configured to determine a first word frequency vector according to the first word sequence and the target word sequence, where elements in the first word frequency vector are used to represent whether the words in the first word sequence appear in the target word sequence; a fourth processing sub-module, configured to determine a second word frequency vector according to the second word sequence and the target word sequence, where elements in the second word frequency vector are used to represent whether the words in the second word sequence appear in the target word sequence; and a fifth processing sub-module, configured to determine a kth text similarity between the first source sample text and the second source sample text in the kth pair of source sample texts according to the first word frequency vector and the second word frequency vector.
Optionally, the second processing module includes: and the sixth processing sub-module is used for setting 1 on the position corresponding to the first part of words in the target word sequence under the condition that the W words comprise the first part of words in the first word sequence, so as to obtain a first word frequency vector with the dimension of 1 XW.
Optionally, the second processing module includes: and a seventh processing sub-module, configured to, when the W words include a second partial word in the second word sequence, place 1 on a position corresponding to the second partial word in the target word sequence, to obtain a second word frequency vector with a dimension of 1×w.
Optionally, the second processing module includes: and the eighth processing sub-module is used for determining cosine similarity between the first word frequency vector and the second word frequency vector and determining the cosine similarity as the kth text similarity.
Optionally, the first processing unit 1704 includes: the first searching module is used for searching sample texts containing abnormal characters from an initial sample text set to obtain a first type sample text, wherein the initial sample text set comprises the first type sample text, and each pair of texts in the first type sample text comprises a source sample text of a source language and a target sample text of a target language; the second searching module is used for searching the second type of sample texts of which the languages are languages except the source language and the target language from the initial sample text set, wherein the initial sample text set comprises the second type of sample texts, and each pair of texts in the second type of sample texts comprises the source sample text of the source language and the target sample text of the target language; a third searching module, configured to search a third type of sample text whose text content is null from the initial sample text set, where the first set of sample texts includes a first type of sample text, a second type of sample text, and a third type of sample text; wherein the symbol configuration condition includes at least one of: the abnormal symbol does not appear in the sample text, the languages of the sample text comprise source language and target language, and the text content of the sample text does not contain null values.
Optionally, the apparatus further includes: and a fifth processing unit for obtaining a target preprocessing model by inputting the positive sample set and the negative sample set into the preprocessing model, wherein the target preprocessing model is a model for identifying the categories of sample texts in the positive sample set and the negative sample set.
Optionally, the fifth processing unit includes: a sixth processing module, configured to convert the positive samples in the positive sample set into a first multidimensional vector, where the first multidimensional vector includes first class information and semantic information of the positive samples, and the first class information indicates that a class of the sample text is the positive samples; a seventh processing module, configured to convert the negative samples in the negative sample set into a second multidimensional vector, where the second multidimensional vector includes second class information and semantic information of the negative samples, and the second class information indicates that a class of the sample text is the negative samples; an eighth processing module, configured to input a first multidimensional vector and a second multidimensional vector into a preprocessing model to obtain a set of prediction probabilities, where the set of prediction probabilities includes a prediction probability that a class of positive samples in a positive sample set belongs to first class information and second class information, and a prediction probability that a class of negative samples in the negative sample set belongs to the first class information and the second class information; an eighth processing module, configured to determine a target loss function according to a set of prediction probabilities, first class information of the positive samples, and second class information of the negative samples; and the ninth processing module is used for stopping training to obtain the target preprocessing model under the condition that the value of the target loss function meets the convergence condition.
Optionally, the first obtaining unit 1702 includes: and a tenth processing module, configured to determine, as an initial sample text set, a historical text set generated by the target application in a historical running period, a preconfigured original text set, and an associated text set, where the original text set includes text predefined in a configuration table of the target application, and the associated text set is a text set generated by other applications with the same functions and categories as the target application in a running process.
The method comprises the steps that through the application of the device, sample texts containing abnormal symbols are removed from an initial sample text set by judging whether the sample texts in the initial sample text set meet symbol configuration conditions or not; and removing repeated sample texts in the initial sample text set by determining a second group of sample texts with the text similarity larger than a preset threshold value, and training the text conversion model by using the sample set with the abnormal sample texts removed, so that the accuracy of the result of the text conversion model is improved. In other words, through preset judging conditions, abnormal sample texts in the initial sample text set are cleaned, so that the accuracy of the sample texts input into the text conversion model is ensured, and the technical effect of improving the accuracy of the preprocessing result of text conversion is realized.
It should be noted that, the embodiment of the text conversion preprocessing device may refer to the embodiment of the text conversion preprocessing method, and will not be described herein.
According to still another aspect of the embodiment of the present application, there is also provided an electronic device for implementing the above-mentioned text conversion preprocessing method, where the electronic device may be the target terminal or the server shown in fig. 1. The present embodiment is described taking the electronic device as a target terminal as an example. As shown in fig. 18, the electronic device comprises a memory 1802 and a processor 1804, the memory 1802 having stored therein a computer program, the processor 1804 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.
Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.
Alternatively, in the present embodiment, the above-mentioned processor may be configured to execute the following steps S1 to S5 by a computer program.
S1, determining a first group of sample texts which do not meet symbol configuration conditions from an initial sample text set, wherein the symbol configuration conditions are used for indicating format requirements of symbols contained in text contents of the sample texts.
S2, determining a second group of sample texts with the text similarity larger than a preset threshold value from the initial sample text set, wherein the second group of sample texts comprises a first sub-group of sample texts and a second sub-group of sample texts, the text similarity between an ith sample text in the first sub-group of sample texts and a jth sample text in the second sub-group of sample texts is larger than the preset threshold value, and the text similarity is used for indicating the content similarity between the ith sample text and the jth sample text, and i and j are positive integers larger than or equal to 1.
And S3, eliminating the first group of sample texts from the initial sample text set and the first subgroup of sample texts in the second group of sample texts to obtain a third group of sample texts.
S4, determining the third group of sample texts as positive sample sets, and determining the first group of sample texts and the first subgroup of sample texts as negative sample sets.
And S5, training by using the positive sample set and the negative sample set to obtain a text conversion model for converting the source sample text of the source language into the target sample text of the target language.
Alternatively, it will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 18 is merely illustrative, and that fig. 18 is not intended to limit the configuration of the electronic device electronics described above. For example, the electronics can also include more or fewer components (e.g., network interfaces, etc.) than shown in fig. 18, or have a different configuration than shown in fig. 18.
The memory 1802 may be used for storing software programs and modules, such as program instructions/modules corresponding to the text conversion preprocessing method and apparatus in the embodiment of the present application, and the processor 1804 executes the software programs and modules stored in the memory 1802, thereby performing various functional applications and data processing, that is, implementing the above-mentioned text conversion preprocessing method. The memory 1802 may include high-speed random access memory, but may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1802 may further include memory that is remotely located relative to the processor 1804, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. Wherein the memory 1802 may be specifically, but not limited to, used for storing an initial sample text set, a first set of sample texts, a second set of sample texts, and the like. As an example, as shown in fig. 18, the memory 1802 may include, but is not limited to, a first acquiring unit 1702, a first processing unit 1704, a second processing unit 1706, a third processing unit 1708, a fourth processing unit 1710, and a training unit 1712 in the preprocessing apparatus including the text conversion. In addition, other module units in the preprocessing device for text conversion may be included, but are not limited to, and are not described in detail in this example.
Optionally, the transmission device 1806 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 1806 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 1806 is a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.
In addition, the electronic device further includes: a display 1808 for displaying the scene and the target object list; and a connection bus 1810 for connecting the various module components in the electronic device described above.
In other embodiments, the target terminal or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting the plurality of nodes through a network communication. The nodes may form a point-to-point network, and any type of computing device, such as a server, a target terminal, etc., may become a node in the blockchain system by joining the point-to-point network.
According to yet another aspect of the present application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform a method of preprocessing text conversion provided in various alternative implementations of the server verification process described above, wherein the computer program is configured to perform the steps of any one of the method embodiments described above when run.
Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the following steps S1 to S5.
S1, determining a first group of sample texts which do not meet symbol configuration conditions from an initial sample text set, wherein the symbol configuration conditions are used for indicating format requirements of symbols contained in text contents of the sample texts.
S2, determining a second group of sample texts with the text similarity larger than a preset threshold value from the initial sample text set, wherein the second group of sample texts comprises a first sub-group of sample texts and a second sub-group of sample texts, the text similarity between an ith sample text in the first sub-group of sample texts and a jth sample text in the second sub-group of sample texts is larger than the preset threshold value, and the text similarity is used for indicating the content similarity between the ith sample text and the jth sample text, and i and j are positive integers larger than or equal to 1.
And S3, eliminating the first group of sample texts from the initial sample text set and the first subgroup of sample texts in the second group of sample texts to obtain a third group of sample texts.
S4, determining the third group of sample texts as positive sample sets, and determining the first group of sample texts and the first subgroup of sample texts as negative sample sets.
And S5, training by using the positive sample set and the negative sample set to obtain a text conversion model for converting the source sample text of the source language into the target sample text of the target language.
Alternatively, in embodiments of the present application, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.
Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing the target terminal related hardware, and the program may be stored in a computer readable storage medium, where the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method of the various embodiments of the present application.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and are merely a logical functional division, and there may be other manners of dividing the apparatus in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims (15)

1. A method for preprocessing text conversion, comprising:
Acquiring an initial sample text set from application data of a target application, wherein each pair of sample texts in the initial sample text set comprises a source sample text of a source language and a target sample text of a target language, and the target sample text is obtained after content conversion based on text content of the source sample text;
Determining a first group of sample texts which do not meet symbol configuration conditions from the initial sample text set, wherein the symbol configuration conditions are used for indicating format requirements of symbols contained in text contents of the sample texts;
Determining a second group of sample texts with text similarity larger than a preset threshold value from the initial sample text set, wherein the second group of sample texts comprises a first subgroup of sample texts and a second subgroup of sample texts, the text similarity between an ith sample text in the first subgroup of sample texts and a jth sample text in the second subgroup of sample texts is larger than the preset threshold value, and the text similarity is used for indicating the content similarity degree between the ith sample text and the jth sample text, and i and j are positive integers larger than or equal to 1;
Removing the first sub-group sample texts in the first group sample text and the second group sample text from the initial sample text set to obtain a third group sample text;
Determining the third set of sample text as a positive sample set and the first set of sample text and the first subset of sample text as a negative sample set;
Training by using the positive sample set and the negative sample set to obtain a text conversion model for converting the source sample text of the source language into the target sample text of the target language.
2. The method of claim 1, wherein determining a second set of sample texts from the initial set of sample texts having text similarity greater than a preset threshold comprises:
dividing the source sample text in the initial sample text set into F pairs of source sample texts, wherein F is a positive integer greater than or equal to 2;
Determining the text similarity between two source sample texts in each pair of source sample texts to obtain F text similarities, wherein the text similarity comprises the text similarity between each pair of source sample texts in the F pairs of source sample texts, and F is a positive integer greater than or equal to 1;
M pairs of source sample texts with the text similarity larger than the preset threshold value are determined from the F text similarities, wherein M is a positive integer larger than or equal to 1 and smaller than or equal to F;
determining the first subgroup of sample texts according to the M pairs of source sample texts;
and determining the rest sample texts except the first sub-group sample text in the M pairs of source sample texts as the second sub-group sample text.
3. The method of claim 2, wherein the determining the text similarity between two source sample texts in each pair of source sample texts, resulting in F text similarities, comprises:
Determining a kth text similarity between a first source sample text and a second source sample text in a kth pair of source sample texts, wherein the F pair of source sample texts includes the kth pair of source sample texts, k being a positive integer greater than or equal to 1 and less than or equal to F:
determining a first word sequence according to the first source sample text, wherein the s-th word in the first word sequence is the same as the last character in the previous adjacent word, the first word sequence comprises Q words with the number of N, N is a positive integer greater than or equal to a preset value, and s is a positive integer greater than or equal to 2 and less than or equal to Q;
Determining a second word sequence according to the second source sample text, wherein the t-th word in the second word sequence is identical to the last character in the previous adjacent word, the second word sequence comprises R words with the number of characters being N, Q, R is a positive integer greater than or equal to 2, and t is a positive integer greater than or equal to 2 and less than or equal to R;
And determining the kth text similarity according to the first word sequence and the second word sequence, wherein the F text similarities comprise the kth text similarity.
4. The method of claim 3, wherein said determining said kth text similarity from said first word sequence and said second word sequence comprises:
Determining a target word sequence according to the first word sequence and the second word sequence, wherein the target word sequence is a word sequence obtained by performing de-duplication on words in the first word sequence and the second word sequence and then splicing, the target word sequence comprises W words, W is a positive integer greater than or equal to 2 and less than or equal to the sum of the number of target words, and the sum of the number of target words is the sum of the number of words in the first word sequence and the number of words in the second word sequence;
determining a first word frequency vector according to the first word sequence and the target word sequence, wherein elements in the first word frequency vector are used for representing whether words in the first word sequence appear in the target word sequence;
determining a second word frequency vector according to the second word sequence and the target word sequence, wherein elements in the second word frequency vector are used for representing whether words in the second word sequence appear in the target word sequence;
and determining the kth text similarity between the first source sample text and the second source sample text in the kth pair of source sample texts according to the first word frequency vector and the second word frequency vector.
5. The method of claim 4, wherein said determining a first word frequency vector from said first word sequence and said target word sequence comprises:
And under the condition that the W words comprise a first part of words in the first word sequence, setting 1 on the position corresponding to the first part of words in the target word sequence, and obtaining the first word frequency vector with the dimension of 1 XW.
6. The method of claim 4, wherein said determining a second word frequency vector from said second word sequence and said target word sequence comprises:
And under the condition that the W words comprise second partial words in the second word sequence, setting 1 at the position corresponding to the second partial words in the target word sequence, and obtaining the second word frequency vector with the dimension of 1 XW.
7. The method of claim 4, wherein determining the kth text similarity between a first source sample text and a second source sample text in a kth pair of source sample texts based on the first word frequency vector and the second word frequency vector comprises:
And determining cosine similarity between the first word frequency vector and the second word frequency vector, and determining the cosine similarity as the kth text similarity.
8. The method of claim 1, wherein determining a first set of sample text from the initial set of sample text that does not satisfy a symbol configuration condition comprises:
searching sample texts containing abnormal characters from the initial sample text set to obtain first-type sample texts, wherein the initial sample text set comprises the first-type sample texts, and each pair of texts in the first-type sample texts comprises a source sample text of the source language and a target sample text of the target language;
searching the initial sample text set for a second type of sample text of which the languages are other than the source language and the target language, wherein the initial sample text set comprises the second type of sample text, and each pair of texts in the second type of sample text comprises the source sample text of the source language and the target sample text of the target language;
Searching a third type of sample text with text content being null value from the initial sample text set, wherein the first group of sample text comprises the first type of sample text, the second type of sample text and the third type of sample text;
wherein the symbol configuration condition includes at least one of: and no abnormal symbol appears in the sample text, the languages of the sample text comprise the source language and the target language, and the text content of the sample text does not contain null values.
9. The method of any one of claims 1 to 8, wherein prior to said training with said positive and negative sample sets to obtain a text conversion model for converting source sample text of said source language to target sample text of said target language, said method further comprises:
And obtaining a target preprocessing model by inputting the positive sample set and the negative sample set into a preprocessing model, wherein the target preprocessing model is a model for identifying the types of sample texts in the positive sample set and the negative sample set.
10. The method of claim 9, wherein the obtaining the target pretreatment model by inputting the positive sample set and the negative sample set into a pretreatment model comprises:
converting positive samples in the positive sample set into a first multidimensional vector, wherein the first multidimensional vector comprises first type information and semantic information of the positive samples, and the first type information indicates that the types of the sample texts are positive samples;
Converting the negative samples in the negative sample set into a second multidimensional vector, wherein the second multidimensional vector comprises second class information and semantic information of the negative samples, and the second class information indicates that the class of the sample text is the negative sample;
Inputting the first multidimensional vector and the second multidimensional vector into the preprocessing model to obtain a set of prediction probabilities, wherein the set of prediction probabilities comprises a prediction probability representing that the class of the positive sample set belongs to the first class information and the second class information, and a prediction probability representing that the class of the negative sample in the negative sample set belongs to the first class information and the second class information;
determining an objective loss function based on the set of prediction probabilities, the first class information of the positive samples, and the second class information of the negative samples;
and stopping training under the condition that the value of the target loss function meets the convergence condition, so as to obtain the target preprocessing model.
11. The method according to any one of claims 1 to 8, wherein the obtaining an initial sample text set from application data of a target application comprises:
And determining a historical text set, a preconfigured original text set and an associated text set generated by the target application in a historical running period as the initial sample text set, wherein the original text set comprises texts predefined in a configuration table of the target application, and the associated text set is a text set generated in a running process of other applications with the same functions and categories as the target application.
12. A text conversion preprocessing device, comprising:
the first acquisition unit is used for acquiring an initial sample text set from application data of a target application, wherein each pair of sample texts in the initial sample text set comprises a source sample text of a source language and a target sample text of a target language, and the target sample text is obtained after content conversion based on text content of the source sample text;
A first processing unit, configured to determine a first set of sample texts that do not satisfy a symbol configuration condition from the initial sample text set, where the symbol configuration condition is used to indicate a format requirement of a symbol included in text content of the sample text;
A second processing unit, configured to determine a second group of sample texts with a text similarity greater than a preset threshold from the initial sample text set, where the second group of sample texts includes a first subset of sample texts and a second subset of sample texts, the text similarity between an ith sample text in the first subset of sample texts and a jth sample text in the second subset of sample texts is greater than the preset threshold, the text similarity is used to indicate a content similarity between the ith sample text and the jth sample text, and i and j are positive integers greater than or equal to 1;
The third processing unit is used for eliminating the first sub-group sample texts in the first group of sample texts and the second group of sample texts from the initial sample text set to obtain a third group of sample texts;
A fourth processing unit configured to determine the third set of sample texts as a positive sample set, and determine the first set of sample texts and the first subset of sample texts as a negative sample set;
And the training unit is used for training by utilizing the positive sample set and the negative sample set to obtain a text conversion model for converting the source sample text of the source language into the target sample text of the target language.
13. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program is executable by a terminal device or a computer to perform the method of any one of claims 1 to 11.
14. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 11.
15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 11 by means of the computer program.
CN202410387525.9A 2024-04-01 2024-04-01 Text conversion preprocessing method and device, storage medium and electronic equipment Active CN117973402B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410387525.9A CN117973402B (en) 2024-04-01 2024-04-01 Text conversion preprocessing method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410387525.9A CN117973402B (en) 2024-04-01 2024-04-01 Text conversion preprocessing method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN117973402A CN117973402A (en) 2024-05-03
CN117973402B true CN117973402B (en) 2024-06-11

Family

ID=90865067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410387525.9A Active CN117973402B (en) 2024-04-01 2024-04-01 Text conversion preprocessing method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117973402B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143844A1 (en) * 2019-01-10 2020-07-16 深圳Tcl新技术有限公司 Intent analysis method and apparatus, display terminal, and computer readable storage medium
CN115630639A (en) * 2022-11-04 2023-01-20 平安银行股份有限公司 Keyword extraction method and device, computer equipment and storage medium
CN116595991A (en) * 2023-06-12 2023-08-15 中国联合网络通信集团有限公司 Text matching model training method, matching method, equipment and storage medium
CN117252217A (en) * 2023-09-22 2023-12-19 腾讯科技(深圳)有限公司 Verification method and related device for translation text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143844A1 (en) * 2019-01-10 2020-07-16 深圳Tcl新技术有限公司 Intent analysis method and apparatus, display terminal, and computer readable storage medium
CN115630639A (en) * 2022-11-04 2023-01-20 平安银行股份有限公司 Keyword extraction method and device, computer equipment and storage medium
CN116595991A (en) * 2023-06-12 2023-08-15 中国联合网络通信集团有限公司 Text matching model training method, matching method, equipment and storage medium
CN117252217A (en) * 2023-09-22 2023-12-19 腾讯科技(深圳)有限公司 Verification method and related device for translation text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
词义层级上的专家***问题相似度计算优化;乔猛;刘慧君;梁光辉;;信息工程大学学报;20180815(04);第67-72页 *

Also Published As

Publication number Publication date
CN117973402A (en) 2024-05-03

Similar Documents

Publication Publication Date Title
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
CN108287858B (en) Semantic extraction method and device for natural language
CN110543574B (en) Knowledge graph construction method, device, equipment and medium
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
CN110334209B (en) Text classification method, device, medium and electronic equipment
KR20200127020A (en) Computer-readable storage medium storing method, apparatus and instructions for matching semantic text data with tags
US11144723B2 (en) Method, device, and program for text classification
CN113961685A (en) Information extraction method and device
CN113806486B (en) Method and device for calculating long text similarity, storage medium and electronic device
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN112699232A (en) Text label extraction method, device, equipment and storage medium
CN112380866A (en) Text topic label generation method, terminal device and storage medium
CN111177375A (en) Electronic document classification method and device
CN110866408B (en) Database creation device and search system
TW201335776A (en) Dictionary generation device, dictionary generation method, dictionary generation program, and computer readable recording medium memorizing the program
CN109753646B (en) Article attribute identification method and electronic equipment
CN106294689B (en) A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature
CN108475265B (en) Method and device for acquiring unknown words
CN111460114A (en) Retrieval method, device, equipment and computer readable storage medium
CN117973402B (en) Text conversion preprocessing method and device, storage medium and electronic equipment
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN111368553A (en) Intelligent word cloud picture data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant