CN111368566B - Text processing method, text processing device, electronic equipment and readable storage medium - Google Patents

Text processing method, text processing device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN111368566B
CN111368566B CN202010198468.1A CN202010198468A CN111368566B CN 111368566 B CN111368566 B CN 111368566B CN 202010198468 A CN202010198468 A CN 202010198468A CN 111368566 B CN111368566 B CN 111368566B
Authority
CN
China
Prior art keywords
parallel corpus
target
translation
parallel
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010198468.1A
Other languages
Chinese (zh)
Other versions
CN111368566A (en
Inventor
徐晨灿
袁宁
宫晨
石建勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202010198468.1A priority Critical patent/CN111368566B/en
Publication of CN111368566A publication Critical patent/CN111368566A/en
Application granted granted Critical
Publication of CN111368566B publication Critical patent/CN111368566B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The present disclosure provides a text processing method, the method comprising: obtaining mixed parallel corpus and target parallel corpus; taking the mixed parallel corpus and the target parallel corpus as training samples, and training a preset model to obtain a first translation model; and taking the text to be processed as the input of the first translation model to obtain the translation text aiming at the text to be processed. The target parallel corpus is parallel corpus aiming at the target field, and the target parallel corpus comprises parallel corpus obtained through screening of a second translation model. The second translation model is obtained by training by taking the mixed parallel corpus as a training sample. The present disclosure also provides a text processing apparatus, an electronic device, and a computer-readable storage medium.

Description

Text processing method, text processing device, electronic equipment and readable storage medium
Technical Field
The present disclosure relates to the field of text translation and system monitoring, and more particularly, to a text processing method, apparatus, electronic device, and readable storage medium.
Background
With the development of electronic technology, in order to improve processing efficiency and reduce labor cost, language processing based on a machine learning model is rapidly developed. Wherein machine translation is an important branch of the language processing.
In the process of implementing the disclosed concept, the inventor finds that at least the following technical problems exist in the prior art: machine translation can be applied to various specialized fields in addition to daily spoken language translation. When applied to various specialized fields, a large amount of parallel corpus is often required as a priori knowledge to train a machine model. However, in consideration of field expertise and universality of parallel corpus acquisition, the corpus used for training a model is often mixed with fish eyes, and the accuracy of the trained model is difficult to ensure, so that the accuracy of the translated text is influenced to a certain extent.
Disclosure of Invention
In view of this, the present disclosure provides a text processing method, apparatus, electronic device, and computer-readable storage medium capable of improving translation accuracy.
One aspect of the present disclosure provides a text processing method, the method including: obtaining mixed parallel corpus and target parallel corpus; taking the mixed parallel corpus and the target parallel corpus as training samples, and training a preset model to obtain a first translation model; and taking the text to be processed as the input of the first translation model to obtain the translation text aiming at the text to be processed. The target parallel corpus is parallel corpus aiming at the target field, and the target parallel corpus comprises parallel corpus obtained through screening of a second translation model. The second translation model is obtained by training by taking the mixed parallel corpus as a training sample.
According to an embodiment of the present disclosure, the obtaining the target parallel corpus includes: obtaining a plurality of parallel corpus aiming at a target field; adopting a second translation model to determine parallel corpus meeting a first condition from a plurality of parallel corpus; and obtaining a target parallel corpus according to the parallel corpus meeting the first condition.
According to an embodiment of the present disclosure, determining a parallel corpus satisfying a first condition among the plurality of parallel corpora includes: the following operations are performed for a first parallel corpus of the plurality of parallel corpora: taking a source sentence included in the first parallel corpus as input of a second translation model, and outputting to obtain a predicted translation sentence corresponding to the first parallel corpus; and determining whether the first parallel corpus meets a first condition according to the source sentence included in the first parallel corpus, the translation sentence included in the first parallel corpus and the predictive translation sentence corresponding to the first parallel corpus. The first parallel corpus is any one of a plurality of parallel corpora.
According to an embodiment of the present disclosure, determining whether the first parallel corpus satisfies the first condition includes: determining a plurality of target word strings in the predictive translation sentence corresponding to the first parallel corpus, wherein each target word string in the plurality of target word strings consists of a plurality of first words with continuous positions in the predictive translation sentence, and the duty ratio of the target words in the plurality of first words is not less than a preset duty ratio; determining the target word string with the largest length in the target word strings as the largest target word string; determining a first proportion of translation sentences included in the first parallel corpus of target words included in the maximum target word string; and determining that the first parallel corpus satisfies a first condition if the first proportion is not less than a first predetermined proportion. The target words comprise words in a predetermined vocabulary and words in translation sentences comprised by a plurality of parallel corpora.
According to an embodiment of the present disclosure, the obtaining the target parallel corpus according to the parallel corpus satisfying the first condition includes: the following operations are performed for a second parallel corpus among the parallel corpora satisfying the first condition: determining at least one clause meeting a second condition in the translation sentences included in the second parallel corpus according to the maximum target word string of the predicted translation sentences corresponding to the second parallel corpus; determining at least one second word matched with a maximum target word string corresponding to the second parallel corpus in a source sentence included in the second parallel corpus; and splicing at least one second word to obtain a target source sentence, and splicing at least one clause to obtain a target translation sentence. The target parallel corpus obtained according to the second parallel corpus comprises a target source sentence and a target translation sentence, and the second parallel corpus is any one of the parallel corpora meeting the first condition.
According to an embodiment of the present disclosure, determining at least one clause satisfying the second condition in the translation sentence included in the second parallel corpus includes: performing sentence dividing processing on the translation sentences included in the second parallel corpus to obtain a plurality of clauses; determining a second proportion of target words included in each of the multiple clauses in a maximum target word string corresponding to a second parallel corpus, and obtaining multiple second proportions; determining a second ratio which is not smaller than a second predetermined ratio among the plurality of second ratios as a target ratio; determining that the clause corresponding to the target proportion is at least one clause meeting the second condition; and under the condition that the second proportions are smaller than the second preset proportion, determining that the clause corresponding to the largest second proportion in the plurality of second proportions is at least one clause meeting the second condition.
According to an embodiment of the present disclosure, the above text processing method further includes, before training the predetermined model to obtain the first translation model: the following operations are executed for the source sentences and the translation sentences included in any one parallel corpus of the mixed parallel corpus and the target parallel corpus: performing word segmentation processing on the source sentence and the translation sentence to obtain a first word sequence aiming at the source sentence and a second word sequence aiming at the translation sentence; converting the first word sequence into a first numbered sequence according to a preset vocabulary, and converting the second word sequence into a second numbered sequence; and obtaining a first word vector for the first word sequence according to the first number sequence and obtaining a second word vector for the second word sequence according to the second number sequence by adopting a word embedding technology. The predetermined vocabulary comprises a corresponding relation between a plurality of words and numbers aiming at the words, and the words are extracted from the mixed parallel corpus and the target parallel corpus; the first translation model is obtained through training according to the first word vector and the second word vector.
According to an embodiment of the present disclosure, the first translation model includes a first sub-model and a second sub-model, and the obtaining the translation text for the text to be processed includes: taking the text to be processed as the input of the first sub-model to obtain a semantic vector for the text to be processed; and taking the semantic vector as the input of the second sub-model to obtain the translation text aiming at the text to be processed. The second sub-model comprises a long-period memory network model, and the second number sequence comprises a statement start number and a statement end number.
Another aspect of the present disclosure provides a text processing apparatus, the apparatus comprising: the first corpus obtaining module is used for obtaining mixed parallel corpus and target parallel corpus; the model training module is used for taking the mixed parallel corpus and the target parallel corpus as training samples and training a preset model to obtain a first translation model; and the translation text obtaining module is used for obtaining the translation text aiming at the text to be processed by taking the text to be processed as the input of the first translation model. The target parallel corpus is parallel corpus aiming at the target field, and the target parallel corpus comprises parallel corpus obtained through screening of a second translation model. The second translation model is obtained by training by taking the mixed parallel corpus as a training sample.
Another aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a storage means for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the aforementioned text processing method.
Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions that, when executed, cause a processor to perform the foregoing text processing method.
According to the embodiment of the disclosure, the problem of low translation accuracy in the application of the machine translation method to the professional field can be at least partially solved, and therefore, the quality and the integrity of the training sample can be at least partially improved by mixing the mixed parallel corpus and the target parallel corpus to be used as the training sample and obtaining the target parallel corpus through the second translation model screening. Thereby improving the accuracy of the first translation model obtained by training at least partially and improving the accuracy of the obtained translation text.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a text processing method, apparatus, electronic device, and readable storage medium according to an embodiment of the disclosure;
fig. 2 schematically illustrates a flowchart of a text processing method according to an exemplary embodiment of the present disclosure;
fig. 3 schematically illustrates a flowchart of a text processing method according to a second exemplary embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow diagram for obtaining translated text for text to be processed in accordance with an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart for obtaining a target parallel corpus in accordance with an embodiment of the disclosure;
FIG. 6 schematically illustrates a flowchart of determining whether each of a plurality of parallel corpora satisfies a first condition, according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flowchart of determining whether a first parallel corpus satisfies a first condition, according to an embodiment of the disclosure;
FIG. 8 schematically illustrates a flow chart for obtaining a target parallel corpus from parallel corpora each satisfying a first condition according to an embodiment of the present disclosure;
FIG. 9 schematically illustrates a flow chart of determining at least one clause of a translation sentence included in a second parallel corpus that satisfies a second condition, according to an embodiment of the disclosure;
fig. 10 schematically shows a block diagram of a text processing apparatus according to an embodiment of the present disclosure; and
fig. 11 schematically illustrates a block diagram of an electronic device adapted to perform a text processing method according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The embodiment of the disclosure provides a text processing method, which comprises the following steps: first, mixed parallel corpus and target parallel corpus are obtained. And then taking the mixed parallel corpus and the target parallel corpus as training samples, and training a preset model to obtain a first translation model. And finally, taking the text to be processed as the input of the first translation model to obtain the translation text aiming at the text to be processed. The target parallel corpus is parallel corpus aiming at the target field, and the target parallel corpus comprises parallel corpus obtained through screening of a second translation model. The second translation model is obtained by training by taking the mixed parallel corpus as a training sample.
Fig. 1 schematically illustrates an application scenario 100 of a text processing method, an apparatus, an electronic device, and a readable storage medium according to an embodiment of the present disclosure. It should be noted that fig. 1 illustrates only an example of an application scenario in which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but it does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments, or scenarios.
As shown in fig. 1, the application scenario 100 may include, for example, terminal devices 111, 112, 113, a network 120, and a database 130. The network 120 is the medium used to provide communication links between the terminal devices 111, 112, 113 and the database 130. The network 120 may include various connection types, such as wired, wireless communication links, and the like.
The terminal devices 111, 112, 113 may obtain parallel corpus from the database 130, for example, through the network 120, so as to train a predetermined model according to the parallel corpus, and obtain a translation model. Wherein, parallel corpus refers to sentences of two different languages expressing the same meaning, so each parallel corpus can comprise: source statements and translated statements.
The terminal devices 111, 112, 113 may be installed with various client applications, for example, and may include, for example only, a monitoring class application, a web browser application, a search class application, a mailbox client, and the like. The terminal devices 111, 112, 113 may be, for example, various electronic devices with display screens, including but not limited to smartphones, tablet computers, laptop and desktop computers, etc., and the terminal devices 111, 112, 113 may present translation results obtained according to a translation model.
The monitoring class application may be, for example, an application for processing results obtained when monitoring an information system. Wherein, the monitoring of the information system can adopt an alarm means, for example. In consideration of the fact that some unknown alarms which cannot be identified by the information system can be generated in the alarm monitoring, the large number of unknown alarms can cause confusion to monitoring personnel. The meaning of the unknown alarms is often difficult to understand by monitoring staff, so that the influence of the unknown alarms on an information system cannot be judged, the unknown alarms cannot be processed, and the unknown alarms are often contents expressed by non-Chinese language. Thus, to understand the meaning of the unknown alert, the processing of the monitored results by the monitoring class application may include, for example: the content of the unknown alert is translated into the content of the linguistic expression familiar to the monitoring person. Therefore, the technical problems that with the continuous increase of the alarm quantity, the mode of manual translation is more and more difficult, and a large number of unknown alarms are not translated and converted in time, so that the production safety is affected are solved.
According to an embodiment of the present disclosure, as shown in fig. 1, the application scenario 100 may further comprise a server 140, for example, the server 140 may communicate with the terminal devices 111, 112, 113 through the network 120. The server 140 may, for example, have the information system and the tool for monitoring the information system running therein, and the monitoring class application includes a client application matched with the tool for monitoring the information system. The tool for monitoring the information system may feed back alarm information to the monitoring class application via the network 120.
It should be noted that, the text processing method provided in the embodiments of the present disclosure may be generally executed by the terminal devices 111, 112, 113. Accordingly, the text processing apparatus provided by the embodiments of the present disclosure may be generally disposed in the terminal devices 111, 112, 113.
It should be understood that the number and types of terminal devices, networks, databases, and servers in fig. 1 are illustrative only. There may be any number and type of terminal devices, networks, databases, and servers, as desired for implementation.
The text processing method provided by the present disclosure will be described in detail below with reference to the application scenario of fig. 1, and fig. 2 to 9.
Fig. 2 schematically illustrates a flowchart of a text processing method according to an exemplary embodiment of the present disclosure.
As shown in fig. 2, the text processing method of the embodiment of the present disclosure may include, for example, operations S210 to S230.
In operation S210, a mixed parallel corpus and a target parallel corpus are obtained. According to an embodiment of the present disclosure, the operation S210 may be, for example, obtaining, by the terminal device, the mixed parallel corpus and the target parallel corpus from a database.
The mixed parallel corpus may refer to, for example, a parallel corpus obtained from a plurality of fields. The mixed parallel corpus may include, for example, parallel corpus in the prior art, and the mixed parallel corpus may be parallel corpus extracted from news, subtitles, united nations documents, and wikipedia.
The target parallel corpus is the parallel corpus aiming at the target field. The target area may include, for example, the monitoring area or the materials area described in fig. 1, the aerospace area, and the like. The source sentences and the translation sentences in the target parallel corpus comprise professional vocabularies in the target field.
According to the embodiment of the disclosure, in order to improve stability of the translation model obtained by training, after the parallel corpus in the target field is obtained from the database, the parallel corpus in the target field can be screened, so that the parallel corpus with high matching degree of the included source text and the translation text is obtained. In order to determine the matching degree of the source text and the translation text included in the parallel corpus, for example, the matching degree may be realized by using a translation model obtained by training with the mixed parallel corpus as a training sample. Therefore, the target parallel corpus obtained in operation S210 may include, for example, a parallel corpus obtained by screening a second translation model, where the second translation model is a model obtained by training with the mixed parallel corpus as a training sample.
According to the embodiment of the present disclosure, the operation of screening the parallel corpus in the target domain to obtain the target parallel corpus may be detailed in the flow described in the following fig. 5 to 9, which is not described in detail herein.
In operation S220, a predetermined model is trained to obtain a first translation model by using the mixed parallel corpus and the target parallel corpus as training samples.
According to an embodiment of the present disclosure, the predetermined model may include, for example, a sequence-to-sequence model, which may be formed by, for example, two recurrent neural networks in series. In two recurrent neural networks in series: the former recurrent neural network is used for encoding the sentence into a semantic vector, and the latter recurrent neural network is used for decoding the semantic vector into a translation sentence corresponding to the encoded sentence.
According to an embodiment of the present disclosure, in the training sample, the ratio of the mixed parallel corpus and the target parallel corpus may be set according to actual requirements. For example, if it is desired to make the first translation model more focused on the specialized representation of the target domain, the proportion of the target parallel corpus may be larger. If it is desired to make the first translation model more focused on the accuracy of the translation, the proportion of the mixed parallel corpus may be larger. In an embodiment, the ratio of the mixed parallel corpus to the target parallel corpus may be equal, for example.
According to an embodiment of the present disclosure, the aforementioned second translation model may be, for example, a translation model trained based on the predetermined model in a similar manner to operation S220. In the training process, for example, a gradient descent algorithm or the like can be adopted to adjust and optimize parameters of the predetermined model, so as to obtain a first translation model and/or a second translation model. The first translation model and the second translation model differ in that: in training, the training sample of the first translation model comprises the target parallel corpus, and the training sample of the second translation model does not comprise the target parallel corpus.
In order to facilitate training of the predetermined model, according to embodiments of the present disclosure, for example, an operation of preprocessing each parallel corpus in the training samples may be further included to convert the training samples into inputs of the predetermined model and facilitate subsequent adjustment of parameters of the predetermined model. The preprocessing operation performed on each parallel corpus may be implemented, for example, by the flow described in fig. 3, which is not described in detail herein.
In operation S230, a translation text for the text to be processed is obtained with the text to be processed as an input of the first translation model.
After training to obtain a first translation model, translating the text to be processed by adopting the first translation model to obtain a translation text. The text to be processed may be, for example, unknown alert text in the application scenario described in fig. 1. Similarly, to facilitate the input of the first translation model, for example, the text to be processed may also be preprocessed, the text to be processed may be converted into a word vector, and then the word vector may be input into the first translation model. The output of the first translation model may be, for example, a word vector for the translation text, and the word vector for the translation text is subjected to a process opposite to the preprocessing to obtain the translation text of the text to be processed.
According to an embodiment of the present disclosure, preprocessing of text to be processed may include, for example: firstly, word segmentation processing is carried out on the text to be processed, and a word sequence is obtained. And then adopting a predefined vocabulary to correspond each word in the word sequence to a word number, and finally projecting a plurality of word numbers obtained by the word sequence corresponding to the word sequence to a word vector space according to the vocabulary to obtain a word vector.
In order to make the displayed translation result more representative of the alert information in the monitoring field, according to the embodiment of the present disclosure, after the translation text is obtained in operation S230, for example, a significant word string may be selected from the translation text as the translation result. The process of selecting the important word strings from the translated text may include, for example: a plurality of target word strings in the translation text are determined, and then the target word string with the largest length in the plurality of target word strings is determined as a translation result. The process of selecting the important word strings from the translated text may be described in detail in operation S721 to operation S722 in fig. 7, which will not be described in detail herein.
Fig. 3 schematically illustrates a flowchart of a text processing method according to a second exemplary embodiment of the present disclosure.
In order to facilitate training of the predetermined model, preprocessing is further required for each parallel corpus in the training sample, for example, converting a source sentence and a translation sentence included in each parallel corpus into a word vector. Therefore, as shown in fig. 3, the text processing method according to the embodiment of the present disclosure includes performing operations S340 to S360 for the source sentence and the translation sentence included in any one of the mixed parallel corpus and the target parallel corpus, in addition to operations S210 to S230. The operations S340 to S360 may be performed, for example, before the operation S220.
In operation S340, word segmentation processing is performed on the source sentence and the translated sentence, so as to obtain a first word sequence for the source sentence and a second word sequence for the translated sentence.
According to the embodiment of the disclosure, if a sentence (a source sentence or a translation sentence) is an english expression sentence, when the sentence is subjected to word segmentation, for example, word segmentation may be directly performed according to punctuation marks. Alternatively, a space may be first preceded by a punctuation mark in the source sentence, and then the source sentence may be segmented according to the space. If the sentence (translation sentence or source sentence) is a sentence expressed in Chinese, when the sentence is subjected to word segmentation, word segmentation can be directly performed according to characters, so that each Chinese character is a word. By word segmentation of the source sentence included in each parallel corpus, a plurality of first words can be obtained, and the first words are combined to form a first word sequence. By word segmentation of the translation sentences included in each parallel corpus, a plurality of second words can be obtained, and the second words are combined to form a second word sequence. It will be appreciated that the foregoing method of word segmentation is merely exemplary to facilitate an understanding of the present disclosure, and is not limited thereto. The method adopted in the word segmentation process can be specifically determined according to the language type of the expression statement.
In operation S350, the first word sequence is converted to a first number sequence and the second word sequence is converted to a second number sequence according to a predetermined vocabulary.
According to an embodiment of the present disclosure, the predetermined vocabulary (the foregoing predefined vocabulary) includes correspondence between a plurality of words and numbers for the plurality of words. According to an embodiment of the present disclosure, the predetermined vocabulary may include, for example, a first vocabulary for a first language that expresses the source sentence, and a second vocabulary for a second language that expresses the translated sentence. The first vocabulary may include words extracted from source sentences of the parallel corpora of the mixed domain and words extracted from source sentences of the parallel corpora of the target domain. The second vocabulary includes words extracted from the translated sentences and words extracted from the translated sentences of the parallel corpora of the target domain. Operation S350 may include, for example: searching a plurality of first words included in a first word sequence in a first vocabulary, and taking the number of a certain first word as one number in a first number sequence if the certain first word is found in the first vocabulary. If a certain first word is not found in the first vocabulary, a preset number is allocated to the certain first word. Thereby obtaining a plurality of numbers corresponding to the plurality of first words respectively, wherein the plurality of numbers form a first number sequence. Based on the same method, a second numbered sequence may be obtained from a plurality of second words and a second vocabulary included in the second word sequence.
According to an embodiment of the present disclosure, a method for constructing a first vocabulary and/or a second vocabulary may, for example, include: the method comprises the steps of selecting a vocabulary size V, extracting most frequently V words from source sentences and/or translation sentences of a plurality of parallel linguistic data (parallel linguistic data of mixed parallel linguistic data and parallel linguistic data of a target field) to be incorporated into a first vocabulary and/or a second vocabulary, and numbering the V words respectively to enable each word to have a unique number.
According to the embodiment of the disclosure, when constructing the vocabulary, the universal vocabulary can be firstly constructed according to the mixed parallel corpus by the similar vocabulary constructing method. And then extracting most frequently V' words from the screened target parallel corpus to be incorporated into the universal vocabulary, so as to obtain a final preset vocabulary.
According to an embodiment of the present disclosure, in order to facilitate the first translation model to be able to identify the start and end of the translation, when the second number sequence is obtained, a number indicating the start of the sentence may be added to the head of the second number sequence, and a number indicating the end of the sentence may be added to the tail of the second number sequence.
In operation S360, a word embedding technique is used to obtain a first word vector for a first word sequence according to a first numbered sequence, and a second word vector for a second word sequence according to a second numbered sequence.
According to an embodiment of the present disclosure, the Word embedding technique may be implemented, for example, by a Word2Vec (Word to vector) method, and Word2Vec is a method of representing each Word in text using a fixed-length vector. Alternatively, word embedding techniques may be implemented using GloVe (Global Vectors for Word Representation) tools, which are word characterization tools based on global word frequency statistics, which may express a word as a vector of real numbers that capture some semantic features between words.
According to an embodiment of the present disclosure, in operation S360, words in a predetermined vocabulary may be first corresponding to a first number sequence and a second number sequence, and then the corresponding words may be converted into word vectors through a word embedding technique.
After the first word vector and the second word vector are obtained, operation S220 may, for example, train to obtain a first translation model according to the first word vector and the second word vector. Specifically, the process of training the predetermined model may include, for example: sequentially inputting a plurality of first word vectors obtained according to a plurality of source sentences in the parallel corpus in operation S210 into a predetermined model, and then adjusting and optimizing parameters of the predetermined model by adopting a gradient descent algorithm according to an output vector of the predetermined model and a plurality of translation sentences corresponding to the plurality of source sentences until the similarity between the output vector of the predetermined model and the translation sentences is not less than a first predetermined similarity (for example, 0.9). The final optimized model is the first translation model.
Fig. 4 schematically illustrates a flow diagram for obtaining translated text for text to be processed according to an embodiment of the disclosure.
According to an embodiment of the present disclosure, the first translation model may include, for example, a first sub-model and a second sub-model. As shown in fig. 4, with the text to be processed as an input of the first translation model, operation S230 of obtaining the translated text for the text to be processed may include, for example, operations S431 to S432.
In operation S431, a semantic vector for the text to be processed is obtained with the text to be processed as an input of the first submodel.
According to an embodiment of the present disclosure, this operation S431 may include, for example: and taking the word vector obtained after preprocessing the text to be processed as the input of a first sub-model, and outputting the word vector by the first sub-model to obtain the semantic vector. The first sub-model may be, for example, a recurrent neural network model. The operation S431 may specifically include, for example: and taking a first vector value in the word vector as the input of the first sub-model for the first cycle to obtain a first output vector. And then taking a second vector value in the first output vector and the word vector as the input of the second cycle of the first submodel to obtain a second output vector, and then analogizing until the last vector value in the word vector is input into the first submodel, and taking the output of the implicit layer obtained by output as a semantic vector. The first sub-model may be, for example, a Long Short-Term Memory (LSTM) for encoding word vectors, and the finally output semantic vector contains input contents in multiple loops.
In operation S432, a translation text for the text to be processed is obtained with the semantic vector as an input to the second sub-model.
According to embodiments of the present disclosure, the second sub-model may be, for example, a short-term memory network with a mechanism of attention. This operation S432 may include, for example: the semantic vector and sentence start identifier (e.g., "< s >") are used as inputs for the first cycle of the second sub-model, and a word vector for the first word of the translated text is output from the second sub-model. Then, the word vector of the first word is used as the input of the second sub-model for the second cycle, and the word vector of the second word of the translated text is output. Then, the word vector of the second word is used as the input of the third cycle of the second submodel, the word vector of the third word of the translated text is output, and so on until the second submodel outputs a sentence end identifier ("for example" </s ") or the sentence length composed of a plurality of words is output to reach the set sentence length. The set sentence length may be, for example, twice the sentence length in the text to be processed, and the set sentence length may be set according to actual requirements.
The flow of obtaining the target parallel corpus according to the embodiment of the present disclosure will be described in detail below with reference to fig. 5 to 9.
Fig. 5 schematically illustrates a flowchart of obtaining a target parallel corpus according to an embodiment of the disclosure.
As shown in fig. 5, operation S210 of obtaining the target parallel corpus in the embodiment of the present disclosure may include, for example, operations S511 to S513.
In operation S511, a plurality of parallel corpora for the target domain are obtained. This operation S511 may be, for example, acquiring a plurality of parallel corpora of the target domain from the database 130. In order to facilitate obtaining multiple parallel corpora of the target domain, the parallel corpora in the database may have, for example, a label or index for the domain, for characterizing the domain to which the parallel corpora belong. When the parallel corpus is acquired, the parallel corpus is acquired according to the labels aiming at the target field, so that a plurality of parallel corpora of the target field are obtained.
In operation S512, a second translation model is used to determine parallel corpora satisfying the first condition among the plurality of parallel corpora. In operation S513, a target parallel corpus is obtained from parallel corpora that satisfy the first condition.
According to an embodiment of the present disclosure, the first condition may be, for example, that the matching degree of the source sentence and the translation sentence is higher than a predetermined matching degree, or that the number of words included in a predetermined vocabulary in the source sentence and the translation sentence is not less than a predetermined number, or the like. Operation S512 may be implemented by, for example, the flow described in fig. 6.
Fig. 6 schematically illustrates a flowchart of determining whether each of a plurality of parallel corpora satisfies a first condition according to an embodiment of the present disclosure.
As shown in fig. 6, the operation S512 of determining the parallel corpus satisfying the first condition among the plurality of parallel corpora may include, for example: operations S621 to S622 are performed with respect to a first parallel corpus among the plurality of parallel corpora. The first parallel corpus is any one of a plurality of parallel corpora.
In operation S621, a source sentence included in the first parallel corpus is used as an input of the second translation model, and a predicted translation sentence corresponding to the first parallel corpus is output. According to the embodiment of the present disclosure, the implementation manner of the operation S621 is similar to the process of obtaining the translation text of the text to be processed in the foregoing operation S230, and will not be repeated herein.
In operation S622, it is determined whether the first parallel corpus satisfies a first condition according to the source sentence included in the first parallel corpus, the translation sentence included in the first parallel corpus, and the predictive translation sentence corresponding to the first parallel corpus.
According to an embodiment of the present disclosure, this operation S622 may include, for example: firstly, determining the similarity between the translation sentence included in the first parallel corpus and the predicted translation sentence obtained in operation S621, and if the similarity is not less than a second predetermined similarity (for example, 0.6), determining that the first parallel corpus to which the translation sentence belongs meets a first condition.
This operation S622 may also be implemented by the flow described in fig. 7, for example, according to an embodiment of the present disclosure, which is not described in detail herein.
Fig. 7 schematically illustrates a flowchart of determining whether a first parallel corpus satisfies a first condition according to an embodiment of the disclosure.
As shown in fig. 7, operation S622 of determining whether the first parallel corpus satisfies the first condition may include, for example, operations S721 to S726.
In operation S721, a plurality of target word strings in a predictive translation sentence corresponding to a first parallel corpus are determined.
Wherein each of the plurality of target word strings is composed of a plurality of first words having consecutive positions in the predictive translation sentence, and the duty ratio of the target word in the plurality of first words is not less than a predetermined duty ratio. The target word may include, for example, a word in a predetermined vocabulary and a word in a translation sentence included in a plurality of parallel corpora.
According to an embodiment of the present disclosure, this operation S721 may include, for example: firstly, word segmentation processing is carried out on the predictive translation sentence to obtain a word sequence, then a plurality of words included in the word sequence sequentially search for a plurality of continuous words with the duty ratio of the target word not smaller than a preset duty ratio to obtain a target word string, and the target word string is obtained until the last word of the predictive translation sentence is sequentially searched for, so that a plurality of target word strings are obtained in total. For example, if the predictive translation sentence is "the thread of the first server is congested and the second server has an illegal request", the word sequence is obtained first: "first", "server", "thread", "congestion", "second", "server", "presence", "illegal", "request", and then find the following word strings: the "first server", "first server thread congestion", … "first server thread congestion", and the second server has an illegitimate request "," server thread congestion ", …" server thread congestion ", illegitimate request". And then determining the duty ratio of the target words included in the word strings, and obtaining the target word strings by comparing the duty ratio with a preset duty ratio. The resulting target word string may include, for example, "thread congestion of the first server" and "thread congestion, the second server present".
In operation S722, it is determined that a target word string having the largest length among the plurality of target word strings is the largest target word string. The operation S722 may include, for example: and determining the target word string with the most characters included in the target word strings as the largest target word string. According to the target word string in the foregoing embodiment, the determined maximum target word string may be, for example, "thread congestion, the second server exists".
In operation S723, it is determined that the target word included in the maximum target word string belongs to a first proportion of the translation sentence included in the first parallel corpus. This operation S723 may include, for example: determining whether each word included in the maximum target word string appears in the translation sentence included in the first parallel corpus, determining at least one word appearing in the translation sentence included in the first parallel corpus from among the plurality of words included in the maximum target word string, and calculating the ratio of the number of the at least one word to the total number of the plurality of words included in the maximum target word string to obtain a first ratio.
In operation S724, it is determined whether the first ratio is smaller than a first predetermined ratio.
According to an embodiment of the present disclosure, the first predetermined ratio may be, for example, 0.6 or the like. In case that the first ratio is not less than the first predetermined ratio, operation S725 is performed, and it is determined that the first parallel corpus satisfies the first condition. If the first proportion is smaller than the first predetermined proportion, operation S726 is executed, and if it is determined that the first parallel corpus does not satisfy the first condition, the first parallel corpus is rejected from the plurality of parallel corpora.
Fig. 8 schematically illustrates a flowchart for obtaining a target parallel corpus from parallel corpora each satisfying a first condition according to an embodiment of the present disclosure.
As shown in fig. 8, the operation S513 of obtaining the target parallel corpus according to the parallel corpus satisfying the first condition may include, for example: operations S831 to S833 are performed with respect to a second parallel corpus among the parallel corpora satisfying the first condition. The second parallel corpus is any one of the parallel corpora meeting the first condition.
In operation S831, at least one clause satisfying the second condition in the translation sentence included in the second parallel corpus is determined according to the maximum target word string of the predicted translation sentence corresponding to the second parallel corpus.
According to an embodiment of the present disclosure, the operation S831 may include, for example: firstly, sentence dividing processing is carried out on the translation sentences to obtain a plurality of sentences. And then determining the similarity between the multiple clauses and the maximum target word string, and determining the clauses with the similarity being greater than a third preset similarity (for example, the similarity can be 0.5) as the clauses meeting the second condition. This operation S831 may be implemented, for example, by the flow described in fig. 9, and will not be described in detail herein.
In operation S832, at least one second word that matches a maximum target word string corresponding to the second parallel corpus in a source sentence included in the second parallel corpus is determined.
According to an embodiment of the present disclosure, the operation S832 may include, for example: and determining a second word corresponding to the maximum target word string corresponding to the second parallel corpus in the source sentence according to the corresponding relation between the source sentence and the predictive translation sentence of the second parallel corpus, so as to obtain at least one second word. The correspondence may be determined, for example, by a correspondence of an input word vector and an output word vector in the second translation model.
In operation S833, at least one second word is spliced to obtain a target source sentence, and at least one clause is spliced to obtain a target translation sentence. Therefore, the target parallel corpus obtained according to the second parallel corpus includes the target source sentence and the target translation sentence.
Fig. 9 schematically illustrates a flowchart of determining at least one clause of a translated sentence included in a second parallel corpus that satisfies a second condition, according to an embodiment of the present disclosure.
As shown in fig. 9, the operation S831 including determining at least one clause satisfying the second condition among the translation sentences included in the second parallel corpus may include operations S911 to S916, for example.
In operation S911, the translation sentence included in the second parallel corpus is subjected to sentence segmentation processing, so as to obtain a plurality of clauses. According to an embodiment of the present disclosure, the operation S911 may segment the translation sentence according to punctuation in the translation sentence, for example, thereby obtaining a plurality of clauses.
In operation S912, a second proportion of the target word included in each of the plurality of clauses in the maximum target word string corresponding to the second parallel corpus is determined, so as to obtain a plurality of second proportions.
This operation S912 may include, for example: a first number of target words included in each of the plurality of clauses is first determined by comparison to a predetermined vocabulary. It is then determined whether the target word included in each clause occurs in the largest target word string. And obtaining the second number of the target words appearing in the maximum target word string in the target words included in each clause. And finally, dividing the second number by the first number, and calculating to obtain a second proportion. And performing similar operations on each of the plurality of clauses to obtain a plurality of second proportions.
In operation S913, it is determined whether the plurality of second ratios are all smaller than a second predetermined ratio.
In case that the plurality of second ratios are smaller than the second predetermined ratio, operation S914 is performed to determine that the clause corresponding to the largest second ratio among the plurality of second ratios is at least one clause satisfying the second condition.
When there is a second ratio not smaller than a second predetermined ratio among the plurality of second ratios, operations S915 to S916 are performed. In operation S915, it is determined that a second ratio not smaller than a second predetermined ratio among the plurality of second ratios is the target ratio. In operation S916, it is determined that the clause corresponding to the target proportion is at least one clause satisfying the second condition.
Fig. 10 schematically shows a block diagram of a text processing apparatus according to an embodiment of the present disclosure.
As shown in fig. 10, the text processing apparatus 1000 of the embodiment of the present disclosure includes a corpus obtaining module 1010, a model training module 1020, and a translated text obtaining module 1030.
The corpus obtaining module 1010 is configured to obtain a mixed parallel corpus and a target parallel corpus. The target parallel corpus is parallel corpus aiming at the target field, and the target parallel corpus comprises parallel corpus obtained through screening of a second translation model. The second translation model is obtained by training by taking the mixed parallel corpus as a training sample. The corpus obtaining module 1010 may be used, for example, to perform the operation S210 described in fig. 2, which is not described herein.
The model training module 1020 is configured to train a predetermined model to obtain a first translation model by using the mixed parallel corpus and the target parallel corpus as training samples. The model training module 1020 may be used, for example, to perform operation S220 described in fig. 2, which is not described herein.
The translated text obtaining module 1030 is configured to obtain a translated text for the text to be processed, with the text to be processed as an input of the first translation model. The translated text obtaining module 1030 may be used, for example, to perform operation S230 described in fig. 2, which is not described herein.
The corpus acquisition module 1010 described above may include, for example, a first acquisition submodule 1011, a determination submodule 1012, and a second acquisition submodule 1013, according to an embodiment of the present disclosure. The first obtaining submodule is used for obtaining a plurality of parallel corpuses aiming at the target field. The determining submodule is used for determining parallel corpus meeting a first condition in the plurality of parallel corpus. The second obtaining sub-module is used for obtaining a target parallel corpus according to the parallel corpus meeting the first condition. The first obtaining submodule 1011, the determining submodule 1012 and the second obtaining submodule 1013 may be used to perform operations S511 to S513 described in fig. 5, respectively, and are not described here again.
According to an embodiment of the present disclosure, the determining submodule 1012 is configured to perform the following operations with respect to a first parallel corpus of the plurality of parallel corpora: taking a source sentence included in the first parallel corpus as an input of a second translation model, and outputting a predicted translation sentence corresponding to the first parallel corpus (operation S621); and determining whether the first parallel corpus satisfies a first condition according to the source sentence included in the first parallel corpus, the translation sentence included in the first parallel corpus, and the predictive translation sentence corresponding to the first parallel corpus (operation S622). The first parallel corpus is any one of a plurality of parallel corpora.
According to an embodiment of the present disclosure, determining whether the first parallel corpus satisfies the first condition includes: a plurality of target word strings in the predictive translation sentence corresponding to the first parallel corpus are determined (operation S721). Each of the plurality of target word strings is composed of a plurality of first words having consecutive positions in the predictive translation sentence, and the duty ratio of the target word in the plurality of first words is not less than a predetermined duty ratio. The target word string having the largest length among the plurality of target word strings is determined as the largest target word string (operation S722). It is determined that the target word included in the maximum target word string belongs to the first proportion of the translation sentence included in the first parallel corpus (operation S723). In the case where the first ratio is not less than the first predetermined ratio, it is determined that the first parallel corpus satisfies the first condition (operation S725). The target words comprise words in a predetermined vocabulary and words in translation sentences comprised by a plurality of parallel corpora.
According to an embodiment of the present disclosure, the second obtaining submodule 1013 is configured to perform the following operations for a second parallel corpus among parallel corpora that satisfy the first condition: determining at least one clause meeting a second condition in the translation sentences included in the second parallel corpus according to the maximum target word string of the predicted translation sentences corresponding to the second parallel corpus (operation S831); determining at least one second word matched with a maximum target word string corresponding to the second parallel corpus in a source sentence included in the second parallel corpus (operation S832); and concatenating the at least one second word to obtain a target source sentence, and concatenating the at least one clause to obtain a target translation sentence (operation S833). The target parallel corpus obtained according to the second parallel corpus comprises a target source sentence and a target translation sentence, and the second parallel corpus is any one of the parallel corpora meeting the first condition.
According to an embodiment of the present disclosure, determining at least one clause satisfying the second condition in the translation sentence included in the second parallel corpus includes: performing clause processing on the translation sentence included in the second parallel corpus to obtain a plurality of clauses (operation S911); determining a second proportion of target words included in each of the plurality of clauses in a maximum target word string corresponding to the second parallel corpus, to obtain a plurality of second proportions (operation S912); determining a second ratio not smaller than a second predetermined ratio among the plurality of second ratios as a target ratio (operation S915); the clause corresponding to the target proportion is determined as at least one clause satisfying the second condition (operation S916). Wherein, in the case where the second ratios are all smaller than the second predetermined ratio, it is determined that the clause corresponding to the largest second ratio among the plurality of second ratios is at least one clause satisfying the second condition (operation S914).
According to an embodiment of the disclosure, as shown in fig. 10, the above-mentioned text processing apparatus 1000 may further include a preprocessing module 1040, configured to preprocess any one of the mixed parallel corpus and the target parallel corpus obtained by the corpus obtaining module 1010. As shown in fig. 10, the preprocessing module 1040 may include, for example, a word segmentation submodule 1041, a conversion submodule 1042, and a vector acquisition submodule 1043. The word segmentation sub-module 1041 is configured to perform word segmentation on a source sentence and a translation sentence included in any one of the parallel corpora, so as to obtain a first word sequence for the source sentence and a second word sequence for the translation sentence (operation S340). The conversion submodule 1042 is used for converting the first word sequence into a first number sequence and converting the second word sequence into a second number sequence according to a predetermined vocabulary (operation S350). The vector obtaining sub-module 1043 is configured to obtain a first word vector for the first word sequence according to the first numbered sequence and obtain a second word vector for the second word sequence according to the second numbered sequence by using the word embedding technology (operation S360). The predetermined vocabulary comprises a corresponding relation between a plurality of words and numbers aiming at the words, and the words are extracted from the mixed parallel corpus and the target parallel corpus; the first translation model is obtained through training according to the first word vector and the second word vector.
According to an embodiment of the present disclosure, the first translation model includes a first sub-model and a second sub-model, and the above-described translation text obtaining module 1030 may include, for example, a semantic vector obtaining sub-module 1031 and a translation text obtaining sub-module 1032. The semantic vector obtaining sub-module 1031 is configured to obtain a semantic vector for a text to be processed by using the text to be processed as an input of the first sub-model. The translated text obtaining submodule 1032 is used for obtaining the translated text aiming at the text to be processed by taking the semantic vector as the input of the second submodel. The second sub-model comprises a long-period memory network model, and the second number sequence comprises a statement start number and a statement end number. The semantic vector obtaining submodule 1031 and the translation text obtaining submodule 1032 may be used to perform operations S431 to S432 described in fig. 4, respectively, and are not described herein.
Any number of modules, sub-modules, units, sub-units, or at least some of the functionality of any number of the sub-units according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented as split into multiple modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in any other reasonable manner of hardware or firmware that integrates or encapsulates the circuit, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be at least partially implemented as computer program modules, which when executed, may perform the corresponding functions.
Fig. 11 schematically illustrates a block diagram of an electronic device adapted to perform a text processing method according to an embodiment of the disclosure. The electronic device shown in fig. 11 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 11, an electronic device 1100 according to an embodiment of the present disclosure includes a processor 1101 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1102 or a program loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. The processor 1101 may comprise, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 1101 may also include on-board memory for caching purposes. The processor 1101 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flow according to embodiments of the present disclosure.
In the RAM 1103, various programs and data necessary for the operation of the electronic device 1100 are stored. The processor 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. The processor 1101 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1102 and/or the RAM 1103. Note that the program may be stored in one or more memories other than the ROM 1102 and the RAM 1103. The processor 1101 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the disclosure, the electronic device 1100 may also include an input/output (I/O) interface 1105, the input/output (I/O) interface 1105 also being connected to the bus 1104. The electronic device 1100 may also include one or more of the following components connected to the I/O interface 1105: an input section 1106 including a keyboard, a mouse, and the like; an output portion 1107 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a LAN card, a modem, and the like. The communication section 1109 performs communication processing via a network such as the internet. The drive 1110 is also connected to the I/O interface 1105 as needed. Removable media 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in drive 1110, so that a computer program read therefrom is installed as needed in storage section 1108.
According to embodiments of the present disclosure, the method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1109, and/or installed from the removable media 1111. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1101. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 1102 and/or RAM 1103 described above and/or one or more memories other than ROM 1102 and RAM 1103.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the foregoing describes embodiments of the present disclosure. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (9)

1. A text processing method, comprising:
obtaining mixed parallel corpus and target parallel corpus; the mixed parallel corpus is obtained from a plurality of fields;
training a predetermined model by taking the mixed parallel corpus and the target parallel corpus as training samples to obtain a first translation model; and
taking the text to be processed as the input of the first translation model, obtaining the translation text aiming at the text to be processed,
the target parallel corpus is parallel corpus aiming at the target field, and the target parallel corpus comprises parallel corpus obtained through screening of a second translation model; the second translation model is obtained by training the mixed parallel corpus as a training sample;
wherein, obtaining the target parallel corpus comprises:
obtaining a plurality of parallel corpora for the target domain;
adopting the second translation model to determine parallel corpus meeting a first condition in the plurality of parallel corpus; and
obtaining the target parallel corpus according to the parallel corpus meeting the first condition;
wherein the determining parallel corpora of the plurality of parallel corpora that satisfy the first condition includes: the following operations are performed for a first parallel corpus of the plurality of parallel corpora:
Taking a source sentence included in the first parallel corpus as an input of the second translation model, and outputting a predicted translation sentence corresponding to the first parallel corpus; and
determining whether the first parallel corpus meets a first condition according to a source sentence included in the first parallel corpus, a translation sentence included in the first parallel corpus and a predictive translation sentence corresponding to the first parallel corpus,
the first parallel corpus is any one of the plurality of parallel corpora.
2. The method of claim 1, wherein determining whether the first parallel corpus satisfies a first condition comprises:
determining a plurality of target word strings in a predictive translation sentence corresponding to the first parallel corpus, wherein each target word string in the plurality of target word strings consists of a plurality of first words with continuous positions in the predictive translation sentence, and the duty ratio of the target words in the plurality of first words is not less than a preset duty ratio;
determining the target word string with the largest length in the target word strings as the largest target word string;
determining a first proportion of the target words included in the maximum target word string to the translation sentences included in the first parallel corpus; and
Determining that the first parallel corpus satisfies a first condition in the case that the first scale is not smaller than a first predetermined scale,
the target words comprise words in a predetermined vocabulary and words in translation sentences comprised by the plurality of parallel corpora.
3. The method of claim 2, wherein obtaining the target parallel corpus from the parallel corpus satisfying the first condition comprises: the following operations are executed for a second parallel corpus among the parallel corpora meeting the first condition:
determining at least one clause meeting a second condition in the translation sentences included in the second parallel corpus according to the maximum target word string of the predicted translation sentences corresponding to the second parallel corpus;
determining at least one second word matched with a maximum target word string corresponding to the second parallel corpus in a source sentence included in the second parallel corpus; and
splicing the at least one second word to obtain a target source sentence, splicing the at least one clause to obtain a target translation sentence,
the target parallel corpus obtained according to the second parallel corpus comprises the target source sentence and the target translation sentence, and the second parallel corpus is any one of the parallel corpora meeting the first condition.
4. The method of claim 3, wherein determining at least one clause of the translated sentences comprised by the second parallel corpus that satisfies a second condition comprises:
performing clause processing on the translation sentences included in the second parallel corpus to obtain a plurality of clauses;
determining a second proportion of target words included in each of the multiple clauses in the maximum target word strings corresponding to the second parallel corpus, and obtaining multiple second proportions;
determining a second ratio which is not smaller than a second predetermined ratio among the plurality of second ratios as a target ratio;
determining that the clause corresponding to the target proportion is at least one clause meeting a second condition;
and under the condition that the second proportions are smaller than the second preset proportion, determining that the clause corresponding to the largest second proportion in the second proportions is at least one clause meeting the second condition.
5. The method of claim 1, further comprising, prior to training the predetermined model to obtain the first translation model: the following operations are executed for the source sentences and the translation sentences included in any one of the mixed parallel corpus and the target parallel corpus:
performing word segmentation processing on the source sentence and the translation sentence to obtain a first word sequence aiming at the source sentence and a second word sequence aiming at the translation sentence;
Converting the first word sequence into a first number sequence according to a preset vocabulary, and converting the second word sequence into a second number sequence; and
obtaining a first word vector for the first word sequence according to the first number sequence and a second word vector for the second word sequence according to the second number sequence by adopting a word embedding technology,
the predetermined vocabulary comprises a plurality of words and corresponding relations of numbers aiming at the words, and the words are extracted from the mixed parallel corpus and the target parallel corpus; the first translation model is obtained through training according to the first word vector and the second word vector.
6. The method of claim 5, wherein the first translation model comprises a first sub-model and a second sub-model, the taking text to be processed as input to the first translation model, obtaining translated text for the text to be processed comprises:
taking the text to be processed as the input of the first sub-model to obtain a semantic vector for the text to be processed; and
taking the semantic vector as the input of the second sub-model to obtain a translation text aiming at the text to be processed,
The second sub-model comprises a long-short-period memory network model, and the second number sequence comprises a statement start number and a statement end number.
7. A text processing apparatus, comprising:
the corpus obtaining module is used for obtaining mixed parallel corpus and target parallel corpus; the mixed parallel corpus is obtained from a plurality of fields;
the model training module is used for taking the mixed parallel corpus and the target parallel corpus as training samples and training a preset model to obtain a first translation model;
a translation text obtaining module, configured to obtain a translation text for a text to be processed by using the text to be processed as an input of the first translation model,
the target parallel corpus is parallel corpus aiming at the target field, and the target parallel corpus comprises parallel corpus obtained through screening of a second translation model; the second translation model is obtained by training the mixed parallel corpus as a training sample,
the corpus obtaining module comprises:
the first obtaining submodule is used for obtaining a plurality of parallel corpuses aiming at the target field;
a determining submodule, configured to determine a parallel corpus that satisfies a first condition among the plurality of parallel corpora by using the second translation model; and
The second obtaining sub-module is used for obtaining the target parallel corpus according to the parallel corpus meeting the first condition;
wherein the determining submodule is configured to perform the following operations for a first parallel corpus of the plurality of parallel corpora:
taking a source sentence included in the first parallel corpus as an input of the second translation model, and outputting a predicted translation sentence corresponding to the first parallel corpus; and
determining whether the first parallel corpus meets a first condition according to a source sentence included in the first parallel corpus, a translation sentence included in the first parallel corpus and a predictive translation sentence corresponding to the first parallel corpus,
the first parallel corpus is any one of the plurality of parallel corpora.
8. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the text processing method of any of claims 1-6.
9. A computer readable storage medium having stored thereon executable instructions which when executed by a processor cause the processor to perform the text processing method of any of claims 1 to 6.
CN202010198468.1A 2020-03-19 2020-03-19 Text processing method, text processing device, electronic equipment and readable storage medium Active CN111368566B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010198468.1A CN111368566B (en) 2020-03-19 2020-03-19 Text processing method, text processing device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010198468.1A CN111368566B (en) 2020-03-19 2020-03-19 Text processing method, text processing device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111368566A CN111368566A (en) 2020-07-03
CN111368566B true CN111368566B (en) 2023-06-30

Family

ID=71206819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010198468.1A Active CN111368566B (en) 2020-03-19 2020-03-19 Text processing method, text processing device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111368566B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263349A (en) * 2019-03-08 2019-09-20 腾讯科技(深圳)有限公司 Corpus assessment models training method, device, storage medium and computer equipment
CN110309516A (en) * 2019-05-30 2019-10-08 清华大学 Training method, device and the electronic equipment of Machine Translation Model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10133739B2 (en) * 2014-10-24 2018-11-20 Google Llc Neural machine translation systems with rare word processing
CN111046677B (en) * 2019-12-09 2021-07-20 北京字节跳动网络技术有限公司 Method, device, equipment and storage medium for obtaining translation model
CN111859994B (en) * 2020-06-08 2024-01-23 北京百度网讯科技有限公司 Machine translation model acquisition and text translation method, device and storage medium
CN111783480B (en) * 2020-06-29 2024-06-25 北京嘀嘀无限科技发展有限公司 Text processing and model training method and device, storage medium and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263349A (en) * 2019-03-08 2019-09-20 腾讯科技(深圳)有限公司 Corpus assessment models training method, device, storage medium and computer equipment
CN110309516A (en) * 2019-05-30 2019-10-08 清华大学 Training method, device and the electronic equipment of Machine Translation Model

Also Published As

Publication number Publication date
CN111368566A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
US11501182B2 (en) Method and apparatus for generating model
CN110489555B (en) Language model pre-training method combined with similar word information
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN108121700B (en) Keyword extraction method and device and electronic equipment
JP5901001B1 (en) Method and device for acoustic language model training
CN109408824B (en) Method and device for generating information
CN113553412B (en) Question-answering processing method, question-answering processing device, electronic equipment and storage medium
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN112860855B (en) Information extraction method and device and electronic equipment
CN112560486A (en) Power entity identification method based on multilayer neural network, storage medium and equipment
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN115526171A (en) Intention identification method, device, equipment and computer readable storage medium
CN111291565A (en) Method and device for named entity recognition
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN114218940B (en) Text information processing and model training method, device, equipment and storage medium
CN113705222B (en) Training method and device for slot identification model and slot filling method and device
KR102517971B1 (en) Context sensitive spelling error correction system or method using Autoregressive language model
CN111898339B (en) Ancient poetry generating method, device, equipment and medium based on constraint decoding
CN112036186A (en) Corpus labeling method and device, computer storage medium and electronic equipment
CN113935312A (en) Long text matching method and device, electronic equipment and computer readable storage medium
CN117725211A (en) Text classification method and system based on self-constructed prompt template
US11893344B2 (en) Morpheme analysis learning device, morpheme analysis device, method, and program
CN111368566B (en) Text processing method, text processing device, electronic equipment and readable storage medium
US20230139642A1 (en) Method and apparatus for extracting skill label
CN115359323A (en) Image text information generation method and deep learning model training method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant