CN108763222B - Translation missing detection and translation method and device, server and storage medium - Google Patents

Translation missing detection and translation method and device, server and storage medium Download PDF

Info

Publication number
CN108763222B
CN108763222B CN201810473017.7A CN201810473017A CN108763222B CN 108763222 B CN108763222 B CN 108763222B CN 201810473017 A CN201810473017 A CN 201810473017A CN 108763222 B CN108763222 B CN 108763222B
Authority
CN
China
Prior art keywords
translation
participle
probability
word segmentation
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810473017.7A
Other languages
Chinese (zh)
Other versions
CN108763222A (en
Inventor
郑吴杰
邓月堂
刘思凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810473017.7A priority Critical patent/CN108763222B/en
Publication of CN108763222A publication Critical patent/CN108763222A/en
Application granted granted Critical
Publication of CN108763222B publication Critical patent/CN108763222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a translation missing detection and translation method and device, a server and a storage medium, wherein the translation missing detection and translation method and device, the server and the storage medium translate original content, after translation content is generated, the original content is analyzed, at least one participle contained in the original content is obtained, a target participle is determined from the at least one participle according to the un-translation probability of the participle, candidate translation content corresponding to the target participle is obtained, and translation missing detection is carried out.

Description

Translation missing detection and translation method and device, server and storage medium
Technical Field
The invention relates to the field of translation, in particular to translation missing detection and translation methods and devices, a server and a storage medium.
Background
The text translation can enable the same text to be converted among a plurality of languages, which is necessary for communication among users using different languages, and the conventional translation modes comprise manual translation and machine translation, because the manual translation requires a large amount of manpower and material resources and is high in implementation cost, the machine translation is a general technology in the translation field. The Machine Translation includes two directions of SMT (Statistical Machine Translation) and NMT (Neural Machine Translation).
SMT is a statistical-based machine translation method, and the basic idea is to construct a statistical translation model by performing statistical analysis on a large number of parallel corpora and then use the model for translation, but the method is based on statistical translation, and translation is rigid and unsmooth, but obvious grammatical errors do not exist; therefore, the model-based machine translation method NMT is the mainstream development direction of machine translation at present.
NMT also has some problems, such as translation misses, in the following translation pairs:
original content: i give your red pack to your mother.
And (4) translating the content: i gain you a red envelope.
In this example, "mom" in the original text is not translated into the translation, so "mom" belongs to the transliteration.
If the translation result of one translation model is missed, the translation model needs to be improved or replaced to improve the use experience of a user, and therefore, in order to ensure the translation accuracy of the translation model, verification such as missed translation and the like needs to be accurately performed on the translation result.
Disclosure of Invention
The embodiment of the invention provides a translation missing detection method, a translation missing detection device, a translation missing device, a server and a storage medium, which can detect whether translation missing exists.
In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:
a translation missing detection method, comprising:
acquiring original content and translation content corresponding to the original content;
performing word segmentation processing on the original content to obtain at least one word segmentation;
obtaining the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and acquiring candidate translation content corresponding to the target word segmentation;
and when the translation content does not comprise the candidate translation content of the target word segmentation, determining that translation content corresponding to the original content has translation missing.
A method of translation, comprising:
translating the original content by using a machine translation model to obtain translation content corresponding to the original content;
performing word segmentation processing on the original content to obtain at least one word segmentation;
obtaining the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and acquiring candidate translation content corresponding to the target word segmentation;
when the translation content does not comprise the candidate translation content of the target participle, determining that translation content corresponding to the original content has translation missing;
counting the transliterated participles in the original content; the translation missing participles comprise participles without candidate translation contents in the translation contents;
optimizing the machine translation model according to the statistical result;
and translating the original content again by using the optimized machine translation model.
A translation miss detection apparatus comprising:
the first obtaining module is used for obtaining original content and translation content corresponding to the original content to translate the original content to obtain translation content corresponding to the original content;
the first analysis module is used for performing word segmentation processing on the original content to obtain at least one word segmentation;
the second acquisition module is used for acquiring the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
the third obtaining module is used for determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation and obtaining candidate translation contents corresponding to the target word segmentation;
and the first checking module is used for determining that translation content corresponding to the original content has translation missing when the translation content does not comprise candidate translation content of the target word segmentation.
A translation device, comprising:
the first translation module is used for translating the original content by using a machine translation model to obtain a translation content corresponding to the original content;
the second analysis module is used for performing word segmentation processing on the original content to obtain at least one word segmentation;
the fourth acquisition module is used for acquiring the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
a fifth obtaining module, configured to determine a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and obtain candidate translation content corresponding to the target word segmentation;
the second check module is used for determining that translation content corresponding to the original content has missing translation when the translation content does not comprise candidate translation content of the target participle;
the statistical module is used for counting the transliterated participles in the original content; the translation missing participles comprise participles without candidate translation contents in the translation contents;
the optimization module is used for optimizing the machine translation model according to the statistical result;
and the second translation module is used for translating the original content again by using the optimized machine translation model.
A server comprising a processor and a memory, said memory storing a plurality of instructions adapted to be loaded by the processor to perform the steps of the method of miss detection or the method of translation described above.
A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the translation method or the translation method.
According to the embodiment of the invention, the original content is translated to generate the translation content, then the original content is analyzed to obtain at least one participle contained in the original content, then the target participle is determined from the at least one participle according to the untranslated probability of the participle, and the candidate translation content corresponding to the target participle is obtained to perform translation missing detection, for example, when the translation content does not include the candidate translation content corresponding to the target participle, the translation content is determined to have translation missing, therefore, the embodiment of the invention can detect whether translation missing exists, and can optimize the translation model based on the translation missing statistical result, thereby ensuring the translation accuracy of the translation model and simultaneously improving the user experience.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a first networking diagram of a translation system provided by an embodiment of the invention;
FIG. 2 is a first flowchart of a translation miss detection method according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a translation miss detection apparatus according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a translation method according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a translation apparatus according to an embodiment of the present invention;
FIG. 6 is a second networking diagram of a translation system provided by an embodiment of the invention;
FIG. 7 is a second flowchart of a translation miss detection method according to an embodiment of the present invention;
FIG. 8 is a third flowchart illustrating a translation miss detection method according to an embodiment of the present invention;
FIG. 9 is a first schematic diagram of a user interface provided by an embodiment of the invention;
FIG. 10 is a second schematic diagram of a user interface provided by an embodiment of the invention;
FIG. 11 is a first schematic diagram of a data interface provided by an embodiment of the present invention;
FIG. 12 is a second schematic diagram of a data interface provided by an embodiment of the invention;
FIG. 13 is a third schematic diagram of a data interface provided by an embodiment of the present invention;
fig. 14 is a schematic structural diagram of a terminal according to an embodiment of the present invention;
fig. 15 is a schematic structural diagram of a server according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic view of a scenario of a translation system according to an embodiment of the present invention, where the translation system may include an interface server 11, a translation server 12, a verification server 13, and a data server 14 for providing various data supports; wherein:
the data server 14 is used for providing translation model data, verification data and the like, the translation model data is used for the translation server 12 to translate the original content so as to output corresponding translation content, and the verification data is used for the verification server 13 to verify the translation content so as to judge whether the translation missing problem exists or not;
the interface server 11 is used for providing an access interface for a user, receiving a translation request and the like sent by the user through a terminal, and forwarding the translation request and the like to the translation server 12;
the translation server 12 is configured to extract original content to be translated from the translation request, and translate the original content using a translation model to output corresponding translation content;
the verification server 13 is configured to obtain an original content and a translation content corresponding to the original content, and perform word segmentation processing on the original content to obtain at least one word segmentation; obtaining the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set; determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and acquiring candidate translation content corresponding to the target word segmentation; and when the translation content does not comprise the candidate translation content of the target word segmentation, determining that translation content corresponding to the original content has translation missing.
The word segmentation refers to a single word, a phrase group or the like, and is implemented based on a word segmentation processing algorithm.
The candidate translation content refers to a common or general text of the participle in a language corresponding to the translation content, and one original content unit may correspond to a plurality of candidate translation units. For example, when translating english into chinese, the word "Peking" may correspond to candidate translation contents of a plurality of chinese texts such as "beijing university" and "beijing opera", and when translating chinese into english, the word "love" may correspond to candidate translation contents of a plurality of english texts such as "love" and "like".
It should be noted that the system scenario diagram shown in fig. 1 is only an example, and the server and the scenario described in the embodiment of the present invention are for more clearly illustrating the technical solution of the embodiment of the present invention, and do not form a limitation on the technical solution provided in the embodiment of the present invention.
The translation miss detection method and apparatus will be described in detail below.
Fig. 2 is a first flowchart of a missing translation detection method according to an embodiment of the present invention, please refer to fig. 2, in which the missing translation detection method includes the following steps:
s201: and translating the original content to obtain a translated content corresponding to the original content.
The original content may directly obtain the original content sent by the user through the terminal, for example, "It's afine day" shown in fig. 9, or "i give your red purse to your mom" shown in fig. 10, or may be directly called in the translation server.
The translation content is generated by translating the original content by adopting a preset translation model, and the translation content may have a problem of missing translation.
S202: and performing word segmentation processing on the original content to obtain at least one word segmentation.
This step can be implemented by using a conventional segmentation algorithm, for example, parsing "It's a fine day today" into multiple segments such as "It's", "a", "fine", "day", and "day", or parsing "i give your red purse to your mother" into multiple segments such as "i", "give", "you", "red purse", and "mom".
S203: and acquiring the untranslated probability of the word segmentation.
The untranslated probability is a probability that the original content unit is not translated in a translation sample;
this step can be implemented in the following way:
counting a first numerical value of the occurrence times of the participles in the translation sample; as shown in fig. 11, the translation sample includes a plurality of translation sentence pairs (one translation sentence pair includes one original content and a translation content), and it can be expected that the more data of the translation sample, the more accurate the calculation.
Counting a second numerical value of the occurrence times of the candidate translation content corresponding to the participle in the translation sample;
acquiring a first difference value between the first numerical value and the second numerical value;
and acquiring a first ratio of the first difference value to the first numerical value as the corresponding untranslated probability of the participle.
For example, in the translation sample shown in fig. 11, the translation sample includes 4 translation sentence pairs (english-to-chinese translations), the number of occurrences of the english participle "I" in the 4 translation sentence pairs of the translation sample is 3, that is, the first value of the participle "I" is 3, the number of occurrences of the corresponding candidate translation content "I" in the translation sample is 3, that is, the second value of the candidate translation content "I" corresponding to the participle "I" is 3; then, a first difference between the first value and the second value is 0, and a first ratio of the first difference to the first value is 0, i.e. the untranslated probability corresponding to the participle "I" is 0; similarly, the first numerical value of the occurrence frequency of the participle "have" in the translation sample is 3 times, and the second numerical value of the occurrence frequency of the corresponding candidate translation content "have" in the translation sample is 2 times; then, the first difference between the first value and the second value is 1, and the first ratio of the first difference to the first value is 33%, i.e. the untranslated probability corresponding to the word "have" is 33%.
S204: and determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and acquiring candidate translation contents corresponding to the target word segmentation.
The step of determining the target participle from the at least one participle according to the untranslated probability of the participle can be realized by one of the following ways:
mode 1: acquiring an error rate threshold value in a preset condition; comparing the untranslated probability of the participle with the error rate threshold; and if the untranslated probability of the participle is smaller than the error rate threshold, determining the participle as a target participle. For example, the error rate threshold is 10%, and if the untranslated probability of a certain participle is 50%, the participle cannot be regarded as a target participle, and if the untranslated probability of a certain participle is 0.5%, the participle is regarded as a target participle.
Mode 2: acquiring a check number in a preset condition; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the check numbers in the sequencing result from small to large in number; and if the untranslated probability of the participle is selected, determining the participle as a target participle. For example, the check number is 200, if there are 10000 participles in total, the untranslated probabilities of the 10000 participles are sorted from small to large, and in the sorting result, the untranslated probability of 200 is selected from small to large; and if the untranslated probability of a word segmentation is selected, determining the word segmentation as the target word segmentation.
Mode 3: acquiring a check rate in a preset condition; calculating the total verification number according to the total number of the participles contained in the original content and the verification rate; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the total verification number from small to large in the sequencing result; and if the untranslated probability of the participle is selected, determining the participle as a target participle. For example, the check rate is 2%, if there are 10000 participles in total, 200 participles need to be checked, the untranslated probabilities of the 10000 participles are sorted from small to large, and the 200 untranslated probabilities are selected from small to large in the sorting result; and if the untranslated probability of a word segmentation is selected, determining the word segmentation as the target word segmentation.
For the candidate translation unit of the original content unit, the common translation of the word can be obtained by directly searching the translation dictionary, mining the word translation by the word alignment model or recommending the algorithm, and for the two modes of searching the translation dictionary and mining the word translation by the word alignment model, the prior art can be referred, and the description is not repeated, and for the recommendation algorithm, the common translation of the word is obtained, and the following description is given.
The obtaining of the candidate translation content corresponding to the target word segmentation in this step can be realized by the following method:
taking each translation sentence pair in the translation sample as a user;
taking each participle in the translation sample as a project;
constructing a user-project matrix;
acquiring similar items of each item according to the user-item matrix;
and taking the participles corresponding to the similar items of the items as candidate translation contents of the participles corresponding to the items.
The step of obtaining similar items of each item according to the user-item matrix may be implemented in the following manner: calculating cosine similarity among the items by adopting a similarity calculation method; and (4) adopting a collaborative filtering algorithm to obtain similar items of each item based on cosine similarity among the items.
For example, as shown in fig. 11, the translation sample is a chinese-english sentence pair, each chinese-english sentence pair is regarded as a user, each word is regarded as an item, a user-item matrix is obtained, and an item-based algorithm is used to obtain a foreign language similar word of each word, which is used as a translation candidate. The item-based algorithm is a collaborative filtering algorithm based on a project, and the specific implementation thereof can refer to the prior art and is not described in detail.
Specifically, the translation sample is 4 middle-English sentence pairs shown in fig. 11, the Chinese is segmented, each sentence pair is regarded as one user, each word is regarded as one item, and thus a corresponding user-item matrix is obtained, fig. 10 shows part of the content of the example matrix, and from fig. 12, it can be obtained that I is most similar to "I", and have is most similar to "I" and "has". As the data increases, the recommended related words may come closer to daily translation. For example, as shown in fig. 13, the most relevant foreign words of "peking" and "university of beijing" are given, respectively.
S205: and when the translation content does not comprise the candidate translation content of the target word segmentation, determining that translation content corresponding to the original content has translation missing.
For an original sentence and its translations, if a common translation of an original word does not exist in the translation and the estimated error rate is low, it is considered to be a miss and a particular wrong word is indicated. The algorithm is as follows:
a) for each word w in the original text, its common translation is obtained.
b) The translation is checked and if there is not any word in the common translation of w, it is determined that w is missing.
An example of the inspection results is given below:
the Chinese's Economic L if Survey, sponsored by CCTV, The National Bureau of statics, Chinese Post and National School of Development at Peking university, sunscreens new trees in incomes, expandiments, social security, and life of Chinese peers.
The translation is a Chinese economic life survey which is sponsored by a central television station, a national statistical bureau, and China postal service and national development institute, and shows a new trend in the aspects of income, expenditure, social security and quality of life of Chinese people.
All candidate translation contents corresponding to the participle "peking" do not appear in the translation, so that translation missing exists, and the missing word is peking.
In an embodiment, after step S205, the following steps may be further included:
counting the word segmentation of the candidate translation content which does not exist in the translation content;
optimizing the machine translation model according to the statistical result;
and translating the original content again by using the optimized machine translation model.
In an embodiment, after step S204, the following steps may be further included:
acquiring a third numerical value of the occurrence frequency of the word segmentation in the original content;
acquiring a fourth numerical value of the occurrence frequency of the candidate translation content of the participle in the translation content;
obtaining a second difference value between the third numerical value and the fourth numerical value;
acquiring a second ratio of the second difference to the third value;
and if the second ratio is larger than the un-translation probability of the participle, determining that the translation content has translation missing.
For example, the third value of the occurrence frequency of the participle "peking" in the original content is 10, the fourth value of the occurrence frequency of the candidate translation content in the translation content is 9, the second difference value between the third value and the fourth value is 1, and the second ratio of the second difference value to the third value is 10% which is greater than the "peking" un-translation probability 0, so that the translation content has a miss-translation probability.
Accordingly, fig. 3 is a schematic structural diagram of a translation missing detection apparatus according to an embodiment of the present invention, please refer to fig. 3, where the translation missing detection apparatus includes the following modules:
a first obtaining module 31, configured to obtain an original content and a translation content corresponding to the original content;
a first parsing module 32, configured to perform word segmentation processing on the original content to obtain at least one word segment;
a second obtaining module 33, configured to obtain an untranslated probability of the word segmentation; the untranslated probability is the probability that the participle is not translated in the translation sample set;
a third obtaining module 34, configured to determine a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and obtain candidate translation content corresponding to the target word segmentation;
the first checking module 35 is configured to determine that translation content corresponding to the original content has missing translation when the translation content does not include candidate translation content of the target word segmentation.
In an embodiment, the first checking module 35 may be further specifically configured to: acquiring a third numerical value of the occurrence frequency of the word segmentation in the original content; acquiring a fourth numerical value of the occurrence frequency of the candidate translation content of the participle in the translation content; obtaining a second difference value between the third numerical value and the fourth numerical value; acquiring a second ratio of the second difference to the third value; and if the second ratio is larger than the un-translation probability of the participle, determining that the translation content has translation missing.
In an embodiment, the first checking module 35 may be further specifically configured to: counting the word segmentation of the candidate translation content which does not exist in the translation content; optimizing the machine translation model according to the statistical result; and translating the original content again by using the optimized machine translation model.
In an embodiment, the first checking module 35 may be further specifically configured to determine the target segmented word from the at least one segmented word according to the untranslated probability of the segmented word by one of the following manners:
mode 1: acquiring an error rate threshold value in a preset condition; comparing the untranslated probability of the participle with the error rate threshold; if the untranslated probability of the participle is smaller than the error rate threshold, determining the participle as a target participle;
mode 2: acquiring a check number in a preset condition; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the check numbers in the sequencing result from small to large in number; if the untranslated probability of the participle is selected, determining the participle as a target participle;
mode 3: acquiring a check rate in a preset condition; calculating the total verification number according to the total number of the participles contained in the original content and the verification rate; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the total verification number from small to large in the sequencing result; and if the untranslated probability of the participle is selected, determining the participle as a target participle.
In an embodiment, the third obtaining module 34 is specifically configured to: taking each translation sentence pair in the translation sample as a user; taking each participle in the translation sample as a project; constructing a user-project matrix; acquiring similar items of each item according to the user-item matrix; and taking the participles corresponding to the similar items of the items as candidate translation contents of the participles corresponding to the items.
In an embodiment, the third obtaining module 34 is specifically configured to: calculating cosine similarity among the items by adopting a similarity calculation method; and (4) adopting a collaborative filtering algorithm to obtain similar items of each item based on cosine similarity among the items.
In an embodiment, the second obtaining module 33 is specifically configured to: counting a first numerical value of the occurrence times of the participles in the translation sample; counting a second numerical value of the occurrence times of the candidate translation content corresponding to the participle in the translation sample; acquiring a first difference value between the first numerical value and the second numerical value; and acquiring a first ratio of the first difference value to the first numerical value as the corresponding untranslated probability of the participle.
The following describes the translation method and apparatus in detail.
Fig. 4 is a schematic flowchart of a translation method according to an embodiment of the present invention, please refer to fig. 4, in which the translation method includes the following steps:
s401: and translating the original content by using a machine translation model to obtain the translated content corresponding to the original content.
The machine translation model is a model obtained by learning and training a machine model by using a training sample library, and is used for translating original contents into translation contents.
For example, after receiving a translation request, The translation server extracts The contents "The Chinese's economic L if Survey, sponsored by CCTV, The National Bureau of Statistics, China Post and National School of Development at farming University, beneath news tresds in The society, future, social security, and life quality of new Chinese, and then translates them into" Chinese economic life Survey, sponsored by The Central television station, The National Statistics office, The Chinese postal service, and The National Development institute "using a machine translation model, showing new trends in income, expense, social security, and quality of life of Chinese.
S402: and performing word segmentation processing on the original content to obtain at least one word segmentation.
S403: obtaining the untranslated probability of the participle; the untranslated probability is a probability that the participle is not translated in a translation sample set.
S404: and determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and acquiring candidate translation contents corresponding to the target word segmentation.
S405: and when the translation content does not comprise the candidate translation content of the target word segmentation, determining that translation content corresponding to the original content has translation missing.
The implementation of steps S402 to S405 is the same as that of steps S202 to S205, and is not described again.
S406: counting the transliterated participles in the original content; the missing translation participles comprise participles of which no candidate translation content exists in the translation content.
All candidate translation contents corresponding to the participle "peking" do not appear in the translation, so the participle "peking" is taken as a missed translation participle.
S407: and optimizing the machine translation model according to the statistical result.
Adding missing translation participles 'peking' and corresponding candidate translation contents 'Beijing university' in a training sample library, and then retraining by using a new training sample library to obtain a new machine translation model so as to realize optimization processing of the machine translation model.
S408: and translating the original content again by using the optimized machine translation model.
Using The optimized machine translation model, "The Chinese's eco L if Survey, sponsored by CCTV, The National Bureau of Statistics, China Post and National school of Development at Peking University, negests new trees in incomes, expendations, social security, and life quality of Chinese farming scope" is translated into "Chinese Economic life Survey, sponsored by Central TV station, State Statistics office, Beijing University Chinese postal and National Development institute, showing The new trends in income, expense, social security, and quality of life of Chinese people"
Correspondingly, fig. 5 is a schematic structural diagram of a translation apparatus according to an embodiment of the present invention, please refer to fig. 5, in which the translation apparatus includes the following modules:
the first translation module 51 is configured to translate an original content by using a machine translation model, and obtain a translation content corresponding to the original content;
a second parsing module 52, configured to perform word segmentation processing on the original content to obtain at least one word segment;
a fourth obtaining module 53, configured to obtain an untranslated probability of the segmented word; the untranslated probability is the probability that the participle is not translated in the translation sample set;
a fifth obtaining module 54, configured to determine a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and obtain candidate translation content corresponding to the target word segmentation;
the second check module 55 is configured to determine that translation content corresponding to the original content has missing translation when the translation content does not include candidate translation content of the target participle;
a statistic module 56, configured to count the transliterated segmentations in the original content; the translation missing participles comprise participles without candidate translation contents in the translation contents;
the optimization module 57 is configured to perform optimization processing on the machine translation model according to the statistical result;
and the second translation module 58 is used for translating the original content again by using the optimized machine translation model.
In an embodiment, the optimization module 57 is specifically configured to add the missing translation participle "peking" and the corresponding candidate translation content "beijing university" in the training sample library, and then perform retraining using the new training sample library to obtain a new machine translation model, so as to implement optimization processing on the machine translation model.
In an embodiment, the second check module 55 may be further specifically configured to: acquiring a third numerical value of the occurrence frequency of the word segmentation in the original content; acquiring a fourth numerical value of the occurrence frequency of the candidate translation content of the participle in the translation content; obtaining a second difference value between the third numerical value and the fourth numerical value; acquiring a second ratio of the second difference to the third value; and if the second ratio is larger than the un-translation probability of the participle, determining that the translation content has translation missing.
In an embodiment, the second check module 55 may be further specifically configured to: counting the word segmentation of the candidate translation content which does not exist in the translation content; optimizing the machine translation model according to the statistical result; and translating the original content again by using the optimized machine translation model.
In an embodiment, the second verification module 55 may be further specifically configured to determine the target participle from the at least one participle according to the untranslated probability of the participle by one of the following manners:
mode 1: acquiring an error rate threshold value in a preset condition; comparing the untranslated probability of the participle with the error rate threshold; if the untranslated probability of the participle is smaller than the error rate threshold, determining the participle as a target participle;
mode 2: acquiring a check number in a preset condition; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the check numbers in the sequencing result from small to large in number; if the untranslated probability of the participle is selected, determining the participle as a target participle;
mode 3: acquiring a check rate in a preset condition; calculating the total verification number according to the total number of the participles contained in the original content and the verification rate; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the total verification number from small to large in the sequencing result; and if the untranslated probability of the participle is selected, determining the participle as a target participle.
In an embodiment, the fifth obtaining module 54 is specifically configured to: taking each translation sentence pair in the translation sample as a user; taking each participle in the translation sample as a project; constructing a user-project matrix; acquiring similar items of each item according to the user-item matrix; and taking the participles corresponding to the similar items of the items as candidate translation contents of the participles corresponding to the items.
In an embodiment, the fifth obtaining module 54 is specifically configured to: calculating cosine similarity among the items by adopting a similarity calculation method; and (4) adopting a collaborative filtering algorithm to obtain similar items of each item based on cosine similarity among the items.
In an embodiment, the fourth obtaining module 53 is specifically configured to: counting a first numerical value of the occurrence times of the participles in the translation sample; counting a second numerical value of the occurrence times of the candidate translation content corresponding to the participle in the translation sample; acquiring a first difference value between the first numerical value and the second numerical value; and acquiring a first ratio of the first difference value to the first numerical value as the corresponding untranslated probability of the participle.
The present invention will now be described by taking a user social software translation system as an example, please refer to fig. 6, where the system includes a user terminal 61 and a social server 62, the social server 62 may have functions of all servers in fig. 1, and the user terminal 61 is mainly used for translating english into chinese or translating chinese into english when a user interacts with another user after installing a user social software client.
In the scenario 1, the social server determines candidate translation contents of the participle by using a recommendation algorithm.
Specifically, as shown in fig. 7, the translation missing detection method provided in this embodiment includes the following steps:
s701: and the social server trains the translation samples to obtain candidate translation contents of each word.
Recommendation systems are very common intelligent systems such as shopping recommendations, article recommendations, music recommendations, and so on. One algorithm that is commonly used is a collaborative filtering algorithm (collaborative filtering). The collaborative filtering is based on a user-item matrix, the user preference degree of unseen items is predicted according to the existing user evaluation/purchase information, and items which are possibly interested by the user are recommended. For example, a movie recommendation system may predict the preference of a given user for a movie that is not watched by searching for similar users based on the scoring information of a movie by a user and the movie scores of these similar users (user-based algorithm). In addition, it is also possible to search for similar movies for each movie first, and then recommend movies that are similar to the movies that the user likes but that he has not seen (item-based algorithm) according to the list of movies that the user likes.
In this embodiment, the social server trains translation samples by using a recommendation algorithm to obtain candidate translation contents of each word, specifically, the translation samples are 4 chinese-english sentence pairs shown in fig. 11, the chinese is segmented, each sentence pair is regarded as a user, and each word is regarded as an item, so as to obtain a corresponding user-item matrix, fig. 12 shows partial contents of an example matrix, and from fig. 12, it can be obtained that "I" is the most similar to I, and "I" and "yes" are the most similar to have. As the data increases, the recommended related words may come closer to daily translation.
Assume that this step determines: the candidate translation content "good" of the participle "fine", and the candidate translation content "today" of the participle "today".
S702: and the social server counts and obtains the untranslated probability of each word.
The untranslated probability refers to the probability that a word is not translated, and the untranslated probability does not cause translation errors and the like, for example, "have" can select translation or not.
For example, the first value of the number of times that the segmented word "today" appears in the translation sample is 3 times, and the second value of the number of times that the corresponding candidate translation content "today" appears in the translation sample is 3 times; then, a first difference between the first value and the second value is 0, and a first ratio of the first difference to the first value is 0, i.e. the untranslated probability corresponding to the word "today" is 0.
S703: the user terminal sends a translation request.
As shown in FIG. 9, the user detection interface selects the translation "It's a fine day today".
S704: and the social server translates the original content to obtain the translated content.
The social server translates the original content "It's a fine day today" into "good heaven" using a machine translation model.
S705: and the social server performs word segmentation processing on the original content to obtain at least one word segmentation.
This step can be implemented by using a conventional word segmentation algorithm, for example, parsing "It's a fine day today" into multiple word segments such as "It's", "a", "fine", "day", and "day".
S706: and the social server determines a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation.
Assuming that the preset condition includes that the error rate threshold is 10%, if the untranslated probability of the participle "It's" is 15%, the untranslated probability of "day" is 25%, and the untranslated probability of "a" is 50%, the participles "It's", "a", and "day" cannot be used as the target participles, and if the untranslated probability of the participle "fine" is 0.5%, and the untranslated probability of the participle "today" is 0%, the participles "fine" and "day" are used as the target participles.
S707: and the social server acquires candidate translation contents corresponding to the target word segmentation.
And acquiring the candidate translation content 'good' of the participle 'fine' and the candidate translation content 'today' of the participle 'today'.
S708: and the social server judges whether translation contents are missed or not.
There is translation missing because the translation content is "good day", and the candidate translation content "today" of "today" is not included.
S709: the social server re-translates and checks until there are no translations missing.
The social server uses a new machine translation model to translate the original content "It's a fine day" into "good today", and after the missing translation detection, the candidate translation content "today" including "today" and "fine" is good, so that there is no missing translation.
S710: the social server sends the translation content to the user terminal.
The social server sends the translation content to the user terminal, as shown in fig. 9, the translation content is "good today".
And 2, collecting candidate translation contents of the word segmentation determined by the dictionary by the social server.
Specifically, as shown in fig. 8, the translation missing detection method provided in this embodiment includes the following steps:
s801: and the social server analyzes the translation dictionary to obtain candidate translation contents of each word.
In the embodiment, in order to reduce the system computing load, the social server analyzes the translation dictionary to obtain candidate translation contents of each word.
Assume that this step determines: candidate translation content "mothers" of the participle "mom".
S802: and the social server counts and obtains the untranslated probability of each word.
The untranslated probability refers to the probability that a word is not translated, and the untranslated probability does not cause translation errors and the like, for example, "have" can select translation or not.
For example, the first numerical value of the occurrence frequency of the participle "mom" in the translation sample is 3 times, and the second numerical value of the occurrence frequency of the corresponding candidate translation content "mothers" in the translation sample is 3 times; then, the first difference between the first value and the second value is 0, and the first ratio of the first difference to the first value is 0, i.e. the untranslated probability corresponding to the word "forced" is 0.
S803: the user terminal sends a translation request.
As shown in FIG. 10, the user detection interface selects the translation "I give you a red packet to you mom".
S804: and the social server translates the original content to obtain the translated content.
The social server uses a machine translation model to translate the original content "i give your red package to your mom" to "Igave you're a red envelope".
S805: and the social server performs word segmentation processing on the original content to obtain at least one word segmentation.
This step can be implemented by using a conventional segmentation algorithm, for example, parsing "i give your red envelope to your mom" into a plurality of segmentations such as "i", "give", "you", "red envelope" and "mom".
S806: the social server determines a target participle from the at least one participle.
Assuming that the preset conditions include that the number of check is 2, if the untranslated probability of the participle "i" is 10%, the untranslated probability of "give" is 10%, the untranslated probability of "you" is 10%, the untranslated probability of "red packet" is 0%, and the untranslated probability of "mom" is 0%, the participles "red packet" and "mom" are taken as target participles.
S807: and the social server acquires candidate translation contents corresponding to the target word segmentation.
And acquiring the candidate translation content ' red envelope ' of the participle ' red envelope ' and the candidate translation content ' mothers ' of the participle '.
S808: and the social server judges whether translation contents are missed or not.
Since the translation content is "I gain your red envelope to you" and the candidate translation content "heat" of "mom" is not included, there is a miss.
S809: the social server re-translates and checks until there are no translations missing.
The social server uses a new machine translation model to translate original content 'I' you give your red envelope to your mother 'into' I 'you red envelope to your mother', after the miss translation detection is carried out, the candidate translation content 'red envelope' of the segmented word 'red envelope' and the candidate translation content 'mother' of the segmented word 'mom' do not exist miss translation.
S810: the social server sends the translation content to the user terminal.
The social server transmits the translation content to the user terminal, as shown in fig. 10, the translation content being "I-gain yourred enhanced your to your move".
Accordingly, an embodiment of the present invention also provides a terminal, as shown in fig. 14, which may include Radio Frequency (RF) circuit 1401, memory 1402 including one or more computer-readable storage media, input unit 1403, display unit 1404, sensor 1405, audio circuit 1406, wireless fidelity (WiFi) module 1407, processor 1408 including one or more processing cores, and power supply 1409. Those skilled in the art will appreciate that the terminal structure shown in fig. 14 is not intended to be limiting and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:
the RF circuit 1401 may be used for receiving and transmitting signals during a message or call, and in particular, for receiving downlink information of a base station and then processing the received downlink information by one or more processors 1408, and further, for transmitting data related to an uplink to the base station, in general, the RF circuit 1401 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM), a transceiver, a coupler, a Low Noise Amplifier (L NA, &lTtTtTtTtTtTtTtTtTtL & "&/TtTtTtTtTtTtTtTtNoise Amplifier, a duplexer, etc., and the RF circuit 1401 may also communicate with other devices through wireless communication, which may use any communication standard or protocol, including, but not limited to, a Global System for Mobile communications (GSM, Mobile System of Mobile communication), general packet Radio Service (Gene, Radio Service), SMS (SMS), Wireless Service Access Service (SMS), Long term evolution (Wireless Service) and Wireless Service (SMS), Wireless Service (Wireless Service) communication), Wireless Service (Wireless Access) communication (Wireless Service (Wireless communication), Wireless communication) communication (Wireless communication) communication, Wireless communication (Wireless communication) communication, Wireless communication (Wireless communication) communication, Wireless communication.
The memory 1402 may be used to store software programs and modules, and the processor 1408 may execute various functional applications and data processing by operating the software programs and modules stored in the memory 1402. The memory 1402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 1402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 1402 may also include a memory controller to provide access to the memory 1402 by the processor 1408 and the input unit 1403.
The input unit 1403 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in a particular embodiment, input unit 1403 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. In an embodiment, the touch sensitive surface may comprise two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1408, and can receive and execute commands sent from the processor 1408. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 1403 may include other input devices in addition to a touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 1404 may include a display panel, which may be configured in the form of a liquid crystal display (L CD, &lTtTtranslation = L "&gTtL &l/T &gTtiquid crystalline display), Organic light Emitting diodes (O L ED, Organic L ight-emissive Diode), or the like, in one embodiment, the display panel may be configured such that when a touch operation is detected on or near the touch sensitive surface, the touch sensitive surface may be communicated to the processor to determine the type of touch event, and then the processor 1408 may provide a corresponding visual output on the display panel according to the type of touch event.
The terminal may also include at least one sensor 1405, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal, detailed description is omitted here.
Audio circuitry 1406, a speaker, and a microphone may provide an audio interface between the user and the terminal. The audio circuit 1406 can transmit the electrical signal converted from the received audio data to the speaker, and the electrical signal is converted into a sound signal by the speaker and output; on the other hand, the microphone converts a collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 1406, and then processes the audio data by the audio data output processor 1408, and then passes through the RF circuit 1401 to be transmitted to, for example, another terminal, or outputs the audio data to the memory 1402 for further processing. The audio circuitry 1406 may also include an earbud jack to provide peripheral headset communication with the terminal.
WiFi belongs to short-range wireless transmission technology, and the terminal can help the user send and receive e-mail, browse web pages, access streaming media, etc. through the WiFi module 1407, which provides wireless broadband internet access for the user. Although fig. 14 shows the WiFi module 1407, it is understood that it does not belong to the essential constitution of the terminal and can be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 1408 is a control center of the terminal, connects various parts of the entire handset using various interfaces and lines, and performs various functions of the terminal and processes data by operating or executing software programs and/or modules stored in the memory 1402 and calling data stored in the memory 1402, thereby performing overall monitoring of the handset. In an embodiment, processor 1408 may include one or more processing cores; preferably, processor 1408 may integrate an application processor that handles primarily operating system, user interface, and applications, etc. and a modem processor that handles primarily wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 1408.
The terminal also includes a power supply 1409 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 1408 via a power management system that provides management of charging, discharging, and power consumption via the power management system. The power supply 1409 can also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
Although not shown, the terminal may further include a camera, a bluetooth module, and the like, which will not be described herein.
Specifically, in this embodiment, the processor 1408 in the terminal loads an executable file corresponding to a process of one or more application programs into the memory 1402 according to the following instructions, and the processor 1408 runs the application programs stored in the memory 1402, thereby implementing various functions:
acquiring original content and translation content corresponding to the original content;
performing word segmentation processing on the original content to obtain at least one word segmentation;
obtaining the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and acquiring candidate translation content corresponding to the target word segmentation;
and when the translation content does not comprise the candidate translation content of the target word segmentation, determining that translation content corresponding to the original content has translation missing.
In one embodiment, the functions are implemented: counting a first numerical value of the occurrence times of the participles in the translation sample; counting a second numerical value of the occurrence times of the candidate translation content corresponding to the participle in the translation sample; acquiring a first difference value between the first numerical value and the second numerical value; and acquiring a first ratio of the first difference value to the first numerical value as the corresponding untranslated probability of the participle.
In one embodiment, the functions are implemented: taking each translation sentence pair in the translation sample as a user; taking each participle in the translation sample as a project; constructing a user-project matrix; acquiring similar items of each item according to the user-item matrix; and taking the participles corresponding to the similar items of the items as candidate translation contents of the participles corresponding to the items.
In one embodiment, the functions are implemented: calculating cosine similarity among the items by adopting a similarity calculation method; and (4) adopting a collaborative filtering algorithm to obtain similar items of each item based on cosine similarity among the items.
In one embodiment, the functions are implemented: acquiring an error rate threshold value in a preset condition; comparing the untranslated probability of the participle with the error rate threshold; and if the untranslated probability of the participle is smaller than the error rate threshold, determining the participle as a target participle.
In one embodiment, the functions are implemented: acquiring a check number in a preset condition; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the check numbers in the sequencing result from small to large in number; and if the untranslated probability of the participle is selected, determining the participle as a target participle.
In one embodiment, the functions are implemented: acquiring a check rate in a preset condition; calculating the total verification number according to the total number of the participles contained in the original content and the verification rate; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the total verification number from small to large in the sequencing result; and if the untranslated probability of the participle is selected, determining the participle as a target participle.
In one embodiment, the functions are implemented: counting the word segmentation of the candidate translation content which does not exist in the translation content; optimizing the machine translation model according to the statistical result; and translating the original content again by using the optimized machine translation model.
In one embodiment, the functions are implemented: acquiring a third numerical value of the occurrence frequency of the word segmentation in the original content; acquiring a fourth numerical value of the occurrence frequency of the candidate translation content of the participle in the translation content; obtaining a second difference value between the third numerical value and the fourth numerical value; acquiring a second ratio of the second difference to the third value; and if the second ratio is larger than the un-translation probability of the participle, determining that the translation content has translation missing.
Specifically, in an embodiment, the processor 1408 in the terminal loads an executable file corresponding to one or more processes of the application program into the memory 1402 according to the following instructions, and the processor 1408 runs the application program stored in the memory 1402, thereby implementing various functions:
translating the original content by using a machine translation model to obtain translation content corresponding to the original content;
performing word segmentation processing on the original content to obtain at least one word segmentation;
obtaining the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and acquiring candidate translation content corresponding to the target word segmentation;
when the translation content does not comprise the candidate translation content of the target participle, determining that translation content corresponding to the original content has translation missing;
counting the transliterated participles in the original content; the translation missing participles comprise participles without candidate translation contents in the translation contents;
optimizing the machine translation model according to the statistical result;
and translating the original content again by using the optimized machine translation model.
Accordingly, embodiments of the present invention also provide a server, as shown in fig. 15, which may include a memory 1501 including one or more computer-readable storage media, a processor 1502 including one or more processing cores, and the like. Those skilled in the art will appreciate that the architecture shown in FIG. 15 does not constitute a limitation of a server, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the memory 1501 may be used to store software programs and modules, and the processor 1502 executes various functional applications and data processing by operating the software programs and modules stored in the memory 1501. The memory 1501 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 1501 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 1501 may also include a memory controller to provide the processor 1502 with access to the memory 1501.
Specifically, in this embodiment, the processor 1502 in the server loads the executable file corresponding to the process of one or more application programs into the memory 1501 according to the following instructions, and the processor 1502 runs the application programs stored in the memory 1501, thereby implementing various functions:
acquiring original content and translation content corresponding to the original content;
performing word segmentation processing on the original content to obtain at least one word segmentation;
obtaining the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and acquiring candidate translation content corresponding to the target word segmentation;
and when the translation content does not comprise the candidate translation content of the target word segmentation, determining that translation content corresponding to the original content has translation missing.
In one embodiment, the functions are implemented: counting a first numerical value of the occurrence times of the participles in the translation sample; counting a second numerical value of the occurrence times of the candidate translation content corresponding to the participle in the translation sample; acquiring a first difference value between the first numerical value and the second numerical value; and acquiring a first ratio of the first difference value to the first numerical value as the corresponding untranslated probability of the participle.
In one embodiment, the functions are implemented: taking each translation sentence pair in the translation sample as a user; taking each participle in the translation sample as a project; constructing a user-project matrix; acquiring similar items of each item according to the user-item matrix; and taking the participles corresponding to the similar items of the items as candidate translation contents of the participles corresponding to the items.
In one embodiment, the functions are implemented: calculating cosine similarity among the items by adopting a similarity calculation method; and (4) adopting a collaborative filtering algorithm to obtain similar items of each item based on cosine similarity among the items.
In one embodiment, the functions are implemented: acquiring an error rate threshold value in a preset condition; comparing the untranslated probability of the participle with the error rate threshold; and if the untranslated probability of the participle is smaller than the error rate threshold, determining the participle as a target participle.
In one embodiment, the functions are implemented: acquiring a check number in a preset condition; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the check numbers in the sequencing result from small to large in number; and if the untranslated probability of the participle is selected, determining the participle as a target participle.
In one embodiment, the functions are implemented: acquiring a check rate in a preset condition; calculating the total verification number according to the total number of the participles contained in the original content and the verification rate; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the total verification number from small to large in the sequencing result; and if the untranslated probability of the participle is selected, determining the participle as a target participle.
In one embodiment, the functions are implemented: counting the word segmentation of the candidate translation content which does not exist in the translation content; optimizing the machine translation model according to the statistical result; and translating the original content again by using the optimized machine translation model.
In one embodiment, the functions are implemented: acquiring a third numerical value of the occurrence frequency of the word segmentation in the original content; acquiring a fourth numerical value of the occurrence frequency of the candidate translation content of the participle in the translation content; obtaining a second difference value between the third numerical value and the fourth numerical value; acquiring a second ratio of the second difference to the third value; and if the second ratio is larger than the un-translation probability of the participle, determining that the translation content has translation missing.
Specifically, in an embodiment, the processor 1502 in the server loads the executable file corresponding to the process of one or more application programs into the memory 1501 according to the following instructions, and the processor 1502 runs the application programs stored in the memory 1501, thereby implementing various functions:
translating the original content by using a machine translation model to obtain translation content corresponding to the original content;
performing word segmentation processing on the original content to obtain at least one word segmentation;
obtaining the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and acquiring candidate translation content corresponding to the target word segmentation;
when the translation content does not comprise the candidate translation content of the target participle, determining that translation content corresponding to the original content has translation missing;
counting the transliterated participles in the original content; the translation missing participles comprise participles without candidate translation contents in the translation contents;
optimizing the machine translation model according to the statistical result;
and translating the original content again by using the optimized machine translation model.
In the above embodiments, the descriptions of the embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed description of the resource management method, and are not described herein again.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present invention provide a storage medium having stored therein a plurality of instructions, which can be loaded by a processor to perform steps of any of the translation miss detection methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:
acquiring original content and translation content corresponding to the original content;
performing word segmentation processing on the original content to obtain at least one word segmentation;
obtaining the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and acquiring candidate translation content corresponding to the target word segmentation;
and when the translation content does not comprise the candidate translation content of the target word segmentation, determining that translation content corresponding to the original content has translation missing.
In one embodiment, the functions are implemented: counting a first numerical value of the occurrence times of the participles in the translation sample; counting a second numerical value of the occurrence times of the candidate translation content corresponding to the participle in the translation sample; acquiring a first difference value between the first numerical value and the second numerical value; and acquiring a first ratio of the first difference value to the first numerical value as the corresponding untranslated probability of the participle.
In one embodiment, the functions are implemented: taking each translation sentence pair in the translation sample as a user; taking each participle in the translation sample as a project; constructing a user-project matrix; acquiring similar items of each item according to the user-item matrix; and taking the participles corresponding to the similar items of the items as candidate translation contents of the participles corresponding to the items.
In one embodiment, the functions are implemented: calculating cosine similarity among the items by adopting a similarity calculation method; and (4) adopting a collaborative filtering algorithm to obtain similar items of each item based on cosine similarity among the items.
In one embodiment, the functions are implemented: acquiring an error rate threshold value in a preset condition; comparing the untranslated probability of the participle with the error rate threshold; and if the untranslated probability of the participle is smaller than the error rate threshold, determining the participle as a target participle.
In one embodiment, the functions are implemented: acquiring a check number in a preset condition; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the check numbers in the sequencing result from small to large in number; and if the untranslated probability of the participle is selected, determining the participle as a target participle.
In one embodiment, the functions are implemented: acquiring a check rate in a preset condition; calculating the total verification number according to the total number of the participles contained in the original content and the verification rate; sequencing the untranslated probabilities of the participles from small to large; selecting the un-translated probabilities of the total verification number from small to large in the sequencing result; and if the untranslated probability of the participle is selected, determining the participle as a target participle.
In one embodiment, the functions are implemented: counting the word segmentation of the candidate translation content which does not exist in the translation content; optimizing the machine translation model according to the statistical result; and translating the original content again by using the optimized machine translation model.
In one embodiment, the functions are implemented: acquiring a third numerical value of the occurrence frequency of the word segmentation in the original content; acquiring a fourth numerical value of the occurrence frequency of the candidate translation content of the participle in the translation content; obtaining a second difference value between the third numerical value and the fourth numerical value; acquiring a second ratio of the second difference to the third value; and if the second ratio is larger than the un-translation probability of the participle, determining that the translation content has translation missing.
To this end, the embodiment of the present invention provides a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in the translation method provided by the embodiment of the present invention. For example, the instructions may perform the steps of:
translating the original content by using a machine translation model to obtain translation content corresponding to the original content;
performing word segmentation processing on the original content to obtain at least one word segmentation;
obtaining the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation, and acquiring candidate translation content corresponding to the target word segmentation;
when the translation content does not comprise the candidate translation content of the target participle, determining that translation content corresponding to the original content has translation missing;
counting the transliterated participles in the original content; the translation missing participles comprise participles without candidate translation contents in the translation contents;
optimizing the machine translation model according to the statistical result;
and translating the original content again by using the optimized machine translation model.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the storage medium may execute the steps in any resource management method provided in the embodiments of the present invention, beneficial effects that can be achieved by any resource management method provided in the embodiments of the present invention may be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
The method and apparatus for translation missing detection and translation, the server and the storage medium provided by the embodiments of the present invention are described in detail above, and a specific example is applied in the present document to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (12)

1. A translation missing detection method, comprising:
acquiring original content and translation content corresponding to the original content;
performing word segmentation processing on the original content to obtain at least one word segmentation;
obtaining the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation and a preset condition, and acquiring candidate translation content corresponding to the target word segmentation;
when the translation content does not comprise the candidate translation content of the target participle, determining that translation content corresponding to the original content has translation missing;
wherein the step of obtaining the untranslated probability of the participle comprises the following steps: counting a first numerical value of the occurrence times of the participles in the translation sample; counting a second numerical value of the occurrence times of the candidate translation content corresponding to the participle in the translation sample; acquiring a first difference value between the first numerical value and the second numerical value; and acquiring a first ratio of the first difference value to the first numerical value as the corresponding untranslated probability of the participle.
2. The overlooking detection method of claim 1, wherein the step of obtaining candidate translation content corresponding to the target participle comprises:
taking each translation sentence pair in the translation sample as a user;
taking each participle in the translation sample as a project;
constructing a user project matrix;
acquiring similar items of each item according to the user item matrix;
and taking the participles corresponding to the similar items of the items as candidate translation contents of the participles corresponding to the items.
3. The overlook detection method of claim 2, wherein the step of obtaining similar items for each item according to the user item matrix comprises:
calculating cosine similarity among the items by adopting a similarity calculation method;
and (4) adopting a collaborative filtering algorithm to obtain similar items of each item based on cosine similarity among the items.
4. The transliteration detection method of claim 1, wherein the step of determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation comprises:
acquiring an error rate threshold value in a preset condition;
comparing the untranslated probability of the participle with the error rate threshold;
and if the untranslated probability of the participle is smaller than the error rate threshold, determining the participle as a target participle.
5. The transliteration detection method of claim 1, wherein the step of determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation comprises:
acquiring a check number in a preset condition;
sequencing the untranslated probabilities of the participles from small to large;
selecting the un-translated probabilities of the check numbers in the sequencing result from small to large in number;
and if the untranslated probability of the participle is selected, determining the participle as a target participle.
6. The transliteration detection method of claim 1, wherein the step of determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation comprises:
acquiring a check rate in a preset condition;
calculating the total verification number according to the total number of the participles contained in the original content and the verification rate;
sequencing the untranslated probabilities of the participles from small to large;
selecting the un-translated probabilities of the total verification number from small to large in the sequencing result;
and if the untranslated probability of the participle is selected, determining the participle as a target participle.
7. The overlooking detection method of any of claims 1 to 6, further comprising, after the step of obtaining candidate translation content corresponding to the target participle:
acquiring a third numerical value of the occurrence frequency of the word segmentation in the original content;
acquiring a fourth numerical value of the occurrence frequency of the candidate translation content of the participle in the translation content;
obtaining a second difference value between the third numerical value and the fourth numerical value;
acquiring a second ratio of the second difference to the third value;
and if the second ratio is larger than the un-translation probability of the participle, determining that the translation content has translation missing.
8. A method of translation, comprising:
translating the original content by using a machine translation model to obtain translation content corresponding to the original content;
performing word segmentation processing on the original content to obtain at least one word segmentation;
obtaining the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation and a preset condition, and acquiring candidate translation content corresponding to the target word segmentation;
when the translation content does not comprise the candidate translation content of the target participle, determining that translation content corresponding to the original content has translation missing;
counting the transliterated participles in the original content; the translation missing participles comprise participles without candidate translation contents in the translation contents;
optimizing the machine translation model according to the statistical result;
translating the original content again by using the optimized machine translation model;
wherein the step of obtaining the untranslated probability of the participle comprises the following steps: counting a first numerical value of the occurrence times of the participles in the translation sample; counting a second numerical value of the occurrence times of the candidate translation content corresponding to the participle in the translation sample; acquiring a first difference value between the first numerical value and the second numerical value; and acquiring a first ratio of the first difference value to the first numerical value as the corresponding untranslated probability of the participle.
9. An apparatus for translation miss detection, comprising:
the first acquisition module is used for acquiring original content and translation content corresponding to the original content;
the first analysis module is used for performing word segmentation processing on the original content to obtain at least one word segmentation;
the second acquisition module is used for acquiring the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
the third obtaining module is used for determining a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation and a preset condition and obtaining candidate translation contents corresponding to the target word segmentation;
the first checking module is used for determining that translation content corresponding to the original content has translation missing when the translation content does not comprise candidate translation content of the target word segmentation;
the second obtaining module is specifically configured to: counting a first numerical value of the occurrence times of the participles in the translation sample; counting a second numerical value of the occurrence times of the candidate translation content corresponding to the participle in the translation sample; acquiring a first difference value between the first numerical value and the second numerical value; and acquiring a first ratio of the first difference value to the first numerical value as the corresponding untranslated probability of the participle.
10. A translation apparatus, comprising:
the first translation module is used for translating the original content by using a machine translation model to obtain a translation content corresponding to the original content;
the second analysis module is used for performing word segmentation processing on the original content to obtain at least one word segmentation;
the fourth acquisition module is used for acquiring the untranslated probability of the participle; the untranslated probability is the probability that the participle is not translated in the translation sample set;
a fifth obtaining module, configured to determine a target word segmentation from the at least one word segmentation according to the untranslated probability of the word segmentation and a preset condition, and obtain candidate translation content corresponding to the target word segmentation;
the second check module is used for determining that translation content corresponding to the original content has missing translation when the translation content does not comprise candidate translation content of the target participle;
the statistical module is used for counting the transliterated participles in the original content; the translation missing participles comprise participles without candidate translation contents in the translation contents;
the optimization module is used for optimizing the machine translation model according to the statistical result;
the second translation module is used for translating the original content again by using the optimized machine translation model;
the fourth obtaining module is specifically configured to: counting a first numerical value of the occurrence times of the participles in the translation sample; counting a second numerical value of the occurrence times of the candidate translation content corresponding to the participle in the translation sample; acquiring a first difference value between the first numerical value and the second numerical value; and acquiring a first ratio of the first difference value to the first numerical value as the corresponding untranslated probability of the participle.
11. A server comprising a processor and a memory, said memory storing a plurality of instructions adapted to be loaded by the processor to perform the steps of the translation detection method of any of claims 1 to 7 or to perform the steps of the translation method of claim 8.
12. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the translation miss detection method according to any one of claims 1 to 7 or to perform the steps of the translation method according to claim 8.
CN201810473017.7A 2018-05-17 2018-05-17 Translation missing detection and translation method and device, server and storage medium Active CN108763222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810473017.7A CN108763222B (en) 2018-05-17 2018-05-17 Translation missing detection and translation method and device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810473017.7A CN108763222B (en) 2018-05-17 2018-05-17 Translation missing detection and translation method and device, server and storage medium

Publications (2)

Publication Number Publication Date
CN108763222A CN108763222A (en) 2018-11-06
CN108763222B true CN108763222B (en) 2020-08-04

Family

ID=64008371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810473017.7A Active CN108763222B (en) 2018-05-17 2018-05-17 Translation missing detection and translation method and device, server and storage medium

Country Status (1)

Country Link
CN (1) CN108763222B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931519B (en) * 2019-04-28 2023-11-17 阿里巴巴集团控股有限公司 Translation evaluation method and device, storage medium and processor
CN110414013B (en) * 2019-07-31 2024-06-21 腾讯科技(深圳)有限公司 Data processing method and device and electronic equipment
CN114936566A (en) * 2022-04-26 2022-08-23 北京百度网讯科技有限公司 Machine translation method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950286A (en) * 2010-09-14 2011-01-19 传神联合(北京)信息技术有限公司 Error correction module and method in software translation system
CN103116578A (en) * 2013-02-07 2013-05-22 北京赛迪翻译技术有限公司 Translation method integrating syntactic tree and statistical machine translation technology and translation device
KR20130102926A (en) * 2012-03-08 2013-09-23 한국전자통신연구원 Method and apparatus of ellipsis component restoration for chinese machine translation, method and apparatus for chinese machine translation for comprising the same
CN108009158A (en) * 2017-11-27 2018-05-08 环宇爱译(北京)信息技术有限责任公司 Interaction prompts interpretation method, device, storage medium and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950286A (en) * 2010-09-14 2011-01-19 传神联合(北京)信息技术有限公司 Error correction module and method in software translation system
KR20130102926A (en) * 2012-03-08 2013-09-23 한국전자통신연구원 Method and apparatus of ellipsis component restoration for chinese machine translation, method and apparatus for chinese machine translation for comprising the same
CN103116578A (en) * 2013-02-07 2013-05-22 北京赛迪翻译技术有限公司 Translation method integrating syntactic tree and statistical machine translation technology and translation device
CN108009158A (en) * 2017-11-27 2018-05-08 环宇爱译(北京)信息技术有限责任公司 Interaction prompts interpretation method, device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN108763222A (en) 2018-11-06

Similar Documents

Publication Publication Date Title
CN110472251B (en) Translation model training method, sentence translation equipment and storage medium
US20170091335A1 (en) Search method, server and client
EP3113035A1 (en) Method and apparatus for grouping contacts
CN106294308B (en) Named entity identification method and device
US20170109435A1 (en) Apparatus and method for searching for information
JP2018536920A (en) Text information processing method and device
CN108763222B (en) Translation missing detection and translation method and device, server and storage medium
CN104239535A (en) Method and system for matching pictures with characters, server and terminal
CN111061383B (en) Text detection method and electronic equipment
CN104951432A (en) Information processing method and device
CN109543014B (en) Man-machine conversation method, device, terminal and server
CN107885718B (en) Semantic determination method and device
CN111078986B (en) Data retrieval method, device and computer readable storage medium
US20150310119A1 (en) Systems and Methods for Filtering Microblogs
CN103501487A (en) Method, device, terminal, server and system for updating classifier
WO2015096660A1 (en) Methods and devices for displaying a webpage
CN104391588B (en) A kind of method of input prompt and device
CN113704008A (en) Anomaly detection method, problem diagnosis method and related products
CN110781274A (en) Question-answer pair generation method and device
EP3951622A1 (en) Image-based search method, server, terminal, and medium
CN115730047A (en) Intelligent question-answering method, equipment, device and storage medium
CN108897774B (en) Method, device and storage medium for acquiring news hotspots
CN113505596A (en) Topic switching marking method and device and computer equipment
CN113704447B (en) Text information identification method and related device
CN112988406B (en) Remote calling method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant