CN112347757A - Parallel corpus alignment method, device, equipment and storage medium - Google Patents

Parallel corpus alignment method, device, equipment and storage medium Download PDF

Info

Publication number
CN112347757A
CN112347757A CN202011087653.XA CN202011087653A CN112347757A CN 112347757 A CN112347757 A CN 112347757A CN 202011087653 A CN202011087653 A CN 202011087653A CN 112347757 A CN112347757 A CN 112347757A
Authority
CN
China
Prior art keywords
target
paragraphs
translation
sentence
target original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011087653.XA
Other languages
Chinese (zh)
Inventor
陈秋霖
朱宪超
邓龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Lan Bridge Information Technology Co ltd
Original Assignee
Sichuan Lan Bridge Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Lan Bridge Information Technology Co ltd filed Critical Sichuan Lan Bridge Information Technology Co ltd
Priority to CN202011087653.XA priority Critical patent/CN112347757A/en
Publication of CN112347757A publication Critical patent/CN112347757A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a parallel corpus alignment method, a device, equipment and a storage medium. The method comprises the steps of identifying languages of a target translation text paragraph and a target original text paragraph by using an ASCll code; extracting a preset specific context sentence; splitting the sentence fragments of the target translation paragraph and the target original paragraph; and inserting the preset sentences with specific contexts into the corresponding positions of the target translation sentence fragments and the target original sentence fragments, and aligning the preset sentences with the target original sentence fragments one by one. According to the method and the device, the languages of the target translation paragraph and the target original paragraph are identified through the ASCll code, the preset sentence breaking rule of the corresponding language is matched, the preset specific context sentence is extracted, the broken sentence is split after the special context is separated, the accurate sentence breaking can be ensured, the accuracy of the alignment of the languages is greatly improved, and the application range is wide. The method and the device solve the technical problems that in the related art, a single sentence-breaking rule cannot meet the current complex corpus environment, and obvious sentence-breaking errors can directly influence the alignment effect, so that the original translated text cannot be aligned or misaligned.

Description

Parallel corpus alignment method, device, equipment and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a parallel corpus alignment method, apparatus, device, and storage medium.
Background
The existing parallel corpus alignment method is generally to parse an original translation through a text parsing tool, then split the original translation into sentence segments according to paragraph identifiers, split each sentence segment according to obvious sentence break symbols, and finally match the sentence segments of the original translation one by one according to sentences.
However, the existing parallel corpus alignment method has strict requirements on supported languages, and has single language and single file format. The precondition of corpus alignment is to split sentence segments accurately, but the single sentence-breaking rule in the existing corpus alignment cannot meet the current complex corpus environment.
Such as a single punctuation sentence, for example, a sentence with concatenated characters mr. green, u.s. in english, the sentence is split into a sentence. The Chinese does not have the disadvantages, such as "do you do? "," today's weather is very good! ". If the words and sentences are split according to the existing alignment method of parallel corpora, the sentences are divided into "how good you? ","? "," today's weather is very good! ","! "". ". The obvious sentence-break errors directly affect the subsequent alignment effect, so that the original translation cannot be aligned or is misaligned.
Aiming at the problems that in the related art, a single sentence-breaking rule cannot meet the current complex corpus environment, and obvious sentence-breaking errors directly influence the alignment effect, so that the original translation cannot be aligned or is misaligned, an effective solution is not provided at present.
Disclosure of Invention
The present application mainly aims to provide a parallel corpus alignment method, apparatus, device and storage medium, so as to solve the problems that in the related art, a single sentence-breaking rule cannot satisfy the current complex corpus environment, and an obvious sentence-breaking error directly affects the alignment effect, resulting in that the original translated text cannot be aligned or is misaligned.
In order to achieve the above object, in a first aspect, the present application provides a parallel corpus alignment method.
The method according to the application comprises the following steps:
acquiring a target translation file and a target original file;
respectively preprocessing the target translation file and the target original file to obtain a plurality of target translation paragraphs corresponding to the target translation file and a plurality of target original paragraphs corresponding to the target original file;
identifying languages of a plurality of target translation paragraphs by using American Standard Code for Information exchange (ASCll) to obtain a first language set; recognizing languages of a plurality of target original text paragraphs by using ASCll codes to obtain a second language set;
searching and extracting preset specific context sentences in a plurality of target translation paragraphs and a plurality of target original paragraphs;
according to a preset sentence-breaking rule corresponding to each language in the first language set and the second language set, carrying out sentence-breaking on a corresponding target translation paragraph and a corresponding target original text paragraph, wherein the preset specific context sentence is extracted, so as to obtain a plurality of target translation sentences and a plurality of target original text sentences;
and inserting the extracted preset specific context sentences into corresponding positions of the target translation clauses and the target original sentence clauses, and aligning the target translation clauses inserted with the preset specific context sentences with the target original sentence clauses one by one to complete the corpus alignment of the target translation files and the target original files.
In a possible implementation manner of the present application, the preprocessing the target translation file and the target original file to obtain a plurality of target translation paragraphs corresponding to the target translation file and a plurality of target original paragraphs corresponding to the target original file includes:
respectively carrying out file analysis on the target translation file and the target original file, unifying character codes of the target translation file and the target original file, and obtaining a target translation code and a target original code;
deleting identifiers matched with preset deletion symbols in the target translation codes and the target original text codes;
and according to the paragraph identifier, carrying out paragraph splitting on the target translation code and the target original code from which the identifier is deleted to obtain a plurality of target translation paragraphs and a plurality of target original paragraphs.
In one possible implementation of the present application, the preset deletion symbol includes a chinese space character and a western space character.
In one possible implementation manner of the present application, recognizing languages of a plurality of target translation paragraphs by using an ascil code to obtain a first language set, includes:
identifying characters of a plurality of target translation paragraphs according to the ASCll code;
if the ratio of the languages corresponding to the ASCll codes in the target translation paragraphs is larger than a preset threshold, determining the language corresponding to the ASCll codes as the first language of the target translation paragraphs, and forming a first language set by the first language of each target translation paragraph.
In a possible implementation manner of the present application, recognizing languages of a plurality of target text paragraphs by using ascil codes to obtain a second language set, includes:
identifying characters of a plurality of target original text paragraphs according to ASCll codes;
and if the ratio of the languages corresponding to the ASCll codes in the target original text paragraphs is greater than a preset threshold value, determining the language corresponding to the ASCll codes as the second language of the target original text paragraphs, and forming a second language set by the second language of each target original text paragraph.
In one possible implementation of the present application, the preset context-specific sentence includes one or more of an abbreviation, a name of a person, a chapter mark, and a network literature.
In a possible implementation manner of the application, the preset sentence break rule comprises a quotation mark-free rule and a sentence break symbol rule existing in quotation marks, and the quotation mark-free rule comprises a step of splitting a paragraph into sentences according to the sentence break symbols; the rule that punctuation marks exist in quotation marks comprises the following steps: firstly, segmenting paragraphs according to quotation marks, and then segmenting the rest paragraphs according to the quotation mark-free rule.
In a second aspect, the present application further provides a parallel corpus alignment apparatus, comprising:
the acquisition module is used for acquiring the target translation file and the target original file;
the processing module is used for respectively preprocessing the target translation document and the target original document to obtain a plurality of target translation paragraphs corresponding to the target translation document and a plurality of target original document paragraphs corresponding to the target original document;
recognizing languages of a plurality of target translation paragraphs by using ASCll codes to obtain a first language set; recognizing languages of a plurality of target original text paragraphs by using ASCll codes to obtain a second language set;
searching and extracting preset specific context sentences in a plurality of target translation paragraphs and a plurality of target original paragraphs;
according to a preset sentence-breaking rule corresponding to each language in the first language set and the second language set, carrying out sentence-breaking on a corresponding target translation paragraph and a corresponding target original text paragraph, wherein the preset specific context sentence is extracted, so as to obtain a plurality of target translation sentences and a plurality of target original text sentences;
and the output module is used for inserting the extracted preset specific context sentences into corresponding positions of the target translation clauses and the target original sentence clauses, aligning the target translation clauses inserted with the preset specific context sentences with the target original sentence clauses one by one, and finishing the corpus alignment of the target translation files and the target original files.
In one possible implementation manner of the present application, the processing module is specifically configured to:
respectively carrying out file analysis on the target translation file and the target original file, unifying character codes of the target translation file and the target original file, and obtaining a target translation code and a target original code;
deleting identifiers matched with preset deletion symbols in the target translation codes and the target original text codes;
and according to the paragraph identifier, carrying out paragraph splitting on the target translation code and the target original code from which the identifier is deleted to obtain a plurality of target translation paragraphs and a plurality of target original paragraphs.
In one possible implementation manner of the present application, the processing module is further specifically configured to:
identifying characters of a plurality of target translation paragraphs according to the ASCll code;
if the ratio of the languages corresponding to the ASCll codes in the target translation paragraphs is larger than a preset threshold, determining the language corresponding to the ASCll codes as the first language of the target translation paragraphs, and forming a first language set by the first language of each target translation paragraph.
In one possible implementation manner of the present application, the processing module is further specifically configured to:
identifying characters of a plurality of target original text paragraphs according to ASCll codes;
and if the ratio of the languages corresponding to the ASCll codes in the target original text paragraphs is greater than a preset threshold value, determining the language corresponding to the ASCll codes as the second language of the target original text paragraphs, and forming a second language set by the second language of each target original text paragraph.
In a third aspect, the present application further provides an electronic device for aligning parallel corpuses, where the electronic device includes:
one or more processors;
a memory; and
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the parallel corpus alignment method of any one of the first aspects.
In a fourth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is loaded by a processor to execute the steps in the parallel corpus alignment method according to any one of the first aspect.
In the embodiment of the application, a parallel corpus alignment method is provided, languages of a target translation paragraph and a target original paragraph are identified through ASCll codes, preset sentence break rules of corresponding languages are matched according to the languages, preset specific context sentences in the target translation paragraph and the target original paragraph are searched and extracted before the sentences are split according to the preset sentence break rules, special context separation is carried out, and after the sentences are split, the extracted preset specific context sentences are restored to corresponding positions, so that accurate sentence break can be ensured, the corpus alignment accuracy is greatly improved, and the application range is wide; and further, the technical problems that in the related technology, a single sentence-breaking rule cannot meet the current complex corpus environment, and obvious sentence-breaking errors directly influence the alignment effect, so that the original translation cannot be aligned or is misaligned in alignment are solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:
FIG. 1 is a flowchart illustrating an embodiment of a parallel corpus alignment method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating an embodiment of obtaining a plurality of target translation paragraphs and a plurality of target original paragraphs according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating an embodiment of obtaining a first language set according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram illustrating an embodiment of a parallel corpus alignment apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an embodiment of an electronic device for aligning parallel corpuses according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings. These terms are used primarily to better describe the present application and its embodiments, and are not used to limit the indicated devices, elements or components to a particular orientation or to be constructed and operated in a particular orientation.
Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meaning of these terms in this application will be understood by those of ordinary skill in the art as appropriate.
In addition, the term "plurality" shall mean two as well as more than two.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
First, an embodiment of the present application provides a parallel corpus alignment method, where an execution main body of the parallel corpus alignment method is a parallel corpus alignment device, and the parallel corpus alignment device is applied to a processor, and the parallel corpus alignment method includes: acquiring a target translation file and a target original file; respectively preprocessing the target translation file and the target original file to obtain a plurality of target translation paragraphs corresponding to the target translation file and a plurality of target original paragraphs corresponding to the target original file; recognizing languages of a plurality of target translation paragraphs by using ASCll codes to obtain a first language set; recognizing languages of a plurality of target original text paragraphs by using ASCll codes to obtain a second language set; searching and extracting preset specific context sentences in a plurality of target translation paragraphs and a plurality of target original paragraphs; according to a preset sentence-breaking rule corresponding to each language in the first language set and the second language set, carrying out sentence-breaking on a corresponding target translation paragraph and a corresponding target original text paragraph, wherein the preset specific context sentence is extracted, so as to obtain a plurality of target translation sentences and a plurality of target original text sentences; and inserting the extracted preset specific context sentences into corresponding positions of the target translation clauses and the target original sentence clauses, and aligning the target translation clauses inserted with the preset specific context sentences with the target original sentence clauses one by one to complete the corpus alignment of the target translation files and the target original files.
Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a parallel corpus alignment method according to an embodiment of the present application, the parallel corpus alignment method includes:
101. and acquiring the target translation file and the target original file.
In the embodiment of the present application, the target is to perform corpus alignment on a target translation file and a target original file, where the target translation file is obtained by translating the target original file, for example, the target original file is an english file, and the target translation file may be a chinese file obtained by translating the english file into chinese.
102. And respectively preprocessing the target translation file and the target original file to obtain a plurality of target translation paragraphs corresponding to the target translation file and a plurality of target original paragraphs corresponding to the target original file.
In this embodiment of the present application, the preprocessing the target translation file and the target original file may be to convert the target translation file and the target original file into a uniform character coding format, and then perform paragraph splitting on corresponding codes, so as to obtain a plurality of target translation paragraphs corresponding to the target translation file and a plurality of target original paragraphs corresponding to the target original file.
103. Recognizing languages of a plurality of target translation paragraphs by using ASCll codes to obtain a first language set; and identifying languages of a plurality of target original text paragraphs by using ASCll codes to obtain a second language set.
The language and code identification is a basic link of natural language processing, the language identification is to identify the language to which a certain text belongs, the code identification is to identify the code to which a certain text belongs, besides ASCll codes, 8-bit Unicode Transformation Format (UTF-8), Chinese character coding character set GBK codes and the like can also be used, the byte arrangement sequence of each coding method is not used, for example, for ASCll codes, the code is basically composed of byte sequences smaller than 0x80, the GBK codes are mostly composed of two bytes, the first byte is larger than 0x80, and the second byte is generally larger than 0x 40.
In the embodiment of the application, the ASCll code is used for identifying the language of each target translation paragraph to obtain the language corresponding to each target translation paragraph, and the first language set is composed of the languages corresponding to each target translation paragraph; similarly, the language of each target original text paragraph is identified by the ascil code to obtain the language corresponding to each target original text paragraph, and the second language set is composed of the languages corresponding to each target original text paragraph.
104. And searching and extracting preset specific context sentences in the target translation paragraphs and the target original paragraphs.
In the embodiment of the present application, the preset context-specific sentence may include one or more of an abbreviation, a name of a person, a chapter mark, and a network literature, such as U.S., mr.green, 1.1.2, IV, "king: today's weather is very good! "and the like.
105. And splitting the corresponding target translation paragraphs and target original paragraphs from which the preset specific context sentences are extracted according to preset sentence break rules corresponding to each language in the first language set and the second language set to obtain a plurality of target translation sentences and a plurality of target original sentences.
In this embodiment, the preset sentence break rule may include a rule without quotation marks and a rule with sentence break symbols in quotation marks, where the rule without quotation marks includes splitting a paragraph into sentences according to the sentence break symbols, where the sentence break symbols may be sentence numbers. ", exclamation point"! ", question mark"? "and ellipses" … … ", etc., e.g., the paragraphs before splitting are: today the weather is very good. The sun is facing the brow tip! Splitting the broken sentence according to the quotation mark-free rule to obtain a split sentence which is: (1) today the weather is very good. (2) The sun is facing the brow tip!
The rule of sentence break sign in the quotation mark may include: firstly, segmenting paragraphs according to quotation marks, and then segmenting the rest paragraphs according to a quotation mark-free rule, wherein for example, paragraphs before segmentation are as follows: "today's weather is really good! The sun is facing the brow apex. "Xiaoming is a happy saying. Splitting the punctuation according to the punctuation rule of the punctuation in the quotation marks to obtain split sentences: (1) "today's weather is really good! The sun is facing the brow apex. "(2) Xiaoming is said to be happy.
According to the preset sentence-breaking rule, a corresponding regular expression for paragraph splitting and sentence breaking can be obtained through induction: the \ s + \\\ | is a new technology for the synthesis of new substances! \/, $% + | [ + -.) ()? 【】 ""! ,. Is there a And @ # … … & (), splitting the target translation segment and the target original segment from which the preset specific context sentence is extracted respectively by using the regular expression, and further obtaining a plurality of target translation segments and a plurality of target original segments.
106. And inserting the extracted preset specific context sentences into corresponding positions of the target translation clauses and the target original sentence clauses, and aligning the target translation clauses inserted with the preset specific context sentences with the target original sentence clauses one by one to complete the corpus alignment of the target translation files and the target original files.
In the embodiment of the present application, after the target translation paragraph is divided into multiple target translation paragraphs in step 105, and multiple target original paragraphs are obtained by dividing the target original paragraph, the preset specific context sentence extracted in step 104 is first inserted into its original position, that is, the preset specific context sentence extracted from the target translation paragraph is inserted or restored into the corresponding position in the corresponding target translation paragraph, and the preset specific context sentence extracted from the target original paragraph is inserted or restored into the corresponding position in the corresponding target original sentence; and then, aligning the plurality of target translation sentences and the plurality of target original sentence fragments inserted with preset sentences with specific contexts one by combining the information such as the contexts, the punctuations and the like to complete the corpus alignment of the target translation files and the target original files.
In the embodiment of the application, languages of the target translation paragraph and the target original paragraph are identified through the ASCll code, the preset sentence break rule of the corresponding language is matched according to the language, the preset specific context sentences in the target translation paragraph and the target original paragraph are searched and extracted before the sentence break is split according to the preset sentence break rule, the special context separation is carried out, and the extracted preset specific context sentences are restored to the corresponding positions after the sentence break is carried out, so that the accurate sentence break can be ensured, the accuracy of the alignment of the language materials is greatly improved, and the application range is wide.
As shown in fig. 2, a flowchart illustrating an embodiment of obtaining a plurality of target translation paragraphs and a plurality of target original paragraphs according to the embodiment of the present application is provided, where in some embodiments of the present application, a target translation file and a target original file are preprocessed respectively to obtain a plurality of target translation paragraphs corresponding to the target translation file and a plurality of target original paragraphs corresponding to the target original file, and the method may further include:
201. and respectively carrying out file analysis on the target translation file and the target original file, unifying the character codes of the target translation file and the target original file, and obtaining a target translation code and a target original code.
In the embodiment of the application, the file types of the obtained target translation file and the target original file can be identified and analyzed, the character codes of the target translation file and the target original file are unified, the target translation code and the target original code are obtained, and subsequent corpus alignment data processing is facilitated.
202. And deleting the identifier matched with the preset deletion symbol in the target translation code and the target original text code.
In this embodiment of the present application, the preset deleting symbol may include a chinese space character and a western space character, it should be noted that other types of space characters or meaningless characters may also belong to the preset deleting symbol in this embodiment of the present application, and the specific details are not limited herein.
203. And according to the paragraph identifier, carrying out paragraph splitting on the target translation code and the target original code from which the identifier is deleted to obtain a plurality of target translation paragraphs and a plurality of target original paragraphs.
Because the preset deleting symbol has no meaning on the file content, the paragraph splitting is carried out on the target translation code and the target original code from which the identifier matched with the preset deleting symbol is deleted in the embodiment of the application, the accuracy of the paragraph splitting can be improved, and the interference of some meaningless characters on the splitting result is avoided.
As shown in fig. 3, which is a flowchart illustrating an embodiment of obtaining a first language set according to the present application, in some embodiments of the present application, the identifying languages of multiple target translation paragraphs by using ascil codes to obtain the first language set may further include:
301. and identifying characters of a plurality of target translation paragraphs according to the ASCll codes.
302. If the ratio of the languages corresponding to the ASCll codes in the target translation paragraphs is larger than a preset threshold, determining the language corresponding to the ASCll codes as the first language of the target translation paragraphs, and forming a first language set by the first language of each target translation paragraph.
Similarly, the identifying languages of the target original paragraphs by the ascil code to obtain the second language set may further include:
identifying characters of a plurality of target original text paragraphs according to ASCll codes;
and if the ratio of the languages corresponding to the ASCll codes in the target original text paragraphs is greater than a preset threshold value, determining the language corresponding to the ASCll codes as the second language of the target original text paragraphs, and forming a second language set by the second language of each target original text paragraph.
In the embodiment of the present application, the method for identifying characters of a plurality of target translation paragraphs and a plurality of target original paragraphs by using an ASCll code may be implemented based on a language model, where the language model determines a paragraph language according to an input paragraph character, and in the embodiment of the present application, if a preset threshold is 0.8, a language ratio exceeding 0.8 determines that a language corresponding to the ASCll code is the language of the paragraph.
The method has the advantages of being high in efficiency, wide in application range, high in sentence breaking accuracy, accurate in alignment, capable of supporting multiple formats and single/double language files, free of any language knowledge and saving labor cost, capable of automatically identifying the types of uploaded files, friendly in large file support due to high-efficiency analysis efficiency, capable of intelligently identifying languages after the files are uploaded, quickly matching sentence breaking rules, wide in language application range, capable of greatly increasing accuracy of sentence breaking by utilizing special context separation reduction and word segmentation processing, capable of conducting accurate alignment by utilizing the combination of contexts and punctuations, greatly improving accuracy of corpus alignment, and capable of automatically identifying, splitting and the like files in terms of corpus collection in the translation field compared with file singleness supported by other technologies.
In order to better implement the parallel corpus alignment method in the embodiment of the present application, based on the parallel corpus alignment method, an embodiment of the present application further provides a parallel corpus alignment apparatus, as shown in fig. 4, the parallel corpus alignment apparatus 400 includes:
an obtaining module 401, configured to obtain a target translation file and a target original file;
a processing module 402, configured to pre-process the target translation file and the target original file, respectively, to obtain a plurality of target translation paragraphs corresponding to the target translation file and a plurality of target original paragraphs corresponding to the target original file;
recognizing languages of a plurality of target translation paragraphs by using ASCll codes to obtain a first language set; recognizing languages of a plurality of target original text paragraphs by using ASCll codes to obtain a second language set;
searching and extracting preset specific context sentences in a plurality of target translation paragraphs and a plurality of target original paragraphs;
according to a preset sentence-breaking rule corresponding to each language in the first language set and the second language set, carrying out sentence-breaking on a corresponding target translation paragraph and a corresponding target original text paragraph, wherein the preset specific context sentence is extracted, so as to obtain a plurality of target translation sentences and a plurality of target original text sentences;
an output module 403, configured to insert the extracted preset specific context sentence into corresponding positions of the multiple target translation clauses and the multiple target original sentence clauses, and align the multiple target translation clauses and the multiple target original sentence clauses, into which the preset specific context sentence is inserted, one by one, to complete corpus alignment of the target translation file and the target original file.
In some embodiments of the present application, the processing module 402 is specifically configured to:
respectively carrying out file analysis on the target translation file and the target original file, unifying character codes of the target translation file and the target original file, and obtaining a target translation code and a target original code;
deleting identifiers matched with preset deletion symbols in the target translation codes and the target original text codes;
and according to the paragraph identifier, carrying out paragraph splitting on the target translation code and the target original code from which the identifier is deleted to obtain a plurality of target translation paragraphs and a plurality of target original paragraphs.
In some embodiments of the present application, the processing module 402 is further specifically configured to:
identifying characters of a plurality of target translation paragraphs according to the ASCll code;
if the ratio of the languages corresponding to the ASCll codes in the target translation paragraphs is larger than a preset threshold, determining the language corresponding to the ASCll codes as the first language of the target translation paragraphs, and forming a first language set by the first language of each target translation paragraph.
In some embodiments of the present application, the processing module 402 is further specifically configured to:
identifying characters of a plurality of target original text paragraphs according to ASCll codes;
and if the ratio of the languages corresponding to the ASCll codes in the target original text paragraphs is greater than a preset threshold value, determining the language corresponding to the ASCll codes as the second language of the target original text paragraphs, and forming a second language set by the second language of each target original text paragraph.
Specifically, the specific process of each module in the device according to the embodiment of the present application to realize the function thereof may refer to the description of the parallel corpus alignment method in any embodiment corresponding to fig. 1 to fig. 3, and details thereof are not repeated herein.
An embodiment of the present application further provides an electronic device for aligning parallel corpuses, which integrates any one of the devices for aligning parallel corpuses provided by the embodiment of the present application, and the electronic device includes:
one or more processors;
a memory; and
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to perform the steps of the parallel corpus alignment method in any of the above embodiments of the parallel corpus alignment method.
The electronic device for aligning parallel corpuses according to the embodiment of the present application integrates any one of the devices for aligning parallel corpuses provided by the embodiment of the present application. As shown in fig. 5, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown, specifically:
the electronic device may include components such as a processor 501 of one or more processing cores, memory 502 of one or more computer-readable storage media, a power supply 503, and an input unit 504. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 5 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 501 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 502 and calling data stored in the memory 502, thereby performing overall monitoring of the electronic device. Optionally, processor 501 may include one or more processing cores; the Processor 501 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and preferably the processor 501 may integrate an application processor, which handles primarily the operating system, user interfaces, application programs, etc., and a modem processor, which handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501.
The memory 502 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by operating the software programs and modules stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 501 with access to the memory 502.
The electronic device further comprises a power supply 503 for supplying power to each component, and preferably, the power supply 503 may be logically connected to the processor 501 through a power management system, so that functions of managing charging, discharging, power consumption, and the like are realized through the power management system. The power supply 503 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The electronic device may also include an input unit 504, where the input unit 504 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the server may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 501 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 501 runs the application program stored in the memory 502, so as to implement various functions as follows:
acquiring a target translation file and a target original file;
respectively preprocessing the target translation file and the target original file to obtain a plurality of target translation paragraphs corresponding to the target translation file and a plurality of target original paragraphs corresponding to the target original file;
recognizing languages of a plurality of target translation paragraphs by using ASCll codes to obtain a first language set; recognizing languages of a plurality of target original text paragraphs by using ASCll codes to obtain a second language set;
searching and extracting preset specific context sentences in a plurality of target translation paragraphs and a plurality of target original paragraphs;
according to a preset sentence-breaking rule corresponding to each language in the first language set and the second language set, carrying out sentence-breaking on a corresponding target translation paragraph and a corresponding target original text paragraph, wherein the preset specific context sentence is extracted, so as to obtain a plurality of target translation sentences and a plurality of target original text sentences;
and inserting the extracted preset specific context sentences into corresponding positions of the target translation clauses and the target original sentence clauses, and aligning the target translation clauses inserted with the preset specific context sentences with the target original sentence clauses one by one to complete the corpus alignment of the target translation files and the target original files.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the parallel corpus alignment apparatus, the electronic device and the corresponding units thereof described above may refer to the description of the parallel corpus alignment method in any embodiment corresponding to fig. 1 to fig. 3, and are not repeated herein.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by related hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by the processor 501.
To this end, an embodiment of the present application provides a computer-readable storage medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like. The computer program is loaded by a processor to execute the steps of any one of the parallel corpus alignment methods provided in the embodiments of the present application. For example, the computer program may be loaded by a processor to perform the steps of:
acquiring a target translation file and a target original file;
respectively preprocessing the target translation file and the target original file to obtain a plurality of target translation paragraphs corresponding to the target translation file and a plurality of target original paragraphs corresponding to the target original file;
recognizing languages of a plurality of target translation paragraphs by using ASCll codes to obtain a first language set; recognizing languages of a plurality of target original text paragraphs by using ASCll codes to obtain a second language set;
searching and extracting preset specific context sentences in a plurality of target translation paragraphs and a plurality of target original paragraphs;
according to a preset sentence-breaking rule corresponding to each language in the first language set and the second language set, carrying out sentence-breaking on a corresponding target translation paragraph and a corresponding target original text paragraph, wherein the preset specific context sentence is extracted, so as to obtain a plurality of target translation sentences and a plurality of target original text sentences;
and inserting the extracted preset specific context sentences into corresponding positions of the target translation clauses and the target original sentence clauses, and aligning the target translation clauses inserted with the preset specific context sentences with the target original sentence clauses one by one to complete the corpus alignment of the target translation files and the target original files.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A parallel corpus alignment method is characterized by comprising the following steps:
acquiring a target translation file and a target original file;
respectively preprocessing the target translation file and the target original file to obtain a plurality of target translation paragraphs corresponding to the target translation file and a plurality of target original paragraphs corresponding to the target original file;
recognizing languages of the target translation paragraphs by using ASCll codes to obtain a first language set; recognizing languages of the target original text paragraphs by using ASCll codes to obtain a second language set;
searching and extracting preset specific context sentences in the target translation paragraphs and the target original paragraphs;
splitting the target translation paragraphs and the target original paragraphs from which the preset specific context sentences are extracted correspondingly according to preset sentence-breaking rules corresponding to each language in the first language set and the second language set to obtain a plurality of target translation sentences and a plurality of target original sentences;
inserting the extracted preset specific context sentences into corresponding positions of the target translation clauses and the target original sentence clauses, aligning the target translation clauses inserted with the preset specific context sentences with the target original sentence clauses one by one, and finishing the corpus alignment of the target translation files and the target original files.
2. The method according to claim 1, wherein the preprocessing the target translation document and the target original document to obtain a plurality of target translation paragraphs corresponding to the target translation document and a plurality of target original paragraphs corresponding to the target original document comprises:
respectively carrying out file analysis on the target translation file and the target original file, unifying character codes of the target translation file and the target original file, and obtaining a target translation code and a target original code;
deleting identifiers matched with preset deletion symbols in the target translation codes and the target original text codes;
and according to the paragraph identifier, carrying out paragraph splitting on the target translation code and the target original text code from which the identifier is deleted to obtain the plurality of target translation paragraphs and the plurality of target original text paragraphs.
3. The method of claim 2, wherein the predetermined deletion symbols include a chinese space character and a western space character.
4. The method as claimed in claim 1, wherein said using ascil codes to identify languages of said plurality of target translation paragraphs to obtain a first set of languages comprises:
recognizing characters of the target translation paragraphs according to the ASCll codes;
if the ratio of the languages corresponding to the ASCll codes in the target translation paragraphs is greater than a preset threshold, determining that the language corresponding to the ASCll codes is a first language of the target translation paragraphs, and forming the first language set by the first language of each target translation paragraph.
5. The method of claim 1, wherein said identifying the languages of the target text paragraphs using ascil codes to obtain a second set of languages comprises:
identifying characters of the target original text paragraphs according to the ASCll codes;
if the ratio of the languages corresponding to the ASCll codes in the target original text paragraphs is greater than a preset threshold, determining that the language corresponding to the ASCll codes is a second language of the target original text paragraphs, and forming a second language set by the second language of each target original text paragraph.
6. The method of claim 1, wherein the preset context-specific sentence comprises one or more of an abbreviation, a person name, a chapter mark, a web literature.
7. The method of claim 1, wherein the preset sentence break rules include a no quotation mark rule and a sentence break sign rule existing in quotation marks, the no quotation mark rule including splitting a paragraph into paragraphs according to sentence break signs; the rule that the punctuation mark exists in the quotation mark comprises the following steps: firstly, segmenting the paragraphs according to quotation marks, and then segmenting the rest paragraphs according to the quotation mark-free rule.
8. A parallel corpus alignment apparatus, comprising:
the acquisition module is used for acquiring the target translation file and the target original file;
the processing module is used for respectively preprocessing the target translation file and the target original file to obtain a plurality of target translation paragraphs corresponding to the target translation file and a plurality of target original paragraphs corresponding to the target original file;
recognizing languages of the target translation paragraphs by using ASCll codes to obtain a first language set; recognizing languages of the target original text paragraphs by using ASCll codes to obtain a second language set;
searching and extracting preset specific context sentences in the target translation paragraphs and the target original paragraphs;
splitting the target translation paragraphs and the target original paragraphs from which the preset specific context sentences are extracted correspondingly according to preset sentence-breaking rules corresponding to each language in the first language set and the second language set to obtain a plurality of target translation sentences and a plurality of target original sentences;
and the output module is used for inserting the extracted preset specific context sentences into corresponding positions of the target translation clauses and the target original sentence clauses, aligning the target translation clauses inserted with the preset specific context sentences with the target original sentence clauses one by one, and finishing the corpus alignment of the target translation files and the target original files.
9. An electronic device for aligning parallel corpuses, comprising:
one or more processors;
a memory; and
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the parallel corpus alignment method of any of claims 1-7.
10. A computer-readable storage medium, having stored thereon a computer program which is loaded by a processor to perform the steps of the parallel corpus alignment method according to any of the claims 1-7.
CN202011087653.XA 2020-10-12 2020-10-12 Parallel corpus alignment method, device, equipment and storage medium Pending CN112347757A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011087653.XA CN112347757A (en) 2020-10-12 2020-10-12 Parallel corpus alignment method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011087653.XA CN112347757A (en) 2020-10-12 2020-10-12 Parallel corpus alignment method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112347757A true CN112347757A (en) 2021-02-09

Family

ID=74360612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011087653.XA Pending CN112347757A (en) 2020-10-12 2020-10-12 Parallel corpus alignment method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112347757A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126506A (en) * 2016-06-22 2016-11-16 上海者信息科技有限公司 A kind of online language material alignment schemes and system
CN110321532A (en) * 2019-06-06 2019-10-11 数译(成都)信息技术有限公司 Language pre-processes punctuate method, computer equipment and computer readable storage medium
CN111160003A (en) * 2018-11-07 2020-05-15 北京猎户星空科技有限公司 Sentence-breaking method and device
CN111259652A (en) * 2020-02-10 2020-06-09 腾讯科技(深圳)有限公司 Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment
CN111652006A (en) * 2020-06-09 2020-09-11 北京中科凡语科技有限公司 Computer-aided translation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126506A (en) * 2016-06-22 2016-11-16 上海者信息科技有限公司 A kind of online language material alignment schemes and system
CN111160003A (en) * 2018-11-07 2020-05-15 北京猎户星空科技有限公司 Sentence-breaking method and device
CN110321532A (en) * 2019-06-06 2019-10-11 数译(成都)信息技术有限公司 Language pre-processes punctuate method, computer equipment and computer readable storage medium
CN111259652A (en) * 2020-02-10 2020-06-09 腾讯科技(深圳)有限公司 Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment
CN111652006A (en) * 2020-06-09 2020-09-11 北京中科凡语科技有限公司 Computer-aided translation method and device

Similar Documents

Publication Publication Date Title
JP3696745B2 (en) Document search method, document search system, and computer-readable recording medium storing document search program
CN103123618B (en) Text similarity acquisition methods and device
CN111143556B (en) Automatic counting method and device for software function points, medium and electronic equipment
CN108763176A (en) A kind of document processing method and device
CN114495143B (en) Text object recognition method and device, electronic equipment and storage medium
CN112541070B (en) Mining method and device for slot updating corpus, electronic equipment and storage medium
CN110853625A (en) Speech recognition model word segmentation training method and system, mobile terminal and storage medium
CN113821593A (en) Corpus processing method, related device and equipment
CN107577713B (en) Text handling method based on electric power dictionary
CN112182141A (en) Key information extraction method, device, equipment and readable storage medium
CN116468009A (en) Article generation method, apparatus, electronic device and storage medium
US6754386B1 (en) Method and system of matching ink processor and recognizer word breaks
CN113010593B (en) Event extraction method, system and device for unstructured text
CN103164398A (en) Chinese-Uygur language electronic dictionary and automatic translating Chinese-Uygur language method thereof
CN111597302B (en) Text event acquisition method and device, electronic equipment and storage medium
CN109902299B (en) Text processing method and device
CN109344389B (en) Method and system for constructing Chinese blind comparison bilingual corpus
CN103164396A (en) Chinese-Uygur language-Kazakh-Kirgiz language electronic dictionary and automatic translating Chinese-Uygur language-Kazakh-Kirgiz language method thereof
CN112347757A (en) Parallel corpus alignment method, device, equipment and storage medium
CN109933799B (en) Statement splicing method and device
CN115221266A (en) Raw corpus retrieval method and device, electronic equipment and storage medium
CN109657207B (en) Formatting processing method and processing device for clauses
CN114444503A (en) Target information identification method, device, equipment, readable storage medium and product
CN112955961A (en) Method and system for normalization of gene names in medical texts
CN106598936B (en) Letter word extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination