CN107862045A

CN107862045A - A kind of across language plagiarism detection method based on multiple features

Info

Publication number: CN107862045A
Application number: CN201711084337.5A
Authority: CN
Inventors: 刘刚; 胡昱临; 李光曦
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2018-03-30
Anticipated expiration: 2037-11-07
Also published as: CN107862045B

Abstract

The present invention is to provide a kind of across language plagiarism detection method based on multiple features.(1) corpus is built；(2) structure of translation feature, the europeanized phenomenon and translation body problem generally occurred according to translation article has carried out translation feature construction, cleaning is carried out to feature by way of feature selecting and filters out validity feature, filters invalid feature or the feature of DeGrain；(3) feature selecting, some validity features are selected from all multiple features to carry out the training of grader, and then a certain piece of differentiation or a few Chinese articles whether there is across language plagiarization problem；(4) plagiarism detection corresponding to feature based, for the feature of Chinese, the accurate correspondence of English feature is carried out, and correspondingly carries out plagiarizing the filtering and generation of result according to translation feature and architectural feature, by WordNet plagiarize the final confirmation of result.The present invention can solve across language plagiarization problem according to the various features excavated from translation.

Description

A kind of across language plagiarism detection method based on multiple features

Technical field

Detect whether article has the method for cribbing the present invention relates to a kind of.

Background technology

(1) the europeanized phenomenon in English-Chinese translation and translation body problem are found

The mutual change of English and chinese gives both language and all brings subtle change, include accent, vocabulary, The factors such as grammer, rhetoric.Although macaronic influence is mutual, comparatively speaking, influence of the English to Chinese is remote Much larger than influence of the Chinese to English.When single language plagiarism detection can not increasingly meet science that it runs into dishonour problem when, this When across language plagiarism detection occur.However, single language plagiarism detection technology is in across language plagiarism detection and does not apply to.Currently across The method of language plagiarism detection most main flow has cross-language information retrieval (CLIR) and across language similitude detection (CLSD) two kinds of sides Method.

It can be carried out between bilingual by way of dictionary, corpus or translation correspondingly.When being studied, it is necessary to Solve the problems, such as word ambiguousness and ambiguity problem, the sortord of output result, the cutting problems of query word, to multilingual The various problems such as the dependence of resource.CLIR is just gradually popular in recent years, in across the language assessment forum of 2009, in 10 kinds of methods There are 7 kinds all directly to use machine translation that bilingual is converted into a kind of language.

CLSD refers to across language similarity detection technique, and it and CLIR have many similar places.In CLSD, compare Be different language article between similitude.Currently conventional CLSD algorithms have a lot, it is main including it is based on language syntax, Algorithm based on machine translation, based on terminological dictionary, based on parallel or comparable corpus and based on semantic network.

Abroad, exist much on the related article plagiarized across language, such as English-Arabic, English-German, English-Czech etc..But due to Chinese particularity, external article does not relate to the plagiarism detection of English-Chinese, but other double Detection of the Chinese language chapter to English-Chinese has very big reference.

(2) it make use of Feature Engineering.Feature Engineering is can to make engineering using the relevant knowledge of data fields to create Practise the process that algorithm reaches the feature of optimum performance." data and feature determine the upper limit of machine learning, and model and algorithm are only It is to approach this upper limit ".As can be seen here, Feature Engineering has extremely important status in machine learning.In practical application Central, Feature Engineering is that machine learning is successfully crucial.

Under the data of restriction, with excessive feature, to carry out, the training complexity of sorting algorithm is too big and easy cross is intended Close.So after feature is built, it is necessary first to carry out feature selecting.The purpose is to one is selected from all multiple features most Have the character subset of statistical significance, on the one hand can reach the purpose for screening invalid feature, on the other hand feature space is entered again Go dimensionality reduction, reduce model complexity.

The content of the invention

Can be solved to rob across language according to the various features excavated from translation it is an object of the invention to provide a kind of Surreptitiously across the language plagiarism detection method based on multiple features of problem.

The object of the present invention is achieved like this：

(1) corpus is built first

The corpus is divided into Chinese training set and Chinese test set, corpus is divided into two classes, first kind corpus is In the presence of the Chinese articles plagiarized across language, the acquisition of this part language material is by the way that English document automatic translation is obtained；Second class language Storehouse is expected for original Chinese articles, and the acquisition of this part language material is by downloading authoritative Chinese Papers；

The construction method of second class corpus is：A large amount of english articles are crawled with reptile, and batch is carried out certainly by program Dynamic translation obtains plagiarizing Chinese document, realizes and the article of the PDF format with particular number of batch is handled, a volume Number article for being n, forms m text-only file, file entitled n.m, wherein m for this article paragraph number, mainly including following Three steps,

1) by the document of PDF format be converted to can text mark XML format document；

2) converted according to XML label, the information of each class text,Between for a paragraph, go to read text successively Shelves, readIts special marking is just added before the label afterwards, and removes the content between other labels and label, text Remaining in shelves is the document for above being added at every section special marking；

3) gone to read the document for adding special marking with program, often read special marking and content behind is just write one Plain text document, and its special marking is removed；

(2) structure of translation feature

The europeanized phenomenon and translation body problem generally occurred according to translation article has carried out translation feature construction, passes through feature The mode of selection carries out cleaning to feature and filters out validity feature, filters invalid feature or the feature of DeGrain；

(3) feature selecting

Some validity features are selected from all multiple features to carry out the training of grader, and then distinguish an a certain piece or a few Piece Chinese articles whether there is across language plagiarization problem；

(4) plagiarism detection corresponding to feature based

For the feature of Chinese, the accurate correspondence of English feature is carried out, and correspondingly enter according to translation feature and architectural feature Row plagiarizes the filtering and generation of result, by WordNet plagiarize the final confirmation of result；

Four-stage is broadly divided into, the first stage, plagiarizes Candidate Set pretreatment stage, plagiarizing Candidate Set to Chinese and English is carried out Paragraph divides and part-of-speech tagging；Second stage, a filtration stage, accurate feature is carried out according to translation feature and corresponded to, and in fact Existing paragraph is apart from computational algorithm；Phase III, in the secondary filter stage, plagiarization result is carried out according to architectural feature and filtered again；The In four stages, final result the stage of recognition, final confirmation is carried out to plagiarizing result with WordNet, obtains finally plagiarizing result.

The language disunity of article is across the language most important obstacle of plagiarization of detection.Consider macaronic representation It is uniformly conventional method.But there is the problem of inaccuracy in form carries out RUP.Some unity of form compared with Difference, for example be compared again as same language by translating, this mode is higher for the effect requirements of translation, and different Translation bring influence degree difference larger so that the result of plagiarism detection is not known.Also word alignment problem.In detection process In, it is necessary to pass through dictionary or it is parallel expect storehouse realize across Language Document word alignment problem.This dictionary or Parallel Corpus It is large-scale and needs a variety of probabilistic dictionaries, can like this wastes a large amount of human and material resources, and be difficult to expand to a variety of languages Speech is above.

The present invention combines the particularity of Chinese, and certain piece is detected by finding a kind of style of writing style for not meeting Chinese custom Whether article has cribbing, to reduce plagiarism detection scope.It can so avoid translating ineffective and disambiguation inaccurate Problem, reach purpose of the detection across language plagiarization with a kind of unique method.Science improper detection of the present invention to Chinese has Significance, be advantageous to the raising of the specification, scientific research level of academic atmosphere.

Present invention firstly provides a kind of method that paragraph divides automatically, then from plagiarizing in Candidate Set by special based on translation Sign is corresponding and based on being filtered twice corresponding to architectural feature, filters out ineligible paragraph, filters out qualified section Fall, formed and plagiarize result, and final result confirmation has been carried out to plagiarizing result using WordNet.Finally, experimental result is carried out Checking, compared for existing with calculating the interpretation of result of similarity, indicated in the absence of filtering twice filter twice it is effective Property.

Interpretation of result

After being segmented, every article has been partitioned into some paragraphs.For specific test set, with fixed number come A certain section in an a certain piece is specified, and using this numbering as document title.For example, C307 representatives is Chinese Candidate Set the 3rd The 07th section.Similarly, English test set is represented with E.Ensuing experiment counts for convenience, by 142 articles before selection 1000 paragraphs be test set.

For multiple English paragraphs of a Chinese paragraph, ineligible is all filtered out.In contrast 1000 After paragraph, there are 749 paragraphs only to remain a suspicious paragraph after filtering twice, wherein there are 736 paragraphs accurately to match To the paragraph of its plagiarization, there is the situation of matching error in only 13 paragraphs.In remaining 251 paragraph Candidate Sets, there are 24 Paragraph is matching by filtering no suspicious paragraph twice, has 227 paragraphs to have multiple suspicious paragraphs matching.And , it is necessary to filter out and specifically plagiarize paragraph, this will be complete by WordNet dictionaries in 227 paragraphs matched with multiple results Into the confirmation work of final result.Through statistics, in 227 paragraphs of checking, there are 220 paragraphs to realize the standard for plagiarizing result It is really corresponding, the screening mistake of only 7 paragraphs, return its reason, because there is error when WordNet calculates similarity, correctly Paragraph, which does not have, obtains the similarity of maximum.But plagiarized paragraph exists in its suspicious paragraph after filtration, from side illustration As a result validity.

By the contrast verification to more than 1000 individual paragraphs, it is 74% to test the accuracy reached.Confirmation to final result In, the present invention is more more effective than without the method filtered twice.

Brief description of the drawings

Across the language plagiarism detection sorting technique routes of Fig. 1.

Across the language plagiarism detection method route figures of Fig. 2.

Across the language plagiarism detection overall framework figures of Fig. 3.

Across the language plagiarism detection module maps of Fig. 4.

Fig. 5 divides paragraph detail flowchart automatically.

Embodiment

For being plagiarized across language, it should determine that certain article whether there is across language plagiarization first, there will be across language The article of plagiarization is found out, so just can determine that in this article which paragraph or which across language plagiarization phenomenon partly be present.For Problem above, the present invention are mainly to be found from the Chinese articles plagiarized across language and select its effective translation feature, Different feature weights is given, structure has the disaggregated model plagiarized across language, and given Chinese articles can be classified, Cribbing is there may be in detection wherein which Chinese articles, and cribbing is not present in which article.

The present invention is by building a kind of across language plagiarism detection technology based on multiple features, it is intended to can be dug according to from translation The various features excavated solve across language plagiarization problem.The present Research that the present invention is plagiarized single bilingual speech first has been carried out point Analysis and summarize, it is proposed that a kind of across language plagiarism detection model based on multiple features, the model include based on multiple features across Language plagiarizes across language plagiarism detection corresponding to classification and feature based.In the process, feature selecting has used Chi-square Test Algorithm, and on this basis it is contemplated that quantity to feature in text and in classification feature degree of stability, then The weight of each feature is calculated with a kind of new calculating feature weight method.By the performance that the feature of above-mentioned selection is English with it Form is carried out correspondingly, and the algorithm of characteristic distance carrys out Chinese and English paragraph corresponding to comparison between a kind of calculating paragraph of proposition.So-called knot Structure feature is corresponding, will the structure of Sino-British paragraph be compared, retain the similar paragraph of structure, the big section of filtration difference Fall.Finally, testing result is verified and confirmed with based on WordNet method, be finally reached across language detection Purpose.

The technological means of the present invention mainly includes：

1st, corpus is built first, and corpus is divided into Chinese training set and Chinese test set.

Supervised classification is carried out to article, it is necessary to which that corpus is divided into two is big as training set in view of corpus to be utilized Class, one kind are the Chinese articles plagiarized across language be present.The acquisition of this part language material can be by by English document automatic translation Obtain.It is another kind of to download authoritative Chinese Papers for original Chinese articles, the acquisition of this part language material.

Wherein, the second class corpus is easier to obtain.And for first kind corpus, it there's almost no both at home and abroad such Corpus, so the present invention has crawled a large amount of english articles with reptile first, and batch automatic translation is carried out by program and obtained Plagiarize Chinese document.

This method, which is mainly realized, to be handled the article of the PDF format with particular number of batch, and a numbering is n Article, m text-only file can be formed, file entitled n.m, wherein m are the paragraph number of this article.

Its committed step has following three step.

(1) by the document of PDF format be converted to can text mark XML format document.XML is referred to as extensible markup language Speech, is mainly used to data storage and structure.And compared with HTML, it is that a kind of grammer is more loose, not strict Web language.Institute To be converted to XML format, label thereon can clearly mark each section, and next good base is beaten for division paragraph Plinth.In addition, after PDF being turned into XML, the headerfooter in its article has all removed automatically.Eliminate and remove header by hand Footer it is cumbersome.

(2) converted according to XML label, the information of each class text.UnderstandBetween for a paragraph.Successively Go to read document, readIts special marking is just added before the label afterwards, and is removed interior between other labels and label Hold.Remaining in document is the document for above being added at every section special marking.

(3) gone to read the document for adding special marking with program.Special marking is often read just by content write-in one behind Individual plain text document, and its special marking is removed.Such a plain text document can correspond to one section in article.

2nd, the structure of translation feature.Feature is essential in machine learning, and the accuracy of learner largely depends on In the quality of the feature of structure.The europeanized phenomenon and translation body problem that the present invention generally occurs according to translation article have carried out translation Feature construction, the present invention are cleaned to feature by way of feature selecting, can filter out validity feature, filter invalid spy The feature of sign or DeGrain.Chi-square Test (CHI) method of selection is selected feature.

3rd, feature selecting.If the feature built is all not validity feature, it is necessary to be selected from all multiple features Dry validity feature carries out the training of grader, and then distinguishes an a certain piece or a few Chinese articles whether there is across language plagiarization Problem.

4th, plagiarism detection corresponding to feature based.For the feature of Chinese, the accurate correspondence of English feature is carried out, and according to Translation feature and architectural feature correspondingly carry out plagiarizing the filtering and generation of result, and carry out plagiarization result most by WordNet Confirm eventually.The correspondence of Chinese and English feature has been carried out with feature, has filtered and has not met paragraph corresponding to feature, and then by the knot of plagiarization The scope of fruit is substantially reduced.After calculated based on the distance between paragraph corresponding to translation feature, plagiarize in Candidate Set Chinese paragraph, which can filter, does not largely meet the paragraph of respective conditions, while remains some suspicious English yet and plagiarize paragraph, this Greatly reduce the scope for plagiarizing result.Then further extraction architectural feature is done into secondary filter to plagiarizing result.Choose Five kinds of architectural features：Adjectival length in the length of verb, sentence in the length of noun, sentence in the length of sentence, sentence The length of adverbial word in degree, sentence, for plagiarizing Candidate Set further screen and filter.

It is broadly divided into four-stage.First stage, plagiarize Candidate Set pretreatment stage.Candidate Set is plagiarized to Chinese and English to carry out Paragraph divides and part-of-speech tagging, convenient to plagiarizing being accurately positioned and post-processing for position.Second stage, a filtration stage.Root Accurate feature is carried out according to translation feature to correspond to, and realizes paragraph apart from computational algorithm.Phase III, secondary filter stage.Root Plagiarization result is carried out according to architectural feature to filter again, realization is based on architectural feature filter algorithm.Fourth stage, final result confirm Stage.Final confirmation is carried out to plagiarizing result with WordNet, obtains finally plagiarizing result.

The main processes of the present invention are described in more detail below in conjunction with the accompanying drawings.

Part I Text Pretreatments

Input：Need the text message analyzed

Output：Lexical set after participle

Using sentence as unit, as the basis of term extraction, in this stage, by making pauses in reading unpunctuated ancient writings to text, participle and Filter the procedure extraction data acquisition system of stop words.

The filtering stop words filtering of Part II vocabulary

After Chinese word segmentation is carried out to text, single character string one by one can be obtained.It is seen that in sentence In what semantic meaning representation was had a great influence is mostly noun and verb, the vocabulary of other ornamental equivalents to the semantic effect of sentence not Greatly, so needing to retain this part name influential on sentence semantics or verb, and they are referred to as to continue to employ word.So And in the sequence after these participles, it is very high the frequency that a part of word occurs in the text to be present, but actually to text Analysis does not have too much influence.These vocabulary are mainly made up of auxiliary words of mood, preposition, adverbial word etc., and these words do not have in itself There is clear and definite implication, only when using these words as can just play some use during a part for sentence.This kind of word is referred to as to stop Word.Therefore, if can the inessential vocabulary of those in the text of field filtered completely, it will greatly save system Memory space and reduce the checking system middle and later periods workload and amount of calculation.Therefore the design in preprocessing part not only will Select suitably participle and dimensioning algorithm are simultaneously improved appropriately, and the process that also filtered to vocabulary carries out appropriate set Meter.Process is as shown in Figure 3.

Precondition：Participle operation is performed

Input：Character set to be filtered

Output：Text lexical set after vocabulary filtering

Step 1：By lexical set to be filtered can be obtained after Chinese word segmentation.

Step 2：Stop words text is loaded, a vocabulary is read in from the set, vocabulary is carried out in stop words text Search.If finding, need to filter out the character, otherwise filter.

Part III feature constructions

The present invention summarizes following a few class text features and character representation, and these features are all frequent in text is plagiarized Occur, and in non-plagiarization text and the europeanized phenomenon that infrequently occurs.

Precondition：Perform and complete vocabulary filter operation

Output：Europeanized feature

Step 3：Europeanized feature is selected.

Step 4：Return to the characteristic set after selection；

Part IV feature selectings

Precondition：Perform and complete feature construction.

Input：Characteristic set

Output：Characteristic set after selection

Step 5：Chi-square Test is carried out to feature.

Step 6：Scored by examining to each feature, find the suitable parameters weighting of each feature.And arranged Sequence.

Part V .SVM model constructions

SVM is that current effect is best, one of the most frequently used grader, and it is based on structural risk minimization, extensive energy The advantages of power is strong, and it is a convex quadratic programming problem, and locally optimal solution is equivalent to globally optimal solution.

Precondition：Feature selecting is completed.

Input：Selected europeanized feature.

Output：SVM models based on europeanized feature.

Step 7：Selecting All Parameters C, inner product is replaced with RBF, obtain SVM dual problem.

Step 8：The variable for being unsatisfactory for constraints is updated with SMO algorithms, until all variables all meet KTT Condition.

Part VI paragraphs divide automatically

Because the storage mode of article is PDF format, it is necessary to consider how after PDF format accurately is converted into segmentation TXT plain text formats.PDF format is changed into the third party software of TXT forms or third-party opened there is many on the net Source storehouse, but effect is not notable, as a result in have many contents and stylistic mistake, corpus can be had a strong impact on by doing so Quality and then the inaccuracy for producing anaphase.So the present invention takes a kind of special method to realize the target.

Precondition：SVM structures are completed

Input：Text

Output：Paragraph after division

Step 9：By the document of PDF format be converted to can text mark XML format document；

Step 10：According to XML label, the information conversion of each class text.

Step 11：Gone to read the document for adding special marking with program.

Part VII English part-of-speech taggings

Precondition：Paragraph divides

Input：Paragraph after division

Output：Part-of-speech tagging result

Step 12：To exist form, picture situation filtration problem：Form, picture have special mark in XML document Note is different from the mark of paragraph, the not pre-read in segmentation.

Step 13：Filtering and paragraph consolidation problem to title：On the one hand, the mark of title is sometimes with paragraph marks weight It is multiple, all useAs mark, it is necessary to be identified and filter to title.On the other hand, XML is converted to from PDF sometimes When, occur between paragraph and block, divide into two sections of even more multistages by one section, need exist for merging.

Part VIII is based on plagiarization result corresponding to translation feature and once filtered

Precondition：Part-of-speech tagging

Input：Text after mark

Output：The text of filtering for the first time

Step 14：Chinese and English alignment is carried out according to feature.

Step 15：The paragraph distance of text is calculated.

Part IX carries out second of filtering to plagiarizing result

Step 16：The plagiarization result once filtered that a given Chinese plagiarizes paragraph and a upper trifle screens, one One is compared, and it is plagiarized in result from it if some feature exceeds specific threshold and is filtered, remaining after filtering Paragraph be plagiarization result after secondary filter.

Claims

1. a kind of across language plagiarism detection method based on multiple features, it is characterized in that：

(1) corpus is built；

(2) structure of translation feature

The europeanized phenomenon and translation body problem generally occurred according to translation article has carried out translation feature construction, passes through feature selecting Mode cleaning carried out to feature filter out validity feature, filter invalid feature or the feature of DeGrain；

(3) feature selecting

Some validity features are selected from all multiple features to carry out the training of grader, and then are distinguished in an a certain piece or an a few pieces Article whether there is across language plagiarization problem；

(4) plagiarism detection corresponding to feature based

For the feature of Chinese, the accurate correspondence of English feature is carried out, and correspondingly robbed according to translation feature and architectural feature The surreptitiously filtering and generation of result, by WordNet plagiarize the final confirmation of result.

2. across the language plagiarism detection method according to claim 1 based on multiple features, it is characterized in that the structure language material Storehouse specifically includes：

The corpus is divided into Chinese training set and Chinese test set, corpus is divided into two classes, first kind corpus is to exist Across the Chinese articles that language is plagiarized, the acquisition of this part language material is by the way that English document automatic translation is obtained；Second class corpus For the Chinese articles of originality, the acquisition of this part language material is by downloading authoritative Chinese Papers；

The construction method of second class corpus is：A large amount of english articles are crawled with reptile, and batch automatic turning is carried out by program Translate to obtain and plagiarize Chinese document, realize and the article of the PDF format with particular number of batch is handled, a numbering is N article, forms m text-only file, file entitled n.m, wherein m for this article paragraph number, mainly including following three step,

2) converted according to XML label, the information of each class text,Between for a paragraph, go to read document successively, ReadIts special marking is just added before the label afterwards, and removes the content between other labels and label, in document Remaining is the document for above being added at every section special marking；

3) gone to read the document for adding special marking with program, often read special marking and content behind is just write into a pure text This document, and its special marking is removed.

3. across the language plagiarism detection method according to claim 1 based on multiple features, it is characterized in that feature based is corresponding Plagiarism detection be broadly divided into four-stage, the first stage, plagiarize Candidate Set pretreatment stage, Candidate Set plagiarized to Chinese and English and entered Row paragraph divides and part-of-speech tagging；Second stage, a filtration stage, accurate feature is carried out according to translation feature and corresponded to, and Realize paragraph apart from computational algorithm；Phase III, in the secondary filter stage, plagiarization result is carried out according to architectural feature and filtered again； Fourth stage, final result the stage of recognition, final confirmation is carried out to plagiarizing result with WordNet, obtain final plagiarize and tie Fruit.