CN105550172B - A kind of distributed text detection method and system - Google Patents

A kind of distributed text detection method and system Download PDF

Info

Publication number
CN105550172B
CN105550172B CN201610020566.XA CN201610020566A CN105550172B CN 105550172 B CN105550172 B CN 105550172B CN 201610020566 A CN201610020566 A CN 201610020566A CN 105550172 B CN105550172 B CN 105550172B
Authority
CN
China
Prior art keywords
participle
rwv
amount
document
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610020566.XA
Other languages
Chinese (zh)
Other versions
CN105550172A (en
Inventor
夏峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201610020566.XA priority Critical patent/CN105550172B/en
Publication of CN105550172A publication Critical patent/CN105550172A/en
Application granted granted Critical
Publication of CN105550172B publication Critical patent/CN105550172B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of distributed text detection method and systems.Wherein, comparison database includes material;The comparison database is stored in different site locations using distributed way;Storehouse is segmented, includes participle and corresponding part of speech;Word-dividing mode is segmented;Segment characteristic value generation module generation participle part of speech feature value;Participle free vector dimension determining module determines participle free vector dimension;Participle simplifies vector dimension generation module, and generation participle simplifies vector dimension;Segment feature vector generation module, generation participle feature vector;Document word-dividing mode to be identified obtains word segmentation result for being segmented to document to be identified;Document to be identified segments free vector dimension determining module, determines participle free vector dimension;Document participle to be identified simplifies vector dimension generation module, generates document participle to be identified and simplifies vector dimension;Document to be identified segments feature vector generation module, generates document participle feature vector to be identified;Carry out similarity comparison.

Description

A kind of distributed text detection method and system
Technical field
The invention belongs to text detection field more particularly to a kind of distributed text detecting systems.
Background technology
Text detection refers to the content of text for judging whether a certain piece document is accused of plagiarizing other one or more documents.But Not fully it is equal to duplication due to plagiarizing, but replaces or translate foreign language possibly through certain semantic transforms, synonym The multiple means such as document are accused of plagiarizing the content of text of other documents.
At present, there are mainly two types of methods for text detection techniques:One kind is by fingerprint recognition detection method, and one kind is to pass through base The paragraph word frequency statistics detection method in text.So-called fingerprint recognition refers to that extracting some from the source text content of submission is known as The data characteristics string of fingerprint, judges whether a certain piece document is plagiarized other documents according to the identical rate of fingerprint.Institute Meaning paragraph word frequency statistics detection method refers to segment the text of submission, by the appearance frequency for counting each paragraph in text Rate, set a threshold value after by each array of text to be checked compared with each array of query text, finally according to this Index is to determine whether plagiarized.The above method of the prior art there are a degree of discrimination rate is low, efficiency not The problems such as high.
The content of the invention
To overcome above-mentioned the deficiencies in the prior art, plagiarize detection method the present invention provides a kind of distributed text and be System.
Wherein, the text plagiarizes detecting system and includes comparison database, for including with the material for comparing object;It is described right Different site locations is stored in using distributed way than storehouse;It can be selected when accessing comparison database according to the loading condition of different websites Particular station is taken to access;Storehouse is segmented, for including participle and corresponding part of speech;It segments in storehouse and is carried out uniquely for each participle Number represents unique number of a certain participle in storehouse is segmented using W_ID;Word-dividing mode, for being segmented to each material, And word segmentation result is preserved into comparison database;Participle characteristic value generation module counts what each participle occurred in corresponding material Quantity generates the corresponding participle part of speech feature value of each participle;Segment point of the free vector dimension determining module according to material Word result determines participle free vector dimension WFV;The participle free vector dimension WFV is equal to after segmenting specific material The quantity of obtained different participles;Participle simplifies vector dimension generation module, and generation participle simplifies vector dimension RWV;Participle is special Vector generation module is levied, participle described in each material is extracted and simplifies the corresponding characteristic value generation participle features of vector dimension RWV Vectorial WVE_RWV;User's access mode detection module, for user to be prompted to upload document to be identified;User's detection pattern determines Module, for judge active user's detection pattern for it is common plagiarize identification pattern when, document word-dividing mode to be identified is for treating Identification document is segmented, and obtains word segmentation result;Document to be identified segments free vector dimension determining module, determines participle freely Vector dimension WFV_TBI;Document participle to be identified simplifies vector dimension generation module, generates document participle to be identified and simplifies vector Dimension RWV_TBI;Document to be identified segments feature vector generation module, generates document participle characteristic vector W VE_RWV_ to be identified TBI;When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, similarity ratio is carried out It is right;After the completion of document to be identified and the comparison of all materials, extract all doubtful materials, by document to be identified and doubtful material into Row further comparison.
Above description is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of specification, below with presently preferred embodiments of the present invention and coordinate attached drawing be described in detail as after.
Description of the drawings
Fig. 1 shows the block diagram of distributed text detecting system according to an embodiment of the invention;
Fig. 2 shows sliding window detection method according to an embodiment of the invention.
Specific embodiment
Further to illustrate the present invention to reach the technological means and effect that predetermined goal of the invention is taken, below in conjunction with Attached drawing and preferred embodiment, to according to system and method specific embodiment proposed by the present invention, feature and its effect, specifically It is bright as after.In the following description, what different " embodiment " or " embodiment " referred to is not necessarily same embodiment.This Outside, special characteristic, structure or the feature in one or more embodiments can be combined by any suitable form.
As shown in Figure 1, material subsystem is included in the distributed text detecting system (calling system in the following text) of the present invention;User's System;Doubtful story extraction subsystem;Subsystem is compared, wherein the material subsystem, for preparing for plagiarizing detection comparison The material used;User subsystem, user management user login information and definite user's writing style;Doubtful story extraction Subsystem, for the extraction from comparison database and the doubtful material of document to be identified;Compare subsystem, for by doubtful material with treating Identification document is compared, and generates comparison report.
A specific embodiment according to the present invention, material subsystem may further include:Comparison database;Segment storehouse, It segments and synonymous near synonym storehouse and middle foreign language thesaurus is included in storehouse;Word-dividing mode;Participle group module;Middle foreign language participle group mould Block;Segment parts of speech classification module;Participle group parts of speech classification module;Middle foreign language participle group parts of speech classification module;Segment characteristic value life Into module;Participle group characteristic value generation module;Middle foreign language participle group characteristic value generation module;Segment tightening coefficient generation module; Participle group tightening coefficient generation module;Middle foreign language participle group tightening coefficient generation module;Segment the generation of tightening coefficient feature vector Module;Participle group tightening coefficient feature vector generation module;Middle foreign language participle group tightening coefficient feature vector generation module;Participle Free vector dimension determining module;Participle group free vector dimension determining module;Middle foreign language participle group free vector dimension determines Module;Participle simplifies vector dimension generation module;Participle group simplifies vector dimension generation module;Middle foreign language participle group simplifies vector Dimension generation module;Segment feature vector generation module;Participle group feature vector generation module;And middle foreign language participle group feature One or more of vector generation module.
A specific embodiment according to the present invention, user subsystem may further include:User's access mode is examined Survey module;User's detection pattern determining module;User's writing style test module;Test pictures word description characteristic value generates mould Block;Test article word description characteristic value generation module;Test pictures word description feature vector generation module;Test article text Word description feature vector generation module;Test pictures reference characteristic vector generation module;Test the vector generation of article reference characteristic Module;User test picture character Expressive Features value generation module;User test picture character Expressive Features vector generation module; User's picture writing style feature vector generation module;User test article word description characteristic value generation module;User test Article word description feature vector generation module;User's article writing style and features vector generation module;User's writing style is special Levy vector generation module;Pending file characteristics value generation module;Pending file characteristics value tag vector generation module;User Writing style similarity calculation module;User's writing style judgment module;In user's writing style structural auxiliary word judgment module It is one or more.
A specific embodiment according to the present invention, doubtful story extraction subsystem may further include:It is to be identified Document word-dividing mode;Document participle group module to be identified;Foreign language participle group module in document to be identified;Document to be identified segments word Property sort module;Document participle group parts of speech classification module to be identified;Foreign language participle group parts of speech classification module in document to be identified;It treats Identify document participle characteristic value generation module;Document participle group characteristic value generation module to be identified;Foreign language point in document to be identified Phrase characteristic value generation module;Document to be identified segments tightening coefficient generation module;Document participle group tightening coefficient life to be identified Into module;Foreign language participle group tightening coefficient generation module in document to be identified;Document to be identified segments tightening coefficient feature vector Generation module;Document participle group tightening coefficient feature vector generation module to be identified;Foreign language participle group is close in document to be identified Coefficient characteristics vector generation module;Document to be identified segments free vector dimension determining module;Document participle group to be identified is free Vector dimension determining module;Foreign language participle group free vector dimension determining module in document to be identified;Document participle essence to be identified Simple vector dimension generation module;Document participle group to be identified simplifies vector dimension generation module;Foreign language segments in document to be identified Group simplifies vector dimension generation module;Document to be identified segments feature vector generation module;Document participle group feature to be identified to Measure generation module;Foreign language participle group feature vector generation module in document to be identified;File characteristics vector adjustment module to be identified; Material feature vector adjusts module;Common to plagiarize identification similarity calculation module, identification similarity calculation module is plagiarized in extension;It is more Languages plagiarize identification similarity calculation module;Document tightening coefficient statistical module to be identified;Material tightening coefficient statistical module;It is public Formula extraction module;Formula decomposing module;One or more of doubtful story extraction module of tightening coefficient.
A specific embodiment according to the present invention, comparison subsystem may further include:Sliding window sets mould Block;Sliding window contrast module and comparison report generation module.
In a specific embodiment party according to the present invention, the system comprises comparison database, for including with comparing object Material.The comparison database further comprises books storehouse, paper storehouse, patent database, formula storehouse, proverb common saying storehouse, proverb storehouse, famous person The word banks such as well-known saying storehouse, poem storehouse.Wherein, books storehouse is used to include the books of public publication;Paper storehouse for include journal article, Meeting paper, academic dissertation etc.;Patent database is used to include disclosure etc.., it is necessary to further preserve institute when including material State the source of material, such as the publication date of books, publishing house, author, book number etc.;The date issued of journal article, corresponding phase The periodical name of periodical, issue, author etc.;The meeting title of meeting paper, Meeting Held place, Meeting Held date, author etc.;Degree School, graduation time, degree grade, author of paper etc.;According to the quarry information included, those skilled in the art can Uniquely to obtain the material.Preferably, the material that comparison database is included is not limited to Chinese material, further comprises foreign language element Material.Comparison database establish after also need to periodically or non-periodically be safeguarded, supplement newly-increased books, journal article, meeting paper, Academic dissertation and disclosure etc..Proverb common saying storehouse for be embodied in sentence wide-spread between network or masses, The materials such as phrase.For including famous sayings of famous figures material, poem storehouse is used to include the materials such as poem, word, song, tax in famous sayings of famous figures storehouse. The purpose that proverb common saying storehouse, famous sayings of famous figures storehouse, poem storehouse etc. are further established in comparison database is by the material of object as a comparison Scope is further expanded from traditional books, paper, patent file etc., is improved and is plagiarized the comprehensive of detection.People in the art Member knows that comparison database can also further include other kinds of material, and details are not described herein.
Preferably, comparison database is classified when including material according to material fields.A tool according to the present invention The classification in Chinese library taxonomy, the Chinese library taxonomy totally 5 may be employed in body embodiment, field designation Basic category, 22 major classes, the mixing number combined using Chinese phonetic alphabet with Arabic numerals represent one with a letter A major class alphabetically reflects the order of major class, is marked after letter with number.For example, A1 represents Marx, Engels Works, K6 represent Oceania history, and TN represents electronic technology, the communication technology.To be applicable in industrial technology development, to the two of industrial technology Grade classification uses biliteral.Those skilled in the art know, other taxonomic hierarchieses can also be used to carry out field mark to material Know.
Preferably, comparison database is when including material, to the material included according to title, author, summary and text Mode is indexed respectively.For establishing incidence relation between the title of each material, author, summary and text each several part, The rest part of same material can be obtained by any portion therein.
Preferably, comparison database is when including material, extracts duplication to formula present in the material included, and builds Vertical formula storehouse is individually preserved.Each formula in the formula storehouse established with its material being extracted it is relevant, Its corresponding material full text can be obtained by the formula in formula storehouse.A specific embodiment according to the present invention, is being received When recording formula, the respective variable parameter of formula and dependent variable parameter and oeprator are extracted into preservation respectively.According to The specific embodiment of the present invention, respective variable parameter and the laggard onestep extraction of dependent variable parameter for extracting formula are each Concrete meaning, dimension and the value range of parameter, and preserved respectively.A specific embodiment according to the present invention, After the oeprator for extracting formula, middle foreign language textual annotation is further subject to operator.In formula storehouse, that is included is every One formula preserves the symbolic indication of corresponding independent variable parameter and dependent variable parameter, each independent variable, dependent variable The middle foreign language statement of concrete meaning, dimension and the middle foreign language textual annotation of value range and operator AND operator.Right Purpose than further establishing formula storehouse in storehouse is that the material scope of object as a comparison is further expanded to formula contrast, is carried Height plagiarizes the comprehensive of detection.Those skilled in the art know, comparison database can also to the other content in material further into Row extraction, such as chemical formula, gene order etc., details are not described herein.
A specific embodiment according to the present invention, the comparison database are stored in different websites using distributed way Position;Particular station can be chosen when accessing comparison database according to the loading condition of different websites to access.Each station statistics are current The material quantity being extracted in unit interval from comparison database, the material quantity can be the number or material of material Byte number;Obtain the average load amount of this website;The average load amount of this website is periodically reported doubtful material by each website Extract subsystem;When the doubtful story extraction subsystem needs to extract material from comparison database for choosing doubtful material, A minimum website of average load amount is chosen according to the average load amount of each website reported recently to access;List therein The position period is configured by system;It can be chosen for according to actual needs 5 minutes, 10 minutes, 30 minutes or 60 minutes.Root According to the specific embodiment of the present invention, different word banks can be used distributed way and be stored in different stations in the comparison database Point position;The site location stored according to different word banks during comparison database is accessed to access respectively.Doubtful story extraction subsystem System need from comparison database extract material for choose doubtful material when, according to the fields for the material of being extracted or affiliated Type selects different comparison word banks to access.
A specific embodiment according to the present invention, it is described access comparison database when can be according to the loading condition of different websites Choose particular station to access and refers to, the loading condition of variant website obtained before accessing, choose load minimum website into Row accesses.
A specific embodiment according to the present invention, comprising participle storehouse in system, for including participle and corresponding part of speech. The participle storehouse is set in advance by system, and periodic maintenance, is mended and is increased neologisms etc..Preferably, segment storehouse in for it is each segment into Row unique number can use W_ID to represent unique number of a certain participle in storehouse is segmented.Preserve participle in the participle storehouse Part of speech, such as noun, verb, adjective, number, quantifier, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia. Word segmentation result is divided into notional word and function word by a specific embodiment according to the present invention according to part of speech, and wherein notional word includes Noun, verb, adjective, number, quantifier and pronoun;Function word includes adverbial word, preposition, conjunction, auxiliary word, interjection, onomatopoeia.It is preferred that Ground segments in storehouse and has further included synonymous near synonym storehouse, wherein the same or similar participle of meaning is formed one group, using group as Unit is numbered.Multiple equivalent in meaning or similar participle corresponds to a participle group #, can represent certain using WG_ID Unique number of one participle in storehouse is segmented.Preferably, segment in storehouse and further included the synonymous near synonym storehouse of middle foreign language, wherein The same or similar middle foreign language participle of meaning is formed one group, is numbered in units of group.It is multiple equivalent in meaning or similar Middle foreign language participle corresponds to a middle foreign language participle group #, can represent that a certain middle foreign language participle group is being segmented using WFG_ID Unique number in storehouse.
A specific embodiment according to the present invention, comprising word-dividing mode in system, for being segmented to each material, And word segmentation result is preserved into comparison database.Preferably, word-dividing mode compares the part of speech that word segmentation result is preserved with participle storehouse It is right, determine the part of speech of word segmentation result.Preferably, parts of speech classification module is segmented according to the corresponding part of speech of word segmentation result to word segmentation result Carry out classification processing.
A specific embodiment according to the present invention, comprising participle group module in system, for dividing each material Word, and participle group result is preserved into comparison database.Preferably, the part of speech that participle group module preserves word segmentation result with participle storehouse It is compared, determines the part of speech of participle group result.Preferably, participle group parts of speech classification module is according to the corresponding word of participle group result Property carries out classification processing to participle group result.
A specific embodiment according to the present invention, comprising middle foreign language participle group module in system, for each material It is segmented, and middle foreign language participle group result is preserved into comparison database.Preferably, middle foreign language participle group module divides middle foreign language Word result is compared with the part of speech that participle storehouse preserves, the part of speech of foreign language participle group result in determining.Preferably, middle foreign language participle Group parts of speech classification module corresponding part of speech centering foreign language participle group result of foreign language participle group result in carries out classification processing.
A specific embodiment according to the present invention, participle parts of speech classification module, participle group parts of speech classification module and Middle foreign language participle group parts of speech classification module respectively divides word segmentation result, participle group result and middle foreign language participle group according to part of speech For A classes notional word, B classes notional word, C classes notional word, D classes notional word and V class function words, wherein A classes notional word includes noun;B class notional words include Verb, adjective;C classes notional word includes number, quantifier;D classes notional word includes pronoun;V classes function word includes adverbial word, preposition, conjunction, helps Word, interjection, onomatopoeia.Preferably, segment in storehouse and noun is further divided into technical term and common noun.According to this hair Word segmentation result is divided into A1 classes notional word, A2 classes notional word, B classes notional word, C classes reality by a bright specific embodiment according to part of speech Word, D classes notional word and V class function words, wherein A1 classes notional word include technical term noun;A2 classes notional word includes common noun;B classes are real Word includes verb, adjective;C classes notional word includes number, quantifier;D classes notional word includes pronoun;V classes function word include adverbial word, preposition, Conjunction, auxiliary word, interjection, onomatopoeia.Those skilled in the art can choose different classification processing schemes according to actual needs.
A specific embodiment according to the present invention, participle characteristic value generation module count each participle in corresponding element The quantity occurred in material, generates the corresponding participle characteristic value WCV=[W_ID, W_N] of each participle, and wherein W_ID represents this point Unique number of the word in storehouse is segmented, W_N represent the total degree that the participle occurs in the material.Preferably, it is contemplated that each The part of speech of a participle, participle characteristic value generation module generation participle part of speech feature value WCCV=[W_ID, W_N, W_CHAR], wherein W_CHAR represents the part of speech of the participle.
A specific embodiment according to the present invention, participle group characteristic value generation module count each participle group right The quantity occurred in material is answered, generates the corresponding participle group characteristic value WGCV=[WG_ID, WG_N] of each participle group, wherein WG_ID represents unique number of the participle group in storehouse is segmented, and WG_N represents the total degree that the participle group occurs in the material. Preferably, it is contemplated that the part of speech of each participle group, participle group characteristic value generation module generation participle group part of speech feature value WGCCV =[WG_ID, WG_N, WG_CHAR], wherein WG_CHAR represent the part of speech of the participle group.
A specific embodiment according to the present invention, middle foreign language participle group characteristic value generation module count each China and foreign countries The quantity that literary participle group occurs in corresponding material, generates the corresponding participle group characteristic value WFGCV of foreign language participle group in each =[WFG_ID, WFG_N], wherein WFG_ID represent unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N is represented should The total degree that middle foreign language participle group occurs in the material.Preferably, it is contemplated that the part of speech of foreign language participle group in each, participle Foreign language participle group part of speech feature value WFGCCV=[WFG_ID, WFG_N, WFG_CHAR] in the generation module generation of group characteristic value, Middle WFG_CHAR represents the part of speech of foreign language participle group in this.
A specific embodiment according to the present invention, participle tightening coefficient generation module segment close system for generating Number.The participle tightening coefficient refers to that same participle is adjacent in entire material and occurs be spaced participle quantity twice.According to The specific embodiment of the present invention, the corresponding participle tightening coefficient of each participle are expressed as WGC=[G_W_ID_1, G_ W_ID_2 ..., G_W_ID_ (W_N-1)], wherein, G_W_ID_1 represents that the participle occurs and second for the first time in the material The participle quantity being spaced between appearance, G_W_ID_2 represent that the participle occurs occurring it with third time second in the material Between the participle quantity that is spaced, G_W_ID_ (W_N-1) represents that the participle the W_N-1 times appearance in the material goes out with the W_N times The participle quantity being spaced between existing;G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1) are corresponding point of the participle Word tightening coefficient.A specific embodiment according to the present invention, participle tightening coefficient feature vector generation module generation participle Tightening coefficient characteristic vector W GCVE=[W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1)], Wherein W_ID represents unique number of the participle in storehouse is segmented, and W_N represents that the participle of the specific participle in the material is always secondary Number, W_CHAR represent the part of speech of the participle.By segmenting tightening coefficient, entirety of the specific participle in corresponding material can be known Distribution situation.
A specific embodiment according to the present invention, participle group tightening coefficient generation module are close for generating participle group Coefficient.The participle group tightening coefficient refers to that same participle group is adjacent in entire material and occurs be spaced participle number twice Amount.A specific embodiment according to the present invention, the corresponding participle group tightening coefficient of each participle group are expressed as WGGC= [G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein, G_WG_ID_1 represents the participle group in the material The participle quantity that middle first time occurs and is spaced between occurring for second, G_WG_ID_2 represent the participle group in the material Second of the participle quantity occurred being spaced between occurring for the third time, G_WG_ID_ (WG_N-1) represent the participle group in the element The participle quantity being spaced in material between the WG_N-1 times appearance and the WG_N times appearance;G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1) is the corresponding participle group tightening coefficient of the participle group.A specific embodiment party according to the present invention Formula, participle group tightening coefficient feature vector generation module generation participle group tightening coefficient characteristic vector W GGCVE=[WG_ID, WG_ N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein WG_ID represents that the participle group is being divided Unique number in dictionary, WG_N represent the participle total degree of the specific participle group in the material, and WG_CHAR represents the participle The part of speech of group.By participle group tightening coefficient, overall distribution situation of the specific participle group in corresponding material can be known.
A specific embodiment according to the present invention, middle foreign language participle group tightening coefficient generation module is for generation China and foreign countries Literary participle group tightening coefficient.The middle foreign language participle group tightening coefficient refers to that same middle foreign language participle group is adjacent in entire material Occurs be spaced participle quantity twice.A specific embodiment according to the present invention, foreign language participle group corresponds in each Middle foreign language participle group tightening coefficient be expressed as WFGGC=[G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N- 1)], wherein, G_WFG_ID_1 represents that foreign language participle group occurs between second of appearance in the material for the first time between institute in this Every participle quantity, between G_WFG_ID_2 represents in this that foreign language participle group occurs for second in the material and third time occurs The participle quantity being spaced, G_WFG_ID_ (WFG_N-1) represent that foreign language participle group goes out for the WFG_N-1 times in the material in this The participle quantity being spaced between now occurring with the WFG_N times;G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_ N-1) it is the corresponding participle group tightening coefficient of foreign language participle group in this.A specific embodiment according to the present invention, China and foreign countries Foreign language participle group tightening coefficient characteristic vector W FGGCVE=in literary participle group tightening coefficient feature vector generation module generation [WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1)], wherein WFG_ ID represents unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N represents the specific middle foreign language participle group in the material In participle total degree, WFG_CHAR represents the part of speech of foreign language participle group in this.It, can be with by middle foreign language participle group tightening coefficient Know overall distribution situation of the specific middle foreign language participle group in corresponding material.
A specific embodiment according to the present invention segments participle knot of the free vector dimension determining module according to material Fruit determines participle free vector dimension WFV;The participle free vector dimension WFV is equal to specific material is segmented after obtain Different participles quantity.When the length of material is shorter or word segmentation result therein is less, obtained participle freely to It is less to measure dimension WFV;When the length of material is longer or word segmentation result therein is more, obtained participle free vector dimension Number WFV is more.
A specific embodiment according to the present invention, participle group free vector dimension determining module is according to the participle of material As a result participle group free vector dimension WGFV is determined;The participle group free vector dimension WGFV is equal to and specific material is divided The quantity of the different participle groups obtained after word.It is acquired when the length of material is shorter or participle group result therein is less Participle group free vector dimension WGFV it is less;When the length of material is longer or participle group result therein is more, gained The participle group free vector dimension WGFV arrived is more.
A specific embodiment according to the present invention, middle foreign language participle group free vector dimension determining module is according to material Word segmentation result determine middle foreign language participle group free vector dimension WFGFV;The middle foreign language participle group free vector dimension WFGFV Equal to the quantity of foreign language participle group in the difference obtained after being segmented to specific material.When the length of material is shorter or wherein Middle foreign language participle group result it is less when, obtained middle foreign language participle group free vector dimension WFGFV is less;When a piece for material Width is longer or when participle group result therein is more, and obtained middle foreign language participle group free vector dimension WFGFV is more.
A specific embodiment according to the present invention, participle simplify vector dimension generation module for each material Participle free vector dimension WFV is simplified, and generation participle simplifies vector dimension RWV.It is described participle simplify vector dimension RWV by System is specified.Preferably, system specifies participle to simplify vector dimension RWV as 500.Preferably, system specifies participle to simplify vector Dimension RWV is 800.Preferably, system specifies participle to simplify vector dimension RWV as 1000.
A specific embodiment according to the present invention, participle simplify vector dimension generation module using extracted at equal intervals method Participle free vector dimension WFV is simplified.It is as follows to simplify process:Judge whether participle free vector dimension WFV is more than to divide Word simplifies vector dimension RWV, if it is, participle free vector dimension WFV divided by the system participle specified are simplified vectorial dimension Number RWV, and upper rounding operation is carried out to obtained quotient, it further obtains simplifying coefficients R EDU;Then in participle free vector At interval of one characteristic value of REDU-1 extraction in characteristic value corresponding to dimension WFV;After all characteristics extractions, sentence Whether the quantity of disconnected extracted characteristic value, which is equal to participle, is simplified vector dimension RWV;When the quantity for the characteristic value extracted is equal to When participle simplifies vector dimension RWV, then complete participle free vector dimension WFV and simplify;When the quantity for the characteristic value extracted is small When participle simplifies vector dimension RWV, then calculate participle and simplify vector dimension RWV and the difference of characteristic value quantity;It is not carried Extraction simplifies the vector dimension RWV characteristic values equal with the difference quantities of characteristic value with participle at random in the characteristic value taken, completes Participle free vector dimension WFV's simplifies.
A specific embodiment according to the present invention, participle simplify vector dimension generation module using part of speech screening method pair Participle free vector dimension WFV is simplified.It is as follows to simplify process:By the characteristic value of word segmentation result according to corresponding participle part of speech Classify;Feature value division is A1 class notional words characteristic value, A2 classes notional word spy by a specific embodiment according to the present invention Value indicative, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Generally, it is considered that notional word Effect bigger played in the similarity comparison of corresponding characteristic value, wherein technical term noun can more be embodied compared with common noun Effective content of material.Count respectively lower eigenvalue of all categories quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values), (C classes are real by AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C The quantity of word characteristic value), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (quantity of V class notional word characteristic values). It calculates participle and simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+ AMOUNT_V value RWV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then it completes this time to simplify;Such as Fruit is less than 0, then further calculates participle and simplify vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_ C+AMOUNT_D value RWV_S_D);If greater than 0, then extracted and the difference at random from the characteristic value corresponding to AMOUNT_V The equal characteristic value of RWV_S_D quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then into One step calculates the value RWV_S_ that participle simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C) C;If greater than 0, then the feature equal with difference RWV_S_C quantity is extracted at random from the characteristic value corresponding to AMOUNT_D Value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate participle and simplify vectorial dimension The value RWV_S_B of number RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then from corresponding to AMOUNT_C Characteristic value in extract the characteristic value equal with difference RWV_S_B quantity at random, complete this time to simplify;If equal to 0, then it is complete It is simplified into this;If less than 0, then further calculate participle and simplify vector dimension RWV-'s (AMOUNT_A1+AMOUNT_A2) Value RWV_S_A2;If greater than 0, then extracted and difference RWV_S_A2 quantity at random from the characteristic value corresponding to AMOUNT_B Equal characteristic value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then participle is further calculated Simplify the value RWV_S_A1 of vector dimension RWV-AMOUNT_A1;If greater than 0, then from the characteristic value corresponding to AMOUNT_A2 The random extraction characteristic value equal with difference RWV_S_A1 quantity, completion are this time simplified;If equal to 0, then complete this time essence Letter;If less than 0, then extracted at random from the characteristic value corresponding to AMOUNT_A1 equal with simplifying vector dimension RWV quantity Characteristic value, completion are this time simplified.
Vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+ are simplified for calculating participle AMOUNT_D+AMOUNT_V value RWV_S_V) is more than 0 situation, that is, means that the material length is smaller or information content is less, Therefore be not suitable for being compared using characteristic value.
Participle free vector dimension WFV, which is less than when participle simplifies vector dimension RWV, represents that itself dimension is small, then other are tieed up Magnitude under several is equivalent to 0.Such situation needs Direct Mark in systems, individually includes processing.Such as common saying among the people, famous person Well-known saying etc. is searched as index and used.Subsequently usable full text sliding window carries out full text comparison and uses.
A specific embodiment according to the present invention, participle group simplify vector dimension generation module for each material Participle group free vector dimension WGFV simplified, generation participle group simplify vector dimension RWGV.The participle group simplify to Amount dimension RWGV is specified by system.Preferably, system specifies participle group to simplify vector dimension RWGV as 500.Preferably, system refers to Determine participle group and simplify vector dimension RWGV as 800.Preferably, system specifies participle group to simplify vector dimension RWGV as 1000.
A specific embodiment according to the present invention, participle group simplify vector dimension generation module using extracted at equal intervals Method simplifies participle group free vector dimension WGFV.It is as follows to simplify process:Judging participle group free vector dimension WGFV is It is no to simplify vector dimension RWGV more than participle group, if it is, participle group free vector dimension WGFV divided by system are specified point Phrase simplifies vector dimension RWGV, and carries out upper rounding operation to obtained quotient, further obtains simplifying coefficients R EDU;Then At interval of one characteristic value of REDU-1 extraction in the characteristic value corresponding to participle group free vector dimension WGFV;As all spies After value indicative is extracted, judge whether the quantity of extracted characteristic value equal to participle group simplifies vector dimension RWGV;When being carried When the quantity of the characteristic value taken simplifies vector dimension RWGV equal to participle group, then participle group free vector dimension WGFV essences are completed Letter;When the quantity for the characteristic value extracted simplifies vector dimension RWGV less than participle group, then calculate participle group and simplify vectorial dimension Number RWGV and the difference of characteristic value quantity;Extraction simplifies vector dimension RWGV with participle group at random in non-extracted characteristic value The characteristic value equal with the difference quantities of characteristic value completes simplifying for participle group free vector dimension WGFV.
A specific embodiment according to the present invention, participle group simplify vector dimension generation module using part of speech screening method Participle group free vector dimension WGFV is simplified.It is as follows to simplify process:Characteristic value is carried out according to corresponding participle part of speech Classification;Feature value division is A1 class notional words characteristic value, A2 class notional word features by a specific embodiment according to the present invention Value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Generally, it is considered that notional word pair Effect bigger played in the similarity comparison for the characteristic value answered, wherein technical term noun can more embody element compared with common noun Effective content of material.Count respectively lower eigenvalue of all categories quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values), (C classes are real by AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C The quantity of word characteristic value), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (quantity of V class notional word characteristic values). It calculates participle group and simplifies vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+ AMOUNT_V value RWGV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then it completes this time to simplify; If less than 0, then further calculate participle group and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+ AMOUNT_C+AMOUNT_D value RWGV_S_D);If greater than 0, then extracted at random from the characteristic value corresponding to AMOUNT_V The characteristic value equal with difference RWGV_S_D quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;It is if small In 0, then further calculate participle and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C) Value RWGV_S_C;If greater than 0, then extracted and difference RWGV_S_C numbers at random from the characteristic value corresponding to AMOUNT_D Equal characteristic value is measured, completion is this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate and divide Phrase simplifies the value RWGV_S_B of vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then It extracts the characteristic value equal with difference RWGV_S_B quantity at random from the characteristic value corresponding to AMOUNT_C, completes this time essence Letter;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate participle group and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2) value RWGV_S_A2;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_B The extraction characteristic value equal with difference RWGV_S_A2 quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify; If less than 0, then the value RWGV_S_A1 that participle group simplifies vector dimension RWGV-AMOUNT_A1 is further calculated;If greater than 0, then it extracts the characteristic value equal with difference RWGV_S_A1 quantity at random from the characteristic value corresponding to AMOUNT_A2, completes This is simplified;If equal to 0, then it completes this time to simplify;It is if less than 0, then random from the characteristic value corresponding to AMOUNT_A1 The characteristic value equal with simplifying vector dimension RWGV quantity is extracted, completion is this time simplified.
Vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C are simplified for calculating participle group + AMOUNT_D+AMOUNT_V) value RWGV_S_V be more than 0 situation, that is, mean that the material length is smaller or information content compared with It is few, therefore be not suitable for being compared using characteristic value.
Participle group free vector dimension WGFV represents that itself dimension is small when simplifying vector dimension RWGV less than participle group, then Magnitude under other dimensions is equivalent to 0.Such situation needs Direct Mark in systems, individually includes processing.Such as custom among the people Language, famous sayings of famous figures etc. are searched as index and used.Subsequently usable full text sliding window carries out full text comparison and uses.
A specific embodiment according to the present invention, middle foreign language participle group simplify vector dimension generation module for every The middle foreign language participle group free vector dimension WFGFV of a material is simplified, and foreign language participle group simplifies vector dimension in generation RWFGV.The middle foreign language participle group is simplified vector dimension RWFGV and is specified by system.Preferably, system specifies middle foreign language participle group Vector dimension RWFGV is simplified as 500.Preferably, system specifies middle foreign language participle group to simplify vector dimension RWFGV as 800.It is preferred that Ground, system specify middle foreign language participle group to simplify vector dimension RWFGV as 1000.
A specific embodiment according to the present invention, middle foreign language participle group are simplified between vector dimension generation module use etc. It is simplified every extraction method centering foreign language participle group free vector dimension WFGFV.It is as follows to simplify process:Foreign language participle group in judgement Whether free vector dimension WFGFV more than middle foreign language participle group simplifies vector dimension RWFGV, if it is, middle foreign language is segmented Group free vector dimension WFGFV divided by system specify middle foreign language participle group to simplify vector dimension RWFGV, and to obtained quotient Rounding operation is carried out, further obtains simplifying coefficients R EDU;Then corresponding to middle foreign language participle group free vector dimension WFGFV Characteristic value at interval of REDU-1 extraction one characteristic value;After all characteristics extractions, extracted spy is judged Whether the quantity of value indicative equal to middle foreign language participle group simplifies vector dimension RWFGV;In the quantity for the characteristic value extracted is equal to When foreign language participle group simplifies vector dimension RWFGV, then foreign language participle group free vector dimension WFGFV is simplified in completing;When being carried When the quantity of the characteristic value taken simplifies vector dimension RWFGV less than middle foreign language participle group, then calculate in foreign language participle group simplify to Measure dimension RWFGV and the difference of characteristic value quantity;Extraction is simplified with middle foreign language participle group at random in non-extracted characteristic value Characteristic value equal with the difference quantities of characteristic value vector dimension RWFGV, foreign language participle group free vector dimension WFGFV in completion Simplify.
A specific embodiment according to the present invention, middle foreign language participle group simplify vector dimension generation module using part of speech Screening method centering foreign language participle group free vector dimension WFGFV is simplified.It is as follows to simplify process:By characteristic value according to corresponding Participle part of speech is classified;Feature value division is A1 class notional words characteristic value, A2 by a specific embodiment according to the present invention Class notional word characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Usually Think, the effect bigger played in the similarity comparison of the corresponding characteristic value of notional word, wherein technical term noun is compared with generic name Word can more embody effective content of material.Quantity AMOUNT_A1 (the A1 class notional word characteristic values of lower eigenvalue of all categories are counted respectively Quantity), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_ C (quantity of C class notional word characteristic values), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V be (V class notional word characteristic values Quantity).Foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_ in calculating C+AMOUNT_D+AMOUNT_V value RWFGV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then it is complete It is simplified into this;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+ in further calculating AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWFGV_S_D);It is if greater than 0, then right from AMOUNT_V institutes The characteristic value equal with difference RWFGV_S_D quantity is extracted in the characteristic value answered at random, completion is this time simplified;If equal to 0, It then completes this time to simplify;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1 in further calculating + AMOUNT_A2+AMOUNT_B+AMOUNT_C) value RWFGV_S_C;If greater than 0, then from the feature corresponding to AMOUNT_D The characteristic value equal with difference RWFGV_S_C quantity is extracted in value at random, completion is this time simplified;If equal to 0, then complete this It is secondary to simplify;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_ in further calculating A2+AMOUNT_B value RWFGV_S_B);If greater than 0, then extract and be somebody's turn to do at random from the characteristic value corresponding to AMOUNT_C The equal characteristic value of difference RWFGV_S_B quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate the value RWFGV_S_A2 that participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2);Such as Fruit is more than 0, then extracts the feature equal with difference RWFGV_S_A2 quantity at random from the characteristic value corresponding to AMOUNT_B Value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then foreign language participle group is smart in further calculating The value RWFGV_S_A1 of simple vector dimension RWFGV-AMOUNT_A1;If greater than 0, then from the characteristic value corresponding to AMOUNT_A2 In extract the characteristic value equal with difference RWFGV_S_A1 quantity at random, complete this time to simplify;If equal to 0, then complete this It is secondary to simplify;If less than 0, then vector dimension RWFGV quantity is extracted and simplified at random from the characteristic value corresponding to AMOUNT_A1 Equal characteristic value, completion are this time simplified.
Vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+ are simplified for foreign language participle group in calculating AMOUNT_C+AMOUNT_D+AMOUNT_V value RWFGV_S_V) be more than 0 situation, that is, mean the material length it is smaller or Information content is less, therefore is not suitable for being compared using characteristic value.
Participle group free vector dimension WFGFV represents that itself dimension is small when simplifying vector dimension RWFGV less than participle group, Then the magnitude under other dimensions is equivalent to 0.Such situation needs Direct Mark in systems, individually includes processing.It is such as among the people Common saying, famous sayings of famous figures etc. are searched as index and used.Subsequently usable full text sliding window carries out full text comparison and uses.
A specific embodiment according to the present invention, participle feature vector generation module simplify vector dimension according to participle RWV extracts participle described in each material and simplifies the corresponding characteristic value generation participle characteristic vector W VE_RWV of vector dimension RWV;
WVE_RWV=[W_ID1, W_N1 ..., W_IDi, W_Ni ..., W_IDRWV, W_NRWV]
Wherein W_IDi represents unique number of the participle in storehouse is segmented, and W_Ni represents what the participle occurred in the material Total degree, using the number as the characteristic value of the participle.
A specific embodiment according to the present invention, participle group feature vector generation module simplify vector according to participle group Dimension RWGV extract participle group described in each material simplify the corresponding characteristic values of vector dimension RWGV generate participle group feature to Measure WVE_RWGV;
WVE_RWGV=[WG_ID1, WG_N1 ..., WG_IDi, WG_Ni ..., WG_IDRWGV, WG_NRWGV]
Wherein WG_IDi represents unique number of the participle group in storehouse is segmented, and WG_Ni represents the participle group in the material The total degree of appearance, using the number as the characteristic value of the participle group.
A specific embodiment according to the present invention, middle foreign language participle group feature vector generation module foreign language point in Phrase simplifies middle foreign language participle group described in each material of vector dimension RWFGV extractions and simplifies the corresponding spies of vector dimension RWFGV Foreign language participle group characteristic vector W VE_RWFGV in value indicative generation;
WVE_RWFGV=[WFG_ID1, WFG_N1 ..., WFG_IDi, WFG_Ni ..., WFG_IDRWFGV, WFG_ NRWFGV]
Unique number of the foreign language participle group in storehouse is segmented during wherein WFG_IDi is represented, WFG_Ni represent foreign language point in this The total degree that phrase occurs in the material, using the number as the characteristic value of foreign language participle group in this.
A specific embodiment according to the present invention, system provide a variety of access modes to the user.User accesses system, User's access mode detection module is used to detect the access mode of active user.
In the specific embodiment of the present invention, user can access system in a manner of on probation, referred to hereinafter as on probation The user that mode accesses is user on probation.When user's access mode detection module, which detects user, to be accessed in a manner of on probation, Prompting is sent to user on probation, it is mode on probation to inform current accessed mode, and informs the access right of user on probation.According to this One specific embodiment of invention, for the user accessed in a manner of on probation, system is only that user on probation provides book character Several detections are tried out, and the predetermined number of words is set in advance by system.Another embodiment according to the present invention, for The user that mode on probation accesses, the database that system provides part or all of scope to try out user are tried out for detection.According to this The another embodiment of invention, for the user accessed in a manner of on probation, system is the plagiarism inspection that user on probation provides Survey result only provides the prompting of plagiarism rate, does not provide specific plagiarism position and with being compared by the plagiarism of plagiarism document.According to The another embodiment of the present invention, for the user accessed in a manner of on probation, system is the plagiarism that user on probation provides Testing result provide it is specific plagiarize position, but pair with carrying out Fuzzy processing by the plagiarism comparison of plagiarism document so that try out User is only capable of knowing that the specific of the document itself provided plagiarizes position, but None- identified is by the specifying information of plagiarism document.
A specific embodiment according to the present invention, user accesses system with counting mode, referred to hereinafter as with counting mode The user of access is counting user.When user's access mode detection module, which detects user, to be accessed with counting mode, to meter Number user sends prompting, informs current accessed mode as counting mode, and prompts to count user and upload to need to carry out plagiarism comparison Document.A specific embodiment according to the present invention, system statistics count the number of characters that user uploads document, and according to system The number of characters counted out calculates the expense that this text plagiarizes detection.Another embodiment according to the present invention, system are The database that counting user provides part or all of scope is selective, and system selects different database scopes according to user is counted Calculate the expense that this text plagiarizes detection.
A specific embodiment according to the present invention, user accesses system with timing mode, referred to hereinafter as with timing mode The user of access is timing user.When user's access mode detection module, which detects user, to be accessed with timing mode, to meter When user send prompting, inform current accessed mode as timing mode, and timing user current residual prompted to use duration.According to The another embodiment of the present invention, for timing user, system is timing user on display circle in use Residue is provided in face in real time to prompt using duration countdown.Another embodiment according to the present invention, system are timing The database that user provides part or all of scope is selective.A specific embodiment according to the present invention, system is according to meter When user select the number of characters of different database scope and timing user institute uploading detection document, estimate needed for the document Duration is detected, and prompts timing user remaining whether to complete currently to plagiarize detection using duration.
A specific embodiment according to the present invention, it is true by user's detection pattern after timing user logs in the system Cover half block determines to plagiarize detection detection pattern.A specific embodiment according to the present invention, system provide self audit mode, It is selective commonly to plagiarize identification pattern, extension plagiarism identification pattern, multilingual plagiarism identification pattern, formula plagiarism identification pattern.
A specific embodiment according to the present invention, user's detection pattern determining module determine active user's detection pattern For self audit mode when, user's writing style test module provides one or more test pictures to the user, is being advised by user Carry out the word description of no less than regulation number of words in fixing time online for test pictures.Preferably, user's writing style is tested Module further provides one or more test articles to the user, and no less than regulation word is carried out online at the appointed time by user Several text reviews.The test pictures or test article from test picture library and test library by user's writing style test module In randomly select.No matter use test pictures or test article, be required for carrying out online word description or comment by user, by Being limited to the stipulated time can not set long, usually be chosen for 30 minutes or 60 minutes, corresponding word description or text reviews Regulation number of words is usually chosen for 400 word/30 minute or 800 word/60 minute.Those skilled in the art can as needed further Other stipulated times or regulation number of words are set., it is specified that the time should not set long from the point of view of experimental data, do not have to avoid user There are enough time or unstable networks that can not complete accordingly to test;In addition, it is specified that the ratio of number of words and stipulated time are unsuitable too low, It is accustomed to avoid that cannot reflect that user writes strictly according to the facts.Long, corresponding word description or text can not be set by being limited to the stipulated time The length of word comment is limited, and the only characteristic value and feature vector of the word description with on-line testing extraction or text reviews may Also the writing custom of user can not be really reflected, it is therefore desirable to which further extraction test pictures describe reference characteristic vector and survey Examination article describes reference characteristic vector, for correct feature caused by word description or text reviews word are insufficient to Measure deviation.
A specific embodiment according to the present invention, the every width test pictures tested in picture library all have test chart chip base Quasi- feature vector.It is the base that predetermined quantity is randomly selected from different background crowds that the test pictures, which describe reference characteristic vector, Quasi- tester carries out the description of no less than regulation number of words with regard to fc-specific test FC picture respectively, gathers all word descriptions, statistics The test pictures word description characteristic value of same test pictures, according to the test pictures word description characteristic value calculate feature to Amount, and feature vector is weighted, obtain the test pictures reference characteristic vector of fc-specific test FC picture.The weighting fortune Weights in calculation are set by system.The every test article tested in library all has test article reference characteristic vector.It is described It is the benchmark test personnel that predetermined quantity is randomly selected from different background crowds to test article reference characteristic vector, just special respectively Location survey examination article carries out the description of no less than regulation number of words, gathers all word descriptions, statistics is for same test article Test article word description characteristic value, feature vector calculated according to the test article word description characteristic value, and to feature to Amount is weighted, and obtains the test article reference characteristic vector of fc-specific test FC article.Weights in the ranking operation by System is set.
A specific embodiment according to the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds are surveyed It during examination personnel, can be chosen according to all ages and classes level, can preferably be divided into 20 years old with the following group, 20-29 Sui group, 30-39 Sui Group, 40-49 Sui group, 50 years old or more group.So as to collect the crowd of age groups for same test pictures or same test text Chapter is no less than the description situation for providing number of words.
A specific embodiment according to the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds are surveyed It during examination personnel, can be chosen according to different academic backgrounds level, can preferably be divided into undergraduate education with the following group, undergraduate education group is large Scholar postgraduate's group, doctoral candidate's group.So as to collect the crowd of different academic backgrounds group for same test pictures or same test text Chapter is no less than the description situation for providing number of words.
A specific embodiment according to the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds are surveyed During examination personnel, can be chosen according to different majors field (can divide professional domain, herein not according to different measuring accuracy demands Repeat again), so as to which the crowd for collecting different majors field group no less than provides for same test pictures or same test article The description situation of number of words.
A specific embodiment according to the present invention, test pictures word description characteristic value generation module obtain benchmark and survey The test pictures that examination personnel obtain benchmark test personnel describe text, generate user test picture character Expressive Features value;It is described Test pictures word description characteristic value includes but not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, section Fall number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use Situation, punctuation mark service condition, part of speech service condition.A specific embodiment according to the present invention, Chinese number of words refer to The Chinese character number included in each test pictures word description in addition to punctuation mark, each word of Chinese are denoted as a word Symbol;Foreign language number of words refers to the foreign language number of characters included in each test pictures word description in addition to punctuation mark, foreign language Each word is denoted as a character;Total word number refers to the word obtained after being segmented to each test pictures word description sum, The participle storehouse that system can be used to carry for middle Chinese word segmentation is segmented, and foreign language can be according to foreign language writing style, directly using per word Between space segmented;Notional word number obtains often after referring to participle according to word segmentation result compared with segmenting the part of speech in storehouse Notional word quantity in one test pictures word description, notional word number can be further divided into Chinese notional word number and foreign language notional word number, In, the summation of Chinese notional word number and foreign language notional word number is equal to notional word number;Function word number refers to after segmenting according to word segmentation result and participle Part of speech in storehouse is compared to obtain the function word quantity in each test pictures word description, during further function word number can be divided into Literary function word number and foreign language function word number, wherein, the summation of Chinese function word number and foreign language function word number is equal to function word number;Paragraph number refers to often Paragraph quantity in one test pictures word description;Bout length distribution situation refers in each test pictures word description Word number and sentence number included in each paragraph;Sentence number refers to the sentence number in each test pictures word description Amount;Sentence length distribution situation refers to each word number included in sentence in each test pictures word description;Synonym, Near synonym spread scenarios refer to the word segmentation result in each test pictures word description being compared with synonymous near synonym storehouse, The same or similar participle of meaning is formed into a set, calculates the word quantity in each set, thus reflects that this tests The synonym of the author of picture character description, near synonym writing custom, if wherein included in synonym or near synonym set Word number it is more, show that the writing style of the author tends to extend using synonym or near synonym, if synonym or nearly justice Word number is fewer included in set of words, shows that the writing style of the author tends to that synonym or near synonym is not used to extend; Function word service condition refers to the statistical conditions that function word uses in each test pictures word description, includes but not limited to each piece The statistics ranking that function word uses in test pictures word description, the word number being each spaced between difference function words, each identical function word Between the word number that is spaced;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus Reflect whether the author of this test pictures word description distinguishes use for " ", " ", " obtaining " three structural auxiliary words;Mark Point symbol service condition refers to the statistical conditions that punctuation mark uses in each test pictures word description, includes but not limited to The statistics ranking that punctuate uses in each test pictures word description, the word number being each spaced between difference punctuation marks, often The word number being spaced between a identical punctuation mark;Part of speech service condition refers to after segmenting according to word segmentation result and the word in participle storehouse Property be compared to obtain the statistical conditions of each part of speech participle in each test pictures word description, such as respectively obtain noun, Verb, adjective, number, quantifier, pronoun, adverbial word, preposition, conjunction, auxiliary word, quantity and each part of speech of interjection and onomatopoeia Quantity and the ratio of the total word number of this test pictures word description.
A specific embodiment according to the present invention, test pictures word description characteristic value generation module is according to test chart Piece word description characteristic value generates test pictures word description feature vector.A specific embodiment according to the present invention, by System specify the test pictures word description feature vector dimension and feature vector in every particular content and row The order of row.When the dimension of the feature vector of the test pictures word description is n, TPCVE=[TPC_ are represented by 1 ..., TPC_m ..., TPC_n], wherein, TPC_1 be test pictures word description feature vector in the first entry value, TPC_m For the m entry value in the feature vector of test pictures word description, TPC_n is in the feature vector of test pictures word description N-th entry value.
Preferably, the test pictures word description feature vector includes one or more of the following items:Middle word The ratio of number and total word number, foreign language number of words and the ratio of total word number, the ratio of notional word number and total word number, function word number and total word number Ratio, the ratio of total word number and paragraph number, most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuate Symbol is using the ratio of number and total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, adjective number with The ratio of total word number, the ratio of number number and total word number, the ratio of quantifier number and total word number, the ratio of pronoun number and total word number, The ratio of adverbial word number and total word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, auxiliary word number and total word number Ratio, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number.
A specific embodiment according to the present invention, test pictures reference characteristic vector generation module statistics is for same The test pictures word description feature vector of test;Test pictures word description feature vector is weighted, obtains spy Location survey attempts piece benchmark feature vector, and the weights used in the ranking operation are set by system.Preferably, test pictures benchmark Feature vector generation module can be directed to age groups, academic group and professional domain group, count the test of predetermined quantity respectively Picture character Expressive Features vector, and be weighted respectively, obtain each age group, each academic group and each professional domain group Fc-specific test FC picture reference characteristic vector.
Fc-specific test FC picture reference characteristic vector can be expressed as:
Wherein TPCVE_ID represents the test pictures reference characteristic vector that number is ID;Tester's quantity on the basis of k; TPC_1iRepresent the first entry value of the feature vector of i-th of benchmark test personnel;TPC_miRepresent i-th benchmark test personnel's The m entry value of feature vector;TPC_niRepresent the n-th entry value of the feature vector of i-th of benchmark test personnel;W1,iFor TPC_1i's Weighting coefficient;Wm,iFor TPC_miWeighting coefficient;Wn,,iFor TPC_niWeighting coefficient.
A specific embodiment according to the present invention, test article word description characteristic value generation module obtain benchmark and survey The test article that examination personnel obtain benchmark test personnel describes text, generates user test article word description characteristic value;It is described Test article word description characteristic value includes but not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, section Fall number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use Situation, punctuation mark service condition, part of speech service condition.A specific embodiment according to the present invention, Chinese number of words refer to The Chinese character number included in each test article word description in addition to punctuation mark, each word of Chinese are denoted as a word Symbol;Foreign language number of words refers to the foreign language number of characters included in each test article word description in addition to punctuation mark, foreign language Each word is denoted as a character;Word number refers to the word sum obtained after being segmented to each test article word description, wherein The participle storehouse that system can be used to carry for Chinese word segmentation is segmented, foreign language can according to foreign language writing style, directly using per word it Between space segmented;Notional word number refers to be obtained compared with segmenting the part of speech in storehouse according to word segmentation result after participle each Notional word quantity in piece test article word description, notional word number can be further divided into Chinese notional word number and foreign language notional word number, wherein, The summation of Chinese notional word number and foreign language notional word number is equal to notional word number;Function word number refers to after segmenting according to word segmentation result with segmenting in storehouse Part of speech be compared to obtain function word quantity in each test article word description, further function word number can be divided into Chinese void Word number and foreign language function word number, wherein, the summation of Chinese function word number and foreign language function word number is equal to function word number;Paragraph number refers to each piece Test the paragraph quantity in article word description;Bout length distribution situation refers to each in each test article word description Word number included in paragraph and sentence number;Sentence number refers to the sentence quantity in each test article word description;Sentence Sub- distribution of lengths situation refers to word number included in each sentence in each test article word description;Synonym, nearly justice The word segmentation result that word spread scenarios refer to test each in article word description is compared with synonymous near synonym storehouse, will contain The same or similar participle of justice forms a set, calculates the word quantity in each set, thus reflects that this tests article The synonym of the author of word description, near synonym writing custom, if wherein word included in synonym or near synonym set Number is more, shows that the writing style of the author tends to extend using synonym or near synonym, if synonym or near synonym collection Word number is fewer included in conjunction, shows that the writing style of the author tends to that synonym or near synonym is not used to extend;Function word Service condition refers to the statistical conditions that function word uses in each test article word description, includes but not limited to each test The statistics ranking that function word uses in article word description, the word number being each spaced between difference function words, each between identical function word The word number at interval;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus reflect Go out this and test the author of article word description and whether distinguish use for " ", " ", " obtaining " three structural auxiliary words;Punctuate accords with Number service condition refers to the statistical conditions that punctuation mark uses in each test article word description, including but not limited to each The statistics ranking that punctuate uses in piece test article word description, the word number being each spaced between difference punctuation marks, Mei Gexiang With the word number being spaced between punctuation mark;Part of speech service condition refer to participle after according to word segmentation result with participle storehouse in part of speech into Row relatively obtains the statistical conditions of each part of speech participle in each test article word description, for example, respectively obtain noun, verb, Adjective, number, quantifier, pronoun, adverbial word, preposition, conjunction, auxiliary word, the quantity of interjection and onomatopoeia and each part of speech quantity with This tests the ratio of the total word number of article word description.
A specific embodiment according to the present invention, test article word description characteristic value generation module is according to test text Chapter word description characteristic value generates test pictures word description feature vector.A specific embodiment according to the present invention, by System specifies particular content every in the dimension for testing article word description feature vector and feature vector and row The order of row.When the dimension of the feature vector of the test article word description is n, TTCVE=[TTC_ are represented by 1 ..., TTC_m ..., TTC_n], wherein, TTC_1 be test pictures word description feature vector in the first entry value, TTC_m For the m entry value in the feature vector of test pictures word description, TTC_n is in the feature vector of test pictures word description N-th entry value.
Preferably, the test article word description feature vector includes one or more of the following items:Middle word The ratio of number and total word number, foreign language number of words and the ratio of total word number, the ratio of notional word number and total word number, function word number and total word number Ratio, the ratio of total word number and paragraph number, most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuate Symbol is using the ratio of number and total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, adjective number with The ratio of total word number, the ratio of number number and total word number, the ratio of quantifier number and total word number, the ratio of pronoun number and total word number, The ratio of adverbial word number and total word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, auxiliary word number and total word number Ratio, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number.
A specific embodiment according to the present invention, test article reference characteristic vector generation module statistics is for same The test article word description feature vector of test;Test article word description feature vector is weighted, obtains spy Location survey examination article reference characteristic is vectorial, and the weights used in the ranking operation are set by system.Preferably, article benchmark is tested Feature vector generation module can be directed to age groups, academic group and professional domain group, count the test of predetermined quantity respectively Article word description feature vector, and be weighted respectively, obtain each age group, each academic group and each professional domain group Fc-specific test FC article reference characteristic vector.
Certain articles reference characteristic vector can be expressed as:
Wherein TTCVE_ID represents the test article reference characteristic vector that number is ID;Tester's quantity on the basis of k; TTC_1iRepresent the first entry value of the feature vector of i-th of benchmark test personnel;TTC_miRepresent i-th benchmark test personnel's The m entry value of feature vector;TTC_niRepresent the n-th entry value of the feature vector of i-th of benchmark test personnel;W1,iFor TPC_1i's Weighting coefficient;Wm,iFor TPC_miWeighting coefficient;Wn,,iFor TPC_niWeighting coefficient.
A specific embodiment according to the present invention, test pictures word description feature vector are retouched with test article word The dimension and the wherein meaning of each characteristic value and putting in order for stating feature vector are consistent.For example, survey can be set It is Chinese number of words to attempt piece word description feature vector with testing the Section 1 characteristic value in article word description feature vector With the ratio of total word number, Section 2 characteristic value is the ratio of foreign language number of words and total word number, and Section 3 characteristic value is notional word number With the ratio of total word number, Section 4 characteristic value is the ratio of function word number and total word number, Section 5 characteristic value be total word number with The ratio of paragraph number, Section 6 characteristic value are most long paragraph word number, and Section 7 characteristic value is synonym, near synonym spreading number With the ratio of total word number, Section 8 characteristic value is ratio of the punctuation mark using number and total word number, and Section 9 characteristic value is The ratio of noun number and total word number, Section 10 characteristic value is the ratio of verb number and total word number, and Section 11 characteristic value is The ratio of adjective number and total word number, Section 12 characteristic value are the ratio of number number and total word number, Section 13 characteristic value It is the ratio of quantifier number and total word number, Section 14 characteristic value is the ratio of pronoun number and total word number, Section 15 Xiang Te Value indicative is the ratio of adverbial word number and total word number, and Section 16 characteristic value is the ratio of preposition number and total word number, Section 17 Characteristic value is the ratio of conjunction number and total word number, and Section 18 characteristic value is the ratio of auxiliary word number and total word number, and the 19th Item characteristic value is the ratio of interjection number and total word number, and Section 20 characteristic value is the ratio of onomatopoeia number and total word number.
A specific embodiment according to the present invention can further increase or delete test pictures word description feature Vector and the characteristic value in test article word description feature vector, but the test pictures word after increase or deletion characteristic value is retouched Feature vector is stated to still need to the dimension and the wherein meaning of various features value and order for testing article word description feature vector It is consistent.
A specific embodiment according to the present invention, user test picture character Expressive Features value generation module, which obtains, to be used Family test pictures describe text, generate user test picture character Expressive Features value;The user test picture character description is special Value indicative is consistent with the content that test pictures word description characteristic value is included, and details are not described herein.User test picture character is retouched It states feature vector generation module and user test picture character description spy is calculated according to the user's test pictures word description characteristic value Sign vector;When the dimension of the test pictures word description feature vector is n, the figure for number ID of active user USER The feature vector of the test pictures word description of piece is represented by TPCVE_ID_USER=[TPC_1_USER ..., TPC_m_ USER ..., TPC_n_USER], wherein, TPC_1_USER be active user USER user test picture character Expressive Features to The first entry value in amount, TPC_m_USER are the m in the user test picture character Expressive Features vector of active user USER Entry value, TPC_n_USER are the n-th entry value in the user test picture character Expressive Features vector of active user USER.
User's picture writing style feature vector generation module calculates the user's test pictures word description feature vector Difference between TPCVE_ID_USER test pictures reference characteristic vector T PCVE_ID corresponding with the test pictures, uses this Difference (TPCVE_ID_USER-TPCVE_ID) is used as the user's picture writing style feature vector T PCVE_USER.
A specific embodiment according to the present invention, user test article word description characteristic value generation module, which obtains, to be used Family test article describes text, generates user test article word description characteristic value;The user test article word description is special Value indicative is consistent with the content that test article word description characteristic value is included, and details are not described herein.User test article word is retouched It states feature vector generation module and article word description characteristic value calculating user test article word description spy is tested according to the user Sign vector;When the dimension of the test article word description feature vector is n, the text for number ID of active user USER The feature vector of the test article word description of chapter is represented by:TTCVE_ID_USER=[TTC_1_USER ..., TTC_m_ USER ..., TTC_n_USER], wherein, TTC_1_USER be active user USER user test article word description feature to The first entry value in amount, TTC_m_USER are the m in the user test article word description feature vector of active user USER Entry value, TTC_n_USER are the n-th entry value in the user test article word description feature vector of active user USER.
User's article writing style and features vector generation module calculates the user and tests article word description feature vector Difference between TTCVE_ID_USER test article reference characteristic vector T PCVE_ID corresponding with the test article, uses this Difference (TTCVE_ID_USER-TTCVE_ID) is used as the user's article writing style and features vector T TCVE_USER.
A specific embodiment according to the present invention, it is when several test pictures of use or more test articles or same When Shi Caiyong one or more test pictures and one or more test articles, the life of user test picture character Expressive Features value Text is described according to every of user test pictures respectively into module and user test article word description characteristic value generation module And test article describes text generation user test picture and/or article word description characteristic value, user test picture character Expressive Features vector generation module and user test article word description feature vector generation module are respectively according to user test figure Piece and/or article word description characteristic value generation user test picture and/or article word description feature vector;User's picture is write Make style feature vector generation module and user's article writing style and features vector generation module calculates each user test figure respectively Difference between piece and/or article word description feature vector and corresponding test pictures and/or article reference characteristic vector;It is right Each difference, which is weighted, respectively obtains the picture writing style feature vector T PCVE_USER of user and the article style Lattice feature vector TTCVE_USER;User's writing style feature vector generation module is to the picture writing style feature vector of user TPCVE_USER and article writing style and features vector T TCVE_USER are weighted to obtain user's writing style feature Vector T VE_USER;The weights of the ranking operation can be chosen according to actual needs.
TVE_USER=TPCVE_USER*WP+TTCVE_USER*WT
Wherein, WPFor user's picture writing style feature vector T PCVE_USER weighting coefficients;WTFor user's article style Lattice feature vector TTCVE_USER weighting coefficients.When user only carries out picture writing test or article writing is tested, can will join 1 is arranged to the weighting coefficient of project, the weighting coefficient for having neither part nor lot in project is arranged to 0.Preferably, weights can be chosen for phase Deng.
User's writing style feature vector is represented by:TVE_USER=[TVE_1 ..., TVE_m ..., TVE_n], In, TVE_1 is the first entry value in user's writing style feature vector, and TVE_m is the m in user's writing style feature vector Entry value, TVE_n are the n-th entry value in user's writing style feature vector.
A specific embodiment according to the present invention, user's detection pattern determining module is for further prompting user Pass pending document;Pending file characteristics value generation module is used to generate the pending file characteristics value of the unexamined document. The pending file characteristics value includes but not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, paragraph Number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use feelings Condition, punctuation mark service condition, part of speech service condition.A specific embodiment according to the present invention, Chinese number of words refer to often The Chinese character number included in one pending document in addition to punctuation mark, each word of Chinese are denoted as a character;Outer word Number refers to the foreign language number of characters included in the pending document of each piece in addition to punctuation mark, and each word of foreign language is denoted as a word Symbol;Word number refers to the word obtained after being segmented to the pending document of each piece sum, and system can be used certainly in wherein Chinese word segmentation The participle storehouse of band is segmented, and foreign language can be segmented according to foreign language writing style, the direct space using between every word;Notional word Number refers to obtain the notional word in the pending document of each piece compared with segmenting the part of speech in storehouse according to word segmentation result after segmenting Quantity, notional word number can be further divided into Chinese notional word number and foreign language notional word number, wherein, Chinese notional word number is total with foreign language notional word number With equal to notional word number;Function word number refers to that obtaining each piece compared with segmenting the part of speech in storehouse according to word segmentation result after segmenting treats The function word quantity in document is audited, further function word number can be divided into Chinese function word number and foreign language function word number, wherein, Chinese function word number It is equal to function word number with the summation of foreign language function word number;Paragraph number refers to the paragraph quantity in the pending document of each piece;Bout length Distribution situation refers to each word number and sentence number included in paragraph in the pending document of each piece;Sentence number refers to each Sentence quantity in the pending document of a piece;Sentence length distribution situation refers to be wrapped in each sentence in the pending document of each piece The word number contained;Synonym, near synonym spread scenarios refer to the word segmentation result in the pending document of each piece and synonymous near synonym Storehouse is compared, and the same or similar participle of meaning is formed a set, the word quantity in each set is calculated, thus reflects Go out synonym, the near synonym writing custom of the author of the pending document of this, if wherein institute in synonym or near synonym set Comprising word number it is more, show that the writing style of the author tends to extend using synonym or near synonym, if synonym or Word number is fewer included near synonym set, shows that the writing style of the author tends to that synonym or near synonym is not used to expand Exhibition;Function word service condition refers to the statistical conditions that function word uses in the pending document of each piece, includes but not limited to each piece and treats The statistics ranking that function word uses in examination & verification document, the word number being each spaced between difference function words, is each spaced between identical function word Word number;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus reflect this Whether the author of pending document distinguishes use for " ", " ", " obtaining " three structural auxiliary words;Punctuation mark service condition Refer to the statistical conditions that punctuation mark uses in the pending document of each piece, include but not limited to the pending document acceptance of the bid of each piece The statistics ranking that point uses, the word number being each spaced between difference punctuation marks, the word being each spaced between identical punctuation mark Number;Part of speech service condition refers to after participle compared with segmenting the part of speech in storehouse to obtain each piece according to word segmentation result pending The statistical conditions of each part of speech participle in document, for example, respectively obtain noun, verb, adjective, number, quantifier, pronoun, adverbial word, Preposition, conjunction, auxiliary word, interjection and the quantity of onomatopoeia and the ratio of each part of speech quantity and the total word number of the pending document of this.
A specific embodiment according to the present invention, pending file characteristics value tag vector generation module is according to pending Core file characteristics value generates pending file characteristics vector.A specific embodiment according to the present invention, institute is specified by system State the feature vector of pending document dimension and feature vector in every particular content and the order of arrangement;It is pending Every particular content and the order of arrangement should be with test charts in the dimension and feature vector of the feature vector of core document Piece benchmark feature vector and test article reference characteristic vector dimension and wherein the meaning of various features value and sequentially still It need to be consistent.When the dimension of the feature vector of the pending document is n, TDCVE_USER=[TDC_ are represented by 1 ..., TDC_m ..., TDC_n], wherein, TDC_1 is the first entry value in the feature vector of pending document, and TDC_m is pending M entry value in the feature vector of core document, TDC_n are the n-th entry value in the feature vector of pending document.
Preferably, the feature vector of the pending document includes Chinese number of words and the ratio of total word number, foreign language number of words with The ratio of total word number, the ratio of notional word number and total word number, the ratio of function word number and total word number, the ratio of total word number and paragraph number, Most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuation mark use the ratio of number and total word number, The ratio of noun number and total word number, the ratio of verb number and total word number, the ratio of adjective number and total word number, number number and total word Several ratio, the ratio of quantifier number and total word number, the ratio of pronoun number and total word number, the ratio of adverbial word number and total word number, preposition The ratio of number and total word number, the ratio of conjunction number and total word number, the ratio of auxiliary word number and total word number, the ratio of interjection number and total word number Value, the ratio of onomatopoeia number and total word number.
User's writing style similarity calculation module can pass through following public affairs for calculating active user's writing style similarity Formula calculates:
User's writing style similarity judgment module is by active user's writing style similarity SimT(USER) it is pre- with system If self examination & verification thresholding be compared;As user's writing style similarity SimT(USER) higher than self examination & verification thresholding When, you can think that the pending document of active user's submission and user's writing style are inconsistent;When user's writing style similarity SimT(USER) less than self examination & verification thresholding when, you can think that the pending document that active user submits writes wind with user Lattice are consistent.
Self examination & verification thresholding is set in advance for system.Self examination & verification threshold value setting is excessively high, then be easy to cause erroneous judgement The pending document and user's writing style that active user submits are inconsistent;Self examination & verification threshold value setting is too low, then easily makes The pending document submitted into erroneous judgement active user is consistent with user's writing style.In general, it is described self examination & verification threshold value when by System carries out selection verification by experiment in advance, and can be at any time adjusted according to operating condition by system.
A specific embodiment according to the present invention can set first self examination & verification thresholding and second self examination & verification respectively Thresholding;Described first self examination & verification thresholding self examination & verification thresholding higher than second;As user's writing style similarity SimT(USER) Higher than described first during self examination & verification thresholding, you can think that the pending document that active user submits differs with user's writing style It causes;As user's writing style similarity SimT(USER) less than described second during self examination & verification thresholding, you can think active user The pending document submitted is consistent with user's writing style;As user's writing style similarity SimT(USER) it is greater than or equal to institute State second self examination & verification thresholding, and self examination & verification thresholding less than or equal to described first;Further verify user's writing style.
Described first self examination & verification thresholding and second self examination & verification thresholding are set in advance for system.If first self examination & verification Threshold value setting is excessively high, then pending document and the user's writing style for be easy to causeing erroneous judgement active user's submission are inconsistent;The Two self examination & verification threshold values settings are too low, then be easy to cause pending document and user's writing style that erroneous judgement active user submits Unanimously;Section is set excessive between first self examination & verification thresholding and second self examination & verification thresholding, then is be easy to cause too much again Verify user's writing style.In general, described first self examination & verification threshold value and second self examination & verification threshold value are led in advance by system It crosses experiment and carries out selection verification, and can be at any time adjusted according to operating condition by system.
A specific embodiment according to the present invention, further verification user's writing style refer to that user writes wind Lattice structural auxiliary word judgment module;Judge pending document and user test picture describes text and/or user test article is retouched " ", " ", the service condition of " obtaining " three structural auxiliary words in text are stated, thus reflects the author of the pending document of this And active user is for the differentiation degree of " ", " ", " obtaining " three structural auxiliary words.User's writing style structural auxiliary word Judgment module judges that pending document " ", " ", the service condition of " obtaining " three structural auxiliary words refer to, counts pending document " ", " ", the access times of " obtaining " in full text, are denoted as T respectively1、T2And T3;It further counts in pending document full text " " after institute with participle part of speech be noun number, be denoted as D1;Count in pending document full text " " after institute with point The part of speech of word is the number of verb, is denoted as D2;Count in pending document full text " " after institute with participle part of speech be describe The number of word, is denoted as D3;Calculate " " after institute with participle part of speech be noun number and full text in " " use it is always secondary Several ratio D1/T1;Calculate " " after institute in number and full text that the part of speech of participle is verb " " using total degree Ratio D2/T2;It is the ratio using total degree " obtained " in the number and full text of verb with the part of speech of participle to calculate institute after " obtaining " D3/T3;Calculate " ", " ", " obtain " differentiation coefficient DC_TD.The numerical value for distinguishing coefficient DC_TD is greater than or equal to 0, is less than Or equal to 3.
The user test picture describes text and/or user test article describes in text " ", " ", " obtaining " three The service condition of structural auxiliary word refers to that counting user test pictures describe text and/or user test article describes text full text In (such as the user tests several pictures and/or plurality of articles, then all description texts is incorporated as full text) " ", " ", the access times of " obtaining ", be denoted as T respectively1’、T2' and T3’;Further count in pending document full text " " after institute Part of speech with participle is the number of noun, is denoted as D1’;Count in pending document full text " " after be with the part of speech of participle The number of verb, is denoted as D2’;Count in pending document full text " " after institute with participle part of speech be adjectival number, It is denoted as D3’;Calculate " " after institute with participle part of speech be noun number and full text in " " the ratio using total degree D1’/T1’;Calculate " " after institute with participle part of speech be verb number and full text in " " the ratio using total degree D2’/T2’;It is the ratio using total degree " obtained " in the number and full text of verb with the part of speech of participle to calculate institute after " obtaining " D3’/T3’;Calculate " ", " ", " obtain " differentiation coefficient DC_TPT.The numerical value for distinguishing coefficient DC_TPT is greater than or equal to 0, Less than or equal to 3.
User's writing style structural auxiliary word judgment module;It calculates and distinguishes between coefficient DC_TD and differentiation coefficient DC_TPT Computing is normalized to distinguishing coefficient DC_TD and distinguishing the absolute value of the difference of both coefficient DC_TPT in drift rate DC-SC.
When the value of DC_SC is less than or equal to the judgement thresholding of drift rate DC-SC, then user's writing style structural auxiliary word Judgment module, which judges the author of pending document, and test pictures describe text and/or tests article describes the user of text and exists Style is consistent in the use of " ", " ", " obtaining " three structural auxiliary words;When the value of DC_SC is more than the judgement of drift rate DC-SC During thresholding, then user's writing style structural auxiliary word judgment module judges that the author of pending document and test pictures describe text And/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style it is inconsistent.Partially The judgement threshold value of shifting degree DC-SC is configured in advance by system, and can be adjusted at any time according to actual needs.Pass through system The experimental data of operation early period is understood, when the value of DC_SC is less than or equal to 10%, can preferably reflect pending document Author and test pictures describe text and/or test article to describe the user of text in " ", " ", " obtaining " three structural auxiliary words Use on style it is consistent;When the value of DC_SC is more than 10%, then it is believed that the author of pending document retouches with test pictures State text and/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style differ It causes.
User's writing style judgment module is used for as user's writing style similarity SimT(USER) greater than or equal to described Second self examination & verification thresholding, and self examination & verification thresholding less than or equal to described first;Further judge to work as by drift rate DC-SC Whether the pending document and user's writing style that preceding user submits are consistent;When drift rate DC-SC sentencing more than drift rate DC-SC During disconnected thresholding, it is believed that the pending document and user's writing style that active user submits are inconsistent;Be less than as drift rate DC-SC or During judgement thresholding equal to drift rate DC-SC, you can think the pending document and user's writing style one that active user submits It causes.
A specific embodiment according to the present invention, user's access mode detection module prompting user upload text to be identified Shelves.
When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, text to be identified Shelves word-dividing mode obtains word segmentation result for being segmented to document to be identified;When carrying out word segmentation processing to document to be identified, It needs that the material with comparison database is used to carry out segmenting identical process flow.
A specific embodiment according to the present invention, document to be identified segment parts of speech classification module;For further obtaining Obtain the corresponding part of speech of word segmentation result.It is consistent with the participle mode classification for the material that comparison database is included to segment parts of speech classification mode.
A specific embodiment according to the present invention, document participle characteristic value generation module to be identified are waited to reflect for generating Determine document participle characteristic value;The quantity that each participle occurs in correspondence document to be identified is counted, obtains each participle pair The participle characteristic value WCV_TBI=[W_ID, W_N] answered, wherein W_ID represent unique number of the participle in storehouse is segmented, W_N Represent the total degree that the participle occurs in the document to be identified.Preferably, it is contemplated that the part of speech of each participle is segmented Part of speech feature value WCCV_TBI=[W_ID, W_N, W_CHAR], wherein W_ID represent unique number of the participle in storehouse is segmented, W_N represents the participle total degree of the specific participle in the document to be identified, and W_CHAR represents the part of speech of the participle.
A specific embodiment according to the present invention, document participle tightening coefficient generation module to be identified are treated for generating Identify document participle tightening coefficient.A specific embodiment according to the present invention, the close system of the corresponding participle of each participle Number can be expressed as WGC_TBI=[G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1)], wherein, G_W_ID_1 is represented The participle quantity that the participle is spaced between occurring for the first time and occur for second in the document to be identified, G_W_ID_2 are represented There is the participle quantity being spaced between third time appearance, G_W_ID_ (W_N- second in the document to be identified in the participle 1) represent that the participle participle quantity being spaced between the W_N times appearance occurs the W_N-1 times in the document to be identified;G_ W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1) are the corresponding participle tightening coefficient of the participle.According to the present invention one The corresponding participle tightening coefficient of each participle further can be expressed as segmenting by a specific embodiment in vector form Tightening coefficient characteristic vector W GCVE_TBI=[W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N- 1)], wherein W_ID represents unique number of the participle in storehouse is segmented, and W_N represents the specific participle in the document to be identified Participle total degree, W_CHAR represents the part of speech of the participle, and G_W_ID_1 represents the participle in the document to be identified for the first time The participle quantity for occurring and being spaced between occurring for second, G_W_ID_2 represent the participle second in the document to be identified There is the participle quantity being spaced between third time appearance, G_W_ID_ (W_N-1) represents the participle in the document to be identified The participle quantity being spaced between the W_N-1 times appearance and the W_N times appearance.Wherein, G_W_ID_1, G_W_ID_2 ..., G_W_ ID_ (W_N-1) is the corresponding participle part of speech feature vector tightening coefficient of the participle.By segmenting feature vector tightening coefficient, Overall distribution situation of the specific participle in correspondence document to be identified can be known, so as in document entirety length mistake to be identified In the case that length or description viewpoint are disperseed, avoid according to participle total degree W_N or according to (W_N/ segments free vector dimension WFV) screening segments feature vector and omits crucial participle characteristic value.Preferably, can also be closely according to participle feature vector Number extracts specific part in a certain document to be identified and is used to compare.
A specific embodiment according to the present invention, document to be identified segment free vector dimension determining module, are used for Participle free vector dimension WFV_TBI is determined according to the word segmentation result of document to be identified.When the length of document to be identified is shorter or When person's word segmentation result therein is less, obtained participle free vector dimension WFV_TBI is less;When the length of document to be identified When word segmentation result longer or therein is more, obtained participle free vector dimension WFV_TBI is more.
When user's detection pattern determining module judges that active user's detection pattern plagiarizes identification pattern for extension, text to be identified Shelves participle group module obtains participle group result for being segmented to document to be identified;The wherein same or similar participle of meaning One group is formed, is numbered in units of group.Multiple equivalent in meaning or similar participle corresponds to a participle group #;Right When document to be identified carries out word segmentation processing, it is necessary to using carrying out segmenting identical process flow with the material of comparison database.
A specific embodiment according to the present invention, document participle group parts of speech classification module to be identified;For further Obtain the corresponding part of speech of participle group result.The participle group mode classification for the material that participle group parts of speech classification mode is included with comparison database Unanimously.
A specific embodiment according to the present invention, document participle group characteristic value generation module to be identified are treated for generating Identify document participle group characteristic value;The quantity that each participle group occurs in correspondence document to be identified is counted, obtains each The corresponding participle characteristic value WGCV_TBI=[WG_ID, WG_N] of participle group, wherein WG_ID represent the participle group in storehouse is segmented Unique number, WG_N represents the total degree that the participle group occurs in the document to be identified.Preferably, it is contemplated that each point The part of speech of phrase, obtains participle group part of speech feature value WGCCV_TBI=[WG_ID, WG_N, WG_CHAR], and wherein WG_ID is represented Unique number of the participle group in storehouse is segmented, WG_N represent that the participle of the specific participle group in the document to be identified is always secondary Number, WG_CHAR represent the part of speech of the participle group.
A specific embodiment according to the present invention, document participle group tightening coefficient generation module to be identified are used to generate Document to be identified segments tightening coefficient.A specific embodiment according to the present invention, the corresponding participle of each participle group are tight Close coefficient can be expressed as WGGC_TBI=[G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein, G_ WG_ID_1 represents the participle number that the participle group is spaced between occurring for the first time and occur for second in the document to be identified Amount, G_WG_ID_2 represent that the participle group point being spaced between third time appearance occurs second in the document to be identified Word quantity, G_WG_ID_ (WG_N-1) represent that the participle group occurs and the W_N times appearance for the W_N-1 times in the document to be identified Between the participle quantity that is spaced;G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1) are that the participle group corresponds to Participle group tightening coefficient.A specific embodiment according to the present invention, can be further corresponding by each participle group Participle group tightening coefficient is expressed as participle group tightening coefficient characteristic vector W GGCVE_TBI=[WG_ID, WG_ in vector form N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein WG_ID represents that the participle group is being divided Unique number in dictionary, WG_N represent the participle total degree of the specific participle group in the document to be identified, and WG_CHAR is represented The part of speech of the participle group, G_WG_ID_1 represent that the participle group occurs in the document to be identified and occur it for the second time for the first time Between the participle quantity that is spaced, G_WG_ID_2 represents that the participle group occurs with going out for the third time for second in the document to be identified The participle quantity being spaced between existing, G_WG_ID_ (WG_N-1) represent the participle group the W_N-1 times in the document to be identified The participle quantity for occurring and being spaced between occurring for the W_N times.Wherein, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1) it is the corresponding participle part of speech feature vector tightening coefficient of the participle group.It is closely by participle group feature vector Number, can know overall distribution situation of the specific participle group in correspondence document to be identified, so as in a document entirety piece to be identified It is long or in the case that description viewpoint is disperseed, it avoids according to participle total degree W_N or according to (W_N/ segments free vector Dimension WFV) it screens participle feature vector and omits crucial participle characteristic value.It preferably, can also be tight according to participle feature vector Close coefficient extracts specific part in a certain document to be identified and is used to compare.
A specific embodiment according to the present invention, document participle group free vector dimension determining module to be identified are used In determining participle group free vector dimension WGFV_TBI according to the word segmentation result of document to be identified.When document to be identified length compared with When word segmentation result short or therein is less, obtained participle group free vector dimension WGFV_TBI is less;When text to be identified The length of shelves is longer or when word segmentation result therein is more, and obtained participle group free vector dimension WGFV_TBI is more.
It is to be identified when user's detection pattern determining module judges active user's detection pattern for multilingual plagiarism identification pattern Foreign language participle group module obtains middle foreign language participle group result for being segmented to document to be identified in document;Wherein meaning phase Same or similar middle foreign language participle forms one group, is numbered in units of group.Multiple equivalent in meaning or similar middle foreign language point Word corresponds to a middle foreign language participle group #.To document to be identified carry out word segmentation processing when, it is necessary to using with comparison database Material carries out segmenting identical process flow.
A specific embodiment according to the present invention, document participle group parts of speech classification module to be identified;For further Obtain the corresponding part of speech of participle group result.The participle group mode classification for the material that participle group parts of speech classification mode is included with comparison database Unanimously.
A specific embodiment according to the present invention, foreign language participle group characteristic value generation module is used in document to be identified Generate foreign language participle group characteristic value in document to be identified;Foreign language participle group in each is counted in correspondence document to be identified to occur Quantity, obtain foreign language participle group in each it is corresponding participle characteristic value WFGCV_TBI=[WFG_ID, WFG_N], wherein WFG_ID represents unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N represents that foreign language participle group is waited to reflect at this in this Determine the total degree occurred in document.Preferably, it is contemplated that the part of speech of foreign language participle group in each obtains middle foreign language participle group word Property characteristic value WFGCCV_TBI=[WFG_ID, WFG_N, WFG_CHAR], wherein FWG_ID represent in this foreign language participle group point Unique number in dictionary, WFG_N represent the participle total degree of the specific middle foreign language participle group in the document to be identified, WFG_ CHAR represents the part of speech of foreign language participle group in this.
A specific embodiment according to the present invention, foreign language participle group tightening coefficient generation module is used in document to be identified Tightening coefficient is segmented in generating foreign language in document to be identified.A specific embodiment according to the present invention, foreign language in each The corresponding middle foreign language participle tightening coefficient of participle group can be expressed as WFGGC_TBI=[G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1)], wherein, G_WFG_ID_1 represents that foreign language participle group goes out for the first time in the document to be identified in this The participle quantity being spaced between now occurring with second, G_WFG_ID_2 represent that foreign language participle group is in the document to be identified in this In second occur and third time occur between the participle quantity that is spaced, G_WFG_ID_ (WFG_N-1) represents foreign language point in this There is the participle quantity being spaced between the W_N times appearance the W_N-1 times in the document to be identified in phrase;G_WFG_ID_ 1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1) be in this corresponding middle foreign language participle group of foreign language participle group be closely Number.A specific embodiment according to the present invention, can be further by the corresponding middle foreign language point of foreign language participle group in each Phrase tightening coefficient is expressed as middle foreign language participle group tightening coefficient characteristic vector W FGGCVE_TBI=[WFG_ in vector form ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1)], wherein WFG_ID is represented Unique number of the foreign language participle group in storehouse is segmented in this, WFG_N represent the specific middle foreign language participle group in the document to be identified In participle total degree, WFG_CHAR represents the part of speech of foreign language participle group in this, and G_WFG_ID_1 represents foreign language participle group in this The participle quantity being spaced between occurring for the first time and occur for second in the document to be identified, G_WFG_ID_2 are represented in this There is the participle quantity being spaced between third time appearance, G_WFG_ second in the document to be identified in foreign language participle group ID_ (WG_N-1) represents that foreign language participle group the institute between the W_N times appearance occurs the W_N-1 times in the document to be identified in this The participle quantity at interval.Wherein, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1) are foreign language point in this The corresponding participle part of speech feature vector tightening coefficient of phrase.By middle foreign language participle group feature vector tightening coefficient, can know Overall distribution situation of the specific middle foreign language participle group in correspondence document to be identified.
A specific embodiment according to the present invention, foreign language participle group free vector dimension determines mould in document to be identified Block, for determining middle foreign language participle group free vector dimension WFGFV_TBI according to the word segmentation result of document to be identified.When to be identified The length of document is shorter or when word segmentation result therein is less, obtained middle foreign language participle group free vector dimension WFGFV_ TBI is less;When the length of document to be identified is longer or word segmentation result therein is more, obtained participle group free vector Dimension WFGFV_TBI is more.
A specific embodiment according to the present invention, document to be identified participle simplify vector dimension generation module for pair The participle free vector dimension WFV_TBI of document to be identified is simplified, and is generated document participle to be identified and is simplified vector dimension RWV_TBI.The participle is simplified vector dimension RWV_TBI and is specified by the system.Preferably, system specifies participle to simplify vector Dimension RWV_TBI is 500.Preferably, system specifies participle to simplify vector dimension RWV_TBI as 800.Preferably, simplified system Specified participle simplifies vector dimension RWV_TBI as 1000.
A specific embodiment according to the present invention, document participle to be identified simplify vector dimension generation module use etc. Interval extraction method simplifies document to be identified participle free vector dimension WFV_TBI.It is as follows to simplify process:Judge to be identified Whether document participle free vector dimension WFV_TBI, which is more than document to be identified participle, is simplified vector dimension RWV_TBI, if so, Document to be identified is then segmented into free vector dimension WFV_TBI divided by simplified system specifies document participle to be identified to simplify vectorial dimension Number RWV_TBI, and upper rounding operation is carried out to obtained quotient, it further obtains document to be identified and simplifies coefficients R EDU_ TBI;Then carried in the characteristic value corresponding to document to be identified participle free vector dimension WFV_TBI at interval of REDU_TBI-1 Take a characteristic value;After all characteristics extractions, judge whether the quantity of extracted characteristic value is equal to text to be identified Shelves participle simplifies vector dimension RWV_TBI;Vectorial dimension is simplified when the quantity for the characteristic value extracted is equal to document to be identified participle During number RWV_TBI, then complete document participle free vector dimension WFV_TBI to be identified and simplify;When the number for the characteristic value extracted When amount simplifies vector dimension RWV_TBI less than document to be identified participle, then calculate document participle to be identified and simplify vector dimension RWV_TBI and the difference of characteristic value quantity;In non-extracted characteristic value at random extraction with document to be identified participle simplify to The dimension RWV_TBI characteristic values equal with the difference quantities of characteristic value is measured, completes document participle free vector dimension to be identified WFV_TBI's simplifies.
A specific embodiment according to the present invention, document participle to be identified simplify vector dimension generation module using word Property screening method to document to be identified participle free vector dimension WFV_TBI simplify.It is as follows to simplify process:By characteristic value according to Corresponding participle part of speech is classified;A specific embodiment according to the present invention, feature value division is special for A1 classes notional word Value indicative, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word features Value.Generally, it is considered that the effect bigger played in the similarity comparison of the corresponding characteristic value of notional word, wherein technical term noun compared with Common noun can more embody effective content of document to be identified.Quantity AMOUNT_A1 (the A1 of lower eigenvalue of all categories are counted respectively The quantity of class notional word characteristic value), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (B class notional word characteristic values Quantity), AMOUNT_C (quantity of C class notional word characteristic values), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (V The quantity of class notional word characteristic value).It calculates document participle to be identified and simplifies vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_ A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_TBI_S_V);If greater than 0, this is exited if It is secondary to simplify;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate document participle to be identified and simplify vector The value RWV_S_D of dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D);If More than 0, then the feature equal with difference RWV_TBI_S_D quantity is extracted at random from the characteristic value corresponding to AMOUNT_V Value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then document participle to be identified is further calculated Simplify the value RWV_TBI_S_C of vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);Such as Fruit is more than 0, then extracts the feature equal with difference RWV_TBI_S_C quantity at random from the characteristic value corresponding to AMOUNT_D Value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then document participle to be identified is further calculated Simplify the value RWV_TBI_S_B of vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, It then extracts the characteristic value equal with difference RWV_TBI_S_B quantity at random from the characteristic value corresponding to AMOUNT_C, completes This is simplified;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate document participle to be identified simplify to Measure the value RWV_TBI_S_A2 of dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2);If greater than 0, then from AMOUNT_B institutes The characteristic value equal with difference RWV_TBI_S_A2 quantity is extracted in corresponding characteristic value at random, completion is this time simplified;If Equal to 0, then complete this time to simplify;If less than 0, then further calculate document participle to be identified and simplify vector dimension RWV_TBI- The value RWV_TBI_S_A1 of AMOUNT_A1;If greater than 0, then extract and be somebody's turn to do at random from the characteristic value corresponding to AMOUNT_A2 The equal characteristic value of difference RWV_TBI_S_A1 quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;It is if small In 0, then extraction simplifies vector dimension RWV_TBI with document to be identified participle at random from the characteristic value corresponding to AMOUNT_A1 The equal characteristic value of quantity, completion are this time simplified.
Vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+ are simplified for calculating document participle to be identified AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_TBI_S_V) is more than 0 situation, that is, means that this is to be identified Document length is smaller or information content is less, therefore is not suitable for being compared using characteristic value.
Document participle free vector dimension WFV_TBI to be identified is less than document to be identified participle and simplifies vector dimension RWV_ During TBI, expression dimension itself is small, then the magnitude under other dimensions is equivalent to 0, can Direct Mark in systems, individually include Processing.
A specific embodiment according to the present invention, document participle group to be identified are simplified vector dimension generation module and are used for The participle group free vector dimension WGFV_TBI of document to be identified is simplified, document participle group to be identified is generated and simplifies vector Dimension RGWV_TBI.The participle group is simplified vector dimension RWGV_TBI and is specified by the system.Preferably, system specifies participle Group simplifies vector dimension RWGV_TBI as 500.Preferably, system specifies participle group to simplify vector dimension RWGV_TBI as 800.It is excellent Selection of land, simplified system specify participle group to simplify vector dimension RWGV_TBI as 1000.
A specific embodiment according to the present invention, document participle group to be identified simplify the use of vector dimension generation module Extracted at equal intervals method simplifies document participle group free vector dimension WGFV_TBI to be identified.It is as follows to simplify process:Judge Whether document participle group free vector dimension WGFV_TBI to be identified more than document participle group to be identified simplifies vector dimension RWGV_ TBI, if it is, document participle group free vector dimension WGFV_TBI to be identified divided by simplified system are specified document to be identified Participle group simplifies vector dimension RWGV_TBI, and carries out upper rounding operation to obtained quotient, further obtains simplifying coefficient REDU_TBI;Then at interval of REDU_TBI-1 in the characteristic value corresponding to document participle group free vector dimension WGFV to be identified One characteristic value of a extraction;After all characteristics extractions, judge whether the quantity of extracted characteristic value is equal to and wait to reflect Determine document participle group and simplify vector dimension RWGV_TBI;When the quantity for the characteristic value extracted is equal to document participle group to be identified essence During simple vector dimension RWGV_TBI, then complete document participle group free vector dimension WGFV_TBI to be identified and simplify;When being extracted The quantity of characteristic value when simplifying vector dimension RWGV_TBI less than document participle group to be identified, then calculate document participle to be identified Group simplifies the difference of vector dimension RWGV_TBI and characteristic value quantity;In non-extracted characteristic value at random extraction with it is to be identified Document participle group simplifies the vector dimension RWGV_TBI characteristic values equal with the difference quantities of characteristic value, completes document to be identified point Phrase free vector dimension WGFV_TBI's simplifies.
A specific embodiment according to the present invention, document participle group to be identified simplify the use of vector dimension generation module Part of speech screening method simplifies document participle group free vector dimension WGFV_TBI to be identified.It is as follows to simplify process:By feature Value is classified according to corresponding participle group part of speech;Feature value division is A1 by a specific embodiment according to the present invention Class notional word characteristic value, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V classes Function word characteristic value.Generally, it is considered that the effect bigger played in the similarity comparison of the corresponding characteristic value of notional word, wherein technical term Noun can more embody effective content of document to be identified compared with common noun.The quantity of lower eigenvalue of all categories is counted respectively AMOUNT_A1 (quantity of A1 class notional word characteristic values), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (B classes The quantity of notional word characteristic value), AMOUNT_C (quantity of C class notional word characteristic values), the AMOUNT_D (numbers of D class notional word characteristic values Amount), AMOUNT_V (quantity of V class notional word characteristic values).It calculates document participle group to be identified and simplifies vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) value RWGV_TBI_S_V;Such as Fruit is more than 0, exits and if this time simplifies;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate and treat Identification document participle group simplifies vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+ AMOUNT_D value RWGV_S_D);If greater than 0, then extracted and the difference at random from the characteristic value corresponding to AMOUNT_V The equal characteristic value of RWGV_TBI_S_D quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, It then further calculates document participle group to be identified and simplifies vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_ B+AMOUNT_C value RWGV_TBI_S_C);If greater than 0, then from the characteristic value corresponding to AMOUNT_D at random extraction with The equal characteristic value of difference RWGV_TBI_S_C quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If Less than 0, then further calculate document participle group to be identified and simplify vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+ AMOUNT_B value RWGV_TBI_S_B);If greater than 0, then extract and be somebody's turn to do at random from the characteristic value corresponding to AMOUNT_C The equal characteristic value of difference RWGV_TBI_S_B quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;It is if small In 0, then further calculate document participle group to be identified and simplify vector dimension RWGV_TBI-'s (AMOUNT_A1+AMOUNT_A2) Value RWV_TBI_S_A2;If greater than 0, then extracted and difference RWGV_ at random from the characteristic value corresponding to AMOUNT_B The equal characteristic value of TBI_S_A2 quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then into One step calculates the value RWGV_TBI_S_A1 that document participle group to be identified simplifies vector dimension RWGV_TBI-AMOUNT_A1;If More than 0, then the spy equal with difference RWGV_TBI_S_A1 quantity is extracted at random from the characteristic value corresponding to AMOUNT_A2 Value indicative, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then from the spy corresponding to AMOUNT_A1 Extraction and the document participle group to be identified characteristic value that simplify vector dimension RWGV_TBI quantity equal at random in value indicative, complete this It simplifies.
Vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+ are simplified for calculating document participle group to be identified AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWGV_TBI_S_V) is more than 0 situation, that is, means that this waits to reflect Determine that document length is smaller or information content is less, therefore be not suitable for being compared using characteristic value.
Document participle group free vector dimension WGFV_TBI to be identified simplifies vector dimension less than document participle group to be identified During RWGV_TBI, expression dimension itself is small, then the magnitude under other dimensions is equivalent to 0, can Direct Mark in systems, individually Include processing.
A specific embodiment according to the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified Block generates document to be identified for being simplified to the middle foreign language participle group free vector dimension WFGFV_TBI of document to be identified Middle foreign language participle group simplifies vector dimension RFGWV_TBI.The middle foreign language participle group simplifies vector dimension RWFGV_TBI by described System is specified.Preferably, system specifies middle foreign language participle group to simplify vector dimension RWFGV_TBI as 500.Preferably, system refers to Foreign language participle group simplifies vector dimension RWFGV_TBI as 800 in fixed.Preferably, simplified system specifies middle foreign language participle group to simplify Vector dimension RWFGV_TBI is 1000.
A specific embodiment according to the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified Block simplifies foreign language participle group free vector dimension WFGFV_TBI in document to be identified using extracted at equal intervals method.It simplifies Process is as follows:Judge whether foreign language participle group free vector dimension WFGFV_TBI is more than in document to be identified in document to be identified Foreign language participle group simplifies vector dimension RWFGV_TBI, if it is, by foreign language participle group free vector dimension in document to be identified WFGFV_TBI divided by simplified system specify foreign language participle group in document to be identified to simplify vector dimension RWFGV_TBI, and to gained To quotient carry out upper rounding operation, further obtain simplifying coefficients R EDU_TBI;The then foreign language participle group in document to be identified At interval of one characteristic value of REDU_TBI-1 extraction in characteristic value corresponding to free vector dimension WFGFV;When all features After value extraction, judge whether the quantity of extracted characteristic value is equal to foreign language participle group in document to be identified and simplifies vectorial dimension Number RWFGV_TBI;Vector dimension is simplified when the quantity for the characteristic value extracted is equal to foreign language participle group in document to be identified During RWFGV_TBI, then complete foreign language participle group free vector dimension WFGFV_TBI in document to be identified and simplify;When what is extracted When the quantity of characteristic value simplifies vector dimension RWFGV_TBI less than foreign language participle group in document to be identified, then text to be identified is calculated Foreign language participle group simplifies the difference of vector dimension RWFGV_TBI and characteristic value quantity in shelves;In non-extracted characteristic value with Machine extraction simplifies the vector dimension RWFGV_TBI spies equal with the difference quantities of characteristic value with foreign language participle group in document to be identified Value indicative completes simplifying for foreign language participle group free vector dimension WFGFV_TBI in document to be identified.
A specific embodiment according to the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified Block simplifies foreign language participle group free vector dimension WFGFV_TBI in document to be identified using part of speech screening method.It simplified Journey is as follows:Characteristic value is classified according to corresponding middle foreign language participle group part of speech;A specific embodiment party according to the present invention Feature value division is A1 class notional words characteristic value, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D by formula Class notional word characteristic value and V class function word characteristic values.Generally, it is considered that the work played in the similarity comparison of the corresponding characteristic value of notional word With bigger, wherein technical term noun can more embody effective content of document to be identified compared with common noun.It counts respectively all kinds of Quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values), the AMOUNT_A2 (numbers of A2 class notional word characteristic values of other lower eigenvalue Amount), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C (quantity of C class notional word characteristic values), AMOUNT_D (D classes The quantity of notional word characteristic value), AMOUNT_V (quantity of V class notional word characteristic values).It calculates document participle group to be identified and simplifies vector The value of dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) RWFGV_TBI_S_V;If greater than 0, exit and if this time simplify;If equal to 0, then it completes this time to simplify;If less than 0, then it further calculates foreign language participle group in document to be identified and simplifies vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_ A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWFGV_S_D);If greater than 0, then from the spy corresponding to AMOUNT_V The characteristic value equal with difference RWFGV_TBI_S_D quantity is extracted in value indicative at random, completion is this time simplified;If equal to 0, then Completion is this time simplified;If less than 0, then further calculate foreign language participle group in document to be identified and simplify vector dimension RWFGV_ The value RWFGV_TBI_S_C of TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);If greater than 0, then from The characteristic value equal with difference RWFGV_TBI_S_C quantity is extracted in characteristic value corresponding to AMOUNT_D at random, completes this It is secondary to simplify;If equal to 0, then it completes this time to simplify;If less than 0, then foreign language participle group in document to be identified is further calculated Simplify the value RWFGV_TBI_S_B of vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);It is if big In 0, then the feature equal with difference RWFGV_TBI_S_B quantity is extracted at random from the characteristic value corresponding to AMOUNT_C Value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then document China and foreign countries to be identified are further calculated Literary participle group simplifies the value RWV_TBI_S_A2 of vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2);If greater than 0, then extract the characteristic value equal with difference RWFGV_TBI_S_A2 quantity at random from the characteristic value corresponding to AMOUNT_B, Completion is this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then foreign language in document to be identified is further calculated Participle group simplifies the value RWGV_TBI_S_A1 of vector dimension RWFGV_TBI-AMOUNT_A1;If greater than 0, then from AMOUNT_ The characteristic value equal with difference RWFGV_TBI_S_A1 quantity is extracted in characteristic value corresponding to A2 at random, completes this time essence Letter;If equal to 0, then it completes this time to simplify;If less than 0, then from the characteristic value corresponding to AMOUNT_A1 at random extraction with Document participle group to be identified simplifies the equal characteristic value of vector dimension RWFGV_TBI quantity, and completion is this time simplified.
Vector dimension RWFGV_TBI- (AMOUNT_A1+ are simplified for calculating foreign language participle group in document to be identified AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWFGV_TBI_S_V) is more than 0 situation, i.e., It means that the document length to be identified is smaller or information content is less, therefore is not suitable for being compared using characteristic value.
Foreign language participle group free vector dimension WFGFV_TBI is less than foreign language participle group in document to be identified in document to be identified When simplifying vector dimension RWFGV_TBI, expression dimension itself is small, then the magnitude under other dimensions is equivalent to 0, can be in systems Direct Mark individually includes processing.
Preferably, compared for ease of similarity, the material participle selected in system simplifies vector dimension RWV and text to be identified The participle of shelves simplifies vector dimension RWV_TBI should be equal;Material participle group simplifies point of vector dimension RWGV and document to be identified Phrase simplifies vector dimension RWGV_TBI should be equal;Foreign language participle group simplifies vector dimension RWFGV and document to be identified in material Middle foreign language participle group simplify vector dimension RWFGV_TBI should be equal.
A specific embodiment according to the present invention, document to be identified segments feature vector generation module, according to participle It simplifies in each document to be identified of vector dimension RWV_TBI extractions and simplifies vector dimension RWV_ with the document participle to be identified The corresponding characteristic values of TBI generate document participle characteristic vector W VE_RWV_TBI to be identified, wherein
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_NiRepresent that the participle goes out in the document to be identified Existing total degree, using the number as the characteristic value of the participle.
A specific embodiment according to the present invention, user's detection pattern determining module judge active user's detection pattern During commonly to plagiarize identification pattern, when carrying out similarity comparison, document participle feature vector generation module to be identified, which generates, to be waited to reflect Determine the participle characteristic vector W VE_RWV_TBI of document;WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_ IDRWV_TBI,W_NRWV_TBI], the dimension of the participle feature vector of document to be identified is RWV_TBI;Segment feature vector generation module Generate the participle characteristic vector W VE_RWV of material in comparison database;WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,..., W_IDRWV,W_NRWV];Wherein, the dimension RWV_TBI of the participle feature vector of document to be identified is equal to the dimension of participle feature vector Number RWV.
It should be noted that although W_ID is all used in characteristic vector W VE_RWV_TBI and WVE_RWV is segmentediTable Show unique number of the participle in storehouse is segmented, W_NiRepresent the total degree that the participle occurs in the document to be identified, and should Characteristic value of the number as the participle, but should be noted that the W_ID in participle characteristic vector W VE_RWV_TBIiHave very big It may be with the W_ID in WVE_RWViAnd it differs.Therefore when carrying out similarity comparison, it is necessary to segment feature vector by two Dimension be adjusted to consistent.
A specific embodiment according to the present invention, file characteristics vector adjustment module to be identified, for spy will to be segmented Levy the corresponding W_ID of all characteristic values in vector WVE_RWV_TBIiValue carries out ascending or descending order according to the number in participle storehouse Arrangement, and the W_ID that will lackiValue insertion, the participle number W_ID of insertioniCorresponding characteristic value is 0;Assuming that in participle storehouse Participle number sum is W, then needs the participle number number being inserted into for W-RWV_TBI, the document to be identified being thus expanded Segment characteristic vector W VE_RWV_TBI_EXT=[W_IDTBI_EXT_1,W_NTBI_EXT_1,...,W_IDTBI_EXT_i,W_ NTBI_EXT_i,...,W_IDTBI_EXT_RWV_TBI,W_NTBI_EXT_RWV_TBI,...,W_IDW,W_NW]。
A specific embodiment according to the present invention, material feature vector adjustment module, for feature vector will to be segmented The corresponding W_ID of all characteristic values in WVE_RWViValue carries out ascending or descending order arrangement according to the number in participle storehouse, and will lack Few W_IDiValue insertion, the participle number W_ID of insertioniCorresponding characteristic value is 0;Assuming that the participle number in participle storehouse is total Number is W, then it is W-RWV, the participle characteristic vector W VE_RWV_EXT=being thus expanded to need the participle number number being inserted into [W_IDEXT_1,W_NEXT_1,...,W_IDEXT_i,W_NEXT_i,...,W_IDEXT_RWV,W_NEXT_RWV,...,W_IDW,W_NW]。
By the above-mentioned means, the dimension of the participle feature vector of the material in document to be identified and comparison database is all extended Unified arrangement is carried out to W, and by carrying out ascending or descending order according to the number in participle storehouse, so as to two participle feature vectors pair The dimension for the characteristic value answered is consistent.
It is common to plagiarize identification similarity calculation module, it calculates between any material in document to be identified and comparison database Similarity;It is calculated by the following formula:
A specific embodiment according to the present invention, user's detection pattern determining module judge active user's detection pattern When plagiarizing identification pattern for extension, when carrying out similarity comparison, document participle group feature vector generation module generation to be identified is treated Identify the participle group characteristic vector W VE_RWGV_TBI of document;WVE_RWGV_TBI=[WG_ID1,WG_N1,...,WG_IDi, WG_Ni,...,WG_IDRWGV_TBI,WG_NRWGV_TBI], the dimension of the participle group feature vector of document to be identified is RWGV_TBI;Point The participle group characteristic vector W VE_RWGV of material in phrase feature vector generation module generation comparison database;WVE_RWGV=[WG_ ID1,WG_N1,...,WG_IDi,WG_Ni,...,WG_IDRWGV,WG_NRWGV];Wherein WG_IDiRepresent participle group in storehouse is segmented Unique number, WG_NiThe total degree that the participle group occurs in the document to be identified is represented, using the number as the participle group Characteristic value.Wherein, the dimension RWGV_TBI of the participle group feature vector of document to be identified is equal to the dimension of participle group feature vector Number RWGV.
Similar with the common processing procedure for plagiarizing identification pattern, a specific embodiment according to the present invention, extension is copied Identification file characteristics vector adjustment module to be identified is attacked, adjusts the document participle group characteristic vector W VE_ to be identified being expanded RWGV_TBI_EXT=[WG_IDTBI_EXT_1,WG_NTBI_EXT_1,...,WG_IDTBI_EXT_i,WG_NTBI_EXT_i,...,WG_ IDTBI_EXT_RWV_TBI,WG_NTBI_EXT_RWGV_TBI,...,WG_IDW,WG_NW];Material feature vector adjusts module, and adjustment is expanded The participle group characteristic vector W VE_RWGV_EXT=[WG_ID of exhibitionEXT_1,WG_NEXT_1,...,WG_IDEXT_i,WG_NEXT_i,..., WG_IDEXT_RWV,WG_NEXT_RWGV,...,WG_IDW,W_NW].The participle group characteristic vector W VE_RWGV_TBI_EXT=of extension [WG_IDTBI_EXT_1,WG_NTBI_EXT_1,...,WG_IDTBI_EXT_i,WG_NTBI_EXT_i,...,WG_IDTBI_EXT_RWGV_TBI,WG_ NTBI_EXT_RWGV_TBI,...,WG_IDW,WG_NW]。
By the above-mentioned means, the dimension of the participle feature vector of the material in document to be identified and comparison database is all extended Unified arrangement is carried out to W, and by carrying out ascending or descending order according to the number in participle storehouse, so as to two participle feature vectors pair The dimension for the characteristic value answered is consistent.
Identification similarity calculation module is plagiarized in extension, is calculated between any material in document to be identified and comparison database Similarity;It is calculated by the following formula:
A specific embodiment according to the present invention, user's detection pattern determining module judge active user's detection pattern For multilingual plagiarism identification pattern when, when carrying out similarity comparison, foreign language participle group feature vector generation mould in document to be identified Block generates the middle foreign language participle group characteristic vector W VE_RWFGV_TBI of document to be identified;WVE_RWFGV_TBI=[WFG_ID1, WFG_N1,...,WFG_IDi,WFG_Ni,...,WFG_IDRWFGV_TBI,WFG_NRWFGV_TBI], the middle foreign language participle of document to be identified The dimension of group feature vector is RWFGV_TBI;The middle foreign language point of material in participle group feature vector generation module generation comparison database Phrase characteristic vector W VE_RWFGV;WVE_RWFGV=[WFG_ID1,WFG_N1,...,WFG_IDi,WFG_Ni,...,WFG_ IDRWFGV,WFG_NRWFGV];Wherein WFG_IDiUnique number of the foreign language participle group in storehouse is segmented, WFG_N in expressioniRepresenting should The total degree that middle foreign language participle group occurs in the document to be identified, using the number as the characteristic value of foreign language participle group in this. Wherein, the dimension RWFGV_TBI of the middle foreign language participle group feature vector of document to be identified is equal to middle foreign language participle group feature vector Dimension RWFGV.
Similar with the common processing procedure for plagiarizing identification pattern, a specific embodiment according to the present invention is multilingual It plagiarizes under identification pattern, file characteristics vector adjustment module to be identified adjusts foreign language in the document to be identified being expanded and segments Group characteristic vector W VE_RWFGV_TBI_EXT=[WFG_IDTBI_EXT_1,WFG_NTBI_EXT_1,...,WFG_IDTBI_EXT_i,WFG_ NTBI_EXT_i,...,WFG_IDTBI_EXT_RWFGV_TBI,WFG_NTBI_EXT_RWFGV_TBI,...,WFG_IDW,WFG_NW];Material feature to Amount adjustment module, adjusts the participle group characteristic vector W VE_RWFGV_EXT=[WFG_ID being expandedEXT_1,WFG_ NEXT_1,...,WFG_IDEXT_i,WFG_NEXT_i,...,WFG_IDEXT_RWV,WFG_NEXT_RWFGV,...,WFG_IDW,WFG_NW]。 The participle characteristic vector W VE_RWFGV_TBI_EXT=[WFG_ID of extensionTBI_EXT_1,WFG_NTBI_EXT_1,...,WFG_ IDTBI_EXT_i,WFG_NTBI_EXT_i,...,WFG_IDTBI_EXT_RWFGV_TBI,WFG_NTBI_EXT_RWFGV_TBI,...,WFG_IDW,WFG_ NW]。
By the above-mentioned means, the dimension of the participle feature vector of the material in document to be identified and comparison database is all extended Unified arrangement is carried out to W, and by carrying out ascending or descending order according to the number in participle storehouse, so as to two participle feature vectors pair The dimension for the characteristic value answered is consistent.
It is multilingual to plagiarize identification similarity calculation module, it calculates between any material in document to be identified and comparison database Similarity;It is calculated by the following formula:
A specific embodiment according to the present invention, for avoid extension after dimension it is excessive, also can will participle feature to All participle ID in WVE_RWV_TBI are measured as a set;And collect the participle ID in WVE_RWV as another It closes;Or using all participle ID in participle group characteristic vector W VE_RWGV_TBI as a set;And by WVE_RWGV In participle ID as another gather;Or by all points in middle foreign language participle group characteristic vector W VE_RWFGV_TBI Word ID is as a set;And gather the participle ID in WVE_RWFGV as another;Two collection conjunction unions obtain total Segment ID set;Gather according to total participle ID by the dimension of the participle feature vector of the material in document to be identified and comparison database Number is extended, and the corresponding participle ID of all characteristic values is carried out ascending or descending order arrangement according to the number in participle storehouse, is inserted Enter and included in total participle ID set and originally itself gathered the W_ID not includediValue, the participle number W_ID being inserted intoiIt is corresponding Characteristic value be 0;Or it is included in the total participle group ID set of insertion and WG_ID that itself original set does not includeiValue, is inserted The participle number WG_ID enterediCorresponding characteristic value is 0;Or it is included in the total middle foreign language participle group ID set of insertion and original The WFG_ID that itself set does not includeiValue, the participle number WFG_ID being inserted intoiCorresponding characteristic value is 0.
According to the access mode of user, the material for providing different word banks in comparison database carries out similarity comparison, compares and use The mode of traversal, the characteristic vector pickup that will select all materials in scope come out, and similarity is carried out with document to be identified Comparison;And compare the similarity value being calculated with predetermined threshold, it, will when similarity value is higher than predetermined threshold Corresponding material records spare as doubtful material.
After the completion of document to be identified and the comparison of all materials, extract all doubtful materials, by document to be identified with it is doubtful Material is further compared.
A preferred embodiment according to the present invention, can will be in proverb common saying storehouse, famous sayings of famous figures storehouse, poem storehouse it is all Material selectiong is doubtful material.
Participle free vector dimension WFV can be less than participle and simplify vector by a preferred embodiment according to the present invention The material selectiong of dimension RWV is doubtful material.
A preferred embodiment according to the present invention can simplify participle group free vector dimension WGFV less than participle group The material selectiong of vector dimension RWGV is doubtful material.
A preferred embodiment according to the present invention, during can middle foreign language participle group free vector dimension WFGFV be less than The material selectiong that foreign language participle group simplifies vector dimension RWFGV is doubtful material.
A preferred embodiment according to the present invention can further choose doubtful material by segmenting tightening coefficient.
A specific embodiment according to the present invention, common plagiarize can be according to point of document to be identified under identification pattern The participle tightening coefficient of word tightening coefficient and material screens doubtful material.Document tightening coefficient statistical module to be identified is according to this Corresponding participle tightening coefficient characteristic vector W GCVE_TBI=[W_ID, W_N, W_CHAR, G_W_ID_ are segmented in document to be identified 1, G_W_ID_2 ..., G_W_ID_i ..., G_W_ID_ (W_N-1)] extraction high density segments and corresponding position.It is described to wait to reflect Determine participle part of speech W_CHAR of the document tightening coefficient statistical module in participle tightening coefficient feature vector, choose part of speech as in fact The participle of word, and count the spacing participle total amount of predetermined adjacent quantity participle:Wherein n is predetermined adjacent Quantity, when the spacing participle total amount of predetermined adjacent quantity participle is less than predetermined close threshold T HGWhen, then record the participle ID and corresponding position.
A specific embodiment according to the present invention, extension is plagiarized can be according to point of document to be identified under identification pattern The participle group tightening coefficient of phrase tightening coefficient and material screens doubtful material.Document tightening coefficient statistical module root to be identified According to corresponding participle tightening coefficient characteristic vector W GGCVE_TBI=[WG_ID, WG_N, the WG_ of participle group in the document to be identified CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_i ..., G_WG_ID_ (W_N-1)] extraction high density participle group and Corresponding position.Participle group of the document tightening coefficient statistical module to be identified in participle group tightening coefficient feature vector Part of speech WG_CHAR chooses the participle group that part of speech is notional word, and counts the spacing participle total amount for making a reservation for adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when the spacing participle total amount for making a reservation for adjacent quantity participle group is less than Predetermined close threshold T HGWhen, then record the ID of the participle group and corresponding position.
A specific embodiment according to the present invention, multilingual plagiarize can be according to document to be identified under identification pattern The middle foreign language participle group tightening coefficient of middle foreign language participle group tightening coefficient and material screens doubtful material.Document to be identified is close Coefficients statistics module is according to the corresponding participle tightening coefficient characteristic vector W FGGCVE_ of middle foreign language participle group in the document to be identified TBI=[WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_i ..., G_WFG_ID_ (W_N-1)] high density participle group and corresponding position are extracted.The document tightening coefficient statistical module to be identified is according to China and foreign countries Participle group part of speech WFG_CHAR in literary participle group tightening coefficient feature vector chooses part of speech and is the participle group of notional word, and counts Make a reservation for the spacing participle total amount of adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when pre- The spacing participle total amount of fixed adjacent quantity participle group is less than predetermined close threshold T HGWhen, then record foreign language participle group in this ID and corresponding position.
The value for making a reservation for adjacent quantity n and close threshold T HGIt is pre-set by system, and can be according to reality It needs to be adjusted;When the spacing participle total amount of predetermined adjacent quantity participle is less than predetermined close threshold T HGWhen, then it can recognize It is more intensive in corresponding position appearance for notional word participle, it is possible to which that concentration elaborates a certain viewpoint, it is necessary to which emphasis is paid close attention to.
It is common to plagiarize under identification pattern, the doubtful story extraction module of tightening coefficient, according between predetermined adjacent quantity participle It is less than predetermined close threshold T H every participle total amountGWhen, the participle ID that is recorded is extracted and all in comparison database is included the participle The material of ID;Calculate respectively participle tightening coefficient characteristic vector W GCVE=corresponding with participle ID in material [W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_i ..., G_W_ID_ (W_N-1)], the predetermined adjacent quantity participle of statistics Spacing participle total amount:Wherein n is to make a reservation for adjacent quantity, when the interval point of predetermined adjacent quantity participle Word total amount is less than predetermined close threshold T HGWhen, then it is doubtful material by the material selectiong.The participle ID is one or more It is a, it is one or more according to the material comprising one or more participle ID is extracted for one or more participle ID.
Extension is plagiarized under identification pattern, the doubtful story extraction module of tightening coefficient, according to predetermined adjacent quantity participle group Spacing participle total amount is less than predetermined close threshold T HGWhen, the participle group ID that is recorded is extracted all comprising should in comparison database Segment the material of ID groups;Participle group tightening coefficient characteristic vector W GGCVE=corresponding with participle group ID in material is calculated respectively [WG_ID, WG_N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_i ..., G_WG_ID_ (WG_N-1)], system Meter makes a reservation for the spacing participle total amount of adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when pre- The spacing participle group total amount of fixed adjacent quantity participle is less than predetermined close threshold T HGWhen, then it is doubtful by the material selectiong Material.The participle group ID is one or more, is extracted according to for one or more participle group ID comprising the one or more The material of participle group ID is one or more.
Under multilingual plagiarism identification pattern, the doubtful story extraction module of tightening coefficient, according to predetermined adjacent quantity China and foreign countries text The spacing participle total amount of participle group is less than predetermined close threshold T HGWhen, the middle foreign language participle group ID that is recorded, extraction comparison All materials for including foreign language participle ID groups in this in storehouse;China and foreign countries corresponding with foreign language participle group ID in this in material are calculated respectively Literary participle group tightening coefficient characteristic vector W FGGCVE=[WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_ 2 ..., G_WFG_ID_i ..., G_WFG_ID_ (WFG_N-1)], the spacing participle of the predetermined adjacent literary participle group in quantity China and foreign countries of statistics Total amount:Wherein n is to make a reservation for adjacent quantity, when foreign language point in the interval of predetermined adjacent quantity participle Phrase total amount is less than predetermined close threshold T HGWhen, then it is doubtful material by the material selectiong.The middle foreign language participle group ID For one or more, extracted according to for foreign language participle group ID in one or more comprising foreign language participle group in the one or more The material of ID is one or more.
By this extracting mode, can total degree occur not high by some in the document to be identified, but may be at certain Notional word participle and corresponding position described in the collection of a little positions are extracted and further compared.
A specific embodiment according to the present invention, in the case where formula plagiarizes identification pattern, formulas Extraction module, for inciting somebody to action Extract the formula in document to be identified;Formula decomposing module, for by the respective variable parameter of formula and dependent variable parameter, fortune Operator number, the concrete meaning of each parameter, dimension and value range extract respectively;Formula contrast module, for that will wait to reflect Determine the respective variable parameter of the formula extracted in document and dependent variable parameter, oeprator, the concrete meaning of each parameter, dimension And the respective variable parameter of formula and dependent variable parameter, oeprator, each parameter preserved in value range and formula storehouse Concrete meaning, dimension and value range compared one by one;When the formula in document to be identified respective variable parameter with And the formula preserved in dependent variable parameter, oeprator, dimension and value range and formula storehouse respective variable parameter and Dependent variable parameter, oeprator, the registration of dimension and value range are more than formula comparison threshold T HMATHWhen, by formula In storehouse with currently by compared with the associated material of formula as doubtful material.The registration refers to the formula in document to be identified The sum of independent variable parameter, dependent variable parameter, oeprator, dimensions number compared with the formula in formula storehouse, identical with it is to be identified The ratio of the sum of the independent variable parameter of current formula, dependent variable parameter, oeprator, dimensions number in document.
A specific embodiment according to the present invention, may be employed sliding window by document to be identified and doubtful material into Row full text compares.The size of sliding window can be configured by system.The size of sliding window directly affects contrast effect, sliding Dynamic window selection is too small, be easy to cause erroneous judgement, sliding window selection is excessive, be easy to cause and fails to judge.The slip step of sliding window Length is also pre-set by system.As shown in Fig. 2, step S0:Start;S1:Sliding window setup module initializes similar window Mouth counter CT1=0, Hua Dong Walk long counters CT2=0;Step S2:Sliding window setup module sets document to be identified with doubting Document initial position is respectively positioned on like the sliding window of material;Step S3:Sliding window contrast module compares the cunning of document to be identified The sliding window of dynamic window and doubtful material, the quantity of the wherein identical notional word participle of statistics;Step S4:Sliding window compares mould Block judges whether the quantity of identical notional word participle is greater than or equal to threshold T HW;When more than or equal to threshold value hour counter Value plus one, i.e. CT1=CT1+ 1, and record the position and cunning for identifying that the sliding window of document is current with the sliding window of doubtful material Content in dynamic window;Step S5:Sliding window setup module sets the sliding window of doubtful material to slide a sliding step; Step S6:Sliding window setup module judges whether to be located at document end position;If not end position, then return to step S3:If end position, then step S11 is gone to;Step S11:Sliding window setup module judges the slip of document to be identified Whether window is located at document end position;If not end position, then step S12 is gone to, if end position, then gone Toward step S13;Step S12:Sliding window setup module sets the sliding window of doubtful material to return to document initial position;It waits to reflect The sliding window for determining document slides a sliding step, CT2=CT2+ 1 goes to step S3;Step S13:Sliding window contrast module Calculate similar window counter CT1Numerical value Yu Hua Dong Walk long counters CT2The ratio M of numerical value;S14:Sliding window contrast module is sentenced Whether disconnected ratio M is greater than or equal to predetermined threshold value THm, as M >=THMWhen, then it is assumed that the document to be identified and the doubtful material phase Seemingly;Work as M<THMWhen, then it is assumed that the document to be identified and the doubtful material are dissimilar;S15:Sliding window contrast module judges It is no to also have doubtful material to need to compare, if so, then return to step S1;Step S16 is gone to if not;Step S16:Comparison Report generation module generates and exports comparison report, and the identification document and all similar doubtful elements are included in the comparison report The similar window counter CT of material1Numerical value, Hua Dong Walk long counters CT2The ratio of numerical value and the two, the identification document and phase As doubtful material similar portion specific location and particular content;Step S17:Comparison terminates.
A specific embodiment according to the present invention, step S3:Sliding window contrast module compares document to be identified The sliding window of sliding window and doubtful material, the quantity of the wherein identical notional word participle of statistics;Wherein identification is plagiarized common Under pattern, identical notional word participle refers to that ID of the notional word participle in storehouse is segmented is identical;Wherein in the case where identification pattern is plagiarized in extension, Identical notional word participle refers to that ID of the notional word participle group in storehouse is segmented is identical;Wherein under multilingual plagiarism identification pattern, phase With notional word participle refer to that ID of the foreign language participle group in storehouse is segmented is identical in notional word.
A specific embodiment according to the present invention, step S16:Comparison report generation module exports comparison report, into One step includes the content of comparison report according to the different and different of identification pattern.It is common to plagiarize under identification pattern, in comparison report Specific location and particular content comprising the document to be identified to similar doubtful material similar portion;Document to be identified uses The form of presentation consistent with similar portion in the similar doubtful material;The word statement used is also completely the same;It may Only indivedual word orders are adjusted;If the document that identified document plagiarizes it is rewritten, when the degree of rewriting compared with When big, common identification pattern of plagiarizing possibly can not find its document plagiarized.Extension is plagiarized under identification pattern, in comparison report Specific location and particular content comprising the document to be identified to similar doubtful material similar portion;If identified document The document plagiarized to it has carried out synonym or near synonym are rewritten, and when file structure rewriting is little, identification mould is plagiarized in extension Formula can may also find its document plagiarized.Under multilingual plagiarism identification pattern, the document to be identified is included in comparison report To the specific location and particular content of similar doubtful material similar portion;If the document that identified document plagiarizes it It has carried out translation to rewrite, when file structure rewriting degree is little, extension plagiarism identification pattern may can also find it and be plagiarized Document.
A specific embodiment according to the present invention, sliding window are located at document initial position and refer to sliding window most Left side is overlapped with document initial position;Sliding window is located at document end position and refers to that the rightmost side of sliding window and document terminate Position overlaps.
According to system operation test in advance, four notional words participle sizes of sliding window selected as are more suitable, sliding window Size can also other sizes of selected as needed.Sliding window slides the step-length of a notional word participle every time during comparison; In comparison process when occur in sliding window three or three or more notional words participle it is identical when (at this time without considering the elder generation of notional word participle Order afterwards), then record current location and content of the sliding window in document to be identified and doubtful material.
The above described is only a preferred embodiment of the present invention, not make limitation in any form to the present invention, though So the present invention is disclosed above with preferred embodiment, however is not limited to the present invention, any to be familiar with this professional technology people Member, without departing from the scope of the present invention, when the technology contents using the disclosure above make a little change or modification For the equivalent embodiment of equivalent variations, as long as being the content without departing from technical solution of the present invention, technical spirit according to the invention To any simple modification, equivalent change and modification that above example is made, in the range of still falling within technical solution of the present invention.

Claims (10)

1. a kind of distributed text detecting system, which is characterized in that including:
Comparison database, for including with the material for comparing object;The comparison database is stored in different stations using distributed way Point position;Particular station is chosen when accessing comparison database according to the loading condition of different websites to access;
The comparison database further comprises books storehouse, paper storehouse, patent database, formula storehouse, proverb common saying storehouse, proverb storehouse, name name Yan Ku, poem storehouse word bank;
Storehouse is segmented, for including participle and corresponding part of speech;It segments in storehouse and carries out unique number for each participle, use W_ID tables Show unique number of a certain participle in storehouse is segmented;
Word-dividing mode for being segmented to each material, and word segmentation result is preserved into comparison database;Word-dividing mode will be segmented and tied Fruit is compared with the part of speech that participle storehouse preserves, and determines the part of speech of word segmentation result;
Participle characteristic value generation module counts the quantity that each participle occurs in corresponding material, generates each participle and corresponds to Participle part of speech feature value WCCV=[W_ID, W_N, W_CHAR], WCV=[W_ID, W_N], wherein W_ID represent that the participle exists The unique number in storehouse is segmented, W_N represents the total degree that the participle occurs in the material;W_CHAR represents the word of the participle Property;
Participle free vector dimension determining module determines participle free vector dimension WFV according to the word segmentation result of material;Described point Word free vector dimension WFV is equal to the quantity of the different participles obtained after being segmented to specific material;
Participle simplifies vector dimension generation module, is simplified for the participle free vector dimension WFV to each material, generates Participle simplifies vector dimension RWV;
Feature vector generation module is segmented, participle essence described in each material is extracted for simplifying vector dimension RWV according to participle The corresponding characteristic value generation participle characteristic vector W VE_RWV of simple vector dimension RWV;
WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV,W_NRWV]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_Ni, represent total time that the participle occurs in the material Number, using the number as the characteristic value of the participle;
User's access mode detection module, for user to be prompted to upload document to be identified;
User's detection pattern determining module, for judge active user's detection pattern for it is common plagiarize identification pattern when, it is to be identified Document word-dividing mode obtains word segmentation result for being segmented to document to be identified;
Document to be identified segments free vector dimension determining module, for determining participle certainly according to the word segmentation result of document to be identified By vector dimension WFV_TBI;
Document participle to be identified simplifies vector dimension generation module, for the participle free vector dimension WFV_ to document to be identified TBI is simplified;It generates document participle to be identified and simplifies vector dimension RWV_TBI;
Document to be identified segments feature vector generation module, and it is each to be identified to simplify vector dimension RWV_TBI extractions according to participle The corresponding characteristic value generations of vector dimension RWV_TBI document to be identified is simplified with the document participle to be identified segment spy in document Vector WVE_RWV_TBI is levied, wherein
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_NiRepresent what the participle occurred in the document to be identified Total degree, using the number as the characteristic value of the participle;
When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, similarity ratio is carried out Pair when, document to be identified participle feature vector generation module generates the participle characteristic vector W VE_RWV_TBI of document to be identified; WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI], point of document to be identified The dimension of word feature vector is RWV_TBI;Segment the participle feature vector of material in feature vector generation module generation comparison database WVE_RWV;WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV,W_NRWV];Wherein, document to be identified Participle feature vector dimension RWV_TBI be equal to participle feature vector dimension RWV;
File characteristics vector adjustment module to be identified, for all characteristic values pair in characteristic vector W VE_RWV_TBI will to be segmented The W_ID answerediValue carries out ascending or descending order arrangement, and the W_ID that will lack according to the number in participle storehouseiValue insertion, point of insertion Word number W_IDiCorresponding characteristic value is 0;The document to be identified participle characteristic vector W VE_RWV_TBI_EXT being expanded =[W_IDTBI_EXT_1,W_NTBI_EXT_1,...,W_IDTBI_EXT_i,W_NTBI_EXT_i,...,W_IDTBI_EXT_RWV_TBI,W_ NTBI_EXT_RWV_TBI,...,W_IDW,W_NW];
Material feature vector adjusts module, for that will segment the corresponding W_ID of all characteristic values in characteristic vector W VE_RWViValue Ascending or descending order arrangement, and the W_ID that will lack are carried out according to the number in participle storehouseiValue insertion, the participle number W_ID of insertioni Corresponding characteristic value is 0;The participle characteristic vector W VE_RWV_EXT=[W_ID being expandedEXT_1,W_NEXT_1,...,W_ IDEXT_i,W_NEXT_i,...,W_IDEXT_RWV,W_NEXT_RWV,...,W_IDW,W_NW];
It is common to plagiarize identification similarity calculation module, it calculates similar between document to be identified and any material in comparison database Degree;It is calculated by the following formula:
After the completion of document to be identified and the comparison of all materials, all doubtful materials are extracted, by document to be identified and doubtful material Further compared.
2. distributed text detecting system according to claim 1, wherein the participle part of speech classification that the participle storehouse preserves For noun, verb, adjective, number, quantifier, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia.
3. document to be identified and doubtful material are carried out full text by distributed text detecting system according to claim 1 or 2 Comparison.
4. distributed text detecting system according to claim 3, wherein:Participle simplifies the use of vector dimension generation module Part of speech screening method simplifies participle free vector dimension WFV;It is as follows to simplify process:By the characteristic value of word segmentation result according to right The participle part of speech answered is classified;It is A1 class notional words characteristic value by feature value division, A2 class notional words characteristic value, B class notional word features Value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values;The quantity of lower eigenvalue of all categories is counted respectively; AMOUNT_A1, the quantity for referring to A1 class notional word characteristic values, AMOUNT_A2, the quantity for referring to A2 class notional word characteristic values, AMOUNT_B, refer to The quantity of B class notional word characteristic values, the quantity of AMOUNT_C, C class notional word characteristic value, the number of AMOUNT_D, D class notional word characteristic value Amount, the quantity of AMOUNT_V, V class notional word characteristic value;It calculates participle and simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+ AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_S_V);If greater than 0, then exit and this time simplify;If Equal to 0, then complete this time to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1+ AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWV_S_D);If greater than 0, then from corresponding to AMOUNT_V Characteristic value in extract the characteristic value equal with value RWV_S_D quantity at random, complete this time to simplify;If equal to 0, then complete this It is secondary to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_ B+AMOUNT_C value RWV_S_C);If greater than 0, then extracted and value RWV_ at random from the characteristic value corresponding to AMOUNT_D The equal characteristic value of S_C quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then further Calculate the value RWV_S_B that participle simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, Then extract the characteristic value equal with value RWV_S_B quantity at random from the characteristic value corresponding to AMOUNT_C, completion is this time simplified; If equal to 0, then it completes this time to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1 + AMOUNT_A2) value RWV_S_A2;If greater than 0, then extract and be worth at random from the characteristic value corresponding to AMOUNT_B The equal characteristic value of RWV_S_A2 quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then into One step calculates the value RWV_S_A1 that participle simplifies vector dimension RWV-AMOUNT_A1;It is if greater than 0, then right from AMOUNT_A2 institutes The characteristic value equal with value RWV_S_A1 quantity is extracted in the characteristic value answered at random, completion is this time simplified;If equal to 0, then it is complete It is simplified into this;If less than 0, then vector dimension RWV numbers are extracted and simplified at random from the characteristic value corresponding to AMOUNT_A1 Equal characteristic value is measured, completion is this time simplified.
5. distributed text detecting system according to claim 4 simplifies vector dimension RWV- for calculating participle (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) value RWV_S_V is more than 0 feelings Shape, using corresponding material as doubtful material.
6. a kind of distributed text detection method, which is characterized in that including:
Comparison database is included with the material for comparing object;The comparison database is stored in different website positions using distributed way It puts;Particular station is chosen when accessing comparison database according to the loading condition of different websites to access;The comparison database is further wrapped Include books storehouse, paper storehouse, patent database, formula storehouse, proverb common saying storehouse, proverb storehouse, famous sayings of famous figures storehouse, poem storehouse word bank;
Include participle and corresponding part of speech in participle storehouse;It segments in storehouse and carries out unique number for each participle, certain is represented using W_ID Unique number of one participle in storehouse is segmented;
Word-dividing mode segments each material, and word segmentation result is preserved into comparison database;Word-dividing mode by word segmentation result with The part of speech that participle storehouse preserves is compared, and determines the part of speech of word segmentation result;
Participle characteristic value generation module counts the quantity that each participle occurs in corresponding material, generates each participle and corresponds to Participle part of speech feature value WCCV=[W_ID, W_N, W_CHAR], WCV=[W_ID, W_N], wherein W_ID represent that the participle exists The unique number in storehouse is segmented, W_N represents the total degree that the participle occurs in the material;W_CHAR represents the word of the participle Property;
Participle free vector dimension determining module determines participle free vector dimension WFV according to the word segmentation result of material;Described point Word free vector dimension WFV is equal to the quantity of the different participles obtained after being segmented to specific material;
Participle simplifies vector dimension generation module and the participle free vector dimension WFV of each material is simplified, generation participle Simplify vector dimension RWV;
Participle feature vector generation module simplifies participle described in each material of vector dimension RWV extractions according to participle and simplifies vector The corresponding characteristic value generation participle characteristic vector W VE_RWV of dimension RWV;
WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV,W_NRWV]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_Ni, represent total time that the participle occurs in the material Number, using the number as the characteristic value of the participle;
User's access mode detection module prompting user uploads document to be identified;
When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, document to be identified point Word module obtains word segmentation result for being segmented to document to be identified;
Document to be identified participle free vector dimension determining module according to the word segmentation result of document to be identified determine participle freely to Measure dimension WFV_TBI;
Document to be identified participle simplify vector dimension generation module to the participle free vector dimension WFV_TBI of document to be identified into Row is simplified;It generates document participle to be identified and simplifies vector dimension RWV_TBI;
Document participle feature vector generation module to be identified simplifies each text to be identified of vector dimension RWV_TBI extractions according to participle The corresponding characteristic value generations of vector dimension RWV_TBI document to be identified is simplified with the document participle to be identified segment feature in shelves Vectorial WVE_RWV_TBI, wherein
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_NiRepresent what the participle occurred in the document to be identified Total degree, using the number as the characteristic value of the participle;
When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, similarity ratio is carried out Pair when, document to be identified participle feature vector generation module generates the participle characteristic vector W VE_RWV_TBI of document to be identified; WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI], point of document to be identified The dimension of word feature vector is RWV_TBI;Segment the participle feature vector of material in feature vector generation module generation comparison database WVE_RWV;WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV,W_NRWV];Wherein, document to be identified Participle feature vector dimension RWV_TBI be equal to participle feature vector dimension RWV;
The corresponding W_ of all characteristic values that file characteristics vector adjustment module to be identified will be segmented in characteristic vector W VE_RWV_TBI IDiValue carries out ascending or descending order arrangement, and the W_ID that will lack according to the number in participle storehouseiValue insertion, the participle number of insertion W_IDiCorresponding characteristic value is 0;The document to be identified participle characteristic vector W VE_RWV_TBI_EXT=[W_ being expanded IDTBI_EXT_1,W_NTBI_EXT_1,...,W_IDTBI_EXT_i,W_NTBI_EXT_i,...,W_IDTBI_EXT_RWV_TBI,W_ NTBI_EXT_RWV_TBI,...,W_IDW,W_NW];
Material feature vector adjusts the corresponding W_ID of all characteristic values that module will be segmented in characteristic vector W VE_RWViValue according to point Number in dictionary carries out ascending or descending order arrangement, and the W_ID that will lackiValue insertion, the participle number W_ID of insertioniIt is corresponding Characteristic value be 0;The participle characteristic vector W VE_RWV_EXT=[W_ID being expandedEXT_1,W_NEXT_1,...,W_IDEXT_i, W_NEXT_i,...,W_IDEXT_RWV,W_NEXT_RWV,...,W_IDW,W_NW];
Common plagiarism identification similarity calculation module calculates similar between document to be identified and any material in comparison database Degree;It is calculated by the following formula:
After the completion of document to be identified and the comparison of all materials, all doubtful materials are extracted, by document to be identified and doubtful material Further compared.
7. distributed text detection method according to claim 6, wherein the participle part of speech classification that the participle storehouse preserves For noun, verb, adjective, number, quantifier, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia.
8. document to be identified and doubtful material are carried out full text by the distributed text detection method according to claim 6 or 7 Comparison.
9. distributed text detection method according to claim 8, wherein:Participle simplifies the use of vector dimension generation module Part of speech screening method simplifies participle free vector dimension WFV;It is as follows to simplify process:By the characteristic value of word segmentation result according to right The participle part of speech answered is classified;It is A1 class notional words characteristic value by feature value division, A2 class notional words characteristic value, B class notional word features Value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values;The quantity of lower eigenvalue of all categories is counted respectively; AMOUNT_A1, the quantity for referring to A1 class notional word characteristic values, AMOUNT_A2, the quantity for referring to A2 class notional word characteristic values, AMOUNT_B, refer to The quantity of B class notional word characteristic values, the quantity of AMOUNT_C, C class notional word characteristic value, the number of AMOUNT_D, D class notional word characteristic value Amount, the quantity of AMOUNT_V, V class notional word characteristic value;It calculates participle and simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+ AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_S_V);If greater than 0, then exit and this time simplify;If Equal to 0, then complete this time to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1+ AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWV_S_D);If greater than 0, then from corresponding to AMOUNT_V Characteristic value in extract the characteristic value equal with value RWV_S_D quantity at random, complete this time to simplify;If equal to 0, then complete this It is secondary to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_ B+AMOUNT_C value RWV_S_C);If greater than 0, then extracted and value RWV_ at random from the characteristic value corresponding to AMOUNT_D The equal characteristic value of S_C quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then further Calculate the value RWV_S_B that participle simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, Then extract the characteristic value equal with value RWV_S_B quantity at random from the characteristic value corresponding to AMOUNT_C, completion is this time simplified; If equal to 0, then it completes this time to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1 + AMOUNT_A2) value RWV_S_A2;If greater than 0, then extract and be worth at random from the characteristic value corresponding to AMOUNT_B The equal characteristic value of RWV_S_A2 quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then into One step calculates the value RWV_S_A1 that participle simplifies vector dimension RWV-AMOUNT_A1;It is if greater than 0, then right from AMOUNT_A2 institutes The characteristic value equal with value RWV_S_A1 quantity is extracted in the characteristic value answered at random, completion is this time simplified;If equal to 0, then it is complete It is simplified into this;If less than 0, then vector dimension RWV numbers are extracted and simplified at random from the characteristic value corresponding to AMOUNT_A1 Equal characteristic value is measured, completion is this time simplified.
10. distributed text detection method according to claim 9 simplifies vector dimension RWV- for calculating participle (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) value RWV_S_V is more than 0 feelings Shape, using corresponding material as doubtful material.
CN201610020566.XA 2016-01-13 2016-01-13 A kind of distributed text detection method and system Active CN105550172B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610020566.XA CN105550172B (en) 2016-01-13 2016-01-13 A kind of distributed text detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610020566.XA CN105550172B (en) 2016-01-13 2016-01-13 A kind of distributed text detection method and system

Publications (2)

Publication Number Publication Date
CN105550172A CN105550172A (en) 2016-05-04
CN105550172B true CN105550172B (en) 2018-06-01

Family

ID=55829361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610020566.XA Active CN105550172B (en) 2016-01-13 2016-01-13 A kind of distributed text detection method and system

Country Status (1)

Country Link
CN (1) CN105550172B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334325A (en) * 2019-07-16 2019-10-15 同方知网数字出版技术股份有限公司 A kind of full text similarity analysis method compared towards publishing house's strange land resource joint
CN111159337A (en) * 2019-12-20 2020-05-15 中国建设银行股份有限公司 Chemical expression extraction method, device and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226546A (en) * 2013-04-15 2013-07-31 北京邮电大学 Suffix tree clustering method on basis of word segmentation and part-of-speech analysis
CN104899304A (en) * 2015-06-12 2015-09-09 北京京东尚科信息技术有限公司 Named entity identification method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014026455A (en) * 2012-07-26 2014-02-06 Nippon Telegr & Teleph Corp <Ntt> Media data analysis device, method and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226546A (en) * 2013-04-15 2013-07-31 北京邮电大学 Suffix tree clustering method on basis of word segmentation and part-of-speech analysis
CN104899304A (en) * 2015-06-12 2015-09-09 北京京东尚科信息技术有限公司 Named entity identification method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
云环境中的近似复制文本检测;许君 等;《计算机研究与发展》;20121231;第329-335页 *
基于左归词频向量空间模型的中文文本抄袭检测算法;谢松山 等;《西南大学学报(自然科学版)》;20150531;第37卷(第5期);第158-161页 *

Also Published As

Publication number Publication date
CN105550172A (en) 2016-05-04

Similar Documents

Publication Publication Date Title
Liu et al. Using argument-based features to predict and analyse review helpfulness
CN105701076A (en) Thesis plagiarism detection method and system
Jayakodi et al. An automatic classifier for exam questions in Engineering: A process for Bloom's taxonomy
CN106156204A (en) The extracting method of text label and device
CN109388801A (en) The determination method, apparatus and electronic equipment of similar set of words
CN105701085A (en) Network duplicate checking method and system
CN110472203A (en) A kind of duplicate checking detection method, device, equipment and the storage medium of article
CN110019660A (en) A kind of Similar Text detection method and device
Ronan et al. Determining light verb constructions in contemporary British and Irish English
Argamon Computational forensic authorship analysis: Promises and pitfalls
CN105677641B (en) A kind of paper self checking method and system
CN105701086A (en) Method and system for detecting literature through sliding window
CN105550172B (en) A kind of distributed text detection method and system
Tedeschi et al. ID10M: Idiom identification in 10 languages
Kuzman et al. The GINCO training dataset for web genre identification of documents out in the wild
Yan et al. On the robustness of reading comprehension models to entity renaming
Curtotti et al. Machine learning for readability of legislative sentences
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN105701213B (en) A kind of document control methods and system
Bian et al. Detecting spam game reviews on steam with a semi-supervised approach
Taerungruang et al. Constructing an Academic Thai Plagiarism Corpus for Benchmarking Plagiarism Detection Systems.
Shrestha Detecting fake news with sentiment analysis and network metadata
CN105701206B (en) A kind of document detection method and system based on sampling
Chaturvedi et al. Detecting fake news using machine learning algorithms
CN105701077A (en) Multi-language literature detection method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant