CN110033236A

CN110033236A - A kind of project duplicate checking method and system based on concurrent tasks

Info

Publication number: CN110033236A
Application number: CN201910287630.4A
Authority: CN
Inventors: 李�荣; 白万建; 李冬; 李勇; 李庆文; 何召慧; 于展鹏; 邢宏伟; 王刚; 戚鲁凤; 王宗光; 夏光明
Original assignee: State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Shandong Electric Power Co Ltd; Shandong Luneng Software Technology Co Ltd
Priority date: 2019-04-11
Filing date: 2019-04-11
Publication date: 2019-07-19

Abstract

The present invention discloses a kind of project duplicate checking method and system based on concurrent tasks, including four steps, relies on Internet technology that internet hot word, everyday expressions are carried out dynamic analysis, forms cloud dictionary.The text information in declaration material is matched with cloud dictionary by characters matching method, is that best participle scheme is obtained by weighted calculation, word frequency is counted and excludes high-frequency " monosyllabic word " with the semantic participle factor by declaration material cutting.The participle subset of the participle subset of current duplicate checking project and history item is returned to the similar value of current duplicate checking project and history item by cosine similarity algorithm CosineSimilar.When big data calculates, using high-capacity and high-speed memory, reasonable employment memory management reduces the frequent read and write access of hard disk, opens parallel multi-thread task, makes full use of system resource, CPU maximum frequency is played, to improve duplicate checking efficiency.

Description

A kind of project duplicate checking method and system based on concurrent tasks

Technical field

Repeatedly or calculating side similar with sundry item whether the present invention relates to judge during approving and initiate a project declaration material Law technology field, specially a kind of project duplicate checking method and system based on concurrent tasks.

Background technique

It usually requires to fill in a large amount of text declaration material, these materials during project, achievement and reward are declared Can exist and repeat report, plagiarize the problems such as other people achievements, cause the waste of human and material resources.Previous inspection text repeated work All it is to be carried out by the way of manual read, since the project information accumulated over a long period is more and more, the job requirement of audit is got over Carry out higher, a large amount of reading project information of related personnel's needs, and possess superpower memory capability to grasp this work Technical ability, and comparison workload is big, low efficiency, so that desk checking work is more and more difficult, is difficult to exclude in review process Repetition reports, plagiarizes the problems such as other people achievements.Although having relevant detection system on the net at present, so obtained from duplicate checking result all Be it is multifarious, very different, not only duplicate checking efficiency is slow, charge it is also very high, sometimes cannot effectively be tied spending Fruit.

Summary of the invention

(1) the technical issues of solving

In view of the deficiencies of the prior art, the project duplicate checking method and system based on concurrent tasks that the present invention provides a kind of, Have many advantages, such as to reduce the frequent read and write access of hard disk, make full use of system resource, solves and examine work more and more difficult, be difficult It excludes to repeat report, plagiarize the problems such as other people achievements in review process.

(2) technical solution

To achieve the above object, the invention provides the following technical scheme: a kind of project duplicate checking method based on concurrent tasks And system, comprising the following steps:

Step 1: handling by distributed way, " electron cloud " (Electron Cloud) skill in quantum physics is borrowed Art, since collecting the everyday expressions and temperature on internet using characteristics such as property, diffusivity, the simultaneities of electron cloud, being transmitted to Cloud server carries out dynamic analysis, and the word of parsing is saved as cloud dictionary according to temperature arrangement.

Step 2: opening parallel multi-thread task, pass through the details of processor, the utilization rate of CPU, memory usage And combine concurrent parameter (default=2), calculate openable concurrent thread quantity Num_Threads, retain kernel thread with The normal operation of guarantee system, system all will be automatically using concurrent multi-thread when for occurring the calculating of high-volume data in subsequent step Journey task makes full use of system resource, CPU maximum frequency is played, to improve duplicate checking efficiency.

Step 3: the declaration material of current duplicate checking is split as paragraph set, wherein Cur_Sen is the paragraph of declaration material Set；Sen_1, Sen_2 ..., Sen_n are the paragraphs split.Pass through positive matching method and combine cloud dictionary, by each paragraph The set with semantic participle is resolved to, wherein Cur_Sen_i_F is the paragraph participle set of positive matching method；Word_1, Word_2 ..., Word_n are the participles for splitting paragraph；I=1,2 ..., n are the index index of paragraph.By in dictionary Temperature calculates matching weight score, and wherein Cur_FScore is the average weighted score number of positive matching method paragraph；sum{hot(Word_ I) ^2 } it is the weighted score that participle is calculated by hot function, total number is then calculated by sum function；I=1,2 ..., n are The index index of participle.It is 0 that matching weight score, which is then arranged, for the word being not present in dictionary.Simultaneously by reverse matching method In conjunction with cloud dictionary, each paragraph is resolved into the set with semantic participle, wherein Cur_Sen_i_R is reverse matching method Paragraph participle set；Word_1, Word_2 ..., Word_n are the participles for splitting paragraph；I=1,2 ..., n are the ropes of paragraph Draw index.Matching weight score is calculated by the temperature in dictionary, wherein Cur_RScore is the weighting of reverse matching method paragraph Gross score；Sum { hot (Word_i) ^2 } is the weighted score that participle is calculated by hot function, is then calculated by sum function Total number；I=1,2 ..., n are the index index of participle.Matching weight score is then arranged for the word being not present in dictionary It is 0.Finally take the participle scheme that the former is taken when participle score is maximum or score value is identical.Max_Score=max Cur_FScore, Cur_RScore }, cycle calculations are finished until the calculating of all paragraphs, participle set are saved in database to repeat benefit from now on With.Similarly, if declaration material word segmentation result in history item is empty, also using the method for step 3 to declaration material into Row parsing calculates best participle scheme and stores into database.

Step 4: current duplicate checking project to be segmented to the participle factor marker of the factor and history item by statistics segmentation methods Index finds out set, statistics word frequency and exclude high-frequency " monosyllabic word " (as " ", " ", " " etc.).Wherein Cur_Word_ Index is the participle word frequency set to duplicate checking project；

W_ID_1, W_ID_2 ..., W_IN_n are participles because of subindex；Num_1, Num_2 ..., Num_n are the words of participle Frequently.Wherein His_Word_Index is the participle word frequency set of history item；W_ID_1, W_ID_2 ..., W_IN_n be participle because Subindex；Num_1, Num_2 ..., Num_n are the word frequency of participle.Current duplicate checking project is calculated by the Map interface of Hash table Word frequency vector c0=[Num_1, Num_2 ..., Num_n] and history item word frequency vector c1=[Num_1, Num_2 ..., Num_n], word frequency vector result is constructed into union, wherein Index is the call number of each participle factor；Pass through cosine similarity Algorithm CosineSimilar returns to the similar value of current duplicate checking project and history item, and similar value is got over closer to 1 similarity It is high.

Preferably, " electron cloud " (Electron Cloud) is connect with cloud server by Ethernet in step 1.

Preferably, the core number of CPU is more than or equal to two in step 2.

Preferably, its Max_Score of step 3 is maximum score value；Max { Cur_FScore, Cur_RScore } is to pass through Max returns to maximum value.

Preferably, in step 4 c0 be current duplicate checking project word frequency vector；C1 is the word frequency vector of history item.

(3) beneficial effect

Compared with prior art, the project duplicate checking method and system based on concurrent tasks that the present invention provides a kind of, have Below the utility model has the advantages that

1, project duplicate checking method and system of this kind based on concurrent tasks rely on Internet technology by internet hot word, often Dynamic analysis is carried out with word, forms cloud dictionary.By characters matching method to the text information and cloud word in declaration material Library is matched, and is to obtain best participle scheme, system by weighted calculation with the semantic participle factor by declaration material cutting Meter word frequency simultaneously excludes high-frequency " monosyllabic word ".The participle subset of the participle subset of current duplicate checking project and history item is passed through Cosine similarity algorithm CosineSimilar returns to the similar value of current duplicate checking project and history item.When big data calculates, Using high-capacity and high-speed memory, reasonable employment memory management reduces the frequent read and write access of hard disk, opens parallel multi-thread task, System resource is made full use of, CPU maximum frequency is played, to improve duplicate checking efficiency.

2, project duplicate checking method and system of this kind based on concurrent tasks, taking full advantage of system resource realizes high efficiency Duplicate checking function, in addition to this, the hot word collected by cloud technology and everyday expressions provide strong branch for characters matching method Support, improves the participle accuracy of declaration material, while its scalability is strong, supports to choose plurality of articles duplicate checking simultaneously.

Detailed description of the invention

Fig. 1 is flow chart of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Referring to Fig. 1, the present invention provides a kind of technical solution: Step 1: handling by distributed way, borrowing quantum object " electron cloud " (Electron Cloud) technology in reason, since being received using characteristics such as property, diffusivity, the simultaneities of electron cloud Collect the everyday expressions and temperature on internet, is transmitted to cloud server and carries out dynamic analysis, by the word of parsing according to temperature Arrangement saves as cloud dictionary.

It is further improved ground, " electron cloud " (Electron Cloud) and cloud server are connected by Ethernet in step 1 It connects.

It is further improved ground, the core number of CPU is more than or equal to two in step 2.

It is further improved ground, its Max_Score of step 3 is maximum score value；Max { Cur_FScore, Cur_RScore } is Maximum value is returned by max.

It is further improved ground, c0 is the word frequency vector of current duplicate checking project in step 4；C1 be history item word frequency to Amount.

The electric elements occurred in this article are electrically connected with extraneous main controller and 220V alternating current, and main controller can be meter Calculation machine etc. plays the conventionally known equipment of control.

In conclusion project duplicate checking method and system of this kind based on concurrent tasks, rely on Internet technology by internet Hot word, everyday expressions carry out dynamic analysis, form cloud dictionary.By characters matching method in declaration material text information with Cloud dictionary is matched, and is to obtain best participle by weighted calculation with the semantic participle factor by declaration material cutting Scheme counts word frequency and excludes high-frequency " monosyllabic word ".By the participle of the participle subset of current duplicate checking project and history item Subset returns to the similar value of current duplicate checking project and history item by cosine similarity algorithm CosineSimilar.It is counting greatly When according to calculating, using high-capacity and high-speed memory, reasonable employment memory management reduces the frequent read and write access of hard disk, opens concurrent more Thread task makes full use of system resource, CPU maximum frequency is played, to improve duplicate checking efficiency.Take full advantage of system resource reality Efficient duplicate checking function is showed, in addition to this, the hot word collected by cloud technology and everyday expressions are that characters matching method mentions Powerful support has been supplied, has improved the participle accuracy of declaration material, while its scalability is strong, has supported that choose plurality of articles looks into simultaneously Weight.

It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.

It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with A variety of variations, modification, replacement can be carried out to these embodiments without departing from the principles and spirit of the present invention by understanding And modification, the scope of the present invention is defined by the appended.

Claims

1. a kind of project duplicate checking method and system based on concurrent tasks, which comprises the following steps:

Step 1: handling by distributed way, " electron cloud " (Electron Cloud) technology in quantum physics, benefit are borrowed Since characteristics such as property, diffusivity, simultaneities with electron cloud, the everyday expressions and temperature on internet are collected, cloud is transmitted to Server carries out dynamic analysis, and the word of parsing is saved as cloud dictionary according to temperature arrangement.

Step 2: opening parallel multi-thread task, by the details of processor, the utilization rate of CPU, memory usage is simultaneously tied Merge hair parameter (default=2), calculate openable concurrent thread quantity Num_Threads, retains kernel thread to guarantee The normal operation of system, system will all be appointed using parallel multi-thread automatically when for occurring the calculating of high-volume data in subsequent step Business makes full use of system resource, CPU maximum frequency is played, to improve duplicate checking efficiency.

Step 3: the declaration material of current duplicate checking is split as paragraph set, wherein Cur_Sen is the paragraph collection of declaration material It closes；Sen_1, Sen_2 ..., Sen_n are the paragraphs split.Pass through positive matching method and combine cloud dictionary, by each paragraph solution Analysis is the set with semanteme participle, and wherein Cur_Sen_i_F is the paragraph participle set of positive matching method；Word_1, Word_ 2 ..., Word_n are the participles for splitting paragraph；I=1,2 ..., n are the index index of paragraph.Pass through the fever thermometer in dictionary Matching weight score is calculated, wherein Cur_FScore is the average weighted score number of positive matching method paragraph；sum{hot(Word_i)^2} It is the weighted score for calculating participle by hot function, total number is then calculated by sum function；I=1,2 ..., n are participles Index index.It is 0 that matching weight score, which is then arranged, for the word being not present in dictionary.Pass through reverse matching method and combines cloud Each paragraph is resolved to the set with semantic participle by terminal word library, and wherein Cur_Sen_i_R is the paragraph point of reverse matching method Set of words；Word_1, Word_2 ..., Word_n are the participles for splitting paragraph；I=1,2 ..., n are the indexes of paragraph index.Matching weight score is calculated by the temperature in dictionary, wherein Cur_RScore is that the weighting of reverse matching method paragraph is total Score；Sum { hot (Word_i) ^2 } is the weighted score that participle is calculated by hot function, is then calculated and is closed by sum function It counts；I=1,2 ..., n are the index index of participle.Matching weight score, which is then arranged, for the word being not present in dictionary is 0.Finally take the participle scheme that the former is taken when participle score is maximum or score value is identical.Max_Score=max Cur_FScore, Cur_RScore }, cycle calculations are finished until the calculating of all paragraphs, participle set are saved in database to repeat benefit from now on With.Similarly, if declaration material word segmentation result in history item is empty, also using the method for step 3 to declaration material into Row parsing calculates best participle scheme and stores into database.

Step 4: the participle factor marker that current duplicate checking project segments the factor and history item is indexed by statistics segmentation methods Find out set, count word frequency and exclude high-frequency " monosyllabic word " (as " ", " ", " " etc.).Wherein Cur_Word_ Index is the participle word frequency set to duplicate checking project；W_ID_1, W_ID_2 ..., W_IN_n are participles because of subindex；Num_1, Num_2 ..., Num_n are the word frequency of participle.Wherein His_Word_Index is the participle word frequency set of history item；W_ID_1, W_ID_2 ..., W_IN_n are participles because of subindex；Num_1, Num_2 ..., Num_n are the word frequency of participle.Pass through Hash table Map interface calculate current duplicate checking project word frequency vector c0=[Num_1, Num_2 ..., Num_n] and history item word frequency to It measures c1=[Num_1, Num_2 ..., Num_n], word frequency vector result is constructed into union, wherein Index is each participle factor Call number；The similar value of current duplicate checking project and history item is returned by cosine similarity algorithm CosineSimilar, it is similar It is worth higher closer to 1 similarity.

2. a kind of project duplicate checking method and system based on concurrent tasks according to claim 1, it is characterised in that: step " electron cloud " (Electron Cloud) is connect with cloud server by Ethernet in one.

3. a kind of project duplicate checking method and system based on concurrent tasks according to claim 1, it is characterised in that: step The core number of CPU is more than or equal to two in two.

4. a kind of project duplicate checking method and system based on concurrent tasks according to claim 1, it is characterised in that: step Three its Max_Score are maximum score values；Max { Cur_FScore, Cur_RScore } is to return to maximum value by max.

5. a kind of project duplicate checking method and system based on concurrent tasks according to claim 1, it is characterised in that: step C0 is the word frequency vector of current duplicate checking project in four；C1 is the word frequency vector of history item.