CN112199938B

CN112199938B - Science and technology project similarity analysis method, computer equipment and storage medium

Info

Publication number: CN112199938B
Application number: CN202011258083.6A
Authority: CN
Inventors: 汪桢子; 章彬; 何维; 汪伟
Original assignee: Shenzhen Power Supply Co ltd
Current assignee: Shenzhen Power Supply Co ltd
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2023-11-14
Anticipated expiration: 2040-11-12
Also published as: CN112199938A

Abstract

The invention relates to a science and technology project similarity analysis method, computer equipment and storage medium, wherein the method comprises the following steps: acquiring an electronic document of a reporting material of a project to be reviewed, and extracting text of the electronic document to obtain title information to be reviewed of the project to be reviewed; acquiring a historical review project reporting material electronic document, and extracting text of the electronic document to obtain historical title information of the historical review project; performing short text similarity analysis according to the title information to be reviewed and the historical title information, and primarily judging whether the title information to be reviewed and the historical title information are similar according to an analysis result; if yes, extracting text from the electronic documents of the project to be reviewed and the history project to obtain long text information to be reviewed and history long text information, analyzing long text similarity and judging final similarity, and if not, circulating or ending. The invention is suitable for text similarity analysis of science and technology project reporting materials in the electric power professional fields, and is beneficial to realizing intelligent auxiliary stand review and avoiding repeated stands.

Description

Science and technology project similarity analysis method, computer equipment and storage medium

Technical Field

The invention relates to the technical field of software information, in particular to a science and technology project similarity analysis method, computer equipment and a storage medium.

Background

With the continuous deep development of electric power reform and continuous development of science and technology, scientific and technical research projects in each professional field of power grid companies are reviewed more and more, and in order to avoid repeated reporting of similar projects, similarity examination needs to be conducted on reporting materials of the scientific and technical research projects. In general, science and technology project reporting materials are large texts, and at present, the similarity judging mode of the science and technology projects needs to rely on professional manual reading and screening comparison, so that each science and technology project reporting material needs to be manually compared with mass prior science and technology project reporting materials in a database, a large amount of manpower and time cost is consumed, and the high efficiency and accuracy of similarity judgment are difficult to ensure. Along with the enhancement of environmental awareness, the current power grid company carries out paperless office work, scientific and technological project declaration materials are submitted and reviewed in an electronic document mode, the electronic document provides a basis for informatization of review work, whether repeated declaration conditions exist or not can be judged by analyzing text similarity of a project to be reviewed and a history review project, and the current text similarity analysis mainly comprises word segmentation and distance calculation between words after word segmentation, and finally similarity results are obtained comprehensively.

However, the current text similarity analysis method is not suitable for the scientific and technical research project review in each professional field of the power grid company, and the main reasons are as follows:

(1) Because the professional words in the title are more and all appear in combined long words, the method is not a simple professional word which can be segmented, such as 'research and application of a device visual monitoring model based on big data acceleration analysis and three-dimensional digitization', wherein 'big data acceleration analysis', 'device visual detection model' is simply segmented into 'big data', 'acceleration', 'analysis', 'device', 'visualization', 'detection', 'model', and the meaning is changed;

(2) Semantic understanding is poor for professional names. Such as: the similarity of the source terminal integrated energy system key technology and the development mode research and the comprehensive energy system multipotency conversion simulation and comprehensive energy efficiency evaluation technology research in semantic understanding is relatively high, but in practice, the two technological projects are quite different;

(3) The title of the science and technology project is relatively short, about 30 words long, and only 10 words short. Since technical project titles contain a large number of names, and names are often combined together to form longer terms that contain semantics, the likelihood of two projects being similar is very high if there are many duplicates of such terms in the two names. But may result in a very low degree of similarity if direct edit distance calculations are employed.

(4) The title of a science and technology item is short text, part of contents such as item abstract, main research content, technical route and expected target in declaration materials of the science and technology item are long text, and the contents comprise more sentences, and the upper sentences and the lower sentences are in mutual relation, so that text comparison of declaration materials of a science and technology item cannot be simply processed by using a text comparison method, and the text processing is not considered in the prior art.

Disclosure of Invention

The invention aims to provide a science and technology project similarity analysis method, computer equipment and a computer readable storage medium, which are suitable for text similarity analysis of science and technology project reporting materials in the power professional fields, and are beneficial to realizing intelligent auxiliary stand review, avoiding repeated stands and ensuring the quality improvement and efficiency of stand management work.

To achieve the above object, according to a first aspect, an embodiment of the present invention provides a method for analyzing similarity of scientific and technological projects, including:

s1, acquiring an electronic document of a reporting material of a project to be reviewed, and extracting a text of the electronic document to obtain title information to be reviewed of the project to be reviewed;

s2, acquiring an electronic document of the declaration material of the ith historical review project, and extracting a text of the electronic document to acquire historical title information of the ith historical review project;

s3, carrying out short text similarity analysis according to the title information to be reviewed and the history title information of the ith history review project, and preliminarily judging whether the two are similar according to an analysis result; if yes, executing the steps S4 to S5 in turn, and if not, executing the step S6; wherein the initial value of i is 1;

s4, extracting text from the electronic document of the reporting material of the project to be reviewed to obtain long text information of the project to be reviewed, and extracting text from the electronic document of the reporting material of the i-th historical project to obtain the long text information of the historical project;

s5, performing long text similarity analysis according to the long text information to be reviewed and the history long text information of the ith history review project, and finally judging whether the long text information to be reviewed and the history long text information of the ith history review project are similar or not according to analysis results;

s6, judging whether i is smaller than N; if yes, let i=i+1 and return to step S2; if not, outputting the similarity judgment results between the to-be-reviewed item and all the history review items to a display unit for display, and ending the analysis flow; wherein M is a preset number; where N is the total number of history review items.

Optionally, the step S31 includes:

step S31, obtaining the longest continuous public substring between the title information to be reviewed and the historical title information of the i-th historical review item, and removing the longest continuous public substring from the title information to be reviewed and the historical title information of the i-th historical review item respectively to obtain a first character string and a second character string;

step S32, calculating an editing distance between the first character string and the second character string;

step S33, calculating the similarity between the title information to be reviewed and the historical title information of the i-th historical review item according to the editing distance;

and step S34, judging whether the historical title information of the i-th historical review item is similar to the historical title information of the i-th historical review item according to a comparison result of the similarity of the historical title information of the i-th historical review item and a first similarity threshold.

Optionally, the step S31 includes:

step S311, setting the title information to be reviewed as a character string S ₁ The history title information of the ith history review item is a character string s _i ；

Step S312, finding character string S ₁ Sum s _i Longest continuous common substring s of (2) _z ；

Step S313, if the longest continuous common substring S _z If the length of the string is greater than 2, the strings s are respectively ₁ Sum s _i S in (3) _z After removal, new 2 character strings s are obtained ₁₀ Sum s _i0 And let s ₁ ＝s ₁₀ ，s _i ＝s _i0 Returning to step S312; if the longest continuous common substring s _z The length of (2) or less, the output s ₁₀ As a first character string, s _i0 As a second string.

Optionally, the calculating the similarity between the title information to be reviewed and the historical title information of the i-th historical review item according to the editing distance includes:

wherein s is ₁₀ Representing a first string, s _i0 Represents a second string, sim (s ₁₀ ,s _i0 ) Representing the edit distance to calculate the similarity of the title information to be reviewed and the history title information of the i-th history review item, ED representing the edit distance between the first character string and the second character string, len(s) ₁₀ ) Represents the length of the first character string, len (s _i0 ) Representing the length of the second string.

Optionally, the title information to be reviewed includes a main title of the project to be reviewed and a subtitle in the research content; the history title information of the ith history review project comprises a project main title of the ith history review project and a subtitle in the research content;

the step S31 specifically includes: acquiring the longest continuous public substring between each title information in the title information to be reviewed and each title information in the history title information of the ith history review item, and respectively removing the longest continuous public substring to obtain a first character string s _jk1 And a second character string s _jk2 The method comprises the steps of carrying out a first treatment on the surface of the Wherein s is _jk1 Representing a first character string obtained by removing the jth title information in the title information to be reviewed and the kth title information in the history title information through removing the largest continuous public substring, s _jk2 The kth title information in the historical title information is removed and the jth title information in the title information to be reviewed is removedA second character string obtained after the largest continuous public sub-string of the message;

the step S32 specifically includes: calculate all the first strings s _jk1 And a second character string s corresponding to the first character string s _jk2 Editing distances between the two to obtain an editing distance set; each piece of title information in the title information to be reviewed has k corresponding editing distances;

the step S33 specifically includes: calculating all the first character strings s according to the editing distance set _jk1 And a second character string s corresponding to the first character string s _jk2 Similarity between the historical title information and the i-th historical review item is calculated according to all similarity calculation results; and each piece of title information to be reviewed has corresponding k similarity calculation results.

Optionally, the outputting, to a display unit, the result of the similarity judgment between the item to be reviewed and all the history review items includes:

if at least one history review item is similar to the to-be-reviewed item, outputting the declaration material electronic document of the at least one history review item to a display unit;

if at least one history review item is not similar to the to-be-reviewed item, sorting the similarity between the to-be-reviewed item and all the history review items, and then selecting M reporting material electronic documents of the history review item with the highest similarity to output to a display unit for display; m is a preset number.

Optionally, the step S5 includes:

step S51, respectively inputting a Doc2vec model trained in advance according to the long text information to be reviewed and the history long text information of the i-th history review project, and outputting a corresponding paragraph vector to be reviewed and a history paragraph vector of the i-th history review project;

step S52, calculating the similarity between the paragraph vector to be reviewed and the history paragraph vector of the i-th history review item;

and step S53, judging whether the paragraph vectors to be reviewed and the history paragraph vectors of the i-th history review project are similar or not according to the comparison result of the similarity of the paragraph vectors to be reviewed and the history paragraph vectors of the i-th history review project and the second similarity threshold.

Optionally, the step S1 further includes:

text extraction is carried out on the electronic document of the reporting material of the project to be reviewed to obtain project technical field information of the project to be reviewed;

the step S2 is to acquire the electronic document of the ith historical review project declaration material, and specifically comprises the following steps:

acquiring an ith historical review project declaration material electronic document in a database corresponding to the technical field of the project according to the technical field information of the project to be reviewed;

wherein, in the step S6, all the history review items are all the history review items in the database of the corresponding item technical field.

According to a third aspect, an embodiment of the present invention further proposes a computer device comprising: according to the science and technology project similarity analysis system; or, a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform steps according to the technology project similarity analysis method.

According to a fourth aspect, an embodiment of the present invention further proposes a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method for similarity analysis of scientific and technological projects.

The embodiment of the invention provides a science and technology project similarity analysis method and system, computer equipment and a computer readable storage medium, which are used for extracting the title information of a declaration material electronic document of a project to be reviewed and a history review project and judging the similarity of the extracted title information, and because the title information is a short text, the calculated amount is small, the required calculation resources are less, the consumed calculation time is also very small, thus being beneficial to traversing all the history review projects, preliminarily and rapidly judging the similarity between the project to be reviewed and all the history review projects and realizing the quick similarity project primary screening; and further extracting long text information of the to-be-reviewed item and the historical review item according to the preliminary similarity judgment result, carrying out similarity analysis according to the long text information, and finally determining whether the to-be-reviewed item and the historical review item are similar according to the analysis result. Based on the text characteristics of science and technology project reporting materials, the method for judging whether two projects are similar by combining short text similarity analysis and long text similarity analysis is provided, so that a judge can be assisted to rapidly judge whether the projects are repeatedly reported, the high efficiency and the accuracy of similarity judgment can be ensured, intelligent auxiliary stand review can be realized, repeated stands are avoided, and the quality improvement and the efficiency of stand management work are ensured.

Additional features and advantages of the invention will be set forth in the detailed description which follows.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a similarity analysis method for technical projects according to an embodiment of the invention.

FIG. 2 is a schematic diagram of a Doc2vec PV-DM in accordance with an embodiment of the invention.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better illustration of the invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known means have not been described in detail in order to not obscure the present invention.

Referring to fig. 1, an embodiment of the present invention provides a technology project similarity analysis method, which includes:

for example, "source-terminal integrated energy system key technology and development model research".

for example, "comprehensive energy system multi-energy conversion simulation and comprehensive energy efficiency evaluation technical research".

s6, judging whether i is smaller than N; if yes, let i=i+1 and return to step S2; if not, outputting the similarity judgment results between the to-be-reviewed item and all the history review items to a display unit for display, and ending the analysis flow; wherein M is a preset number; where N is the total number of history review items. M and N are integers.

According to the method, the title information of the declaration material electronic document of the project to be reviewed and the current history review project is extracted, and similarity judgment is carried out on the extracted title information, and as the title information is short text, the calculation amount is small, the required calculation resources are small, and the consumed calculation time is also very small, the method is beneficial to traversing all the history review projects, preliminarily and rapidly judging the similarity between the project to be reviewed and all the history review projects, and realizing quick similarity project primary screening; and further extracting long text information of the to-be-reviewed item and the historical review item according to the preliminary similarity judgment result, carrying out similarity analysis according to the long text information, and finally determining whether the to-be-reviewed item is similar to the current historical review item or not according to the analysis result. Based on the text characteristics of the science and technology project reporting material, the embodiment provides a mode of combining short text similarity analysis and long text similarity analysis to judge whether two projects are similar.

Optionally, the step S31 includes:

illustratively, the longest continuous public substring of the "source-end-group comprehensive energy system key technology and development mode research" and "comprehensive energy system multi-energy conversion simulation and comprehensive energy efficiency evaluation technology research" is the "comprehensive energy system".

Specifically, the reason for selecting consecutive common substrings instead of the Longest Common Subsequence (LCS) in this embodiment is that the longest common subsequence may split a noun that originally has semantics into single words, and if consecutive substrings occur in both strings, it is highly likely to be a complete noun; wherein the longest continuous common substring problem is to find the longest substring of two or more known strings, and the longest continuous common substring problem is different from the longest common substring problem in that the substring does not have to be continuous, but the substring does have to be.

specifically, the editing distance refers to the minimum number of editing times required for converting one substring into another substring between two substrings; wherein editing operations include deletion, insertion, replacement, and the like.

The edit distance may be expressed as:

where D (str 1, str2, i, j) represents an edit distance between the first i characters of the character string str1 and the first j characters of the character string str2, str1 _i The i-th substring of the string str1 is represented. The initial value D (str 1, str2, 0) is 0.

The above formula is a recursive definition, and if strings s1 and s2 are provided, the lengths are m and n, respectively, a matching relation matrix of (m+1) ×n+1 is generally used to calculate the edit distance. The element values in the matrix are:

wherein d _i,j The values of the ith row and j column in the matrix are shown, an example of a matching relationship matrix is given below, and the edit distances of the character strings "similarity calculation" and "calculated similarity" are calculated, and the obtained edit distance is 4, as shown in table 1:

TABLE 1 edit distance calculation matrix

0	Phase (C)	Similar to	Degree of	Meter with a meter body	Calculation of
						Meter with a meter body	1	2	3	3	4
Calculation of	2	2	3	4	3
						Phase (C)	2	3	3	4	4
Similar to	3	2	3	4	5
						Degree of	4	3	2	3	4

specifically, in this embodiment, some technological project sets are randomly selected, and the project title similarity calculation of the existing method and the method in this embodiment is performed on the technological project sets, and the comparison results are shown in the following table 2: it can be seen that the method of the embodiment has relatively smaller calculated editing distance, and the similarity result is more in line with the similarity value close to reality. In addition, the existing method and the method of this embodiment yield the same results when there is no common substring.

TABLE 2 comparison of name similarity under different algorithms

It should be noted that, the method of this embodiment is used for calculating and comparing the project titles, and can obtain a more ideal effect. For example, if the item A is similar to the item title of the item B in the main content subtitle, the item A and the item B may have more or less similar relation, which is used as a preliminary judgment basis for repeated reporting of the items; in addition, the calculation amount required by the calculation comparison method is small, the electronic document of the science and technology project reporting material is usually a large text, if each history project is compared with the whole text in a conventional manner, a great amount of time and calculation resources are inevitably consumed, and the second similarity judgment can be further carried out according to the long text only when the preliminary judgment exists in the method of the embodiment, so that the technical problem can be effectively solved.

Step S34, judging whether the historical title information of the i-th historical review item is similar to the historical title information of the i-th historical review item according to a comparison result of the similarity of the historical title information and a first similarity threshold;

specifically, when the similarity is greater than the first similarity threshold, it is determined that the header information to be reviewed is similar to the i-th history review item, and at this time, steps S4 to S5 are continuously performed.

Optionally, the step S31 includes:

specifically, in general, the main title of the project, that is, the subject name, needs to be filled in the reporting material (project report) of the technical project; and describes the main study content, which will generally be described in terms of several aspects, each with a subheading.

The step S31 specifically includes: acquiring the longest continuous public substring between each title information in the title information to be reviewed and each title information in the history title information of the ith history review item, and respectively removing the longest continuous public substring to obtain a first character string s _jk1 And a second character string s _jk2 The method comprises the steps of carrying out a first treatment on the surface of the Wherein s is _jk1 Representing a first character string obtained by removing the jth title information in the title information to be reviewed and the kth title information in the history title information through removing the largest continuous public substring, s _jk2 A second character string which is obtained by removing the maximum continuous common substring of the kth title information in the historical title information and the jth title information in the title information to be reviewed;

it should be noted that, both the main title of the project and the subtitle in the study content are regarded as one piece of title information.

specifically, assuming that the title information to be reviewed has j pieces of title information, the title information to be reviewed corresponds to j×k pieces of edit distance data, respectively.

Said step S33 is specificallyThe method comprises the following steps: calculating all the first character strings s according to the editing distance set _jk1 And a second character string s corresponding to the first character string s _jk2 Similarity between the historical title information and the i-th historical review item is calculated according to all similarity calculation results; and each piece of title information to be reviewed has corresponding k similarity calculation results.

Specifically, correspondingly, the title information to be reviewed corresponds to j×k similarity data; and for the j×k similarity data, taking the average similarity output of the j×k similarity data as the similarity between the title information to be reviewed and the historical title information of the i-th historical review item.

Specifically, after the similarity determination by the method of the present embodiment, M most similar history review items are output for further confirmation by the reviewer.

Optionally, the step S5 includes:

illustratively, the similarity between two paragraph vectors may be determined according to the distance between them, wherein the closer the distance is, the greater the similarity.

It may be appreciated that the long text information in this embodiment may include multiple aspects, such as a project summary, a main study content, and the like, each of which includes multiple paragraphs, and the multiple aspects may be separately and individually subjected to similarity calculation; finally, carrying out comprehensive analysis and calculation according to the similarity of the multiple aspects, for example, taking an average value of the similarity of the multiple aspects as a long text similarity analysis result; for example, the similarity of the multiple aspects is multiplied by corresponding preset weights respectively and then accumulated to be used as a long text similarity analysis result; for similarity calculation of a certain aspect, for example, n paragraphs exist in the E aspect of the item to be reviewed, m paragraphs exist in the E aspect of the current historical review item, after similarity calculation is performed on the paragraphs of the certain aspect of the item to be reviewed and the paragraphs of the certain aspect corresponding to the current historical review item, m similarity calculation data exist in each paragraph of the E aspect of the item to be reviewed, n paragraphs of the E aspect of the item to be reviewed have n multiplied by m similarity calculation data, and a similarity average value of the n multiplied by m similarity calculation data is used as similarity of the item to be reviewed and the current historical review item in the E aspect.

Specifically, the embodiment adopts a PV-DM (Distribute Memory Model of Paragraph Vectors) training method to train the Doc2vec model, and as shown in fig. 2, a frame diagram of the Doc2vec PV-DM of the embodiment is shown, and as can be seen from fig. 2, there is a vector representation of each paragraph/sentence in addition to the vector added with the word level. For example, for a sentence 'the cat sat on', if the word on in the sentence is to be predicted, the prediction may be performed not only by generating corresponding features from other words, but also by generating features from other words and sentences. Each paragraph/sentence is mapped into a vector space, which can be represented by a column of a matrix. Each word is also mapped to a vector space, which can be represented by a column of a matrix. The paragraph vector and word vector are then concatenated or averaged to obtain a feature, predicting the next word in the sentence. A paragraph vector/sentence vector can also be considered a word that acts as a memory unit for the context or as the subject of the paragraph. Where the context length is fixed during training, the training set is also generated by means of a sliding window. And paragraph/sentence vectors are shared in this context. The training process of the Doc2vec model in this embodiment is specifically as follows, and mainly includes the following (1) and (2):

(1) training a model to obtain word vectors, softmax parameters and paragraph/sentence vectors in known training data.

(2) The inference process (index stage) gets its vector expression for the new paragraph. Specifically, more columns are added in the matrix, training is performed by the method under the condition of fixed length, and a new D (paragraph vector matrix) is obtained by using a gradient descent method, so that the vector expression of a new paragraph is obtained.

Optionally, the step S1 further includes:

Specifically, since many historical technological projects are reviewed, the embodiment further provides a preliminary classification concept, and the declaration material electronic documents of different types of historical technological projects are respectively stored in different databases, so that when similarity analysis is performed, the to-be-reviewed projects are compared with the historical technological projects in the corresponding technical fields in a similar manner according to the technical fields of the to-be-reviewed projects, and the calculation workload is effectively reduced.

To sum up, in this embodiment, the problem of larger data size of the science and technology project is provided, 3 aspects of targeted setting are provided altogether, the first is database classification screening, the second is short text preliminary similar screening, the third is long text secondary similar screening, screening is performed layer by layer, the whole process not only can accurately perform similarity analysis, but also has less workload and higher processing speed.

Another embodiment of the present invention also proposes a computer device comprising: the system comprises a memory and a processor, wherein the memory stores computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the technology project similarity analysis method according to the above embodiment.

Of course, the computer device may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

The computer program may be divided into one or more units, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more elements may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the computer device.

The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is a control center of the computer device, connecting various interfaces and lines throughout the various portions of the computer device.

The memory may be used to store the computer program and/or elements, and the processor may implement various functions of the computer device by running or executing the computer program and/or elements stored in the memory, and invoking data stored in the memory. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

Another embodiment of the present invention also proposes a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for analyzing similarity of scientific and technological projects described in the above embodiment.

In particular, the computer-readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

In summary, the embodiment of the invention provides a technology project similarity analysis method and system, computer equipment and a computer readable storage medium, which are used for extracting the title information of a declaration material electronic document of a project to be reviewed and a history review project and judging the similarity of the extracted title information.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for similarity analysis of scientific and technological projects, comprising:

s1, acquiring an electronic document of a reporting material of a project to be reviewed, and extracting a text of the electronic document to obtain title information to be reviewed of the project to be reviewed; the title information to be reviewed comprises a project main title of the project to be reviewed and a subtitle in the research content; said firstiThe history title information of each history review item includes the firstiProject main titles of the historical review projects and sub-titles in the study content;

step S2, obtaining the firstiThe electronic document of the reporting material of each history review project is subjected to text extraction to obtain the firstiHistorical title information of the individual historical review items;

step S3, according to the title information to be reviewed and the first stepiPerforming short text similarity analysis on the historical title information of each historical review item, and preliminarily judging whether the historical title information and the short text similarity are similar according to analysis results; if yes, executing the steps S4-S5 in sequence, and if not, executing the step S6; wherein the method comprises the steps ofiAn initial value of 1;

step S4, carrying out text extraction on the electronic document of the declaration material of the project to be reviewed to obtain the long text information of the project to be reviewed,and for the firstiText extraction is carried out on the electronic documents of the reporting materials of the historical projects to obtain the long-history text information of the historical projects;

step S5, according to the text information to be reviewed and the first stepiPerforming long text similarity analysis on the historical long text information of each historical review item, and finally judging whether the historical long text information is similar or not according to analysis results;

step S6, judgingiWhether or not to be smaller thanNThe method comprises the steps of carrying out a first treatment on the surface of the If yes, makei=i+1, and returning to said step S2; if not, outputting the similarity judgment results between the to-be-reviewed item and all the history review items to a display unit for display, and ending the analysis flow; wherein the method comprises the steps ofMIs a preset number; wherein the method comprises the steps ofNThe total number of the historical review items;

wherein, the step S3 includes:

step S31, acquiring the title information to be reviewed and the first stepiThe longest continuous public substring among the historical title information of each historical review item and the title information to be reviewed and the firstiRespectively removing the longest continuous public substring from the history title information of each history review item to obtain a first character string and a second character string;

step S33, calculating the title information to be reviewed and the first title according to the editing distanceiSimilarity of historical title information of the historical review items;

step S34, according to the title information to be reviewed and the first stepiJudging whether the historical title information of each historical review item is similar to the first similarity threshold value or not according to the comparison result of the similarity of the historical title information of each historical review item and the first similarity threshold value;

the step S31 specifically includes: acquiring each title information in the title information to be reviewed and the first title informationiThe longest continuous public substring between each title information in the historical title information of each historical review item is removed to obtain a first character strings _jk1 And a second character string s _jk2 The method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps ofs _jk1 Representing the first of the title information to be reviewedjThe individual title information removes the first of the historical title informationkThe first character string obtained by removing the largest continuous public sub-string from the header information,s _jk2 representing the first of the historical title informationkThe title information is removed from the title information to be reviewedjA second character string obtained after the largest continuous public sub-string of the title information;

the step S32 specifically includes: calculate all first character stringss _jk1 And a second character string s corresponding to the first character string s _jk2 Editing distances between the two to obtain an editing distance set; wherein each title information in the title information to be reviewed has a corresponding onekEditing distance;

the step S33 specifically includes: calculating all first character strings according to the editing distance sets _jk1 And a second character string corresponding to the first character strings _jk2 Similarity between the first and second pieces of the title information to be reviewed and the first piece of the title information to be reviewed are calculated according to all similarity calculation resultsiSimilarity of historical title information of the historical review items; wherein each title information in the title information to be reviewed has a corresponding onekCalculating a result of the similarity;

the step S31 includes:

step S311, setting the title information to be reviewed as a character strings ₁ The first step ofiThe history title information of each history review item is a character strings _i ；

Step S312, obtaining character strings ₁ Ands _i longest continuous common substring of (2)s _z ；

Step S313, if the longest continuous common substrings _z If the length of the character string is greater than 2, the character strings are respectivelys ₁ Ands _i in (a) and (b)s _z After removal, new 2 character strings are obtaineds ₁₀ Ands _i0 and orders ₁ = s ₁₀ ，s _i = s _i0 Then return to step S312; if the longest continuous common substrings _z The length of (2) or less, outputs ₁₀ As a first string of characters,s _i0 as a second string.

2. The method according to claim 1, wherein the calculating the title information to be reviewed and the first according to the edit distanceiThe similarity of the historical title information of the historical review items comprises:

wherein,s ₁₀ a first string of characters is represented and,s _i0 a second character string is represented as such,sim(s ₁₀ , s _i0 ) Representing the editing distance to calculate the title information to be reviewed and the firstiSimilarity of historical title information for individual historical review items,EDrepresenting the edit distance between the first string and the second string,len(s ₁₀ ) Representing the length of the first string of characters,len(s _i0 ) Representing the length of the second string.

3. The technological project similarity analysis method according to claim 1, wherein the outputting the similarity judgment result between the project to be reviewed and all the history review projects to a display unit for display includes:

if at least one history review item is not similar to the to-be-reviewed item, sorting the similarity of the to-be-reviewed item and all the history review items, and then selecting the item with the highest similarityMThe declaration material electronic documents of the historical review projects are output to a display unit for display;Mfor a preset number of。

4. The method of analyzing similarity of scientific and technological projects according to claim 1, wherein said step S5 includes:

step S51, according to the text information of the length to be reviewed and the first stepiThe history long text information of each history review project is respectively input into a pre-trained Doc2vec model, and corresponding paragraph vectors to be reviewed and the first paragraph vector are outputiA history paragraph vector of each history review item;

step S52, calculating the paragraph vector to be reviewed and the thiSimilarity of history paragraph vectors of the history review items;

step S53, according to the paragraph vector to be reviewed and the first paragraph vectoriThe result of the comparison of the similarity of the history paragraph vectors of the history review items with the second similarity threshold determines whether the two are similar.

5. The method for analyzing similarity of scientific and technological projects according to claim 1,

the step S1 further includes:

the step S2 is to obtain the firstiThe electronic document of the reporting material of each history review project specifically comprises:

acquiring a first item in a database corresponding to the technical field of the item according to the technical field information of the item to be reviewediThe individual history review items claim an electronic document of material.

6. A computer device, comprising: the system comprises a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions when executed by the processor cause the processor to execute the steps of the technology project similarity analysis method according to any one of claims 1-5.

7. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method for similarity analysis of scientific and technological projects according to any one of claims 1 to 5.