CN112199938B - Science and technology project similarity analysis method, computer equipment and storage medium - Google Patents

Science and technology project similarity analysis method, computer equipment and storage medium Download PDF

Info

Publication number
CN112199938B
CN112199938B CN202011258083.6A CN202011258083A CN112199938B CN 112199938 B CN112199938 B CN 112199938B CN 202011258083 A CN202011258083 A CN 202011258083A CN 112199938 B CN112199938 B CN 112199938B
Authority
CN
China
Prior art keywords
reviewed
title information
historical
similarity
project
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011258083.6A
Other languages
Chinese (zh)
Other versions
CN112199938A (en
Inventor
汪桢子
章彬
何维
汪伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Power Supply Co ltd
Original Assignee
Shenzhen Power Supply Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Power Supply Co ltd filed Critical Shenzhen Power Supply Co ltd
Priority to CN202011258083.6A priority Critical patent/CN112199938B/en
Publication of CN112199938A publication Critical patent/CN112199938A/en
Application granted granted Critical
Publication of CN112199938B publication Critical patent/CN112199938B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a science and technology project similarity analysis method, computer equipment and storage medium, wherein the method comprises the following steps: acquiring an electronic document of a reporting material of a project to be reviewed, and extracting text of the electronic document to obtain title information to be reviewed of the project to be reviewed; acquiring a historical review project reporting material electronic document, and extracting text of the electronic document to obtain historical title information of the historical review project; performing short text similarity analysis according to the title information to be reviewed and the historical title information, and primarily judging whether the title information to be reviewed and the historical title information are similar according to an analysis result; if yes, extracting text from the electronic documents of the project to be reviewed and the history project to obtain long text information to be reviewed and history long text information, analyzing long text similarity and judging final similarity, and if not, circulating or ending. The invention is suitable for text similarity analysis of science and technology project reporting materials in the electric power professional fields, and is beneficial to realizing intelligent auxiliary stand review and avoiding repeated stands.

Description

Science and technology project similarity analysis method, computer equipment and storage medium
Technical Field
The invention relates to the technical field of software information, in particular to a science and technology project similarity analysis method, computer equipment and a storage medium.
Background
With the continuous deep development of electric power reform and continuous development of science and technology, scientific and technical research projects in each professional field of power grid companies are reviewed more and more, and in order to avoid repeated reporting of similar projects, similarity examination needs to be conducted on reporting materials of the scientific and technical research projects. In general, science and technology project reporting materials are large texts, and at present, the similarity judging mode of the science and technology projects needs to rely on professional manual reading and screening comparison, so that each science and technology project reporting material needs to be manually compared with mass prior science and technology project reporting materials in a database, a large amount of manpower and time cost is consumed, and the high efficiency and accuracy of similarity judgment are difficult to ensure. Along with the enhancement of environmental awareness, the current power grid company carries out paperless office work, scientific and technological project declaration materials are submitted and reviewed in an electronic document mode, the electronic document provides a basis for informatization of review work, whether repeated declaration conditions exist or not can be judged by analyzing text similarity of a project to be reviewed and a history review project, and the current text similarity analysis mainly comprises word segmentation and distance calculation between words after word segmentation, and finally similarity results are obtained comprehensively.
However, the current text similarity analysis method is not suitable for the scientific and technical research project review in each professional field of the power grid company, and the main reasons are as follows:
(1) Because the professional words in the title are more and all appear in combined long words, the method is not a simple professional word which can be segmented, such as 'research and application of a device visual monitoring model based on big data acceleration analysis and three-dimensional digitization', wherein 'big data acceleration analysis', 'device visual detection model' is simply segmented into 'big data', 'acceleration', 'analysis', 'device', 'visualization', 'detection', 'model', and the meaning is changed;
(2) Semantic understanding is poor for professional names. Such as: the similarity of the source terminal integrated energy system key technology and the development mode research and the comprehensive energy system multipotency conversion simulation and comprehensive energy efficiency evaluation technology research in semantic understanding is relatively high, but in practice, the two technological projects are quite different;
(3) The title of the science and technology project is relatively short, about 30 words long, and only 10 words short. Since technical project titles contain a large number of names, and names are often combined together to form longer terms that contain semantics, the likelihood of two projects being similar is very high if there are many duplicates of such terms in the two names. But may result in a very low degree of similarity if direct edit distance calculations are employed.
(4) The title of a science and technology item is short text, part of contents such as item abstract, main research content, technical route and expected target in declaration materials of the science and technology item are long text, and the contents comprise more sentences, and the upper sentences and the lower sentences are in mutual relation, so that text comparison of declaration materials of a science and technology item cannot be simply processed by using a text comparison method, and the text processing is not considered in the prior art.
Disclosure of Invention
The invention aims to provide a science and technology project similarity analysis method, computer equipment and a computer readable storage medium, which are suitable for text similarity analysis of science and technology project reporting materials in the power professional fields, and are beneficial to realizing intelligent auxiliary stand review, avoiding repeated stands and ensuring the quality improvement and efficiency of stand management work.
To achieve the above object, according to a first aspect, an embodiment of the present invention provides a method for analyzing similarity of scientific and technological projects, including:
s1, acquiring an electronic document of a reporting material of a project to be reviewed, and extracting a text of the electronic document to obtain title information to be reviewed of the project to be reviewed;
s2, acquiring an electronic document of the declaration material of the ith historical review project, and extracting a text of the electronic document to acquire historical title information of the ith historical review project;
s3, carrying out short text similarity analysis according to the title information to be reviewed and the history title information of the ith history review project, and preliminarily judging whether the two are similar according to an analysis result; if yes, executing the steps S4 to S5 in turn, and if not, executing the step S6; wherein the initial value of i is 1;
s4, extracting text from the electronic document of the reporting material of the project to be reviewed to obtain long text information of the project to be reviewed, and extracting text from the electronic document of the reporting material of the i-th historical project to obtain the long text information of the historical project;
s5, performing long text similarity analysis according to the long text information to be reviewed and the history long text information of the ith history review project, and finally judging whether the long text information to be reviewed and the history long text information of the ith history review project are similar or not according to analysis results;
s6, judging whether i is smaller than N; if yes, let i=i+1 and return to step S2; if not, outputting the similarity judgment results between the to-be-reviewed item and all the history review items to a display unit for display, and ending the analysis flow; wherein M is a preset number; where N is the total number of history review items.
Optionally, the step S31 includes:
step S31, obtaining the longest continuous public substring between the title information to be reviewed and the historical title information of the i-th historical review item, and removing the longest continuous public substring from the title information to be reviewed and the historical title information of the i-th historical review item respectively to obtain a first character string and a second character string;
step S32, calculating an editing distance between the first character string and the second character string;
step S33, calculating the similarity between the title information to be reviewed and the historical title information of the i-th historical review item according to the editing distance;
and step S34, judging whether the historical title information of the i-th historical review item is similar to the historical title information of the i-th historical review item according to a comparison result of the similarity of the historical title information of the i-th historical review item and a first similarity threshold.
Optionally, the step S31 includes:
step S311, setting the title information to be reviewed as a character string S 1 The history title information of the ith history review item is a character string s i
Step S312, finding character string S 1 Sum s i Longest continuous common substring s of (2) z
Step S313, if the longest continuous common substring S z If the length of the string is greater than 2, the strings s are respectively 1 Sum s i S in (3) z After removal, new 2 character strings s are obtained 10 Sum s i0 And let s 1 =s 10 ,s i =s i0 Returning to step S312; if the longest continuous common substring s z The length of (2) or less, the output s 10 As a first character string, s i0 As a second string.
Optionally, the calculating the similarity between the title information to be reviewed and the historical title information of the i-th historical review item according to the editing distance includes:
wherein s is 10 Representing a first string, s i0 Represents a second string, sim (s 10 ,s i0 ) Representing the edit distance to calculate the similarity of the title information to be reviewed and the history title information of the i-th history review item, ED representing the edit distance between the first character string and the second character string, len(s) 10 ) Represents the length of the first character string, len (s i0 ) Representing the length of the second string.
Optionally, the title information to be reviewed includes a main title of the project to be reviewed and a subtitle in the research content; the history title information of the ith history review project comprises a project main title of the ith history review project and a subtitle in the research content;
the step S31 specifically includes: acquiring the longest continuous public substring between each title information in the title information to be reviewed and each title information in the history title information of the ith history review item, and respectively removing the longest continuous public substring to obtain a first character string s jk1 And a second character string s jk2 The method comprises the steps of carrying out a first treatment on the surface of the Wherein s is jk1 Representing a first character string obtained by removing the jth title information in the title information to be reviewed and the kth title information in the history title information through removing the largest continuous public substring, s jk2 The kth title information in the historical title information is removed and the jth title information in the title information to be reviewed is removedA second character string obtained after the largest continuous public sub-string of the message;
the step S32 specifically includes: calculate all the first strings s jk1 And a second character string s corresponding to the first character string s jk2 Editing distances between the two to obtain an editing distance set; each piece of title information in the title information to be reviewed has k corresponding editing distances;
the step S33 specifically includes: calculating all the first character strings s according to the editing distance set jk1 And a second character string s corresponding to the first character string s jk2 Similarity between the historical title information and the i-th historical review item is calculated according to all similarity calculation results; and each piece of title information to be reviewed has corresponding k similarity calculation results.
Optionally, the outputting, to a display unit, the result of the similarity judgment between the item to be reviewed and all the history review items includes:
if at least one history review item is similar to the to-be-reviewed item, outputting the declaration material electronic document of the at least one history review item to a display unit;
if at least one history review item is not similar to the to-be-reviewed item, sorting the similarity between the to-be-reviewed item and all the history review items, and then selecting M reporting material electronic documents of the history review item with the highest similarity to output to a display unit for display; m is a preset number.
Optionally, the step S5 includes:
step S51, respectively inputting a Doc2vec model trained in advance according to the long text information to be reviewed and the history long text information of the i-th history review project, and outputting a corresponding paragraph vector to be reviewed and a history paragraph vector of the i-th history review project;
step S52, calculating the similarity between the paragraph vector to be reviewed and the history paragraph vector of the i-th history review item;
and step S53, judging whether the paragraph vectors to be reviewed and the history paragraph vectors of the i-th history review project are similar or not according to the comparison result of the similarity of the paragraph vectors to be reviewed and the history paragraph vectors of the i-th history review project and the second similarity threshold.
Optionally, the step S1 further includes:
text extraction is carried out on the electronic document of the reporting material of the project to be reviewed to obtain project technical field information of the project to be reviewed;
the step S2 is to acquire the electronic document of the ith historical review project declaration material, and specifically comprises the following steps:
acquiring an ith historical review project declaration material electronic document in a database corresponding to the technical field of the project according to the technical field information of the project to be reviewed;
wherein, in the step S6, all the history review items are all the history review items in the database of the corresponding item technical field.
According to a third aspect, an embodiment of the present invention further proposes a computer device comprising: according to the science and technology project similarity analysis system; or, a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform steps according to the technology project similarity analysis method.
According to a fourth aspect, an embodiment of the present invention further proposes a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method for similarity analysis of scientific and technological projects.
The embodiment of the invention provides a science and technology project similarity analysis method and system, computer equipment and a computer readable storage medium, which are used for extracting the title information of a declaration material electronic document of a project to be reviewed and a history review project and judging the similarity of the extracted title information, and because the title information is a short text, the calculated amount is small, the required calculation resources are less, the consumed calculation time is also very small, thus being beneficial to traversing all the history review projects, preliminarily and rapidly judging the similarity between the project to be reviewed and all the history review projects and realizing the quick similarity project primary screening; and further extracting long text information of the to-be-reviewed item and the historical review item according to the preliminary similarity judgment result, carrying out similarity analysis according to the long text information, and finally determining whether the to-be-reviewed item and the historical review item are similar according to the analysis result. Based on the text characteristics of science and technology project reporting materials, the method for judging whether two projects are similar by combining short text similarity analysis and long text similarity analysis is provided, so that a judge can be assisted to rapidly judge whether the projects are repeatedly reported, the high efficiency and the accuracy of similarity judgment can be ensured, intelligent auxiliary stand review can be realized, repeated stands are avoided, and the quality improvement and the efficiency of stand management work are ensured.
Additional features and advantages of the invention will be set forth in the detailed description which follows.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a similarity analysis method for technical projects according to an embodiment of the invention.
FIG. 2 is a schematic diagram of a Doc2vec PV-DM in accordance with an embodiment of the invention.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better illustration of the invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known means have not been described in detail in order to not obscure the present invention.
Referring to fig. 1, an embodiment of the present invention provides a technology project similarity analysis method, which includes:
s1, acquiring an electronic document of a reporting material of a project to be reviewed, and extracting a text of the electronic document to obtain title information to be reviewed of the project to be reviewed;
for example, "source-terminal integrated energy system key technology and development model research".
S2, acquiring an electronic document of the declaration material of the ith historical review project, and extracting a text of the electronic document to acquire historical title information of the ith historical review project;
for example, "comprehensive energy system multi-energy conversion simulation and comprehensive energy efficiency evaluation technical research".
S3, carrying out short text similarity analysis according to the title information to be reviewed and the history title information of the ith history review project, and preliminarily judging whether the two are similar according to an analysis result; if yes, executing the steps S4 to S5 in turn, and if not, executing the step S6; wherein the initial value of i is 1;
s4, extracting text from the electronic document of the reporting material of the project to be reviewed to obtain long text information of the project to be reviewed, and extracting text from the electronic document of the reporting material of the i-th historical project to obtain the long text information of the historical project;
s5, performing long text similarity analysis according to the long text information to be reviewed and the history long text information of the ith history review project, and finally judging whether the long text information to be reviewed and the history long text information of the ith history review project are similar or not according to analysis results;
s6, judging whether i is smaller than N; if yes, let i=i+1 and return to step S2; if not, outputting the similarity judgment results between the to-be-reviewed item and all the history review items to a display unit for display, and ending the analysis flow; wherein M is a preset number; where N is the total number of history review items. M and N are integers.
According to the method, the title information of the declaration material electronic document of the project to be reviewed and the current history review project is extracted, and similarity judgment is carried out on the extracted title information, and as the title information is short text, the calculation amount is small, the required calculation resources are small, and the consumed calculation time is also very small, the method is beneficial to traversing all the history review projects, preliminarily and rapidly judging the similarity between the project to be reviewed and all the history review projects, and realizing quick similarity project primary screening; and further extracting long text information of the to-be-reviewed item and the historical review item according to the preliminary similarity judgment result, carrying out similarity analysis according to the long text information, and finally determining whether the to-be-reviewed item is similar to the current historical review item or not according to the analysis result. Based on the text characteristics of the science and technology project reporting material, the embodiment provides a mode of combining short text similarity analysis and long text similarity analysis to judge whether two projects are similar.
Optionally, the step S31 includes:
step S31, obtaining the longest continuous public substring between the title information to be reviewed and the historical title information of the i-th historical review item, and removing the longest continuous public substring from the title information to be reviewed and the historical title information of the i-th historical review item respectively to obtain a first character string and a second character string;
illustratively, the longest continuous public substring of the "source-end-group comprehensive energy system key technology and development mode research" and "comprehensive energy system multi-energy conversion simulation and comprehensive energy efficiency evaluation technology research" is the "comprehensive energy system".
Specifically, the reason for selecting consecutive common substrings instead of the Longest Common Subsequence (LCS) in this embodiment is that the longest common subsequence may split a noun that originally has semantics into single words, and if consecutive substrings occur in both strings, it is highly likely to be a complete noun; wherein the longest continuous common substring problem is to find the longest substring of two or more known strings, and the longest continuous common substring problem is different from the longest common substring problem in that the substring does not have to be continuous, but the substring does have to be.
Step S32, calculating an editing distance between the first character string and the second character string;
specifically, the editing distance refers to the minimum number of editing times required for converting one substring into another substring between two substrings; wherein editing operations include deletion, insertion, replacement, and the like.
The edit distance may be expressed as:
where D (str 1, str2, i, j) represents an edit distance between the first i characters of the character string str1 and the first j characters of the character string str2, str1 i The i-th substring of the string str1 is represented. The initial value D (str 1, str2, 0) is 0.
The above formula is a recursive definition, and if strings s1 and s2 are provided, the lengths are m and n, respectively, a matching relation matrix of (m+1) ×n+1 is generally used to calculate the edit distance. The element values in the matrix are:
wherein d i,j The values of the ith row and j column in the matrix are shown, an example of a matching relationship matrix is given below, and the edit distances of the character strings "similarity calculation" and "calculated similarity" are calculated, and the obtained edit distance is 4, as shown in table 1:
TABLE 1 edit distance calculation matrix
0 Phase (C) Similar to Degree of Meter with a meter body Calculation of
Meter with a meter body 1 2 3 3 4
Calculation of 2 2 3 4 3
Phase (C) 2 3 3 4 4
Similar to 3 2 3 4 5
Degree of 4 3 2 3 4
Step S33, calculating the similarity between the title information to be reviewed and the historical title information of the i-th historical review item according to the editing distance;
specifically, in this embodiment, some technological project sets are randomly selected, and the project title similarity calculation of the existing method and the method in this embodiment is performed on the technological project sets, and the comparison results are shown in the following table 2: it can be seen that the method of the embodiment has relatively smaller calculated editing distance, and the similarity result is more in line with the similarity value close to reality. In addition, the existing method and the method of this embodiment yield the same results when there is no common substring.
TABLE 2 comparison of name similarity under different algorithms
It should be noted that, the method of this embodiment is used for calculating and comparing the project titles, and can obtain a more ideal effect. For example, if the item A is similar to the item title of the item B in the main content subtitle, the item A and the item B may have more or less similar relation, which is used as a preliminary judgment basis for repeated reporting of the items; in addition, the calculation amount required by the calculation comparison method is small, the electronic document of the science and technology project reporting material is usually a large text, if each history project is compared with the whole text in a conventional manner, a great amount of time and calculation resources are inevitably consumed, and the second similarity judgment can be further carried out according to the long text only when the preliminary judgment exists in the method of the embodiment, so that the technical problem can be effectively solved.
Step S34, judging whether the historical title information of the i-th historical review item is similar to the historical title information of the i-th historical review item according to a comparison result of the similarity of the historical title information and a first similarity threshold;
specifically, when the similarity is greater than the first similarity threshold, it is determined that the header information to be reviewed is similar to the i-th history review item, and at this time, steps S4 to S5 are continuously performed.
Optionally, the step S31 includes:
step S311, setting the title information to be reviewed as a character string S 1 The history title information of the ith history review item is a character string s i
Step S312, finding character string S 1 Sum s i Longest continuous common substring s of (2) z
Step S313, if the longest continuous common substring S z If the length of the string is greater than 2, the strings s are respectively 1 Sum s i S in (3) z After removal, new 2 character strings s are obtained 10 Sum s i0 And let s 1 =s 10 ,s i =s i0 Returning to step S312; if the longest continuous common substring s z The length of (2) or less, the output s 10 As a first character string, s i0 As a second string.
Optionally, the calculating the similarity between the title information to be reviewed and the historical title information of the i-th historical review item according to the editing distance includes:
wherein s is 10 Representing a first string, s i0 Represents a second string, sim (s 10 ,s i0 ) Representing the edit distance to calculate the similarity of the title information to be reviewed and the history title information of the i-th history review item, ED representing the edit distance between the first character string and the second character string, len(s) 10 ) Represents the length of the first character string, len (s i0 ) Representing the length of the second string.
Optionally, the title information to be reviewed includes a main title of the project to be reviewed and a subtitle in the research content; the history title information of the ith history review project comprises a project main title of the ith history review project and a subtitle in the research content;
specifically, in general, the main title of the project, that is, the subject name, needs to be filled in the reporting material (project report) of the technical project; and describes the main study content, which will generally be described in terms of several aspects, each with a subheading.
The step S31 specifically includes: acquiring the longest continuous public substring between each title information in the title information to be reviewed and each title information in the history title information of the ith history review item, and respectively removing the longest continuous public substring to obtain a first character string s jk1 And a second character string s jk2 The method comprises the steps of carrying out a first treatment on the surface of the Wherein s is jk1 Representing a first character string obtained by removing the jth title information in the title information to be reviewed and the kth title information in the history title information through removing the largest continuous public substring, s jk2 A second character string which is obtained by removing the maximum continuous common substring of the kth title information in the historical title information and the jth title information in the title information to be reviewed;
it should be noted that, both the main title of the project and the subtitle in the study content are regarded as one piece of title information.
The step S32 specifically includes: calculate all the first strings s jk1 And a second character string s corresponding to the first character string s jk2 Editing distances between the two to obtain an editing distance set; each piece of title information in the title information to be reviewed has k corresponding editing distances;
specifically, assuming that the title information to be reviewed has j pieces of title information, the title information to be reviewed corresponds to j×k pieces of edit distance data, respectively.
Said step S33 is specificallyThe method comprises the following steps: calculating all the first character strings s according to the editing distance set jk1 And a second character string s corresponding to the first character string s jk2 Similarity between the historical title information and the i-th historical review item is calculated according to all similarity calculation results; and each piece of title information to be reviewed has corresponding k similarity calculation results.
Specifically, correspondingly, the title information to be reviewed corresponds to j×k similarity data; and for the j×k similarity data, taking the average similarity output of the j×k similarity data as the similarity between the title information to be reviewed and the historical title information of the i-th historical review item.
Optionally, the outputting, to a display unit, the result of the similarity judgment between the item to be reviewed and all the history review items includes:
if at least one history review item is similar to the to-be-reviewed item, outputting the declaration material electronic document of the at least one history review item to a display unit;
if at least one history review item is not similar to the to-be-reviewed item, sorting the similarity between the to-be-reviewed item and all the history review items, and then selecting M reporting material electronic documents of the history review item with the highest similarity to output to a display unit for display; m is a preset number.
Specifically, after the similarity determination by the method of the present embodiment, M most similar history review items are output for further confirmation by the reviewer.
Optionally, the step S5 includes:
step S51, respectively inputting a Doc2vec model trained in advance according to the long text information to be reviewed and the history long text information of the i-th history review project, and outputting a corresponding paragraph vector to be reviewed and a history paragraph vector of the i-th history review project;
step S52, calculating the similarity between the paragraph vector to be reviewed and the history paragraph vector of the i-th history review item;
illustratively, the similarity between two paragraph vectors may be determined according to the distance between them, wherein the closer the distance is, the greater the similarity.
It may be appreciated that the long text information in this embodiment may include multiple aspects, such as a project summary, a main study content, and the like, each of which includes multiple paragraphs, and the multiple aspects may be separately and individually subjected to similarity calculation; finally, carrying out comprehensive analysis and calculation according to the similarity of the multiple aspects, for example, taking an average value of the similarity of the multiple aspects as a long text similarity analysis result; for example, the similarity of the multiple aspects is multiplied by corresponding preset weights respectively and then accumulated to be used as a long text similarity analysis result; for similarity calculation of a certain aspect, for example, n paragraphs exist in the E aspect of the item to be reviewed, m paragraphs exist in the E aspect of the current historical review item, after similarity calculation is performed on the paragraphs of the certain aspect of the item to be reviewed and the paragraphs of the certain aspect corresponding to the current historical review item, m similarity calculation data exist in each paragraph of the E aspect of the item to be reviewed, n paragraphs of the E aspect of the item to be reviewed have n multiplied by m similarity calculation data, and a similarity average value of the n multiplied by m similarity calculation data is used as similarity of the item to be reviewed and the current historical review item in the E aspect.
And step S53, judging whether the paragraph vectors to be reviewed and the history paragraph vectors of the i-th history review project are similar or not according to the comparison result of the similarity of the paragraph vectors to be reviewed and the history paragraph vectors of the i-th history review project and the second similarity threshold.
Specifically, the embodiment adopts a PV-DM (Distribute Memory Model of Paragraph Vectors) training method to train the Doc2vec model, and as shown in fig. 2, a frame diagram of the Doc2vec PV-DM of the embodiment is shown, and as can be seen from fig. 2, there is a vector representation of each paragraph/sentence in addition to the vector added with the word level. For example, for a sentence 'the cat sat on', if the word on in the sentence is to be predicted, the prediction may be performed not only by generating corresponding features from other words, but also by generating features from other words and sentences. Each paragraph/sentence is mapped into a vector space, which can be represented by a column of a matrix. Each word is also mapped to a vector space, which can be represented by a column of a matrix. The paragraph vector and word vector are then concatenated or averaged to obtain a feature, predicting the next word in the sentence. A paragraph vector/sentence vector can also be considered a word that acts as a memory unit for the context or as the subject of the paragraph. Where the context length is fixed during training, the training set is also generated by means of a sliding window. And paragraph/sentence vectors are shared in this context. The training process of the Doc2vec model in this embodiment is specifically as follows, and mainly includes the following (1) and (2):
(1) training a model to obtain word vectors, softmax parameters and paragraph/sentence vectors in known training data.
(2) The inference process (index stage) gets its vector expression for the new paragraph. Specifically, more columns are added in the matrix, training is performed by the method under the condition of fixed length, and a new D (paragraph vector matrix) is obtained by using a gradient descent method, so that the vector expression of a new paragraph is obtained.
Optionally, the step S1 further includes:
text extraction is carried out on the electronic document of the reporting material of the project to be reviewed to obtain project technical field information of the project to be reviewed;
the step S2 is to acquire the electronic document of the ith historical review project declaration material, and specifically comprises the following steps:
acquiring an ith historical review project declaration material electronic document in a database corresponding to the technical field of the project according to the technical field information of the project to be reviewed;
wherein, in the step S6, all the history review items are all the history review items in the database of the corresponding item technical field.
Specifically, since many historical technological projects are reviewed, the embodiment further provides a preliminary classification concept, and the declaration material electronic documents of different types of historical technological projects are respectively stored in different databases, so that when similarity analysis is performed, the to-be-reviewed projects are compared with the historical technological projects in the corresponding technical fields in a similar manner according to the technical fields of the to-be-reviewed projects, and the calculation workload is effectively reduced.
To sum up, in this embodiment, the problem of larger data size of the science and technology project is provided, 3 aspects of targeted setting are provided altogether, the first is database classification screening, the second is short text preliminary similar screening, the third is long text secondary similar screening, screening is performed layer by layer, the whole process not only can accurately perform similarity analysis, but also has less workload and higher processing speed.
Another embodiment of the present invention also proposes a computer device comprising: the system comprises a memory and a processor, wherein the memory stores computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the technology project similarity analysis method according to the above embodiment.
Of course, the computer device may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.
The computer program may be divided into one or more units, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more elements may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the computer device.
The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is a control center of the computer device, connecting various interfaces and lines throughout the various portions of the computer device.
The memory may be used to store the computer program and/or elements, and the processor may implement various functions of the computer device by running or executing the computer program and/or elements stored in the memory, and invoking data stored in the memory. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Another embodiment of the present invention also proposes a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for analyzing similarity of scientific and technological projects described in the above embodiment.
In particular, the computer-readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
In summary, the embodiment of the invention provides a technology project similarity analysis method and system, computer equipment and a computer readable storage medium, which are used for extracting the title information of a declaration material electronic document of a project to be reviewed and a history review project and judging the similarity of the extracted title information.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (7)

1. A method for similarity analysis of scientific and technological projects, comprising:
s1, acquiring an electronic document of a reporting material of a project to be reviewed, and extracting a text of the electronic document to obtain title information to be reviewed of the project to be reviewed; the title information to be reviewed comprises a project main title of the project to be reviewed and a subtitle in the research content; said firstiThe history title information of each history review item includes the firstiProject main titles of the historical review projects and sub-titles in the study content;
step S2, obtaining the firstiThe electronic document of the reporting material of each history review project is subjected to text extraction to obtain the firstiHistorical title information of the individual historical review items;
step S3, according to the title information to be reviewed and the first stepiPerforming short text similarity analysis on the historical title information of each historical review item, and preliminarily judging whether the historical title information and the short text similarity are similar according to analysis results; if yes, executing the steps S4-S5 in sequence, and if not, executing the step S6; wherein the method comprises the steps ofiAn initial value of 1;
step S4, carrying out text extraction on the electronic document of the declaration material of the project to be reviewed to obtain the long text information of the project to be reviewed,and for the firstiText extraction is carried out on the electronic documents of the reporting materials of the historical projects to obtain the long-history text information of the historical projects;
step S5, according to the text information to be reviewed and the first stepiPerforming long text similarity analysis on the historical long text information of each historical review item, and finally judging whether the historical long text information is similar or not according to analysis results;
step S6, judgingiWhether or not to be smaller thanNThe method comprises the steps of carrying out a first treatment on the surface of the If yes, makei=i+1, and returning to said step S2; if not, outputting the similarity judgment results between the to-be-reviewed item and all the history review items to a display unit for display, and ending the analysis flow; wherein the method comprises the steps ofMIs a preset number; wherein the method comprises the steps ofNThe total number of the historical review items;
wherein, the step S3 includes:
step S31, acquiring the title information to be reviewed and the first stepiThe longest continuous public substring among the historical title information of each historical review item and the title information to be reviewed and the firstiRespectively removing the longest continuous public substring from the history title information of each history review item to obtain a first character string and a second character string;
step S32, calculating an editing distance between the first character string and the second character string;
step S33, calculating the title information to be reviewed and the first title according to the editing distanceiSimilarity of historical title information of the historical review items;
step S34, according to the title information to be reviewed and the first stepiJudging whether the historical title information of each historical review item is similar to the first similarity threshold value or not according to the comparison result of the similarity of the historical title information of each historical review item and the first similarity threshold value;
the step S31 specifically includes: acquiring each title information in the title information to be reviewed and the first title informationiThe longest continuous public substring between each title information in the historical title information of each historical review item is removed to obtain a first character strings jk1 And a second character string s jk2 The method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps ofs jk1 Representing the first of the title information to be reviewedjThe individual title information removes the first of the historical title informationkThe first character string obtained by removing the largest continuous public sub-string from the header information,s jk2 representing the first of the historical title informationkThe title information is removed from the title information to be reviewedjA second character string obtained after the largest continuous public sub-string of the title information;
the step S32 specifically includes: calculate all first character stringss jk1 And a second character string s corresponding to the first character string s jk2 Editing distances between the two to obtain an editing distance set; wherein each title information in the title information to be reviewed has a corresponding onekEditing distance;
the step S33 specifically includes: calculating all first character strings according to the editing distance sets jk1 And a second character string corresponding to the first character strings jk2 Similarity between the first and second pieces of the title information to be reviewed and the first piece of the title information to be reviewed are calculated according to all similarity calculation resultsiSimilarity of historical title information of the historical review items; wherein each title information in the title information to be reviewed has a corresponding onekCalculating a result of the similarity;
the step S31 includes:
step S311, setting the title information to be reviewed as a character strings 1 The first step ofiThe history title information of each history review item is a character strings i
Step S312, obtaining character strings 1 Ands i longest continuous common substring of (2)s z
Step S313, if the longest continuous common substrings z If the length of the character string is greater than 2, the character strings are respectivelys 1 Ands i in (a) and (b)s z After removal, new 2 character strings are obtaineds 10 Ands i0 and orders 1 = s 10s i = s i0 Then return to step S312; if the longest continuous common substrings z The length of (2) or less, outputs 10 As a first string of characters,s i0 as a second string.
2. The method according to claim 1, wherein the calculating the title information to be reviewed and the first according to the edit distanceiThe similarity of the historical title information of the historical review items comprises:
wherein,s 10 a first string of characters is represented and,s i0 a second character string is represented as such,sim(s 10 , s i0 ) Representing the editing distance to calculate the title information to be reviewed and the firstiSimilarity of historical title information for individual historical review items,EDrepresenting the edit distance between the first string and the second string,len(s 10 ) Representing the length of the first string of characters,len(s i0 ) Representing the length of the second string.
3. The technological project similarity analysis method according to claim 1, wherein the outputting the similarity judgment result between the project to be reviewed and all the history review projects to a display unit for display includes:
if at least one history review item is similar to the to-be-reviewed item, outputting the declaration material electronic document of the at least one history review item to a display unit;
if at least one history review item is not similar to the to-be-reviewed item, sorting the similarity of the to-be-reviewed item and all the history review items, and then selecting the item with the highest similarityMThe declaration material electronic documents of the historical review projects are output to a display unit for display;Mfor a preset number of。
4. The method of analyzing similarity of scientific and technological projects according to claim 1, wherein said step S5 includes:
step S51, according to the text information of the length to be reviewed and the first stepiThe history long text information of each history review project is respectively input into a pre-trained Doc2vec model, and corresponding paragraph vectors to be reviewed and the first paragraph vector are outputiA history paragraph vector of each history review item;
step S52, calculating the paragraph vector to be reviewed and the thiSimilarity of history paragraph vectors of the history review items;
step S53, according to the paragraph vector to be reviewed and the first paragraph vectoriThe result of the comparison of the similarity of the history paragraph vectors of the history review items with the second similarity threshold determines whether the two are similar.
5. The method for analyzing similarity of scientific and technological projects according to claim 1,
the step S1 further includes:
text extraction is carried out on the electronic document of the reporting material of the project to be reviewed to obtain project technical field information of the project to be reviewed;
the step S2 is to obtain the firstiThe electronic document of the reporting material of each history review project specifically comprises:
acquiring a first item in a database corresponding to the technical field of the item according to the technical field information of the item to be reviewediThe individual history review items claim an electronic document of material.
6. A computer device, comprising: the system comprises a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions when executed by the processor cause the processor to execute the steps of the technology project similarity analysis method according to any one of claims 1-5.
7. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method for similarity analysis of scientific and technological projects according to any one of claims 1 to 5.
CN202011258083.6A 2020-11-12 2020-11-12 Science and technology project similarity analysis method, computer equipment and storage medium Active CN112199938B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011258083.6A CN112199938B (en) 2020-11-12 2020-11-12 Science and technology project similarity analysis method, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011258083.6A CN112199938B (en) 2020-11-12 2020-11-12 Science and technology project similarity analysis method, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112199938A CN112199938A (en) 2021-01-08
CN112199938B true CN112199938B (en) 2023-11-14

Family

ID=74033475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011258083.6A Active CN112199938B (en) 2020-11-12 2020-11-12 Science and technology project similarity analysis method, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112199938B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784569B (en) * 2021-02-04 2024-04-19 北京秒针人工智能科技有限公司 Method, system, equipment and storage medium for realizing similar text aggregation
CN113064979A (en) * 2021-03-10 2021-07-02 国网河北省电力有限公司 Keyword retrieval-based method for judging construction period and price reasonability
CN112926299B (en) * 2021-03-29 2024-04-09 杭州天谷信息科技有限公司 Text comparison method, contract review method and auditing system
CN113139374A (en) * 2021-04-12 2021-07-20 北京明略昭辉科技有限公司 Method, system, equipment and storage medium for querying marks of document similar paragraphs
CN113762719A (en) * 2021-08-03 2021-12-07 远光软件股份有限公司 Text similarity calculation method, computer equipment and storage device
CN113761869A (en) * 2021-08-17 2021-12-07 中移(杭州)信息技术有限公司 Method and device for detecting resource coverage rate and computer readable storage medium
CN113704427A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Text provenance determination method, device, equipment and storage medium
CN115801483B (en) * 2023-02-10 2023-05-19 北京京能高安屯燃气热电有限责任公司 Information sharing processing method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105446954A (en) * 2015-11-18 2016-03-30 广东省科技基础条件平台中心 Project duplicate checking method for science and technology big data
CN106095865A (en) * 2016-06-03 2016-11-09 中细软移动互联科技有限公司 A kind of trade mark text similarity reviewing method
CN107122340A (en) * 2017-03-30 2017-09-01 浙江省科技信息研究院 A kind of similarity detection method for the science and technology item return analyzed based on synonym
CN110163476A (en) * 2019-04-15 2019-08-23 重庆金融资产交易所有限责任公司 Project intelligent recommendation method, electronic device and storage medium
CN111782797A (en) * 2020-07-13 2020-10-16 贵州省科技信息中心 Automatic matching method for scientific and technological project review experts and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105446954A (en) * 2015-11-18 2016-03-30 广东省科技基础条件平台中心 Project duplicate checking method for science and technology big data
CN106095865A (en) * 2016-06-03 2016-11-09 中细软移动互联科技有限公司 A kind of trade mark text similarity reviewing method
CN107122340A (en) * 2017-03-30 2017-09-01 浙江省科技信息研究院 A kind of similarity detection method for the science and technology item return analyzed based on synonym
CN110163476A (en) * 2019-04-15 2019-08-23 重庆金融资产交易所有限责任公司 Project intelligent recommendation method, electronic device and storage medium
CN111782797A (en) * 2020-07-13 2020-10-16 贵州省科技信息中心 Automatic matching method for scientific and technological project review experts and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
文本相似度指标分析及文本相似性分析方法研究;张自锋;周育忠;陶秀杰;;信息***工程(第04期);全文 *

Also Published As

Publication number Publication date
CN112199938A (en) 2021-01-08

Similar Documents

Publication Publication Date Title
CN112199938B (en) Science and technology project similarity analysis method, computer equipment and storage medium
CN112199937B (en) Short text similarity analysis method and system, computer equipment and medium thereof
CN112163424B (en) Data labeling method, device, equipment and medium
CN112199940B (en) Project review method and storage medium
CN112199939B (en) Intelligent recommendation method and storage medium for review experts
CN110910175B (en) Image generation method for travel ticket product
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN111429184A (en) User portrait extraction method based on text information
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN112883730A (en) Similar text matching method and device, electronic equipment and storage medium
CN110889412B (en) Medical long text positioning and classifying method and device in physical examination report
CN112381381B (en) Expert's device is recommended to intelligence
JP5462546B2 (en) Content detection support apparatus, content detection support method, and content detection support program
CN117592470A (en) Low-cost gazette data extraction method driven by large language model
CN110888983B (en) Positive and negative emotion analysis method, terminal equipment and storage medium
CN110489514B (en) System and method for improving event extraction labeling efficiency, event extraction method and system
CN116578696A (en) Text abstract generation method, device, equipment and storage medium
CN112329425B (en) Scientific research project intelligent review method and storage medium
CN114842982A (en) Knowledge expression method, device and system for medical information system
CN113706207B (en) Order success rate analysis method, device, equipment and medium based on semantic analysis
CN112417840B (en) Scientific research project intelligent review system and computer equipment
CN111814457B (en) Power grid engineering contract text generation method
CN112632951B (en) Method, computer equipment and storage medium for intelligent expert recommendation
CN115481240A (en) Data asset quality detection method and detection device
CN112989814B (en) Search map construction method, search device, search apparatus, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant