CN110807449A

CN110807449A - Science and technology project application on-line service terminal

Info

Publication number: CN110807449A
Application number: CN202010015896.6A
Authority: CN
Inventors: 江峰; 李缙航
Original assignee: Hangzhou Haozhi Tiancheng Information Technology Co Ltd
Current assignee: Hangzhou Haozhi Tiancheng Information Technology Co Ltd
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2020-02-18

Abstract

The invention relates to the technical field of service terminals, in particular to a science and technology project application on-line service terminal. The system comprises a data collection unit, a data pre-inspection unit and an information query unit, wherein the data collection unit is used for collecting and classifying reported scientific and technological project data, and the data pre-inspection unit is used for carrying out preprocessing inspection on the reported scientific and technological project data. In this scientific and technological project application on-line service terminal, based on marginal characters detection algorithm extraction scientific and technological project name information of typing to extract the keyword of scientific and technological project name information of typing, classify the scientific and technological project according to the similarity of keyword, accomplish the typing of scientific and technological project, the later stage classification of being convenient for handles, improves the treatment effeciency, adopts the data to check in advance the unit and carries out the preliminary treatment inspection to the scientific and technological project data of declaring, improves the integrality of the scientific and technological project data of declaring.

Description

Science and technology project application on-line service terminal

Technical Field

The invention relates to the technical field of service terminals, in particular to a science and technology project application on-line service terminal.

Background

The project declaration refers to a series of preferential policies made by government organs for enterprises or other research units, and the enterprises or related research units write declaration files according to the government policies and then declare according to related declaration requirements and processes. Along with the improvement of the protection consciousness of intellectual property of people, the declaration quantity of the scientific and technological projects is increasingly increased, the conventional scientific and technological project declaration terminal only can collect the declaration information of the scientific and technological projects, but the declaration information of the scientific and technological projects is various in types, and the invalid data contained in the declaration information is more, so that the post-processing is difficult and the processing efficiency is low.

Disclosure of Invention

The invention aims to provide a scientific and technological project declaration on-line service terminal to solve the problems in the background technology.

In order to achieve the above object, the present invention provides a science and technology project declaration online service terminal, which includes a data collection unit, a data pre-inspection unit and an information query unit, wherein the data collection unit is configured to collect and classify declared science and technology project data, the data pre-inspection unit is configured to perform pre-processing inspection on the declared science and technology project data, and the information query unit is configured to query the declared science and technology project data processing flow tracing information.

Preferably, the data collection unit comprises the following steps:

s1.1, recording data: inputting scientific and technological project data;

s1.2, extracting a name: extracting recorded scientific and technological project name data;

s1.3, extracting keywords: extracting key words in the scientific and technological project name data;

s1.4, data classification: and classifying the entered scientific and technological project data according to the similarity of the extracted keywords.

Preferably, in S1.2, the name extraction is performed by using an edge character detection algorithm, and the algorithm flow is as follows:

s1.2.1, detecting the edge characteristics of the name characters by using an edge detection operator;

s1.2.2, filtering the edge characteristics;

s1.2.3, merging the edges into regions by morphological operations;

s1.2.4, extracting the character area according to the horizontal projection algorithm.

Preferably, the edge detection operator detects the character edge features by using a Sobel operator, and the operator formula is as follows:

k represents a neighborhood point mark matrix template, a 3 multiplied by 3 neighborhood matrix with (i, j) as a center, a is a control factor in a condition, the value range is 0 to 1, and the width of an edge is controlled by a plurality of values of a;

the matrixes (1), (2) and (3) are respectively an x-direction convolution template, a y-direction convolution template and a neighborhood point mark matrix of the point to be processed of the operator.

Preferably, the filtering processing of the edge feature uses gaussian filtering processing, and the formula is as follows:

wherein the content of the first and second substances,

the width of the gaussian filter is such that,the degree of smoothing is determined, x being the coordinate, controlling the gaussian kernel shape.

Preferably, the formula of the horizontal projection algorithm is as follows:

where E represents the edge map of the text region，

Is the coordinate of the pixel point in the image, h is the height of the image,as the abscissa is

Is projected horizontally.

Preferably, in S1.3, the key extraction employs a TFIDF algorithm, and the algorithm flow is as follows:

s1.3.1, performing word segmentation on all documents in the cluster, and then storing the occurrence frequency of each word by using a dictionary;

s1.3.2, traversing each word to obtain the IDF value of each word in all documents and the value multiplied by the times TF appearing in the cluster;

s1.3.3, a dictionary is used to store all word information, then the dictionary is sorted according to value, and finally, the words with the top names of the re-arranged words are taken as keywords.

Preferably, the similarity of the keywords is calculated by a text similarity calculation method using a hamming distance, and the calculation method has the following formula:

wherein the content of the first and second substances,

it is shown that the addition operation modulo 2,

representing the sum of the number of different code symbols at the same position for two codewords, n being the distance between two long codewords, k being the number of codewords.

Preferably, the data classification adopts a K-means clustering algorithm, and the method comprises the following steps:

s1.4.1, determining the number k of clusters to be generated for the text set D waiting for clustering;

s1.4.2, generating k clustering centers as the initial central points of the clusters,；

s1.4.3, for each text in D

Sequentially calculating it and each central pointDegree of similarity of

；

S1.4.4, selecting the center point with the largest similarity

Will be

Fall under the category of

Cluster being the center of cluster

To obtain D one cluster；

S1.4.5, re-determining the center point of each cluster;

s1.4.6, repeat S1.4.3-S1.4.5 until the center point no longer changes and the text is no longer reassigned.

Compared with the prior art, the invention has the beneficial effects that:

1. in this science and technology project application on-line service terminal, draw the science and technology project name information of typeeing based on marginal word detection algorithm to extract the keyword to the science and technology project name information of typeeing, classify the science and technology project according to the similarity of keyword, accomplish the typeeing of science and technology project, the later stage classification of being convenient for improves the treatment effeciency.

2. In the scientific and technological project reporting on-line service terminal, a data pre-inspection unit is adopted to carry out pre-processing inspection on reported scientific and technological project data, so that the integrity of the reported scientific and technological project data is improved.

Drawings

FIG. 1 is an overall flow diagram of the present invention;

FIG. 2 is a block diagram of an edge text detection algorithm of the present invention;

FIG. 3 is a block diagram of a keyword extraction process according to the present invention;

FIG. 4 is a block diagram of a data sorting process of the present invention;

FIG. 5 is a schematic view of the expansion unit of the present invention;

FIG. 6 is a schematic diagram of an etching unit of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-6, the present invention provides a technical solution:

the invention provides a scientific and technological project declaration on-line service terminal which comprises a data collection unit, a data pre-inspection unit and an information query unit, wherein the data collection unit is used for collecting and classifying declared scientific and technological project data, the data pre-inspection unit is used for pre-processing and inspecting the declared scientific and technological project data, and the information query unit is used for querying declared scientific and technological project data processing flow traceability information.

In the embodiment, the service terminal is implemented by adopting a J2EE mode, interaction with a mobile terminal of a user is mainly implemented by adopting a servlet technology, and the system is implemented by adopting J2EE which has the characteristics of platform independence, easy transplantation, high performance, easy deployment and the like.

Further, the data collection unit comprises the following steps:

s1.1, recording data: inputting scientific and technological project data;

In S1.2, the name is extracted by using an edge character detection algorithm, and the algorithm flow is as follows:

s1.2.2, filtering the edge characteristics;

s1.2.3, merging the edges into regions by morphological operations;

Furthermore, the edge detection operator adopts a Sobel operator to detect character edge characteristics, and the operator formula is as follows:

the matrices (1), (2) and (3) being respectively of this operator

Towards the convolution template,

Marking a matrix to the convolution template and the neighborhood points of the points to be processed, and accordingly expressing the gradient amplitude of each point as follows by using a mathematical formula:

specifically, the filtering processing of the edge features adopts gaussian filtering processing, the gaussian filtering can be realized by weighting two one-dimensional gaussian kernels twice respectively, and the gaussian kernels have the following expression:

wherein the content of the first and second substances,

the width of the gaussian filter is such that,

the degree of smoothing is determined, x being the coordinate, controlling the gaussian kernel shape.

The formula (4) is a discretized one-dimensional Gaussian function, and a one-dimensional kernel vector can be obtained by determining parameters, wherein the formula is as follows:

the formula (4.1) is a discretized two-dimensional Gaussian function, and a two-dimensional kernel vector can be obtained by determining parameters.

It is worth mentioning that the formula of the horizontal projection algorithm is as follows:

wherein E represents an edge map of the text region,

is the coordinate of the pixel point in the image, h is the height of the image,

as the abscissa isIs projected horizontally.

It is worth mentioning that the morphological operations include an expansion unit, an erosion unit, an open operation unit and a close operation unit.

Wherein the expansion unit is defined as: translating the structural element B by a to obtain B_aIf B is_aOn hit X, we note down this point a. The set of all points a satisfying the above condition is called the result of expansion of X by B. Is formulated as: d (x) = { a | B_a↑X}=XB，（B_a×) X represents B_aIn the hit of the X,

represents an exclusive OR operation with an algorithm of a

B = (|' a ^ B) | (a ^ B)), as shown in fig. 5, X is the object being treated, B is the structural element, for any one point a in the shaded area, Ba hits X, the result of X being expanded by B is the shaded area in fig. 5.

Wherein the etch unit is defined as: the structural element B is translated by a to obtain B_aIf B is_aIncluded in X, we note this point a, and the set of all points a satisfying the above condition is called the result of the corrosion of X by B, and is expressed by the formula: e (x) = { a | B_a

X}=X

B, wherein X

B represents the result of erosion of X by B, and as shown in fig. 6, X is the object to be processed, B is a structural element, Ba is included in X for any one of the points a in the shaded area, and the result of erosion of X by B is the shaded area in fig. 6.

The on operation of the structural element B on the input image a is denoted as a ○ B, and is defined as a ○ B = (a)

B）

B=U{B+x:B+x

A }. The opening operation can be obtained by calculating and calculating translation of all structural elements which can be filled into the image, namely, the result of the expansion operation after the corrosion of the A is carried out, the opening operation has a smoothing function, and certain tiny connections, edge burrs and isolated spots of the image can be removed.

The closed operation of the structural element B on the input image a is denoted as a ● B, and is defined as a ● B = (a)

B）

B. The closed operation is a dual operation of the open operation, namely, the result of the corrosion operation after the expansion of the A, has a filtering function, and can fill and level up small ditches, holes and cracks in the image so as to connect broken lines.

Further, in S1.3, extracting the keyword uses a TFIDF algorithm, which includes the following steps:

Specifically, the similarity of the keywords adopts a text similarity calculation method of hamming distance, and the calculation method has the following formula:

wherein n is the distance between two long code words, k is the number of code words,it is shown that the addition operation modulo 2,

，

which represents the sum of the number of different code symbols at the same position of two code words, can reflect the difference between the two code words, and can be used as an objective basis for providing the similarity degree between the code words. The method arranges the information of key words, abstracts and the like in the text into a code word with n bit sequences, and the text information is expressed by the code words, so that the text and the code words establish a 1-1 corresponding relation.

In particular, if the text is

Corresponding code word is

The code word corresponding to the query is

To a

The distance between the two is between 0 and n, when the text and the query expression are completely different by using n-bit code words, the distance is n, when the text and the query expression code words are completely the same, the distance is 0, and when the similarity is calculated, the code word set corresponding to the text set is determined, and for different texts or between the text and the query expression, the code word set corresponding to the text set is set

The similarity calculation based on hamming distance is shown in the formula:

wherein the content of the first and second substances,

respectively representing text

Corresponding code wordAnd query typeCorresponding code word

To middle

The bit component, either 0 or 1,

a modulo-2 addition operation.

It is worth to be noted that the data classification adopts a K-means clustering algorithm, and the method comprises the following steps:

s1.4.2, generating k clustering centers as the initial central points of the clusters,

；

s1.4.3, for each text in D

Sequentially calculating it and each central point

Degree of similarity of

；

S1.4.4, selecting the center point with the largest similarity

Will be

Fall under the category of

Cluster being the center of cluster

To obtain D one cluster

；

S1.4.5, re-determining the center point of each cluster;

It is worth to be noted that the data pre-checking unit includes an error correcting module, a duplicate item deleting module, a unified specification module, a correction logic module, a conversion construction module, a data compression module, a data supplementing module and a data discarding module.

In this embodiment, the error correcting module is configured to correct a data error form, and the error correcting module is configured to correct a data value error, correct a data type error, correct a data coding error, correct a data format error, correct a data exception error, correct a dependency conflict, and correct a multi-value error.

Further, due to various reasons, repeated records or repeated fields (columns) may exist in the data, repeated items (rows and columns) need to be deleted and processed by a repeated item deleting module, the repeated item deleting module is used for deleting the repeated records or the repeated fields existing in the data, and for judging the repeated items, the basic idea is 'sorting and merging', the records in the database are sorted according to a certain rule, and then whether the records are repeated is detected by comparing whether adjacent records are similar.

Specifically, because the data source systems are dispersed in each service line, different service lines have different requirements, understandings and specifications for data, and the description specifications for the same data object are completely different, the data specification needs to be unified through the unified specification module and the content of consistency needs to be abstracted out in the cleaning process.

In addition, the correction logic module is used for determining the logic, conditions and caliber of each source system and correcting the acquisition logic of the abnormal source system.

In addition, the conversion construction module is used for carrying out standardization processing on the data, and comprises data type conversion, data semantic conversion, data granularity conversion, table/data splitting, row-column conversion, data discretization, data standardization, new field refinement and attribute construction.

Wherein, the data type conversion: when data come from different data sources, incompatibility of data types of the different data sources may cause error reporting of the system, and at this time, the data types of the different data sources need to be uniformly converted into a compatible data type.

Wherein, the data semantic conversion: in a conventional data warehouse, a dimension table, a fact table and the like may exist based on a third paradigm, and at this time, many fields in the fact table need to be combined with the dimension table to perform semantic parsing.

Wherein, the data granularity conversion: and aggregating the data according to different granularity requirements in the data warehouse.

Wherein, table/data splitting: some fields may store multiple data information, for example, the timestamp includes information of year, month, day, hour, minute, second, etc., and some rules need to split some or all of the time attributes to meet the data aggregation requirement at multiple granularities.

Wherein, the row-column conversion: and converting row and column data in the table.

Wherein, data discretization: the continuous attribute is discretized into a plurality of intervals to help reduce the value number of one continuous attribute.

Wherein, data standardization: different fields have different business meanings, so that the difference between values caused by different orders of magnitude among variables needs to be eliminated.

Wherein, refining the new field: in many cases, new fields, also called compound fields, need to be extracted based on business rules.

Wherein, the attribute structure is as follows: in the modeling process, new attributes are constructed according to the existing attribute set.

Furthermore, the data compression module is used for maintaining the integrity and accuracy of the original data set, and reorganizing the data according to a certain algorithm and a certain mode on the premise of not losing useful information, and complex data analysis and data calculation of large-scale data generally consume a large amount of time, so that reduction and compression of the data are required before the reorganization and the compression, the data scale is reduced, interactive data mining can be faced, and information feedback is carried out on the comparison data before and after the data mining. Thus, the data mining on the reduced data set is obviously more efficient, and the mining result is basically the same as the result obtained by using the original data set.

In addition, the data supplementing module is used for supplementing the data of the incomplete data, the data supplementation comprises supplementing a missing value and a supplementing null value, the missing value refers to the condition that the data originally and necessarily exists, but actually has no data, and the null value refers to the condition that the data actually exists and can be null.

In addition, the data discarding module deletes abnormal data in the data, the types of discarded data include whole deletion and variable deletion, the whole deletion refers to deleting samples containing missing values, the variable deletion can be considered if the invalid value and the missing value of a certain variable are many, and the variable is not particularly important for the problem under study, and this way reduces the number of variables for analysis, but does not change the sample amount.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A science and technology project application on-line service terminal comprises a data collection unit, a data pre-inspection unit and an information query unit, and is characterized in that: the system comprises a data collection unit, a data pre-inspection unit and an information query unit, wherein the data collection unit is used for collecting and classifying reported scientific and technological project data, the data pre-inspection unit is used for pre-processing and inspecting the reported scientific and technological project data, and the information query unit is used for querying the source tracing information of the reported scientific and technological project data processing flow.

2. A scientific and technological project declaration on-line service terminal according to claim 1, wherein: the data collection unit comprises the following steps:

s1.1, recording data: inputting scientific and technological project data;

3. A scientific and technological project declaration on-line service terminal according to claim 2, wherein: in S1.2, the name is extracted and an edge character detection algorithm is selected, and the algorithm flow is as follows:

s1.2.2, filtering the edge characteristics;

s1.2.3, merging the edges into regions by morphological operations;

4. A scientific and technological project declaration on-line service terminal according to claim 3, characterized in that: the edge detection operator adopts a Sobel operator to detect character edge characteristics, and the operator formula is as follows:

（1）

（2）

5. A scientific and technological project declaration on-line service terminal according to claim 3, characterized in that: the filtering processing of the edge features adopts Gaussian filtering processing, and the formula is as follows:

wherein the content of the first and second substances,

the width of the gaussian filter is such that,

6. A scientific and technological project declaration on-line service terminal according to claim 3, characterized in that: the formula of the horizontal projection algorithm is as follows:

wherein E represents an edge map of the text region,is the coordinate of the pixel point in the image, h is the height of the image,

as the abscissa isIs projected horizontally.

7. A scientific and technological project declaration on-line service terminal according to claim 2, wherein: in the step S1.3, the key words are extracted by adopting a TFIDF algorithm, and the algorithm flow is as follows:

8. A scientific and technological project declaration on-line service terminal according to claim 2, wherein: the similarity of the keywords adopts a text similarity calculation method of Hamming distance, and the calculation method has the following formula:

wherein the content of the first and second substances,

it is shown that the addition operation modulo 2,

9. A scientific and technological project declaration on-line service terminal according to claim 2, wherein: the data classification adopts a K-means clustering algorithm, and the method comprises the following steps:

；

s1.4.3, for each text in D

Sequentially calculating it and each central point

Degree of similarity of

；

S1.4.4, selecting the center point with the largest similarityWill be

Fall under the category ofCluster being the center of cluster

To obtain D one cluster

；

S1.4.5, re-determining the center point of each cluster;