CN110807449A - Science and technology project application on-line service terminal - Google Patents

Science and technology project application on-line service terminal Download PDF

Info

Publication number
CN110807449A
CN110807449A CN202010015896.6A CN202010015896A CN110807449A CN 110807449 A CN110807449 A CN 110807449A CN 202010015896 A CN202010015896 A CN 202010015896A CN 110807449 A CN110807449 A CN 110807449A
Authority
CN
China
Prior art keywords
data
scientific
technological project
service terminal
line service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010015896.6A
Other languages
Chinese (zh)
Inventor
江峰
李缙航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Haozhi Tiancheng Information Technology Co Ltd
Original Assignee
Hangzhou Haozhi Tiancheng Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Haozhi Tiancheng Information Technology Co Ltd filed Critical Hangzhou Haozhi Tiancheng Information Technology Co Ltd
Priority to CN202010015896.6A priority Critical patent/CN110807449A/en
Publication of CN110807449A publication Critical patent/CN110807449A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/155Segmentation; Edge detection involving morphological operators

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of service terminals, in particular to a science and technology project application on-line service terminal. The system comprises a data collection unit, a data pre-inspection unit and an information query unit, wherein the data collection unit is used for collecting and classifying reported scientific and technological project data, and the data pre-inspection unit is used for carrying out preprocessing inspection on the reported scientific and technological project data. In this scientific and technological project application on-line service terminal, based on marginal characters detection algorithm extraction scientific and technological project name information of typing to extract the keyword of scientific and technological project name information of typing, classify the scientific and technological project according to the similarity of keyword, accomplish the typing of scientific and technological project, the later stage classification of being convenient for handles, improves the treatment effeciency, adopts the data to check in advance the unit and carries out the preliminary treatment inspection to the scientific and technological project data of declaring, improves the integrality of the scientific and technological project data of declaring.

Description

Science and technology project application on-line service terminal
Technical Field
The invention relates to the technical field of service terminals, in particular to a science and technology project application on-line service terminal.
Background
The project declaration refers to a series of preferential policies made by government organs for enterprises or other research units, and the enterprises or related research units write declaration files according to the government policies and then declare according to related declaration requirements and processes. Along with the improvement of the protection consciousness of intellectual property of people, the declaration quantity of the scientific and technological projects is increasingly increased, the conventional scientific and technological project declaration terminal only can collect the declaration information of the scientific and technological projects, but the declaration information of the scientific and technological projects is various in types, and the invalid data contained in the declaration information is more, so that the post-processing is difficult and the processing efficiency is low.
Disclosure of Invention
The invention aims to provide a scientific and technological project declaration on-line service terminal to solve the problems in the background technology.
In order to achieve the above object, the present invention provides a science and technology project declaration online service terminal, which includes a data collection unit, a data pre-inspection unit and an information query unit, wherein the data collection unit is configured to collect and classify declared science and technology project data, the data pre-inspection unit is configured to perform pre-processing inspection on the declared science and technology project data, and the information query unit is configured to query the declared science and technology project data processing flow tracing information.
Preferably, the data collection unit comprises the following steps:
s1.1, recording data: inputting scientific and technological project data;
s1.2, extracting a name: extracting recorded scientific and technological project name data;
s1.3, extracting keywords: extracting key words in the scientific and technological project name data;
s1.4, data classification: and classifying the entered scientific and technological project data according to the similarity of the extracted keywords.
Preferably, in S1.2, the name extraction is performed by using an edge character detection algorithm, and the algorithm flow is as follows:
s1.2.1, detecting the edge characteristics of the name characters by using an edge detection operator;
s1.2.2, filtering the edge characteristics;
s1.2.3, merging the edges into regions by morphological operations;
s1.2.4, extracting the character area according to the horizontal projection algorithm.
Preferably, the edge detection operator detects the character edge features by using a Sobel operator, and the operator formula is as follows:
Figure 379516DEST_PATH_IMAGE002
k represents a neighborhood point mark matrix template, a 3 multiplied by 3 neighborhood matrix with (i, j) as a center, a is a control factor in a condition, the value range is 0 to 1, and the width of an edge is controlled by a plurality of values of a;
the matrixes (1), (2) and (3) are respectively an x-direction convolution template, a y-direction convolution template and a neighborhood point mark matrix of the point to be processed of the operator.
Preferably, the filtering processing of the edge feature uses gaussian filtering processing, and the formula is as follows:
Figure 835905DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 591372DEST_PATH_IMAGE005
the width of the gaussian filter is such that,the degree of smoothing is determined, x being the coordinate, controlling the gaussian kernel shape.
Preferably, the formula of the horizontal projection algorithm is as follows:
Figure 750269DEST_PATH_IMAGE006
where E represents the edge map of the text region,
Figure 490691DEST_PATH_IMAGE007
Is the coordinate of the pixel point in the image, h is the height of the image,as the abscissa is
Figure 965852DEST_PATH_IMAGE009
Is projected horizontally.
Preferably, in S1.3, the key extraction employs a TFIDF algorithm, and the algorithm flow is as follows:
s1.3.1, performing word segmentation on all documents in the cluster, and then storing the occurrence frequency of each word by using a dictionary;
s1.3.2, traversing each word to obtain the IDF value of each word in all documents and the value multiplied by the times TF appearing in the cluster;
s1.3.3, a dictionary is used to store all word information, then the dictionary is sorted according to value, and finally, the words with the top names of the re-arranged words are taken as keywords.
Preferably, the similarity of the keywords is calculated by a text similarity calculation method using a hamming distance, and the calculation method has the following formula:
Figure 702995DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 337239DEST_PATH_IMAGE011
it is shown that the addition operation modulo 2,
Figure 434508DEST_PATH_IMAGE012
representing the sum of the number of different code symbols at the same position for two codewords, n being the distance between two long codewords, k being the number of codewords.
Preferably, the data classification adopts a K-means clustering algorithm, and the method comprises the following steps:
s1.4.1, determining the number k of clusters to be generated for the text set D waiting for clustering;
s1.4.2, generating k clustering centers as the initial central points of the clusters,
s1.4.3, for each text in D
Figure 627909DEST_PATH_IMAGE014
Sequentially calculating it and each central pointDegree of similarity of
Figure 194949DEST_PATH_IMAGE016
S1.4.4, selecting the center point with the largest similarity
Figure 616703DEST_PATH_IMAGE017
Will be
Figure 944916DEST_PATH_IMAGE018
Fall under the category of
Figure 84910DEST_PATH_IMAGE019
Cluster being the center of cluster
Figure 805873DEST_PATH_IMAGE020
To obtain D one cluster
S1.4.5, re-determining the center point of each cluster;
s1.4.6, repeat S1.4.3-S1.4.5 until the center point no longer changes and the text is no longer reassigned.
Compared with the prior art, the invention has the beneficial effects that:
1. in this science and technology project application on-line service terminal, draw the science and technology project name information of typeeing based on marginal word detection algorithm to extract the keyword to the science and technology project name information of typeeing, classify the science and technology project according to the similarity of keyword, accomplish the typeeing of science and technology project, the later stage classification of being convenient for improves the treatment effeciency.
2. In the scientific and technological project reporting on-line service terminal, a data pre-inspection unit is adopted to carry out pre-processing inspection on reported scientific and technological project data, so that the integrity of the reported scientific and technological project data is improved.
Drawings
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is a block diagram of an edge text detection algorithm of the present invention;
FIG. 3 is a block diagram of a keyword extraction process according to the present invention;
FIG. 4 is a block diagram of a data sorting process of the present invention;
FIG. 5 is a schematic view of the expansion unit of the present invention;
FIG. 6 is a schematic diagram of an etching unit of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-6, the present invention provides a technical solution:
the invention provides a scientific and technological project declaration on-line service terminal which comprises a data collection unit, a data pre-inspection unit and an information query unit, wherein the data collection unit is used for collecting and classifying declared scientific and technological project data, the data pre-inspection unit is used for pre-processing and inspecting the declared scientific and technological project data, and the information query unit is used for querying declared scientific and technological project data processing flow traceability information.
In the embodiment, the service terminal is implemented by adopting a J2EE mode, interaction with a mobile terminal of a user is mainly implemented by adopting a servlet technology, and the system is implemented by adopting J2EE which has the characteristics of platform independence, easy transplantation, high performance, easy deployment and the like.
Further, the data collection unit comprises the following steps:
s1.1, recording data: inputting scientific and technological project data;
s1.2, extracting a name: extracting recorded scientific and technological project name data;
s1.3, extracting keywords: extracting key words in the scientific and technological project name data;
s1.4, data classification: and classifying the entered scientific and technological project data according to the similarity of the extracted keywords.
In S1.2, the name is extracted by using an edge character detection algorithm, and the algorithm flow is as follows:
s1.2.1, detecting the edge characteristics of the name characters by using an edge detection operator;
s1.2.2, filtering the edge characteristics;
s1.2.3, merging the edges into regions by morphological operations;
s1.2.4, extracting the character area according to the horizontal projection algorithm.
Furthermore, the edge detection operator adopts a Sobel operator to detect character edge characteristics, and the operator formula is as follows:
Figure 50089DEST_PATH_IMAGE002
k represents a neighborhood point mark matrix template, a 3 multiplied by 3 neighborhood matrix with (i, j) as a center, a is a control factor in a condition, the value range is 0 to 1, and the width of an edge is controlled by a plurality of values of a;
the matrices (1), (2) and (3) being respectively of this operator
Figure 677380DEST_PATH_IMAGE022
Towards the convolution template,
Figure 185722DEST_PATH_IMAGE023
Marking a matrix to the convolution template and the neighborhood points of the points to be processed, and accordingly expressing the gradient amplitude of each point as follows by using a mathematical formula:
Figure 785330DEST_PATH_IMAGE024
specifically, the filtering processing of the edge features adopts gaussian filtering processing, the gaussian filtering can be realized by weighting two one-dimensional gaussian kernels twice respectively, and the gaussian kernels have the following expression:
Figure 471658DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 586244DEST_PATH_IMAGE005
the width of the gaussian filter is such that,
Figure 367118DEST_PATH_IMAGE005
the degree of smoothing is determined, x being the coordinate, controlling the gaussian kernel shape.
The formula (4) is a discretized one-dimensional Gaussian function, and a one-dimensional kernel vector can be obtained by determining parameters, wherein the formula is as follows:
Figure 883550DEST_PATH_IMAGE027
the formula (4.1) is a discretized two-dimensional Gaussian function, and a two-dimensional kernel vector can be obtained by determining parameters.
It is worth mentioning that the formula of the horizontal projection algorithm is as follows:
Figure 475200DEST_PATH_IMAGE028
wherein E represents an edge map of the text region,
Figure 77082DEST_PATH_IMAGE029
is the coordinate of the pixel point in the image, h is the height of the image,
Figure 661648DEST_PATH_IMAGE030
as the abscissa isIs projected horizontally.
It is worth mentioning that the morphological operations include an expansion unit, an erosion unit, an open operation unit and a close operation unit.
Wherein the expansion unit is defined as: translating the structural element B by a to obtain BaIf B isaOn hit X, we note down this point a. The set of all points a satisfying the above condition is called the result of expansion of X by B. Is formulated as: d (x) = { a | Ba↑X}=XB,(Ba×) X represents BaIn the hit of the X,
Figure 84648DEST_PATH_IMAGE032
represents an exclusive OR operation with an algorithm of a
Figure 676167DEST_PATH_IMAGE032
B = (|' a ^ B) | (a ^ B)), as shown in fig. 5, X is the object being treated, B is the structural element, for any one point a in the shaded area, Ba hits X, the result of X being expanded by B is the shaded area in fig. 5.
Wherein the etch unit is defined as: the structural element B is translated by a to obtain BaIf B isaIncluded in X, we note this point a, and the set of all points a satisfying the above condition is called the result of the corrosion of X by B, and is expressed by the formula: e (x) = { a | Ba
Figure 167191DEST_PATH_IMAGE033
X}=X
Figure 349910DEST_PATH_IMAGE034
B, wherein X
Figure 129648DEST_PATH_IMAGE034
B represents the result of erosion of X by B, and as shown in fig. 6, X is the object to be processed, B is a structural element, Ba is included in X for any one of the points a in the shaded area, and the result of erosion of X by B is the shaded area in fig. 6.
The on operation of the structural element B on the input image a is denoted as a ○ B, and is defined as a ○ B = (a)
Figure 56015DEST_PATH_IMAGE034
B)
Figure 886699DEST_PATH_IMAGE032
B=U{B+x:B+x
Figure 505899DEST_PATH_IMAGE033
A }. The opening operation can be obtained by calculating and calculating translation of all structural elements which can be filled into the image, namely, the result of the expansion operation after the corrosion of the A is carried out, the opening operation has a smoothing function, and certain tiny connections, edge burrs and isolated spots of the image can be removed.
The closed operation of the structural element B on the input image a is denoted as a ● B, and is defined as a ● B = (a)
Figure 304091DEST_PATH_IMAGE032
B)
Figure 34150DEST_PATH_IMAGE034
B. The closed operation is a dual operation of the open operation, namely, the result of the corrosion operation after the expansion of the A, has a filtering function, and can fill and level up small ditches, holes and cracks in the image so as to connect broken lines.
Further, in S1.3, extracting the keyword uses a TFIDF algorithm, which includes the following steps:
s1.3.1, performing word segmentation on all documents in the cluster, and then storing the occurrence frequency of each word by using a dictionary;
s1.3.2, traversing each word to obtain the IDF value of each word in all documents and the value multiplied by the times TF appearing in the cluster;
s1.3.3, a dictionary is used to store all word information, then the dictionary is sorted according to value, and finally, the words with the top names of the re-arranged words are taken as keywords.
Specifically, the similarity of the keywords adopts a text similarity calculation method of hamming distance, and the calculation method has the following formula:
Figure 984919DEST_PATH_IMAGE035
wherein n is the distance between two long code words, k is the number of code words,it is shown that the addition operation modulo 2,
Figure 529350DEST_PATH_IMAGE037
Figure 797520DEST_PATH_IMAGE038
which represents the sum of the number of different code symbols at the same position of two code words, can reflect the difference between the two code words, and can be used as an objective basis for providing the similarity degree between the code words. The method arranges the information of key words, abstracts and the like in the text into a code word with n bit sequences, and the text information is expressed by the code words, so that the text and the code words establish a 1-1 corresponding relation.
In particular, if the text is
Figure 55326DEST_PATH_IMAGE039
Corresponding code word is
Figure 16329DEST_PATH_IMAGE040
The code word corresponding to the query is
Figure 539846DEST_PATH_IMAGE041
To a
Figure 611707DEST_PATH_IMAGE042
The distance between the two is between 0 and n, when the text and the query expression are completely different by using n-bit code words, the distance is n, when the text and the query expression code words are completely the same, the distance is 0, and when the similarity is calculated, the code word set corresponding to the text set is determined, and for different texts or between the text and the query expression, the code word set corresponding to the text set is set
Figure 989598DEST_PATH_IMAGE043
The similarity calculation based on hamming distance is shown in the formula:
Figure 855923DEST_PATH_IMAGE044
wherein the content of the first and second substances,
Figure 381583DEST_PATH_IMAGE045
respectively representing text
Figure 739358DEST_PATH_IMAGE046
Corresponding code wordAnd query typeCorresponding code word
Figure 21938DEST_PATH_IMAGE049
To middle
Figure 435602DEST_PATH_IMAGE050
The bit component, either 0 or 1,
Figure 522506DEST_PATH_IMAGE051
a modulo-2 addition operation.
It is worth to be noted that the data classification adopts a K-means clustering algorithm, and the method comprises the following steps:
s1.4.1, determining the number k of clusters to be generated for the text set D waiting for clustering;
s1.4.2, generating k clustering centers as the initial central points of the clusters,
Figure 12525DEST_PATH_IMAGE052
s1.4.3, for each text in D
Figure 184880DEST_PATH_IMAGE053
Sequentially calculating it and each central point
Figure 136656DEST_PATH_IMAGE054
Degree of similarity of
Figure 405963DEST_PATH_IMAGE055
S1.4.4, selecting the center point with the largest similarity
Figure 988254DEST_PATH_IMAGE056
Will be
Figure 444643DEST_PATH_IMAGE057
Fall under the category of
Figure 216421DEST_PATH_IMAGE058
Cluster being the center of cluster
Figure DEST_PATH_IMAGE059
To obtain D one cluster
Figure 74656DEST_PATH_IMAGE060
S1.4.5, re-determining the center point of each cluster;
s1.4.6, repeat S1.4.3-S1.4.5 until the center point no longer changes and the text is no longer reassigned.
It is worth to be noted that the data pre-checking unit includes an error correcting module, a duplicate item deleting module, a unified specification module, a correction logic module, a conversion construction module, a data compression module, a data supplementing module and a data discarding module.
In this embodiment, the error correcting module is configured to correct a data error form, and the error correcting module is configured to correct a data value error, correct a data type error, correct a data coding error, correct a data format error, correct a data exception error, correct a dependency conflict, and correct a multi-value error.
Further, due to various reasons, repeated records or repeated fields (columns) may exist in the data, repeated items (rows and columns) need to be deleted and processed by a repeated item deleting module, the repeated item deleting module is used for deleting the repeated records or the repeated fields existing in the data, and for judging the repeated items, the basic idea is 'sorting and merging', the records in the database are sorted according to a certain rule, and then whether the records are repeated is detected by comparing whether adjacent records are similar.
Specifically, because the data source systems are dispersed in each service line, different service lines have different requirements, understandings and specifications for data, and the description specifications for the same data object are completely different, the data specification needs to be unified through the unified specification module and the content of consistency needs to be abstracted out in the cleaning process.
In addition, the correction logic module is used for determining the logic, conditions and caliber of each source system and correcting the acquisition logic of the abnormal source system.
In addition, the conversion construction module is used for carrying out standardization processing on the data, and comprises data type conversion, data semantic conversion, data granularity conversion, table/data splitting, row-column conversion, data discretization, data standardization, new field refinement and attribute construction.
Wherein, the data type conversion: when data come from different data sources, incompatibility of data types of the different data sources may cause error reporting of the system, and at this time, the data types of the different data sources need to be uniformly converted into a compatible data type.
Wherein, the data semantic conversion: in a conventional data warehouse, a dimension table, a fact table and the like may exist based on a third paradigm, and at this time, many fields in the fact table need to be combined with the dimension table to perform semantic parsing.
Wherein, the data granularity conversion: and aggregating the data according to different granularity requirements in the data warehouse.
Wherein, table/data splitting: some fields may store multiple data information, for example, the timestamp includes information of year, month, day, hour, minute, second, etc., and some rules need to split some or all of the time attributes to meet the data aggregation requirement at multiple granularities.
Wherein, the row-column conversion: and converting row and column data in the table.
Wherein, data discretization: the continuous attribute is discretized into a plurality of intervals to help reduce the value number of one continuous attribute.
Wherein, data standardization: different fields have different business meanings, so that the difference between values caused by different orders of magnitude among variables needs to be eliminated.
Wherein, refining the new field: in many cases, new fields, also called compound fields, need to be extracted based on business rules.
Wherein, the attribute structure is as follows: in the modeling process, new attributes are constructed according to the existing attribute set.
Furthermore, the data compression module is used for maintaining the integrity and accuracy of the original data set, and reorganizing the data according to a certain algorithm and a certain mode on the premise of not losing useful information, and complex data analysis and data calculation of large-scale data generally consume a large amount of time, so that reduction and compression of the data are required before the reorganization and the compression, the data scale is reduced, interactive data mining can be faced, and information feedback is carried out on the comparison data before and after the data mining. Thus, the data mining on the reduced data set is obviously more efficient, and the mining result is basically the same as the result obtained by using the original data set.
In addition, the data supplementing module is used for supplementing the data of the incomplete data, the data supplementation comprises supplementing a missing value and a supplementing null value, the missing value refers to the condition that the data originally and necessarily exists, but actually has no data, and the null value refers to the condition that the data actually exists and can be null.
In addition, the data discarding module deletes abnormal data in the data, the types of discarded data include whole deletion and variable deletion, the whole deletion refers to deleting samples containing missing values, the variable deletion can be considered if the invalid value and the missing value of a certain variable are many, and the variable is not particularly important for the problem under study, and this way reduces the number of variables for analysis, but does not change the sample amount.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (9)

1. A science and technology project application on-line service terminal comprises a data collection unit, a data pre-inspection unit and an information query unit, and is characterized in that: the system comprises a data collection unit, a data pre-inspection unit and an information query unit, wherein the data collection unit is used for collecting and classifying reported scientific and technological project data, the data pre-inspection unit is used for pre-processing and inspecting the reported scientific and technological project data, and the information query unit is used for querying the source tracing information of the reported scientific and technological project data processing flow.
2. A scientific and technological project declaration on-line service terminal according to claim 1, wherein: the data collection unit comprises the following steps:
s1.1, recording data: inputting scientific and technological project data;
s1.2, extracting a name: extracting recorded scientific and technological project name data;
s1.3, extracting keywords: extracting key words in the scientific and technological project name data;
s1.4, data classification: and classifying the entered scientific and technological project data according to the similarity of the extracted keywords.
3. A scientific and technological project declaration on-line service terminal according to claim 2, wherein: in S1.2, the name is extracted and an edge character detection algorithm is selected, and the algorithm flow is as follows:
s1.2.1, detecting the edge characteristics of the name characters by using an edge detection operator;
s1.2.2, filtering the edge characteristics;
s1.2.3, merging the edges into regions by morphological operations;
s1.2.4, extracting the character area according to the horizontal projection algorithm.
4. A scientific and technological project declaration on-line service terminal according to claim 3, characterized in that: the edge detection operator adopts a Sobel operator to detect character edge characteristics, and the operator formula is as follows:
Figure 699932DEST_PATH_IMAGE001
(1)
Figure 148231DEST_PATH_IMAGE002
(2)
Figure 459127DEST_PATH_IMAGE003
k represents a neighborhood point mark matrix template, a 3 multiplied by 3 neighborhood matrix with (i, j) as a center, a is a control factor in a condition, the value range is 0 to 1, and the width of an edge is controlled by a plurality of values of a;
the matrixes (1), (2) and (3) are respectively an X-direction convolution template, a Y-direction convolution template and a neighborhood point mark matrix of the point to be processed of the operator.
5. A scientific and technological project declaration on-line service terminal according to claim 3, characterized in that: the filtering processing of the edge features adopts Gaussian filtering processing, and the formula is as follows:
Figure 651074DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 747337DEST_PATH_IMAGE005
the width of the gaussian filter is such that,
Figure 304220DEST_PATH_IMAGE005
the degree of smoothing is determined, x being the coordinate, controlling the gaussian kernel shape.
6. A scientific and technological project declaration on-line service terminal according to claim 3, characterized in that: the formula of the horizontal projection algorithm is as follows:
Figure 367991DEST_PATH_IMAGE006
wherein E represents an edge map of the text region,is the coordinate of the pixel point in the image, h is the height of the image,
Figure 501349DEST_PATH_IMAGE008
as the abscissa isIs projected horizontally.
7. A scientific and technological project declaration on-line service terminal according to claim 2, wherein: in the step S1.3, the key words are extracted by adopting a TFIDF algorithm, and the algorithm flow is as follows:
s1.3.1, performing word segmentation on all documents in the cluster, and then storing the occurrence frequency of each word by using a dictionary;
s1.3.2, traversing each word to obtain the IDF value of each word in all documents and the value multiplied by the times TF appearing in the cluster;
s1.3.3, a dictionary is used to store all word information, then the dictionary is sorted according to value, and finally, the words with the top names of the re-arranged words are taken as keywords.
8. A scientific and technological project declaration on-line service terminal according to claim 2, wherein: the similarity of the keywords adopts a text similarity calculation method of Hamming distance, and the calculation method has the following formula:
Figure 327671DEST_PATH_IMAGE011
wherein the content of the first and second substances,
Figure 799103DEST_PATH_IMAGE012
it is shown that the addition operation modulo 2,
Figure 853647DEST_PATH_IMAGE013
representing the sum of the number of different code symbols at the same position for two codewords, n being the distance between two long codewords, k being the number of codewords.
9. A scientific and technological project declaration on-line service terminal according to claim 2, wherein: the data classification adopts a K-means clustering algorithm, and the method comprises the following steps:
s1.4.1, determining the number k of clusters to be generated for the text set D waiting for clustering;
s1.4.2, generating k clustering centers as the initial central points of the clusters,
Figure 80229DEST_PATH_IMAGE014
s1.4.3, for each text in D
Figure 56275DEST_PATH_IMAGE015
Sequentially calculating it and each central point
Figure 862557DEST_PATH_IMAGE016
Degree of similarity of
Figure 50569DEST_PATH_IMAGE017
S1.4.4, selecting the center point with the largest similarityWill be
Figure 380236DEST_PATH_IMAGE019
Fall under the category ofCluster being the center of cluster
Figure 488186DEST_PATH_IMAGE021
To obtain D one cluster
Figure 790992DEST_PATH_IMAGE022
S1.4.5, re-determining the center point of each cluster;
s1.4.6, repeat S1.4.3-S1.4.5 until the center point no longer changes and the text is no longer reassigned.
CN202010015896.6A 2020-01-08 2020-01-08 Science and technology project application on-line service terminal Pending CN110807449A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010015896.6A CN110807449A (en) 2020-01-08 2020-01-08 Science and technology project application on-line service terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010015896.6A CN110807449A (en) 2020-01-08 2020-01-08 Science and technology project application on-line service terminal

Publications (1)

Publication Number Publication Date
CN110807449A true CN110807449A (en) 2020-02-18

Family

ID=69493425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010015896.6A Pending CN110807449A (en) 2020-01-08 2020-01-08 Science and technology project application on-line service terminal

Country Status (1)

Country Link
CN (1) CN110807449A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111677658A (en) * 2020-05-25 2020-09-18 阿勒泰正元国际矿业有限公司 Automatic control system and method for mine water pump

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105825046A (en) * 2016-03-13 2016-08-03 冯贵良 Medical data collecting and processing method and device
US20170213101A1 (en) * 2016-01-25 2017-07-27 Koninklijke Philips N.V. Image data pre-processing
CN110310083A (en) * 2019-06-04 2019-10-08 南方电网科学研究院有限责任公司 A kind of submission system of science and technology item data report
CN110389950A (en) * 2019-07-31 2019-10-29 南京安夏电子科技有限公司 A kind of big data cleaning method quickly run
CN110618978A (en) * 2019-09-20 2019-12-27 南京信同诚信息技术有限公司 Cloud system integration and storage system and method
CN110659276A (en) * 2019-09-25 2020-01-07 江苏医健大数据保护与开发有限公司 Computer data statistical system and statistical classification method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170213101A1 (en) * 2016-01-25 2017-07-27 Koninklijke Philips N.V. Image data pre-processing
CN105825046A (en) * 2016-03-13 2016-08-03 冯贵良 Medical data collecting and processing method and device
CN110310083A (en) * 2019-06-04 2019-10-08 南方电网科学研究院有限责任公司 A kind of submission system of science and technology item data report
CN110389950A (en) * 2019-07-31 2019-10-29 南京安夏电子科技有限公司 A kind of big data cleaning method quickly run
CN110618978A (en) * 2019-09-20 2019-12-27 南京信同诚信息技术有限公司 Cloud system integration and storage system and method
CN110659276A (en) * 2019-09-25 2020-01-07 江苏医健大数据保护与开发有限公司 Computer data statistical system and statistical classification method thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
杨纪成主编: "《互联网软件应用与开发》", 31 August 2006, 经济科学出版社 *
汪波: "复杂背景图像中的文字提取算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
沈焕生: "基于信息内容的关键词抽取研究", 《中国电子学会第十五届信息论学术年会暨第一届全国网络编码学术年会论文集》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111677658A (en) * 2020-05-25 2020-09-18 阿勒泰正元国际矿业有限公司 Automatic control system and method for mine water pump

Similar Documents

Publication Publication Date Title
Paliwal et al. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images
US8407236B2 (en) Mining new words from a query log for input method editors
US11651150B2 (en) Deep learning based table detection and associated data extraction from scanned image documents
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
Wei et al. A keyword retrieval system for historical Mongolian document images
US20210366055A1 (en) Systems and methods for generating accurate transaction data and manipulation
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN111078979A (en) Method and system for identifying network credit website based on OCR and text processing technology
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN110781333A (en) Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning
CN114048354B (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
Pengcheng et al. Fast Chinese calligraphic character recognition with large-scale data
CN112016294B (en) Text-based news importance evaluation method and device and electronic equipment
CN109753581A (en) Image processing method, device, electronic equipment and storage medium
CN110807449A (en) Science and technology project application on-line service terminal
US11361565B2 (en) Natural language processing (NLP) pipeline for automated attribute extraction
Tseng et al. Document image retrieval techniques for Chinese
WO2007070010A1 (en) Improvements in electronic document analysis
CN115828854A (en) Efficient table entity linking method based on context disambiguation
Dhankhar et al. Support Vector Machine Based Handwritten Hindi Character Recognition and Summarization.
US20230138491A1 (en) Continuous learning for document processing and analysis
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
CN115062147A (en) Chapter-level text event classification method fusing frequent pattern features of named entities
CN115329173A (en) Method and device for determining enterprise credit based on public opinion monitoring
CN110874398B (en) Forbidden word processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200218

RJ01 Rejection of invention patent application after publication