CN114722160A - Text data comparison method and device - Google Patents

Text data comparison method and device Download PDF

Info

Publication number
CN114722160A
CN114722160A CN202210631816.9A CN202210631816A CN114722160A CN 114722160 A CN114722160 A CN 114722160A CN 202210631816 A CN202210631816 A CN 202210631816A CN 114722160 A CN114722160 A CN 114722160A
Authority
CN
China
Prior art keywords
text data
data item
item sets
similarity
similarity measurement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210631816.9A
Other languages
Chinese (zh)
Other versions
CN114722160B (en
Inventor
张万鹏
张虎
谷学强
胡丽
项凤涛
王超
杨景照
张煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210631816.9A priority Critical patent/CN114722160B/en
Publication of CN114722160A publication Critical patent/CN114722160A/en
Application granted granted Critical
Publication of CN114722160B publication Critical patent/CN114722160B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a text data comparison method and device in the technical field of information processing. The method comprises the steps of obtaining text data item sets in two data dictionary tables, carrying out word segmentation on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets, calculating similarity measurement between the elements of the two text data item sets, preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, converting a comparison analysis problem of the two text data item sets into a problem of seeking an optimal matching scheme through abstraction and modeling of a dictionary table comparison analysis problem, and solving the problem by utilizing a KM algorithm. The method realizes the automatic comparison and analysis of the dictionary table data based on the semantics, effectively relieves the working pressure of manually comparing in the data reorganization process, and provides a new idea for the automatic processing of data comparison.

Description

Text data comparison method and device
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a text data comparison method and apparatus.
Background
With the reduction of data acquisition and storage cost, data is in explosive growth in quantity, but meanwhile, more and more requirements are put on data association and fusion, and the data association and fusion face more and more challenges. The data integration and compilation is used as a key bridge between original data and high-value data, plays an increasingly important role in data-based statistical analysis, and becomes an increasingly fundamental and heavy work in data processing.
The data dictionary table is used as basic data for defining meta information such as data items of data in the current database system and is key information applied and understood by the whole database system, so that comparison, association and pull-through of the data dictionary table are of great significance in the data integration process.
In the process of realizing data summarization and fusion unification, the comparison and association of data of heterogeneous databases or different time points is a key step of data compilation integration and updating, and is particularly important for data dictionaries describing metadata such as data items and data structures in databases. Currently, extraction-Transform-Load (ETL) technology of a data warehouse is generally adopted in the industry to realize extraction, transformation and fusion of heterogeneous data, and the existing research results are as follows: the Python is used as an intermediate unit, so that the comparison of the recorded data of the table in the heterogeneous database is realized, the access of data in different storage databases and the comparison of the table recording levels are solved, the problem of automatic comparison of the same data body under the condition of different expression modes is still not solved, and a semantic-based automatic processing means is lacked.
Disclosure of Invention
In view of the above, it is necessary to provide a text data comparing method and apparatus.
A method of text data comparison, the method comprising:
acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
And calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
And according to the similarity measurement matrix and the two text data item sets, converting a comparison analysis problem of the two text data item sets into a matching problem of the weighted bipartite graph.
And solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relation between the two text data item sets.
In one embodiment, acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a chinese word set of each element in the two text data item sets, includes:
a set of text data items in two data dictionary tables is obtained.
And performing word segmentation processing on the elements in the two text data item sets by adopting a word segmentation method based on statistics to obtain a Chinese word set of each element in the two text data item sets.
In one embodiment, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively.
Calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, wherein the similarity measurement matrix comprises:
and calculating similarity measurement between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets.
When the similarity measure between the elements of the two text data item sets is larger than or equal to a preset similarity ratio threshold value, the elements of the corresponding positions of the similarity measure matrix are equal to the similarity measure.
When the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
In one embodiment, according to a chinese word set of each element in two text data item sets, a similarity measure between elements of the two text data item sets is calculated, and the similarity measure is preprocessed by a preset similarity ratio threshold to obtain a similarity measure matrix, wherein a calculation formula of the similarity measure in the step is as follows:
Figure 206819DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 216363DEST_PATH_IMAGE002
in the case of a similar ratio,
Figure 531938DEST_PATH_IMAGE003
for the first in the first set of text data items
Figure 601525DEST_PATH_IMAGE004
The set of chinese words that an individual element includes,
Figure 2550DEST_PATH_IMAGE005
for the first in the second set of text data items
Figure 815786DEST_PATH_IMAGE006
The set of chinese words that an individual element includes,
Figure 782605DEST_PATH_IMAGE007
is an element number count operation.
A method of text data comparison, the method comprising:
acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
And calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
And according to the characteristic of similarity measurement matrix sparsification, the similarity measurement matrix is divided into a plurality of sub-similarity measurement matrixes which are irrelevant to each other.
And solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets.
In one embodiment, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively.
Calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, wherein the similarity measurement matrix comprises:
and calculating similarity measurement between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets.
When the similarity measure between the elements of the two text data item sets is larger than or equal to a preset similarity ratio threshold value, the elements of the corresponding positions of the similarity measure matrix are equal to the similarity measure.
When the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
In one embodiment, a KM algorithm is adopted to solve each sub-similarity metric matrix to obtain a group of globally optimal matching relations between two text data item sets, and in the step, the KM algorithm adopts a depth-first search algorithm for searching the augmented path.
In one embodiment, according to a chinese word set of each element in two text data item sets, a similarity measure between elements of the two text data item sets is calculated, and the similarity measure is preprocessed by a preset similarity ratio threshold to obtain a similarity measure matrix, wherein a calculation formula of the similarity measure in the step is as follows:
Figure 226355DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 911415DEST_PATH_IMAGE009
in the case of a similar ratio,
Figure 466024DEST_PATH_IMAGE010
for the first in the first set of text data items
Figure 552929DEST_PATH_IMAGE011
The set of chinese words that an individual element includes,
Figure 964318DEST_PATH_IMAGE012
for the first in the second set of text data items
Figure 871095DEST_PATH_IMAGE013
The set of chinese words that an individual element includes,
Figure 963815DEST_PATH_IMAGE014
is an element number count operation.
In one embodiment, the data dictionary table includes a data number field; solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets, wherein the steps comprise: and comparing the data dictionary tables updated before and after a period of time according to the data number field.
A text data comparison apparatus, the apparatus comprising:
and the comparison data acquisition module is used for acquiring text data item sets in the two data dictionary tables and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
And the similarity measurement matrix determining module is used for calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
The text data comparison result determining module is used for converting a comparison analysis problem of the two text data item sets into a matching problem of the weighted bipartite graph according to the similarity measurement matrix and the two text data item sets; and solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relation between the two text data item sets.
The text data comparison method and the text data comparison device are characterized in that text data item sets in two data dictionary tables are obtained, word segmentation processing is carried out on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets, similarity measurement between the elements of the two text data item sets is calculated, the similarity measurement is preprocessed through a preset similarity ratio threshold value to obtain a similarity measurement matrix, the two text data item sets are converted into a bipartite graph to seek the problem of an optimal matching scheme through abstraction and modeling of a dictionary table comparison analysis problem, and the problem is solved through a KM algorithm. The method realizes the automatic comparison and analysis of the dictionary table data based on the semantics, effectively relieves the working pressure of manually comparing in the data reorganization process, and provides a new idea for the automatic processing of data comparison.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating a method for comparing textual data in one embodiment;
FIG. 2 is a diagram illustrating a comparison of data dictionary tables in one embodiment;
FIG. 3 is a diagram illustrating the Chinese phrase segmentation results of the data items in another embodiment;
FIG. 4 is a flowchart illustrating a text data comparison method according to an embodiment;
fig. 5 is a block diagram showing a configuration of a text data comparison device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The text data comparison method provided by the application can be used for comparison and analysis of text data items in two data dictionary tables, and the essence of the method is that a group of mapping relations from one set to the other set are constructed for two data sets with certain differences by utilizing similarity characteristics between data, and a process of marking and explaining changed data items is carried out. Because the data items in the two data sets are not completely consistent, the association between different data values of the same data ontology is usually realized through semantic similarity (for example, "Guangxi Zhuang autonomous region Nanning City" is semantically equivalent to "Guangxi Nanning City", but there is a difference in data value); however, there may be some similarity between data items in the same set and other different data ontologies (for example, there is some similarity between "training set" and "training set"), so how to globally construct an optimal mapping relationship is a key to solve the comparison analysis of the data dictionary table. The invention constructs a measure capable of representing semantic similarity of data items aiming at the data items in two data sets, realizes optimal comparison analysis of the data items in the two sets based on semantic association according to the similarity measure, realizes the optimal global similarity of the whole comparison result, and can provide a novel thought for automatic comparison analysis of a data dictionary.
In one embodiment, as shown in fig. 1, there is provided a text data comparison method, including the steps of:
step 100: and acquiring text data item sets in the two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
Specifically, the two data dictionary tables are heterogeneous.
The data dictionary table is basic data for defining meta information such as data items of data in the current database system, and is key information for application and understanding of the whole database system.
The data dictionary table can be a unit sequence dictionary table, and the unit sequence dictionary table mainly comprises fields such as unit numbers, unit names, levels and the like (wherein the unit name field reflects semantic relevance of the data body before and after adjustment); but also department relation dictionary tables, product type dictionary tables, etc.
In order to implement the correlation comparison based on semantic similarity on the text data items such as names in the data dictionary table, firstly, the data items are processed by using a natural language processing method. The word is used as the minimum unit for expressing semantics and is a basic operation unit for constructing data item similarity measurement, so that effective word segmentation on Chinese phrase data items or English phrases is a basic content of dictionary data processing. The invention mainly takes Chinese phrase data items in a data dictionary as an example, and introduces the data dictionary comparison method provided by the invention by using an open Wikipedia corpus and services as a Chinese word segmentation dictionary. The word segmentation method can be a dictionary-based word segmentation method or a statistical-based word segmentation method.
Step 102: and calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
Wherein the number of rows of the similarity metric matrix is equal to the number of elements in a first one of the two sets of text data items, and the number of columns of the similarity metric matrix is equal to the number of elements in a second one of the two sets of text data items.
Element values of a similarity metric matrix
Figure 905227DEST_PATH_IMAGE015
For the first in the first set of text data itemsiThe element and the second of the second set of text data itemsjAnd (4) measuring the similarity between the elements through a preset similarity ratio threshold value to obtain a result after processing.
Step 104: and according to the similarity measurement matrix and the two text data item sets, converting the comparison analysis problem of the two text data item sets into a matching problem of the weighted bipartite graph.
Specifically, the data dictionary table comparison problem can be described as shown in fig. 2 as follows: there are two different data item sequences
Figure 487518DEST_PATH_IMAGE016
And
Figure 147169DEST_PATH_IMAGE017
wherein the data item
Figure 43581DEST_PATH_IMAGE018
And data item
Figure 573919DEST_PATH_IMAGE019
There is similarity with a similarity ratio of
Figure 592691DEST_PATH_IMAGE020
How to construct a set of one-to-one mapping relationships such that
Figure 474059DEST_PATH_IMAGE021
Can only be matched with at most
Figure 174162DEST_PATH_IMAGE022
Corresponds to one of the data items in (a),
Figure 90166DEST_PATH_IMAGE023
at most only by the data item in
Figure 748680DEST_PATH_IMAGE024
Corresponds to one data item in the two sets of sequences and satisfies that the two sets of sequences have the greatest similar matching effect. From the above description, the dictionary table data alignment analysis is a typical matching problem of weighted bipartite graphs.
Step 106: and solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relations between two text data item sets.
Specifically, the KM algorithm is a classical algorithm for solving optimal matching of bipartite graphs, and the core idea is to adjust the standard value of each vertex
Figure 320607DEST_PATH_IMAGE025
And
Figure 355559DEST_PATH_IMAGE026
and finally, the matching number of the bipartite graph can be maximized, and a group of globally optimal matching relations are solved.
Two dictionary tables are arranged to form a bipartite graph
Figure 391648DEST_PATH_IMAGE027
In which the vertex is set
Figure 221064DEST_PATH_IMAGE028
From a collection of data items in two dictionary tables
Figure 280287DEST_PATH_IMAGE029
Is composed of
Figure 118930DEST_PATH_IMAGE030
Set of edges
Figure 743946DEST_PATH_IMAGE031
By gathering
Figure 9842DEST_PATH_IMAGE032
And set of
Figure 290782DEST_PATH_IMAGE033
The similarity relationship between the two components is formed, and the corresponding weight value is
Figure 933116DEST_PATH_IMAGE034
Represents a vertex
Figure 678218DEST_PATH_IMAGE035
And vertex
Figure 115016DEST_PATH_IMAGE036
The similarity of (c). According to the definition, the data dictionary comparison method using the KM algorithm specifically comprises the following steps:
step 1: initializing a set
Figure 883251DEST_PATH_IMAGE037
And set of
Figure 63697DEST_PATH_IMAGE038
The value of each element in
Figure 663306DEST_PATH_IMAGE039
Corresponding index value is
Figure 271005DEST_PATH_IMAGE040
Simultaneously order
Figure 526537DEST_PATH_IMAGE041
The corresponding scalar value is the maximum similarity value associated therewith, i.e.
Figure 245094DEST_PATH_IMAGE042
WhereinmIs a set
Figure 964788DEST_PATH_IMAGE043
The number of elements in (c).
Step 2: from the collection
Figure 743388DEST_PATH_IMAGE044
Element (1) of
Figure 486216DEST_PATH_IMAGE045
At the beginning, in bipartite graph
Figure 8465DEST_PATH_IMAGE046
In accordance with
Figure 582665DEST_PATH_IMAGE047
Searching for augmented paths
Figure 532167DEST_PATH_IMAGE048
If there is an extended path
Figure 496712DEST_PATH_IMAGE049
Skipping to step 4; otherwise, jumping to step 3.
And step 3: in searching for an augmented path
Figure 822651DEST_PATH_IMAGE050
Can be satisfied when failing
Figure 454621DEST_PATH_IMAGE051
Set of interleaved paths of condition consisting of alternating line segments
Figure 575023DEST_PATH_IMAGE052
And then jumps to step 2.
And 4, step 4: for the extended path
Figure 354760DEST_PATH_IMAGE053
And the middle related line segment negating the original matching rule and updating to form a new matching rule.
Figure 422074DEST_PATH_IMAGE054
Figure 439708DEST_PATH_IMAGE055
And 5: inspection set
Figure 996591DEST_PATH_IMAGE056
Whether the middle element is traversed or not, and if the traversal is finished, ending the operation; otherwise, the next element of the set is processed
Figure 998046DEST_PATH_IMAGE057
And jumping to step 2.
The text data comparison method comprises the steps of obtaining text data item sets in two data dictionary tables, carrying out word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets, calculating similarity measurement between the elements of the two text data item sets, preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, converting a comparison analysis problem of the two text data item sets into a problem of seeking an optimal matching scheme through a bipartite graph by abstracting and modeling the dictionary table comparison analysis problem, and solving the problem by using a KM algorithm. The method realizes the automatic comparison and analysis of the dictionary table data based on the semantics, effectively relieves the working pressure of manually comparing in the data reorganization process, and provides a new idea for the automatic processing of data comparison.
In one embodiment, step 100 comprises: acquiring text data item sets in two data dictionary tables; and performing word segmentation processing on the elements in the two text data item sets by adopting a word segmentation method based on statistics to obtain a Chinese word set of each element in the two text data item sets.
Specifically, a Chinese Language Processing (Hanlp) participle toolkit is utilized, and on the basis of establishing a service dictionary, the prior learning of a Chinese corpus and the participle of a text data item are realized through a second-order hidden Markov chain model embedded in the Chinese corpus. As shown in fig. 3, the chinese phrases in two data items are segmented into two word sets, which are:
Figure 665787DEST_PATH_IMAGE058
Figure 741191DEST_PATH_IMAGE059
in one embodiment, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively; step 102 comprises: calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets; when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements at the corresponding positions of the similarity measurement matrix are equal to the similarity measurement; when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
In one embodiment, the similarity measure in step 102 is calculated by the following formula:
Figure 711117DEST_PATH_IMAGE060
(1)
wherein the content of the first and second substances,
Figure 934288DEST_PATH_IMAGE061
in the case of a similar ratio,
Figure 343404DEST_PATH_IMAGE062
for the first in the first set of text data items
Figure 601210DEST_PATH_IMAGE063
The set of chinese words that an individual element includes,
Figure 499895DEST_PATH_IMAGE064
nfor the number of elements in the first set of text data items,
Figure 148046DEST_PATH_IMAGE065
for the first in the second set of text data items
Figure 157590DEST_PATH_IMAGE066
The set of chinese words that an individual element includes,
Figure 535482DEST_PATH_IMAGE067
mfor the number of elements in the second set of text data items,
Figure 542752DEST_PATH_IMAGE068
is an element number count operation.
Specifically, corresponding word vectors are constructed for words in the two word sets, similar words are screened by using a cosine measurement method between the word vectors, and similarity measurement of Chinese phrases can be realized by using a Jaccord algorithm. However, in view of the fact that text data items in a data dictionary are usually short and refined professional phrases, and in order to reduce operation consumption, similarity measurement is simplified into Jaccard similarity measurement which takes words as granularity and among the text data items, and representation of semantic similarity among the data items is achieved, namely when the number of the words commonly owned in two sets is larger, the semantics expressed by two Chinese phrases are more similar; conversely, the lower the similarity of the content expressed by the two Chinese phrases. Based on the above conclusions, the Chinese phrase similarity measure is defined as formula (1), which is called similarity ratio.
According to the formula (1), the more the words of the two word sets overlap, the similarity ratio is determined
Figure 740515DEST_PATH_IMAGE069
The closer to 1; correspondingly, when the word difference between the two word sets is larger, the similarity ratio is larger
Figure 553750DEST_PATH_IMAGE069
The closer to 0. Thus, the similarity ratio
Figure 520569DEST_PATH_IMAGE069
The semantic similarity with two Chinese phrases shows strong positive correlation.
In one embodiment, as shown in fig. 4, there is provided a text data comparison method, including the steps of:
step 400: and acquiring text data item sets in the two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
Step 402: and calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
Step 404: and according to the characteristic of the similarity measurement matrix sparsification, the similarity measurement matrix is cut into a plurality of sub-similarity measurement matrixes which are not related to each other.
According to the characteristic of similarity measurement matrix sparsification, an original matrix is divided into a plurality of sub-similarity measurement matrixes which are irrelevant to each other, and the method specifically comprises the following steps: 1) traversing similarity matrix data needing to be compared; 2) based on the sparsification characteristic that all terms in the matrix are zero, taking the sparse matrix with all terms in the similarity matrix as a matrix partition line; 3) after preserving the segmentation, a sub-similarity metric matrix is generated.
And solving the sub-matrixes by using a KM algorithm respectively, and finally realizing the comparative analysis of the whole dictionary table.
Step 406: and solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets.
In one embodiment, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively; step 402 comprises: calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets; when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements of the corresponding positions of the similarity measurement matrix are equal to the similarity measurement; when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
In one embodiment, the KM algorithm employs a depth-first search algorithm for the search of the augmented path in step 406.
In particular, a similarity ratio threshold is used
Figure 495479DEST_PATH_IMAGE070
The association relation among the data items is preprocessed, so that the association depth among the data items can be effectively reduced, and the augmented path is searched according to the KM algorithm
Figure 383800DEST_PATH_IMAGE071
In the process, a depth-first search algorithm can be used for searching, so that excessive recursive calls can be avoided, and further, the operation and storage consumption is reduced.
In one embodiment, the similarity measure in step 402 is calculated as shown in equation (1).
In one embodiment, the data dictionary table includes a data number field; step 406 includes, before: and comparing the data dictionary tables updated before and after a period of time according to the data number field.
Specifically, a number field, that is, an Identity Document (ID) field, is usually set in the actual dictionary table data design, and if the data dictionary table updated before and after a period of time can be matched according to the characteristic that the data item number field has relative fixity, the operation consumption of the algorithm can be further reduced by preferentially matching the number field, and the comparison analysis efficiency is improved.
It should be understood that although the steps in the flowcharts of fig. 1 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 and 4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 5, there is provided a text data comparing apparatus including: the system comprises a comparison data acquisition module, a similarity measurement matrix determination module and a text data comparison result determination module, wherein:
and the comparison data acquisition module is used for acquiring the text data item sets in the two data dictionary tables and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
And the similarity measurement matrix determining module is used for calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
The text data comparison result determining module is used for converting the two text data item sets into a matching problem of a weighted bipartite graph according to the similarity measurement matrix and the two text data item sets; and solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relation between two text data item sets.
In one embodiment, the comparison data obtaining module is further configured to obtain text data item sets in two data dictionary tables; and performing word segmentation processing on the elements in the two text data item sets by adopting a word segmentation method based on statistics to obtain a Chinese word set of each element in the two text data item sets.
In one embodiment, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively; the similarity measurement matrix determining module is also used for calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets; when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements of the corresponding positions of the similarity measurement matrix are equal to the similarity measurement; when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
In one embodiment, the calculation formula of the similarity metric in the similarity metric matrix determination module is shown as formula (1).
For the specific definition of the text data comparison device, reference may be made to the above definition of the text data comparison method, which is not described herein again. The modules in the text data comparison device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of comparing text data, the method comprising:
acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets;
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix;
according to the similarity measurement matrix and the two text data item sets, converting a comparison analysis problem of the two text data item sets into a matching problem of a weighted bipartite graph;
and solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relation between the two text data item sets.
2. The method of claim 1, wherein obtaining a set of text data items in two data dictionary tables, and performing word segmentation on the two sets of text data items to obtain a set of chinese words for each element in the two sets of text data items comprises:
acquiring text data item sets in two data dictionary tables;
and performing word segmentation processing on the elements in the two text data item sets by adopting a word segmentation method based on statistics to obtain a Chinese word set of each element in the two text data item sets.
3. The method of claim 1, wherein rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively;
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, wherein the similarity measurement matrix comprises:
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets;
when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements of the corresponding positions of the similarity measurement matrix are equal to the similarity measurement;
when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
4. The method according to claim 1, wherein a similarity measure between elements of two sets of text data items is calculated according to a chinese word set of each element in the two sets of text data items, and the similarity measure is preprocessed by a preset similarity ratio threshold to obtain a similarity measure matrix, wherein the similarity measure is calculated according to the following formula:
Figure 245785DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 100608DEST_PATH_IMAGE002
in the case of a similar ratio,
Figure 411504DEST_PATH_IMAGE003
for the first in the first set of text data items
Figure 947659DEST_PATH_IMAGE004
The set of chinese words that an individual element includes,
Figure 965293DEST_PATH_IMAGE005
for the first in the second set of text data items
Figure 318914DEST_PATH_IMAGE006
The set of chinese words that an individual element includes,
Figure 726893DEST_PATH_IMAGE007
is an element number count operation.
5. A method of comparing text data, the method comprising:
acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets;
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix;
according to the characteristic of similarity measurement matrix sparsification, the similarity measurement matrix is divided into a plurality of sub-similarity measurement matrixes which are irrelevant to each other;
and solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets.
6. The method of claim 5, wherein rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively;
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, wherein the similarity measurement matrix comprises:
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets;
when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements at the corresponding positions of the similarity measurement matrix are equal to the similarity measurement;
when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
7. The method as recited in claim 6, wherein each sub-similarity metric matrix is solved using a KM algorithm to obtain a set of globally optimal matching relationships between two sets of text data items, and wherein the KM algorithm uses a depth-first search algorithm for searching for an augmented path.
8. The method according to claim 5, wherein a similarity measure between elements of two sets of text data items is calculated according to a Chinese word set of each element in the two sets of text data items, and the similarity measure is preprocessed by a preset similarity ratio threshold to obtain a similarity measure matrix, wherein the similarity measure is calculated according to a formula:
Figure 456952DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 329093DEST_PATH_IMAGE009
in the case of a similar ratio, the ratio,
Figure 728981DEST_PATH_IMAGE010
for the first in the first set of text data items
Figure 748890DEST_PATH_IMAGE011
The set of chinese words that an individual element includes,
Figure 892426DEST_PATH_IMAGE012
for the first in the second set of text data items
Figure 946970DEST_PATH_IMAGE013
The set of chinese words that an individual element includes,
Figure 580077DEST_PATH_IMAGE014
is an element number count operation.
9. The method of claim 5, wherein the data dictionary table includes a data number field;
solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets, wherein the steps comprise:
and comparing the data dictionary tables updated before and after a period of time according to the data number field.
10. A text data comparison apparatus, characterized in that the apparatus comprises:
the comparison data acquisition module is used for acquiring text data item sets in two data dictionary tables and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets;
the similarity measurement matrix determining module is used for calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix;
the text data comparison result determining module is used for converting a comparison analysis problem of the two text data item sets into a matching problem of the weighted bipartite graph according to the similarity measurement matrix and the two text data item sets; and solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relation between the two text data item sets.
CN202210631816.9A 2022-06-07 2022-06-07 Text data comparison method and device Active CN114722160B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210631816.9A CN114722160B (en) 2022-06-07 2022-06-07 Text data comparison method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210631816.9A CN114722160B (en) 2022-06-07 2022-06-07 Text data comparison method and device

Publications (2)

Publication Number Publication Date
CN114722160A true CN114722160A (en) 2022-07-08
CN114722160B CN114722160B (en) 2022-09-02

Family

ID=82232868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210631816.9A Active CN114722160B (en) 2022-06-07 2022-06-07 Text data comparison method and device

Country Status (1)

Country Link
CN (1) CN114722160B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1959671A (en) * 2005-10-31 2007-05-09 北大方正集团有限公司 Measure of similarity of documentation based on document structure
CN106547739A (en) * 2016-11-03 2017-03-29 同济大学 A kind of text semantic similarity analysis method
CN113407767A (en) * 2021-06-29 2021-09-17 北京字节跳动网络技术有限公司 Method and device for determining text relevance, readable medium and electronic equipment
CN113934842A (en) * 2020-06-29 2022-01-14 数网金融有限公司 Text clustering method and device and readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1959671A (en) * 2005-10-31 2007-05-09 北大方正集团有限公司 Measure of similarity of documentation based on document structure
CN106547739A (en) * 2016-11-03 2017-03-29 同济大学 A kind of text semantic similarity analysis method
CN113934842A (en) * 2020-06-29 2022-01-14 数网金融有限公司 Text clustering method and device and readable storage medium
CN113407767A (en) * 2021-06-29 2021-09-17 北京字节跳动网络技术有限公司 Method and device for determining text relevance, readable medium and electronic equipment

Also Published As

Publication number Publication date
CN114722160B (en) 2022-09-02

Similar Documents

Publication Publication Date Title
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN113282729B (en) Knowledge graph-based question and answer method and device
KR102091633B1 (en) Searching Method for Related Law
CN108846033B (en) Method and device for discovering specific domain vocabulary and training classifier
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN112036178A (en) Distribution network entity related semantic search method
CN114491079A (en) Knowledge graph construction and query method, device, equipment and medium
Bender et al. Unsupervised estimation of subjective content descriptions
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN114722160B (en) Text data comparison method and device
CN116127097A (en) Structured text relation extraction method, device and equipment
CN113468311B (en) Knowledge graph-based complex question and answer method, device and storage medium
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
Angrosh et al. Ontology-based modelling of related work sections in research articles: Using crfs for developing semantic data based information retrieval systems
CN114911826A (en) Associated data retrieval method and system
CN113157892A (en) User intention processing method and device, computer equipment and storage medium
CN111930880A (en) Text code retrieval method, device and medium
Wei et al. An index construction and similarity retrieval method based on sentence-bert
Arivarasan et al. Data mining K-means document clustering using tfidf and word frequency count
CN116126893B (en) Data association retrieval method and device and related equipment
Zhu et al. Doc2Vec on similar document suggestion for pharmaceutical collections

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant