CN114722160A - Text data comparison method and device - Google Patents
Text data comparison method and device Download PDFInfo
- Publication number
- CN114722160A CN114722160A CN202210631816.9A CN202210631816A CN114722160A CN 114722160 A CN114722160 A CN 114722160A CN 202210631816 A CN202210631816 A CN 202210631816A CN 114722160 A CN114722160 A CN 114722160A
- Authority
- CN
- China
- Prior art keywords
- text data
- data item
- item sets
- similarity
- similarity measurement
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application relates to a text data comparison method and device in the technical field of information processing. The method comprises the steps of obtaining text data item sets in two data dictionary tables, carrying out word segmentation on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets, calculating similarity measurement between the elements of the two text data item sets, preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, converting a comparison analysis problem of the two text data item sets into a problem of seeking an optimal matching scheme through abstraction and modeling of a dictionary table comparison analysis problem, and solving the problem by utilizing a KM algorithm. The method realizes the automatic comparison and analysis of the dictionary table data based on the semantics, effectively relieves the working pressure of manually comparing in the data reorganization process, and provides a new idea for the automatic processing of data comparison.
Description
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a text data comparison method and apparatus.
Background
With the reduction of data acquisition and storage cost, data is in explosive growth in quantity, but meanwhile, more and more requirements are put on data association and fusion, and the data association and fusion face more and more challenges. The data integration and compilation is used as a key bridge between original data and high-value data, plays an increasingly important role in data-based statistical analysis, and becomes an increasingly fundamental and heavy work in data processing.
The data dictionary table is used as basic data for defining meta information such as data items of data in the current database system and is key information applied and understood by the whole database system, so that comparison, association and pull-through of the data dictionary table are of great significance in the data integration process.
In the process of realizing data summarization and fusion unification, the comparison and association of data of heterogeneous databases or different time points is a key step of data compilation integration and updating, and is particularly important for data dictionaries describing metadata such as data items and data structures in databases. Currently, extraction-Transform-Load (ETL) technology of a data warehouse is generally adopted in the industry to realize extraction, transformation and fusion of heterogeneous data, and the existing research results are as follows: the Python is used as an intermediate unit, so that the comparison of the recorded data of the table in the heterogeneous database is realized, the access of data in different storage databases and the comparison of the table recording levels are solved, the problem of automatic comparison of the same data body under the condition of different expression modes is still not solved, and a semantic-based automatic processing means is lacked.
Disclosure of Invention
In view of the above, it is necessary to provide a text data comparing method and apparatus.
A method of text data comparison, the method comprising:
acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
And calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
And according to the similarity measurement matrix and the two text data item sets, converting a comparison analysis problem of the two text data item sets into a matching problem of the weighted bipartite graph.
And solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relation between the two text data item sets.
In one embodiment, acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a chinese word set of each element in the two text data item sets, includes:
a set of text data items in two data dictionary tables is obtained.
And performing word segmentation processing on the elements in the two text data item sets by adopting a word segmentation method based on statistics to obtain a Chinese word set of each element in the two text data item sets.
In one embodiment, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively.
Calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, wherein the similarity measurement matrix comprises:
and calculating similarity measurement between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets.
When the similarity measure between the elements of the two text data item sets is larger than or equal to a preset similarity ratio threshold value, the elements of the corresponding positions of the similarity measure matrix are equal to the similarity measure.
When the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
In one embodiment, according to a chinese word set of each element in two text data item sets, a similarity measure between elements of the two text data item sets is calculated, and the similarity measure is preprocessed by a preset similarity ratio threshold to obtain a similarity measure matrix, wherein a calculation formula of the similarity measure in the step is as follows:
wherein the content of the first and second substances,in the case of a similar ratio,for the first in the first set of text data itemsThe set of chinese words that an individual element includes,for the first in the second set of text data itemsThe set of chinese words that an individual element includes,is an element number count operation.
A method of text data comparison, the method comprising:
acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
And calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
And according to the characteristic of similarity measurement matrix sparsification, the similarity measurement matrix is divided into a plurality of sub-similarity measurement matrixes which are irrelevant to each other.
And solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets.
In one embodiment, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively.
Calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, wherein the similarity measurement matrix comprises:
and calculating similarity measurement between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets.
When the similarity measure between the elements of the two text data item sets is larger than or equal to a preset similarity ratio threshold value, the elements of the corresponding positions of the similarity measure matrix are equal to the similarity measure.
When the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
In one embodiment, a KM algorithm is adopted to solve each sub-similarity metric matrix to obtain a group of globally optimal matching relations between two text data item sets, and in the step, the KM algorithm adopts a depth-first search algorithm for searching the augmented path.
In one embodiment, according to a chinese word set of each element in two text data item sets, a similarity measure between elements of the two text data item sets is calculated, and the similarity measure is preprocessed by a preset similarity ratio threshold to obtain a similarity measure matrix, wherein a calculation formula of the similarity measure in the step is as follows:
wherein the content of the first and second substances,in the case of a similar ratio,for the first in the first set of text data itemsThe set of chinese words that an individual element includes,for the first in the second set of text data itemsThe set of chinese words that an individual element includes,is an element number count operation.
In one embodiment, the data dictionary table includes a data number field; solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets, wherein the steps comprise: and comparing the data dictionary tables updated before and after a period of time according to the data number field.
A text data comparison apparatus, the apparatus comprising:
and the comparison data acquisition module is used for acquiring text data item sets in the two data dictionary tables and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
And the similarity measurement matrix determining module is used for calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
The text data comparison result determining module is used for converting a comparison analysis problem of the two text data item sets into a matching problem of the weighted bipartite graph according to the similarity measurement matrix and the two text data item sets; and solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relation between the two text data item sets.
The text data comparison method and the text data comparison device are characterized in that text data item sets in two data dictionary tables are obtained, word segmentation processing is carried out on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets, similarity measurement between the elements of the two text data item sets is calculated, the similarity measurement is preprocessed through a preset similarity ratio threshold value to obtain a similarity measurement matrix, the two text data item sets are converted into a bipartite graph to seek the problem of an optimal matching scheme through abstraction and modeling of a dictionary table comparison analysis problem, and the problem is solved through a KM algorithm. The method realizes the automatic comparison and analysis of the dictionary table data based on the semantics, effectively relieves the working pressure of manually comparing in the data reorganization process, and provides a new idea for the automatic processing of data comparison.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating a method for comparing textual data in one embodiment;
FIG. 2 is a diagram illustrating a comparison of data dictionary tables in one embodiment;
FIG. 3 is a diagram illustrating the Chinese phrase segmentation results of the data items in another embodiment;
FIG. 4 is a flowchart illustrating a text data comparison method according to an embodiment;
fig. 5 is a block diagram showing a configuration of a text data comparison device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The text data comparison method provided by the application can be used for comparison and analysis of text data items in two data dictionary tables, and the essence of the method is that a group of mapping relations from one set to the other set are constructed for two data sets with certain differences by utilizing similarity characteristics between data, and a process of marking and explaining changed data items is carried out. Because the data items in the two data sets are not completely consistent, the association between different data values of the same data ontology is usually realized through semantic similarity (for example, "Guangxi Zhuang autonomous region Nanning City" is semantically equivalent to "Guangxi Nanning City", but there is a difference in data value); however, there may be some similarity between data items in the same set and other different data ontologies (for example, there is some similarity between "training set" and "training set"), so how to globally construct an optimal mapping relationship is a key to solve the comparison analysis of the data dictionary table. The invention constructs a measure capable of representing semantic similarity of data items aiming at the data items in two data sets, realizes optimal comparison analysis of the data items in the two sets based on semantic association according to the similarity measure, realizes the optimal global similarity of the whole comparison result, and can provide a novel thought for automatic comparison analysis of a data dictionary.
In one embodiment, as shown in fig. 1, there is provided a text data comparison method, including the steps of:
step 100: and acquiring text data item sets in the two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
Specifically, the two data dictionary tables are heterogeneous.
The data dictionary table is basic data for defining meta information such as data items of data in the current database system, and is key information for application and understanding of the whole database system.
The data dictionary table can be a unit sequence dictionary table, and the unit sequence dictionary table mainly comprises fields such as unit numbers, unit names, levels and the like (wherein the unit name field reflects semantic relevance of the data body before and after adjustment); but also department relation dictionary tables, product type dictionary tables, etc.
In order to implement the correlation comparison based on semantic similarity on the text data items such as names in the data dictionary table, firstly, the data items are processed by using a natural language processing method. The word is used as the minimum unit for expressing semantics and is a basic operation unit for constructing data item similarity measurement, so that effective word segmentation on Chinese phrase data items or English phrases is a basic content of dictionary data processing. The invention mainly takes Chinese phrase data items in a data dictionary as an example, and introduces the data dictionary comparison method provided by the invention by using an open Wikipedia corpus and services as a Chinese word segmentation dictionary. The word segmentation method can be a dictionary-based word segmentation method or a statistical-based word segmentation method.
Step 102: and calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
Wherein the number of rows of the similarity metric matrix is equal to the number of elements in a first one of the two sets of text data items, and the number of columns of the similarity metric matrix is equal to the number of elements in a second one of the two sets of text data items.
Element values of a similarity metric matrixFor the first in the first set of text data itemsiThe element and the second of the second set of text data itemsjAnd (4) measuring the similarity between the elements through a preset similarity ratio threshold value to obtain a result after processing.
Step 104: and according to the similarity measurement matrix and the two text data item sets, converting the comparison analysis problem of the two text data item sets into a matching problem of the weighted bipartite graph.
Specifically, the data dictionary table comparison problem can be described as shown in fig. 2 as follows: there are two different data item sequencesAndwherein the data itemAnd data itemThere is similarity with a similarity ratio ofHow to construct a set of one-to-one mapping relationships such thatCan only be matched with at mostCorresponds to one of the data items in (a),at most only by the data item inCorresponds to one data item in the two sets of sequences and satisfies that the two sets of sequences have the greatest similar matching effect. From the above description, the dictionary table data alignment analysis is a typical matching problem of weighted bipartite graphs.
Step 106: and solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relations between two text data item sets.
Specifically, the KM algorithm is a classical algorithm for solving optimal matching of bipartite graphs, and the core idea is to adjust the standard value of each vertexAndand finally, the matching number of the bipartite graph can be maximized, and a group of globally optimal matching relations are solved.
Two dictionary tables are arranged to form a bipartite graphIn which the vertex is setFrom a collection of data items in two dictionary tablesIs composed ofSet of edgesBy gatheringAnd set ofThe similarity relationship between the two components is formed, and the corresponding weight value isRepresents a vertexAnd vertexThe similarity of (c). According to the definition, the data dictionary comparison method using the KM algorithm specifically comprises the following steps:
step 1: initializing a setAnd set ofThe value of each element inCorresponding index value isSimultaneously orderThe corresponding scalar value is the maximum similarity value associated therewith, i.e.WhereinmIs a setThe number of elements in (c).
Step 2: from the collectionElement (1) ofAt the beginning, in bipartite graphIn accordance withSearching for augmented pathsIf there is an extended pathSkipping to step 4; otherwise, jumping to step 3.
And step 3: in searching for an augmented pathCan be satisfied when failingSet of interleaved paths of condition consisting of alternating line segmentsAnd then jumps to step 2.
And 4, step 4: for the extended pathAnd the middle related line segment negating the original matching rule and updating to form a new matching rule.
And 5: inspection setWhether the middle element is traversed or not, and if the traversal is finished, ending the operation; otherwise, the next element of the set is processedAnd jumping to step 2.
The text data comparison method comprises the steps of obtaining text data item sets in two data dictionary tables, carrying out word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets, calculating similarity measurement between the elements of the two text data item sets, preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, converting a comparison analysis problem of the two text data item sets into a problem of seeking an optimal matching scheme through a bipartite graph by abstracting and modeling the dictionary table comparison analysis problem, and solving the problem by using a KM algorithm. The method realizes the automatic comparison and analysis of the dictionary table data based on the semantics, effectively relieves the working pressure of manually comparing in the data reorganization process, and provides a new idea for the automatic processing of data comparison.
In one embodiment, step 100 comprises: acquiring text data item sets in two data dictionary tables; and performing word segmentation processing on the elements in the two text data item sets by adopting a word segmentation method based on statistics to obtain a Chinese word set of each element in the two text data item sets.
Specifically, a Chinese Language Processing (Hanlp) participle toolkit is utilized, and on the basis of establishing a service dictionary, the prior learning of a Chinese corpus and the participle of a text data item are realized through a second-order hidden Markov chain model embedded in the Chinese corpus. As shown in fig. 3, the chinese phrases in two data items are segmented into two word sets, which are:,。
in one embodiment, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively; step 102 comprises: calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets; when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements at the corresponding positions of the similarity measurement matrix are equal to the similarity measurement; when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
In one embodiment, the similarity measure in step 102 is calculated by the following formula:
wherein the content of the first and second substances,in the case of a similar ratio,for the first in the first set of text data itemsThe set of chinese words that an individual element includes,,nfor the number of elements in the first set of text data items,for the first in the second set of text data itemsThe set of chinese words that an individual element includes,,mfor the number of elements in the second set of text data items,is an element number count operation.
Specifically, corresponding word vectors are constructed for words in the two word sets, similar words are screened by using a cosine measurement method between the word vectors, and similarity measurement of Chinese phrases can be realized by using a Jaccord algorithm. However, in view of the fact that text data items in a data dictionary are usually short and refined professional phrases, and in order to reduce operation consumption, similarity measurement is simplified into Jaccard similarity measurement which takes words as granularity and among the text data items, and representation of semantic similarity among the data items is achieved, namely when the number of the words commonly owned in two sets is larger, the semantics expressed by two Chinese phrases are more similar; conversely, the lower the similarity of the content expressed by the two Chinese phrases. Based on the above conclusions, the Chinese phrase similarity measure is defined as formula (1), which is called similarity ratio.
According to the formula (1), the more the words of the two word sets overlap, the similarity ratio is determinedThe closer to 1; correspondingly, when the word difference between the two word sets is larger, the similarity ratio is largerThe closer to 0. Thus, the similarity ratioThe semantic similarity with two Chinese phrases shows strong positive correlation.
In one embodiment, as shown in fig. 4, there is provided a text data comparison method, including the steps of:
step 400: and acquiring text data item sets in the two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
Step 402: and calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
Step 404: and according to the characteristic of the similarity measurement matrix sparsification, the similarity measurement matrix is cut into a plurality of sub-similarity measurement matrixes which are not related to each other.
According to the characteristic of similarity measurement matrix sparsification, an original matrix is divided into a plurality of sub-similarity measurement matrixes which are irrelevant to each other, and the method specifically comprises the following steps: 1) traversing similarity matrix data needing to be compared; 2) based on the sparsification characteristic that all terms in the matrix are zero, taking the sparse matrix with all terms in the similarity matrix as a matrix partition line; 3) after preserving the segmentation, a sub-similarity metric matrix is generated.
And solving the sub-matrixes by using a KM algorithm respectively, and finally realizing the comparative analysis of the whole dictionary table.
Step 406: and solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets.
In one embodiment, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively; step 402 comprises: calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets; when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements of the corresponding positions of the similarity measurement matrix are equal to the similarity measurement; when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
In one embodiment, the KM algorithm employs a depth-first search algorithm for the search of the augmented path in step 406.
In particular, a similarity ratio threshold is usedThe association relation among the data items is preprocessed, so that the association depth among the data items can be effectively reduced, and the augmented path is searched according to the KM algorithmIn the process, a depth-first search algorithm can be used for searching, so that excessive recursive calls can be avoided, and further, the operation and storage consumption is reduced.
In one embodiment, the similarity measure in step 402 is calculated as shown in equation (1).
In one embodiment, the data dictionary table includes a data number field; step 406 includes, before: and comparing the data dictionary tables updated before and after a period of time according to the data number field.
Specifically, a number field, that is, an Identity Document (ID) field, is usually set in the actual dictionary table data design, and if the data dictionary table updated before and after a period of time can be matched according to the characteristic that the data item number field has relative fixity, the operation consumption of the algorithm can be further reduced by preferentially matching the number field, and the comparison analysis efficiency is improved.
It should be understood that although the steps in the flowcharts of fig. 1 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 and 4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 5, there is provided a text data comparing apparatus including: the system comprises a comparison data acquisition module, a similarity measurement matrix determination module and a text data comparison result determination module, wherein:
and the comparison data acquisition module is used for acquiring the text data item sets in the two data dictionary tables and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets.
And the similarity measurement matrix determining module is used for calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix.
The text data comparison result determining module is used for converting the two text data item sets into a matching problem of a weighted bipartite graph according to the similarity measurement matrix and the two text data item sets; and solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relation between two text data item sets.
In one embodiment, the comparison data obtaining module is further configured to obtain text data item sets in two data dictionary tables; and performing word segmentation processing on the elements in the two text data item sets by adopting a word segmentation method based on statistics to obtain a Chinese word set of each element in the two text data item sets.
In one embodiment, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively; the similarity measurement matrix determining module is also used for calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets; when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements of the corresponding positions of the similarity measurement matrix are equal to the similarity measurement; when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
In one embodiment, the calculation formula of the similarity metric in the similarity metric matrix determination module is shown as formula (1).
For the specific definition of the text data comparison device, reference may be made to the above definition of the text data comparison method, which is not described herein again. The modules in the text data comparison device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A method of comparing text data, the method comprising:
acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets;
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix;
according to the similarity measurement matrix and the two text data item sets, converting a comparison analysis problem of the two text data item sets into a matching problem of a weighted bipartite graph;
and solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relation between the two text data item sets.
2. The method of claim 1, wherein obtaining a set of text data items in two data dictionary tables, and performing word segmentation on the two sets of text data items to obtain a set of chinese words for each element in the two sets of text data items comprises:
acquiring text data item sets in two data dictionary tables;
and performing word segmentation processing on the elements in the two text data item sets by adopting a word segmentation method based on statistics to obtain a Chinese word set of each element in the two text data item sets.
3. The method of claim 1, wherein rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively;
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, wherein the similarity measurement matrix comprises:
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets;
when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements of the corresponding positions of the similarity measurement matrix are equal to the similarity measurement;
when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
4. The method according to claim 1, wherein a similarity measure between elements of two sets of text data items is calculated according to a chinese word set of each element in the two sets of text data items, and the similarity measure is preprocessed by a preset similarity ratio threshold to obtain a similarity measure matrix, wherein the similarity measure is calculated according to the following formula:
wherein the content of the first and second substances,in the case of a similar ratio,for the first in the first set of text data itemsThe set of chinese words that an individual element includes,for the first in the second set of text data itemsThe set of chinese words that an individual element includes,is an element number count operation.
5. A method of comparing text data, the method comprising:
acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets;
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix;
according to the characteristic of similarity measurement matrix sparsification, the similarity measurement matrix is divided into a plurality of sub-similarity measurement matrixes which are irrelevant to each other;
and solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets.
6. The method of claim 5, wherein rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively;
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, wherein the similarity measurement matrix comprises:
calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets;
when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements at the corresponding positions of the similarity measurement matrix are equal to the similarity measurement;
when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.
7. The method as recited in claim 6, wherein each sub-similarity metric matrix is solved using a KM algorithm to obtain a set of globally optimal matching relationships between two sets of text data items, and wherein the KM algorithm uses a depth-first search algorithm for searching for an augmented path.
8. The method according to claim 5, wherein a similarity measure between elements of two sets of text data items is calculated according to a Chinese word set of each element in the two sets of text data items, and the similarity measure is preprocessed by a preset similarity ratio threshold to obtain a similarity measure matrix, wherein the similarity measure is calculated according to a formula:
wherein the content of the first and second substances,in the case of a similar ratio, the ratio,for the first in the first set of text data itemsThe set of chinese words that an individual element includes,for the first in the second set of text data itemsThe set of chinese words that an individual element includes,is an element number count operation.
9. The method of claim 5, wherein the data dictionary table includes a data number field;
solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets, wherein the steps comprise:
and comparing the data dictionary tables updated before and after a period of time according to the data number field.
10. A text data comparison apparatus, characterized in that the apparatus comprises:
the comparison data acquisition module is used for acquiring text data item sets in two data dictionary tables and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets;
the similarity measurement matrix determining module is used for calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix;
the text data comparison result determining module is used for converting a comparison analysis problem of the two text data item sets into a matching problem of the weighted bipartite graph according to the similarity measurement matrix and the two text data item sets; and solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relation between the two text data item sets.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210631816.9A CN114722160B (en) | 2022-06-07 | 2022-06-07 | Text data comparison method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210631816.9A CN114722160B (en) | 2022-06-07 | 2022-06-07 | Text data comparison method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114722160A true CN114722160A (en) | 2022-07-08 |
CN114722160B CN114722160B (en) | 2022-09-02 |
Family
ID=82232868
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210631816.9A Active CN114722160B (en) | 2022-06-07 | 2022-06-07 | Text data comparison method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114722160B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1959671A (en) * | 2005-10-31 | 2007-05-09 | 北大方正集团有限公司 | Measure of similarity of documentation based on document structure |
CN106547739A (en) * | 2016-11-03 | 2017-03-29 | 同济大学 | A kind of text semantic similarity analysis method |
CN113407767A (en) * | 2021-06-29 | 2021-09-17 | 北京字节跳动网络技术有限公司 | Method and device for determining text relevance, readable medium and electronic equipment |
CN113934842A (en) * | 2020-06-29 | 2022-01-14 | 数网金融有限公司 | Text clustering method and device and readable storage medium |
-
2022
- 2022-06-07 CN CN202210631816.9A patent/CN114722160B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1959671A (en) * | 2005-10-31 | 2007-05-09 | 北大方正集团有限公司 | Measure of similarity of documentation based on document structure |
CN106547739A (en) * | 2016-11-03 | 2017-03-29 | 同济大学 | A kind of text semantic similarity analysis method |
CN113934842A (en) * | 2020-06-29 | 2022-01-14 | 数网金融有限公司 | Text clustering method and device and readable storage medium |
CN113407767A (en) * | 2021-06-29 | 2021-09-17 | 北京字节跳动网络技术有限公司 | Method and device for determining text relevance, readable medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN114722160B (en) | 2022-09-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108280114B (en) | Deep learning-based user literature reading interest analysis method | |
CN117076653B (en) | Knowledge base question-answering method based on thinking chain and visual lifting context learning | |
CN108319583B (en) | Method and system for extracting knowledge from Chinese language material library | |
CN103646112A (en) | Dependency parsing field self-adaption method based on web search | |
CN113282729B (en) | Knowledge graph-based question and answer method and device | |
KR102091633B1 (en) | Searching Method for Related Law | |
CN108846033B (en) | Method and device for discovering specific domain vocabulary and training classifier | |
CN116628173B (en) | Intelligent customer service information generation system and method based on keyword extraction | |
CN112036178A (en) | Distribution network entity related semantic search method | |
CN114491079A (en) | Knowledge graph construction and query method, device, equipment and medium | |
Bender et al. | Unsupervised estimation of subjective content descriptions | |
CN114328800A (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
CN116629258B (en) | Structured analysis method and system for judicial document based on complex information item data | |
CN114722160B (en) | Text data comparison method and device | |
CN116127097A (en) | Structured text relation extraction method, device and equipment | |
CN113468311B (en) | Knowledge graph-based complex question and answer method, device and storage medium | |
CN115203206A (en) | Data content searching method and device, computer equipment and readable storage medium | |
Angrosh et al. | Ontology-based modelling of related work sections in research articles: Using crfs for developing semantic data based information retrieval systems | |
CN114911826A (en) | Associated data retrieval method and system | |
CN113157892A (en) | User intention processing method and device, computer equipment and storage medium | |
CN111930880A (en) | Text code retrieval method, device and medium | |
Wei et al. | An index construction and similarity retrieval method based on sentence-bert | |
Arivarasan et al. | Data mining K-means document clustering using tfidf and word frequency count | |
CN116126893B (en) | Data association retrieval method and device and related equipment | |
Zhu et al. | Doc2Vec on similar document suggestion for pharmaceutical collections |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |