CN103294780A - Directory mapping relationship mining device and directory mapping relationship mining device - Google Patents

Directory mapping relationship mining device and directory mapping relationship mining device Download PDF

Info

Publication number
CN103294780A
CN103294780A CN2013101755697A CN201310175569A CN103294780A CN 103294780 A CN103294780 A CN 103294780A CN 2013101755697 A CN2013101755697 A CN 2013101755697A CN 201310175569 A CN201310175569 A CN 201310175569A CN 103294780 A CN103294780 A CN 103294780A
Authority
CN
China
Prior art keywords
title
catalogue
user catalog
standard directories
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013101755697A
Other languages
Chinese (zh)
Other versions
CN103294780B (en
Inventor
刘埔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201310175569.7A priority Critical patent/CN103294780B/en
Publication of CN103294780A publication Critical patent/CN103294780A/en
Application granted granted Critical
Publication of CN103294780B publication Critical patent/CN103294780B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a directory mapping relationship mining method. The method includes the following steps: taking a full-amount entry under single-class classification in an entry system as an entry to be mapped and taking annotation data and a synonym list as a mapping dictionary; respectively performing user directory name mapping and direction content mapping to determine a standard directory preliminarily corresponding to a user directory name; determining a standard directory finally mapped by the user directory name by means of empowerment voting. Correspondingly, the invention further provides a directory mapping relationship mining device. By directory mapping relationship mining, overall readability of the entry system can be effectively improved.

Description

A kind of catalogue mapping relations method for digging and device
Technical field
The present invention relates to the information processing technology, relate in particular to a kind of catalogue mapping relations method for digging and device.
Background technology
For entry system (as the encyclopaedia entry, search entry), usually the catalogue under the entry classification is divided into standard directories and User Catalog.Wherein, standard directories is formulated by artificial (as the product manager), is positioned under each classification entry, as the standard directories of content correspondences such as the personage introduction in the human classification entry, personage's experience and the prize-winning record of personage; User Catalog then is by the autonomous catalogue of creating of user, and these autonomous catalogues of creating may be identical with the standard directories of artificial formulation, also may be inequality but implication is similar.For example, standard directories is " profile ", and the catalogue that the user creates may be set up according to standard directories, is " profile ", also may have bigger randomness, is " biographical information ".
Because all entries all are to be created by the user basically in the existing entry system, and the entry that the user creates exists usually that directory name is lack of standardization, catalogue logical miss, hierarchical relationship are unreasonable, the problems such as content details and omissions improper, weak (no) related content statement under the catalogue.For example, in the encyclopaedia entry, the catalogue major part that the user adds has the statement colloquial style, title is lack of standardization or the catalogue level arranges characteristics such as unreasonable.
Therefore, hope can propose a kind of be used to the catalogue mapping relations method for digging and the device that address the above problem.
Summary of the invention
The purpose of this invention is to provide a kind of catalogue mapping relations method for digging and device, problem such as can solve effectively in the entry system that the directory name that exists usually is lack of standardization, catalogue logical miss, hierarchical relationship are unreasonable.
According to an aspect of the present invention, provide a kind of catalogue mapping relations method for digging, this method may further comprise the steps:
With single classification in the entry system down the full dose entry as entry to be mapped, with labeled data, synonym table as the mapping dictionary;
Carry out the mapping of User Catalog name map and directory content respectively, determine the preliminary corresponding standard directories of User Catalog title;
Adopt the mode of composing the power ballot to determine the standard directories of the final mapping of User Catalog title.
According to another aspect of the present invention, also provide a kind of catalogue mapping relations excavating gear, having comprised:
Mapping (enum) data is set up module, be used for the single classification of entry system down the full dose entry as entry to be mapped, with labeled data, synonym table as the mapping dictionary;
Catalogue and content map module are used for carrying out respectively the mapping of User Catalog name map and directory content, determine the preliminary corresponding standard directories of User Catalog title;
Final mapping catalogue determination module adopts the mode of composing the power ballot to determine the standard directories of the final mapping of User Catalog title.
Compared with prior art, the present invention has the following advantages:
1) excavate by the catalogue mapping relations, help to improve entry entire system readability, credible and comprehensive;
2) by excavating the catalogue incidence relation, identify and revise the encyclopaedia classification and explain with other catalogue of standard directories mapping down, effectively promote the encyclopaedia total quality.
Description of drawings
By reading the detailed description of doing with reference to the following drawings that non-limiting example is done, it is more obvious that other features, objects and advantages of the present invention will become:
Fig. 1 is catalogue mapping relations method for digging process flow diagram in accordance with a preferred embodiment of the present invention;
Fig. 2 in accordance with a preferred embodiment of the present invention shine upon the method flow diagram of the mapping relations of preliminary digging user catalogue and standard directories based on directory content;
Fig. 3 carries out the process flow diagram of standard directories keyword abstraction for employing TF/IDF algorithm in accordance with a preferred embodiment of the present invention;
Fig. 4 is the schematic block diagram of catalogue mapping relations excavating gear in accordance with a preferred embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing the present invention is described in further detail.
According to an aspect of the present invention, the method that provides a kind of catalogue mapping relations to excavate.
Please refer to Fig. 1, Fig. 1 is catalogue mapping relations method for digging process flow diagram in accordance with a preferred embodiment of the present invention.
As shown in Figure 1, method provided by the present invention may further comprise the steps:
Step S101, with single classification in the entry system down the full dose entry as entry to be mapped, with labeled data, synonym table as the mapping dictionary.
Particularly, the entry system comprises the full dose entry under a plurality of single classification, for example, comprise in the encyclopaedia entry system as the full dose entry under the classification such as amusement personage and cartoon character, and single entry comprises content under a plurality of directory names and the directory name.This singly is categorized as example with the amusement personage, has under the directory name that it comprises and the directory name: content under directory name such as profile, performing art experience, main works and honor record and the directory name.
As stated in the Background Art, the catalogue under the entry classification is divided into standard directories and User Catalog, and labeled data is set up at the mapping relations between standard directories and the User Catalog, and to show as " User Catalog-standard directories " mapping right in form.Usually, labeled data is by manually being marked, and each encyclopaedia entry classification has the labeled data about 100 down, for example: singer's personal information-profile.
Wherein, synonym table is the tables of data by the synonym set, and it is not at concrete entry classification, and is common with the right formal description of synonym, as: introduction-brief introduction is described-describe.
The purpose of present embodiment is to excavate the mapping relations of following all User Catalogs of entry classification and standard directories, therefore with single classification in the entry system down the full dose entry as entry to be mapped, and with labeled data, synonym table as the mapping dictionary, determine each entry classification User Catalog concrete corresponding standard directories in the mapping dictionary down by the calculating of follow-up complexity.
As previously mentioned, User Catalog comprises standard directories and non-criterial list, and in the present embodiment, the non-criterial list that is primarily aimed at user's establishment is handled.
Step S102 carries out the mapping of User Catalog name map and directory content respectively, determines the preliminary corresponding standard directories of User Catalog title.
Particularly, User Catalog name map and directory content mapping specifically comprise: respectively based on User Catalog title and standard directories, and content under content and the standard directories title under the User Catalog title, calculate the standard directories title of the final mapping of User Catalog title.
Further, before the similarity of calculating User Catalog title and standard directories, described User Catalog title is carried out the pre-service of participle and part of speech filtration.Further, described pre-service comprises: the User Catalog title is carried out participle, and filter wherein insignificant word according to part of speech, as punctuation mark, conjunction, interjection etc.Through splicing, the pre-service result is replaced the original directory title.For example, the User Catalog name is called " 1. about personage introduction ", behind the participle, obtain: 1/./about/personage// introduce, filter through part of speech, further obtain: personage/introduction, result " personage " and " introduction " after filtering are spliced, obtained " personage introduction ", therefore, replace original directory title " 1. about personage introduction " with directory name " personage introduction ", and finally calculate the directory name similarity with directory name " personage introduction ".
In the present embodiment, adopt the longest common subsequence (LCS) algorithm as the basic algorithm of the similarity of calculating User Catalog title and standard directories title.Wherein, the longest common subsequence refers to the subsequence of length maximum in all common subsequence of any two character strings.For example, given two character strings " abac " and " caba ", then the longest common subsequence of two character strings is " aba ".In the present embodiment, the derivation algorithm of long common subsequence is not limited, can be adopted as dynamic programming algorithm and suffix tree algorithm at interior multiple algorithm.
Further, for be inverted the semantic constant User Catalog in back through cutting, adopt positive and negative twice the algorithm of long common subsequence calculate the similarity of described User Catalog title and standard directories.For example: two words of the pathology cause of disease and etiology and pathology, though the front and back conversion has been carried out in the position of " pathology " and " cause of disease ", the semanteme before and after the conversion does not change.Based on above-mentioned situation, be positive and negative twice LCS on the basis of LCS algorithm and calculate.If calculate by positive and negative twice LCS, the result who obtains does not overlap at former input position, then adjusts the former LCS length of LCS length to 2 between two catalogues times.
Particularly, for " the pathology cause of disease " and " etiology and pathology " two words, calculate by the LCS algorithm first, grown common subsequence " pathology " most, length is 4 bytes; By the longest common subsequence of LCS algorithm calculating through " etiology and pathology " and " the pathology cause of disease " two words of order inversion, obtain " cause of disease " word again, length also is 4 bytes.Because the longest common subsequence " pathology " and " cause of disease " that calculate for twice do not overlap at former input position (in " the pathology cause of disease ", the pathology input position is the 1-4 byte, cause of disease input position is the 5-8 byte, front and back do not overlap), therefore, judge that " the pathology cause of disease " and the longest common subsequence length of " etiology and pathology " are 8 bytes.
Because the directory name calculation of similarity degree can extensively be calculation of similarity degree between short text, and identical content length is directly proportional between the similarity of two directory names and catalogue, is inversely proportional to content-length inequality.
On the basis of basic algorithm LCS, adopt following dual mode to calculate the similarity of User Catalog title and standard directories title jointly respectively:
Mode one, directly calculate the similarity of standard directories under User Catalog and the taxonomic hierarchies by following formula:
SimA=(User Catalog title and standard directories title LCS length * 2)/User Catalog title and standard directories title length sum;
Mode two, based on labeled data indirect calculation User Catalog and standard directories title similarity:
SimB=(User Catalog title and mark directory name LCS length * 2)/(MAX (User Catalog title length, mark directory name length is with the standard directories length of mark directory name mapping) * 2);
Wherein, the mark directory name refers to User Catalog corresponding in the labeled data, for example, and in the labeled data " biographical information-profile " corresponding " biographical information ".
Preferably, after adopting as above dual mode to calculate User Catalog title and standard directories title similarity, according to the similarity rank, get the standard directories title of the maximum similarity of the rank front two that each mode obtains respectively, as the preliminary corresponding standard directories title of User Catalog title.For example: have in the labeled data: " people information-profile ", the User Catalog name is called " personal information ", the standard directories name is called " profile ", the mark catalogue is " people information ", then when calculating User Catalog " personal information " with the standard directories similarity, by the way one, directly calculate the title similarity of " personal information " and " profile ", and by the way two, based on the title similarity of mark catalogue " people information " indirect calculation " personal information " with " profile ", in this example, be 2*2/ (8+8)=0.25 by directly calculating similarity result, obtaining similarity result by indirect calculation is 6*2/ (8*2)=0.75.
Further, set similarity threshold, the similarity value that more above-mentioned dual mode obtains respectively and described threshold value.If the similarity value that obtains is less than described threshold value, then the participle content in the described User Catalog title is replaced with synonym corresponding in the described synonym table, " introduction " participle of for example replacing in the personage introduction is " brief introduction ", and then general's " individual introduces " replaces with " individual brief introduction ", and calculate User Catalog after the described replacement and the similarity value of standard directories, and this similarity value is replaced the former similarity value that obtains; If resulting similarity value more than or equal to described threshold value, is then kept resulting similarity value.
Further, based on the directory content mapping, calculate the process of the standard directories title of the preliminary mapping of User Catalog title, specifically please refer to Fig. 2, as shown in Figure 2, comprising:
Step S201 extracts final keyword set the content under the content under the standard directories title and the mark directory name corresponding with the standard directories title;
Step S202 with the keyword set of final keyword set as User Catalog and standard directories, calculates the weight of User Catalog keyword and standard directories keyword, forms the keyword weight vectors;
Step S203 based on described keyword weight vectors, calculates the similarity of described User Catalog title and standard directories title, obtain the User Catalog title the standard directories title of preliminary mapping.
Particularly, at step S201, that the extraction of described keyword is adopted is TF/IDF(Term Frequency-Inverse Document Frequency, document-anti-document frequency) algorithm, specifically comprise: with the homonymous standards directory name and with it corresponding mark directory name and under content as a catalogue collection, be referred to as the catalogue set name with described standard directories name, and the catalogue collection of all standard directories name correspondences under the encyclopaedia classification is formed total file set.For example, the biographical information's catalogue in all personage's short contents lists under the human classification and the mark and content are formed a catalogue collection.A thinner step ground, for example, " profile " directory name is arranged in the entry " Liu Dehua ", content is abc under the corresponding directory name, " profile " directory name is also arranged in the entry " Xu Song ", content is efg under the corresponding directory name, then content abc/efg under directory name " profile " and the directory name is formed a catalogue collection, and with " profile " as this catalogue set name.
Wherein, employing be that the TF/IDF algorithm carries out keyword abstraction, specifically please refer to Fig. 3, as shown in Figure 3, this extraction process comprises:
S301, all keywords under the corresponding catalogue collection of draw standard catalogue are formed a file set with the catalogue collection of all standard directories correspondences, and are utilized the TF/IDF algorithm to calculate the weight of keyword under each standard directories collection;
S302 sets a threshold value, and to be higher than the keyword of this threshold value be ultimate criterion catalogue keyword to the IF/IDF value in the draw standard catalogue collection keyword.
Preferably, the keyword abstraction of catalogue collection is data set with the catalogue number greater than 3 encyclopaedia entry, and suitably filters the inconsistent situation of content under catalogue and the catalogue, reduces impurity.According to statistics, many entries have only 1 catalogue, and called after brief introduction, but the content under the catalogue had both comprised personal information, also comprise personal story, honor etc., for this situation, should avoid using this type of catalogue as the keyword abstraction data, corresponding with the content under the directory name to guarantee directory name as far as possible.In the subsequent process, adopt the mode of participle and part-of-speech tagging that content is handled, and obtain the concentrated keyword of each catalogue and total keyword by stop words filtration, part of speech screening, the screening of the participle frequency.
Particularly, at step S202, what described keyword weight vectors also adopted is the TF/IDF algorithm, specifically comprises:
User Catalog has identical keyword set with standard directories, utilize TF/IDF to calculate the weight of all User Catalogs and standard directories keyword respectively, form User Catalog and standard directories keyword weight vectors, for example, the keyword vector A=(x1 of standard directories A, x2, x3 ... xn), wherein xn is the weight of n keyword in standard directories A, and dimension is the number of keyword, limits the back by the TF/IDF threshold value and determines, threshold value is more high, the keyword number is more few, and dimension is more low, and vice versa.
Be example with the amusement human classification, the profile catalogue under all entries and content are formed a catalogue collection, and profile is the catalogue set name, at first calculate the keyword vector under the profile, such as (height, age), calculating the final keyword of all standard directories again, is (height, age as profile and the total keyword of honor record catalogue, obtain, prize), (height, age in all User Catalogs are calculated in the back, obtain prize) weight of keyword vector.Wherein, the concrete computing formula of described weight is as follows:
The Weight=key word is the total word frequency * In of key word in word frequency/catalogue collection (the catalogue number that composite catalog number/key word occurs) * sqrt (catalogue number/this catalogue lump catalogue number that occurs concentrated in key word in catalogue) in the catalogue collection;
For example, total directories is 50000 under the amusement human classification, the total degree that entry comprises the profile catalogue is 300, total word number and the height word number of the content under the profile catalogue collection are respectively 10000 and 500, height appears in the directory content of 200 different entries, is under the catalogue of profile but only appear at 150 titles; Then the weight of height under profile catalogue collection is: w=(500/10000) * In (50000/200) * sqrt (150/300)=0.19;
Wherein, sqrt (catalogue number/this catalogue lump catalogue number that occurs concentrated in key word in catalogue) is the accent weight factor of TF/IDF, it can guarantee two specific characters of keyword: guarantee that 1) keyword is more big at the concentrated catalogue number of times that occurs of catalogue, its representativeness more strong (best situation is that a catalogue concentrates the content under all catalogues of the same name all to comprise this key word); 2) guarantee that the key word weight property distinguished change under the different directories collection is big.
Particularly, at step S203, the final similarity of calculating described standard directories keyword weight vectors and User Catalog keyword weight vectors, for example, the keyword similarity between vectors of standard directory name " profile " and non-criterial list title " personal information " in the calculating User Catalog.Concrete computing formula is as follows:
sim ( A , B ) = Σ k = 1 n A k × B k ( Σ k = 1 n A k 2 ) ( Σ k = 1 n B k 2 )
Wherein, A is standard directories title keyword weight vectors, and B is non-criterial list title keyword weight vectors.
Preferably, according to the similarity rank that finally calculates described standard directories keyword weight vectors and User Catalog keyword weight vectors, as mentioned, get the standard directories title of the maximum similarity of rank front two, as the preliminary corresponding standard directories title of User Catalog title.
Step S103 adopts the mode of composing the power ballot to determine the standard directories of the final mapping of User Catalog title.
Particularly, by among the step S102 respectively based on User Catalog title and standard directories, and content under content and the standard directories title under the User Catalog title, after obtaining the preliminary corresponding standard directories title of User Catalog title, according to the difference of concrete application, adopt different tax power ballot modes.
Wherein, the concrete application comprises following situation: according to encyclopaedia entry mass distribution, if User Catalog is not too consistent with content under the catalogue, then compose temporary, set directory content and shine upon the weight height, it is low that directory name shines upon weight; If User Catalog and directory content quality are all very poor, then when ballot, only think when the highest similarity mapping result of directory name mapping and directory content mapping is identical, just be defined as the standard directories of final mapping, otherwise for guaranteeing that accuracy rate thinks that this catalogue do not shine upon with any standard directories.
Wherein, described tax power, finger is according to the comparative result of the importance of content under directory name and the directory name, respectively the similarity result of content under content and the standard directories under the similarity result of described calculating User Catalog title and standard directories and the User Catalog title carried out the ratio assignment.For example, suppose that directory name importance is higher than the content under the directory name, then can multiply by 1 to the result that the directory name mapping obtains, the result that mapping obtains to directory content multiply by 0.8.
Wherein, described ballot refers to determine final similar standard catalogue from the similar standard catalogue that tentatively obtains.For example, by the User Catalog name map, it is corresponding with standard directories a and b to obtain User Catalog title x, by the User Catalog content map, it is corresponding with standard directories c and d to obtain User Catalog title x, and then the standard directories of the final mapping that obtains by ballot is one the most similar among a, b, c and the d; And for example, by the User Catalog name map, it is corresponding with standard directories a and b still to obtain User Catalog title x, by the User Catalog content map, it is corresponding with standard directories a and c to obtain User Catalog title x, and then the standard directories of finally being shone upon by ballot is a.
More specifically, according to the total quality of content under User Catalog title and the directory name, determine the mode of ballot.If the total quality height then adopts the ballot mode of recalling that enlarges; If total quality is low, then adopt the mode of voting accurately of protecting.Wherein, when the ballot mode accurately of protecting refers to obtain identical mapping result by the mapping of User Catalog name map and directory content, think that just this identical mapping result is final standard directories of shining upon, for example, for User Catalog title " personage introduction " and standard directories title " profile ", User Catalog name map and directory content mapping are all thought when two directory names are similar, just think this result for net result, otherwise think that " personage introduction " catalogue finally do not shine upon with standard directories.Refer to that obtaining mapping result by User Catalog name map and directory content mapping does not exist identically, then gets similarity preliminary mapping standard directories higher and that be higher than the threshold value that sets and is final mapping standard directories and enlarge the ballot mode recall.
Compared with prior art, catalogue mapping relations method for digging provided by the present invention can bring following technique effect:
In the present embodiment, chosen 6 key monitorings classification of encyclopaedia as data source, because each classification descends the standardization of catalogue inconsistent, so technique effect is also distinct, shown in following form:
Figure BDA00003183449500091
Figure BDA00003183449500101
Wherein, total frequency appears in the recall rate=algorithm of the algorithm in the form catalogue frequency of occurrence/split catalog of recalling;
By last figure as can be known, by above-mentioned ballot mode, can realize the automatic correspondence of catalogue of the same type.And this method passes through to make up the catalogue map architecture option, and by the calculating on two dimensions of content similarity under directory name similarity and the catalogue, effectively whole catalogue mapping relations is excavated.
According to another aspect of the present invention, also provide a kind of catalogue mapping relations excavating gear, please refer to Fig. 4, Fig. 4 is the schematic block diagram of catalogue mapping relations excavating gear in accordance with a preferred embodiment of the present invention.As shown in the figure, this device comprises:
Mapping (enum) data is set up module 401, be used for the single classification of entry system down the full dose entry as entry to be mapped, with labeled data, synonym table as the mapping dictionary;
Catalogue and content map module 402 are used for carrying out respectively the mapping of User Catalog name map and directory content, determine the preliminary corresponding standard directories of User Catalog title;
Final mapping catalogue determination module 403 adopts the mode of composing the power ballot to determine the standard directories of the final mapping of User Catalog title.
Hereinafter will the concrete course of work of above-mentioned each module be described in detail.
Particularly, mapping (enum) data is set up module for setting up the basic mapping entry of entry system and carrying out the mapping dictionary that mapping relations are calculated.Wherein, the entry system comprises the full dose entry under a plurality of single classification, for example, comprise in the encyclopaedia entry system as the full dose entry under the classification such as amusement personage and cartoon character, and single entry comprises content under a plurality of directory names and the directory name.This singly is categorized as example with the amusement personage, has under the directory name that it comprises and the directory name: content under directory name such as profile, performing art experience, main works and honor record and the directory name.
As stated in the Background Art, the catalogue under the entry classification is divided into standard directories and User Catalog, and labeled data is set up at the mapping relations between standard directories and the User Catalog, and to show as " User Catalog-standard directories " mapping right in form.Usually, labeled data is by manually being marked, and each encyclopaedia entry classification has the labeled data about 100 down, for example: singer's personal information-profile.
Wherein, synonym table is the tables of data by the synonym set, and it is not at concrete entry classification, and is common with the right formal description of synonym, as: introduction-brief introduction is described-describe.
The purpose of present embodiment is to excavate the mapping relations of following all User Catalogs of entry classification and standard directories, therefore with single classification in the entry system down the full dose entry as entry to be mapped, and with labeled data, synonym table as the mapping dictionary, determine each entry classification User Catalog concrete corresponding standard directories in the mapping dictionary down by the calculating of follow-up complexity.
As previously mentioned, User Catalog comprises standard directories and non-criterial list, and in the present embodiment, the non-criterial list that is primarily aimed at user's establishment is handled.
Wherein, catalogue and content map module, mainly based on User Catalog title and standard directories, and content two aspects under content and the standard directories title under the User Catalog title, calculate the standard directories title of the final mapping of User Catalog title.
Further, this device also comprises pretreatment module, is used for before the similarity of calculating User Catalog title and standard directories, described User Catalog title is carried out the pre-service of participle and part of speech filtration.Wherein, the processing procedure of described pretreatment module comprises: the User Catalog title is carried out participle, and filter wherein insignificant word according to part of speech, as punctuation mark, conjunction, interjection etc.Through splicing, the pre-service result is replaced the original directory title.For example, the User Catalog name is called " 1. about personage introduction ", behind the participle, obtain: 1/./about/personage// introduce, filter through part of speech, further obtain: personage/introduction, result " personage " and " introduction " after filtering are spliced, obtained " personage introduction ", therefore, replace original directory title " 1. about personage introduction " with directory name " personage introduction ", and finally calculate the directory name similarity with directory name " personage introduction ".
In the present embodiment, adopt the longest common subsequence (LCS) algorithm as the basic algorithm of the similarity of calculating User Catalog title and standard directories title.Wherein, the longest common subsequence refers to the subsequence of length maximum in all common subsequence of any two character strings.For example, given two character strings " abac " and " caba ", then the longest common subsequence of two character strings is " aba ".In the present embodiment, the derivation algorithm of long common subsequence is not limited, can be adopted as dynamic programming algorithm and suffix tree algorithm at interior multiple algorithm.
Further, for be inverted the semantic constant User Catalog in back through cutting, adopt positive and negative twice the algorithm of long common subsequence calculate the similarity of described User Catalog title and standard directories.For example: two words of the pathology cause of disease and etiology and pathology, though the front and back conversion has been carried out in the position of " pathology " and " cause of disease ", the semanteme before and after the conversion does not change.Based on above-mentioned situation, be positive and negative twice LCS on the basis of LCS algorithm and calculate.If calculate by positive and negative twice LCS, the result who obtains does not overlap at former input position, then adjusts the former LCS length of LCS length to 2 between two catalogues times.
Particularly, for " the pathology cause of disease " and " etiology and pathology " two words, calculate by the LCS algorithm first, grown common subsequence " pathology " most, length is 4 bytes; By the longest common subsequence of LCS algorithm calculating through " etiology and pathology " and " the pathology cause of disease " two words of order inversion, obtain " cause of disease " word again, length also is 4 bytes.Because the longest common subsequence " pathology " and " cause of disease " that calculate for twice do not overlap at former input position (in " the pathology cause of disease ", the pathology input position is the 1-4 byte, cause of disease input position is the 5-8 byte, front and back do not overlap), therefore, judge that " the pathology cause of disease " and the longest common subsequence length of " etiology and pathology " are 8 bytes.
Because the directory name calculation of similarity degree can extensively be calculation of similarity degree between short text, and identical content length is directly proportional between the similarity of two directory names and catalogue, is inversely proportional to content-length inequality.
On the basis of basic algorithm LCS, adopt following dual mode to calculate the similarity of User Catalog title and standard directories title jointly respectively:
Mode one, directly calculate the similarity of standard directories under User Catalog and the taxonomic hierarchies by following formula:
SimA=(User Catalog title and standard directories title LCS length * 2)/User Catalog title and standard directories title length sum;
Mode two, based on labeled data indirect calculation User Catalog and standard directories title similarity:
SimB=(User Catalog title and mark directory name LCS length * 2)/(MAX (User Catalog title length, mark directory name length is with the standard directories length of mark directory name mapping) * 2);
Wherein, the mark directory name refers to User Catalog corresponding in the labeled data, for example, and in the labeled data " biographical information-profile " corresponding " biographical information ".
Preferably, after adopting as above dual mode to calculate User Catalog title and standard directories title similarity, according to the similarity rank, get the standard directories title of the maximum similarity of the rank front two that each mode obtains respectively, as the preliminary corresponding standard directories title of User Catalog title.For example: have in the labeled data: " people information-profile ", the User Catalog name is called " personal information ", the standard directories name is called " profile ", the mark catalogue is " people information ", then when calculating User Catalog " personal information " with the standard directories similarity, by the way one, directly calculate the title similarity of " personal information " and " profile ", and by the way two, based on the title similarity of mark catalogue " people information " indirect calculation " personal information " with " profile ", in this example, be 2*2/ (8+8)=0.25 by directly calculating similarity result, obtaining similarity result by indirect calculation is 6*2/ (8*2)=0.75.
Further, set similarity threshold, the similarity value that more above-mentioned dual mode obtains respectively and described threshold value.If the similarity value that obtains is less than described threshold value, then the participle content in the described User Catalog title is replaced with synonym corresponding in the described synonym table, " introduction " participle of for example replacing in the personage introduction is " brief introduction ", and then general's " individual introduces " replaces with " individual brief introduction ", and calculate User Catalog after the described replacement and the similarity value of standard directories, and this similarity value is replaced the former similarity value that obtains; If resulting similarity value more than or equal to described threshold value, is then kept resulting similarity value.
Further, based on the directory content mapping, calculate the standard directories title of the preliminary mapping of User Catalog title, specifically comprise:
I) extract final keyword set the content under the content under the standard directories title and the mark directory name corresponding with the standard directories title;
Ii) with the keyword set of final keyword set as User Catalog and standard directories, calculate the weight of User Catalog keyword and standard directories keyword, form the keyword weight vectors;
Iii) based on described keyword weight vectors, calculate the similarity of described User Catalog title and standard directories title, obtain the User Catalog title the standard directories title of preliminary mapping.
Particularly, at step I), that the extraction of described keyword is adopted is TF/IDF(Term Frequency-Inverse Document Frequency, document-anti-document frequency) algorithm, specifically comprise: with the homonymous standards directory name and with it corresponding mark directory name and under content as a catalogue collection, be referred to as the catalogue set name with described standard directories name, and the catalogue collection of all standard directories name correspondences under the encyclopaedia classification is formed total file set.For example, the biographical information's catalogue in all personage's short contents lists under the human classification and the mark and content are formed a catalogue collection.A thinner step ground, for example, " profile " directory name is arranged in the entry " Liu Dehua ", content is abc under the corresponding directory name, " profile " directory name is also arranged in the entry " Xu Song ", content is efg under the corresponding directory name, then content abc/efg under directory name " profile " and the directory name is formed a catalogue collection, and with " profile " as this catalogue set name.
Wherein, employing be that the TF/IDF algorithm carries out keyword abstraction, specifically comprise:
A) all keywords under the corresponding catalogue collection of draw standard catalogue are formed a file set with the catalogue collection of all standard directories correspondences, and are utilized the TF/IDF algorithm to calculate the weight of keyword under each standard directories collection;
B) set a threshold value, to be higher than the keyword of this threshold value be ultimate criterion catalogue keyword to the IF/IDF value in the draw standard catalogue collection keyword.
Preferably, the keyword abstraction of catalogue collection is data set with the catalogue number greater than 3 encyclopaedia entry, and suitably filters the inconsistent situation of content under catalogue and the catalogue, reduces impurity.According to statistics, many entries have only 1 catalogue, and called after brief introduction, but the content under the catalogue had both comprised personal information, also comprise personal story, honor etc., for this situation, should avoid using this type of catalogue as the keyword abstraction data, corresponding with the content under the directory name to guarantee directory name as far as possible.In the subsequent process, adopt the mode of participle and part-of-speech tagging that content is handled, and obtain the concentrated keyword of each catalogue and total keyword by stop words filtration, part of speech screening, the screening of the participle frequency.
Particularly, at step I i), what described keyword weight vectors also adopted is the TF/IDF algorithm, specifically comprises:
User Catalog has identical keyword set with standard directories, utilize TF/IDF to calculate the weight of all User Catalogs and standard directories keyword respectively, form User Catalog and standard directories keyword weight vectors, for example, the keyword vector A=(x1 of standard directories A, x2, x3 ... xn), wherein xn is the weight of n keyword in standard directories A, and dimension is the number of keyword, limits the back by the TF/IDF threshold value and determines, threshold value is more high, the keyword number is more few, and dimension is more low, and vice versa.
Be example with the amusement human classification, the profile catalogue under all entries and content are formed a catalogue collection, and profile is the catalogue set name, at first calculate the keyword vector under the profile, such as (height, age), calculating the final keyword of all standard directories again, is (height, age as profile and the total keyword of honor record catalogue, obtain, prize), (height, age in all User Catalogs are calculated in the back, obtain prize) weight of keyword vector.Wherein, the concrete computing formula of described weight is as follows:
The Weight=key word is the total word frequency * In of key word in word frequency/catalogue collection (the catalogue number that composite catalog number/key word occurs) * sqrt (catalogue number/this catalogue lump catalogue number that occurs concentrated in key word in catalogue) in the catalogue collection;
For example, total directories is 50000 under the amusement human classification, the total degree that entry comprises the profile catalogue is 300, total word number and the height word number of the content under the profile catalogue collection are respectively 10000 and 500, height appears in the directory content of 200 different entries, is under the catalogue of profile but only appear at 150 titles; Then the weight of height under profile catalogue collection is: w=(500/10000) * In (50000/200) * sqrt (150/300)=0.19;
Wherein, sqrt (catalogue number/this catalogue lump catalogue number that occurs concentrated in key word in catalogue) is the accent weight factor of TF/IDF, it can guarantee two specific characters of keyword: guarantee that 1) keyword is more big at the concentrated catalogue number of times that occurs of catalogue, its representativeness more strong (best situation is that a catalogue concentrates the content under all catalogues of the same name all to comprise this key word); 2) guarantee that the key word weight property distinguished change under the different directories collection is big.
Particularly, at step I ii), the final similarity of calculating described standard directories keyword weight vectors and User Catalog keyword weight vectors, for example, the keyword similarity between vectors of standard directory name " profile " and non-criterial list title " personal information " in the calculating User Catalog.Concrete computing formula is as follows:
sim ( A , B ) = Σ k = 1 n A k × B k ( Σ k = 1 n A k 2 ) ( Σ k = 1 n B k 2 )
Wherein, A is standard directories title keyword weight vectors, and B is non-criterial list title keyword weight vectors.
Preferably, according to the similarity rank that finally calculates described standard directories keyword weight vectors and User Catalog keyword weight vectors, as mentioned, get the standard directories title of the maximum similarity of rank front two, as the preliminary corresponding standard directories title of User Catalog title.
Further, according to the concrete difference of using, adopt different tax power ballot modes to determine the standard directories of the final mapping of User Catalog title by final mapping catalogue determination module.Wherein, the concrete application comprises following situation: according to encyclopaedia entry mass distribution, if User Catalog is not too consistent with content under the catalogue, then compose temporary, set directory content and shine upon the weight height, it is low that directory name shines upon weight; If User Catalog and directory content quality are all very poor, then when ballot, only think when the highest similarity mapping result of directory name mapping and directory content mapping is identical, just be defined as the standard directories of final mapping, otherwise for guaranteeing that accuracy rate thinks that this catalogue do not shine upon with any standard directories.
Wherein, described tax power, finger is according to the comparative result of the importance of content under directory name and the directory name, respectively the similarity result of content under content and the standard directories under the similarity result of described calculating User Catalog title and standard directories and the User Catalog title carried out the ratio assignment.For example, suppose that directory name importance is higher than the content under the directory name, then can multiply by 1 to the result that the directory name mapping obtains, the result that mapping obtains to directory content multiply by 0.8.
Wherein, described ballot refers to determine final similar standard catalogue from the similar standard catalogue that tentatively obtains.For example, by the User Catalog name map, it is corresponding with standard directories a and b to obtain User Catalog title x, by the User Catalog content map, it is corresponding with standard directories c and d to obtain User Catalog title x, and then the standard directories of the final mapping that obtains by ballot is one the most similar among a, b, c and the d; And for example, by the User Catalog name map, it is corresponding with standard directories a and b still to obtain User Catalog title x, by the User Catalog content map, it is corresponding with standard directories a and c to obtain User Catalog title x, and then the standard directories of finally being shone upon by ballot is a.
More specifically, according to the total quality of content under User Catalog title and the directory name, determine the mode of ballot.If the total quality height then adopts the ballot mode of recalling that enlarges; If total quality is low, then adopt the mode of voting accurately of protecting.Wherein, when the ballot mode accurately of protecting refers to obtain identical mapping result by the mapping of User Catalog name map and directory content, think that just this identical mapping result is final standard directories of shining upon, for example, for User Catalog title " personage introduction " and standard directories title " profile ", User Catalog name map and directory content mapping are all thought when two directory names are similar, just think this result for net result, otherwise think that " personage introduction " catalogue finally do not shine upon with standard directories.Refer to that obtaining mapping result by User Catalog name map and directory content mapping does not exist identically, then gets similarity preliminary mapping standard directories higher and that be higher than the threshold value that sets and is final mapping standard directories and enlarge the ballot mode recall.
Catalogue mapping relations excavating gear provided by the present invention has the following advantages: by excavating the catalogue incidence relation, identify and revise the entry classification and explain with other catalogue of standard directories mapping down, effectively promote the entry total quality.
Above disclosed only is preferred embodiment of the present invention, can not limit the present invention's interest field certainly with this, and therefore the equivalent variations of doing according to claim of the present invention still belongs to the scope that the present invention is contained.

Claims (16)

1. catalogue mapping relations method for digging, this method may further comprise the steps:
A) with single classification in the entry system down the full dose entry as entry to be mapped, with labeled data, synonym table as the mapping dictionary;
B) carry out the mapping of User Catalog name map and directory content respectively, determine the preliminary corresponding standard directories of User Catalog title;
C) adopt the mode of composing the power ballot to determine the standard directories of the final mapping of User Catalog title.
2. method according to claim 1, wherein, step b) further comprises: before calculating the similarity of User Catalog title and standard directories, described User Catalog title is carried out the pre-service of participle and part of speech filtration.
3. method according to claim 1, wherein, step b) specifically comprises: for be inverted the semantic constant User Catalog in back through cutting, adopt positive and negative twice the algorithm of long common subsequence calculate the similarity of described User Catalog title and standard directories.
4. method according to claim 1, wherein, step b) specifically comprises: adopt following dual mode to calculate the similarity of User Catalog title and standard directories title jointly respectively:
Mode one, directly calculate the similarity of standard directories under User Catalog and the taxonomic hierarchies by following formula:
SimA=(User Catalog title and standard directories title LCS length * 2)/User Catalog title and standard directories title length sum;
Mode two, based on labeled data indirect calculation User Catalog and standard directories title similarity:
SimB=(User Catalog title and mark directory name LCS length * 2)/(MAX (User Catalog title length, mark directory name length is with the standard directories length of mark directory name mapping) * 2);
Wherein, simA and simB represent the similarity that employing mode one and mode two calculate respectively, and the mark directory name refers to User Catalog corresponding in the labeled data.
5. method according to claim 1, described step b) specifically comprises:
I) extract final keyword set the content under the content under the standard directories title and the mark directory name corresponding with the standard directories title;
Ii) with the keyword set of final keyword set as User Catalog and standard directories, calculate the weight of User Catalog keyword and standard directories keyword, form the keyword weight vectors;
Iii) based on described keyword weight vectors, calculate the similarity of described User Catalog title and standard directories title, obtain the User Catalog title the standard directories title of preliminary mapping.
6. method according to claim 5, wherein, the extraction of described keyword specifically comprises:
A) all keywords under the corresponding catalogue collection of draw standard catalogue are formed a file set with the catalogue collection of all standard directories correspondences, and are utilized the TF/IDF algorithm to calculate the weight of keyword under each standard directories collection;
B) set a threshold value, to be higher than the keyword of this threshold value be ultimate criterion catalogue keyword to the IF/IDF value in the draw standard catalogue collection keyword.
7. method according to claim 6, wherein, the concrete computing formula of described weight is as follows:
The Weight=key word is the total word frequency * In of key word in word frequency/catalogue collection (the catalogue number that composite catalog number/key word occurs) * sqrt (catalogue number/this catalogue lump catalogue number that occurs concentrated in key word in catalogue) in the catalogue collection;
Wherein, Weight represents the weight of keyword under each standard directories collection.
8. method according to claim 1, step c) further comprises:
According to the total quality of content under User Catalog title and the directory name, determine the mode of ballot;
If the total quality height then adopts the ballot mode of recalling that enlarges; If total quality is low, then adopt the mode of voting accurately of protecting.
9. catalogue mapping relations excavating gear comprises:
Mapping (enum) data is set up module, be used for the single classification of entry system down the full dose entry as entry to be mapped, with labeled data, synonym table as the mapping dictionary;
Catalogue and content map module are used for carrying out respectively the mapping of User Catalog name map and directory content, determine the preliminary corresponding standard directories of User Catalog title;
Final mapping catalogue determination module adopts the mode of composing the power ballot to determine the standard directories of the final mapping of User Catalog title.
10. device according to claim 9, wherein, described device further comprises pretreatment module, is used for described User Catalog title being carried out the pre-service of participle and part of speech filtration before calculating the similarity of User Catalog title and standard directories.
11. device according to claim 9, wherein, the course of work of described catalogue and content map module specifically comprises: for be inverted the semantic constant User Catalog in back through cutting, adopt positive and negative twice the algorithm of long common subsequence calculate the similarity of described User Catalog title and standard directories.
12. device according to claim 9, wherein, the course of work of described catalogue and content map module specifically comprises: adopt following dual mode to calculate the similarity of User Catalog title and standard directories title jointly respectively:
Mode one, directly calculate the similarity of standard directories under User Catalog and the taxonomic hierarchies by following formula:
SimA=(User Catalog title and standard directories title LCS length * 2)/User Catalog title and standard directories title length sum;
Mode two, based on labeled data indirect calculation User Catalog and standard directories title similarity:
SimB=(User Catalog title and mark directory name LCS length * 2)/(MAX (User Catalog title length, mark directory name length is with the standard directories length of mark directory name mapping) * 2);
Wherein, simA and simB represent the similarity that employing mode one and mode two calculate respectively, and the mark directory name refers to User Catalog corresponding in the labeled data.
13. device according to claim 9, the course of work of described catalogue and content map module specifically comprises:
I) extract final keyword set the content under the content under the standard directories title and the mark directory name corresponding with the standard directories title;
Ii) with the keyword set of final keyword set as User Catalog and standard directories, calculate the weight of User Catalog keyword and standard directories keyword, form the keyword weight vectors;
Iii) based on described keyword weight vectors, calculate the similarity of described User Catalog title and standard directories title, obtain the User Catalog title the standard directories title of preliminary mapping.
14. device according to claim 13, wherein, the extraction of described keyword specifically comprises:
A) all keywords under the corresponding catalogue collection of draw standard catalogue are formed a file set with the catalogue collection of all standard directories correspondences, and are utilized the TF/IDF algorithm to calculate the weight of keyword under each standard directories collection;
B) set a threshold value, to be higher than the keyword of this threshold value be ultimate criterion catalogue keyword to the IF/IDF value in the draw standard catalogue collection keyword.
15. device according to claim 14, wherein, the concrete computing formula of described weight is as follows:
The Weight=key word is the total word frequency * In of key word in word frequency/catalogue collection (the catalogue number that composite catalog number/key word occurs) * sqrt (catalogue number/this catalogue lump catalogue number that occurs concentrated in key word in catalogue) in the catalogue collection;
Wherein, Weight represents the weight of keyword under each standard directories collection.
16. device according to claim 9, the course of work of described final mapping catalogue determination module further comprises:
According to the total quality of content under User Catalog title and the directory name, determine the mode of ballot;
If the total quality height then adopts the ballot mode of recalling that enlarges; If total quality is low, then adopt the mode of voting accurately of protecting.
CN201310175569.7A 2013-05-13 2013-05-13 Directory mapping relationship mining device and directory mapping relationship mining device Active CN103294780B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310175569.7A CN103294780B (en) 2013-05-13 2013-05-13 Directory mapping relationship mining device and directory mapping relationship mining device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310175569.7A CN103294780B (en) 2013-05-13 2013-05-13 Directory mapping relationship mining device and directory mapping relationship mining device

Publications (2)

Publication Number Publication Date
CN103294780A true CN103294780A (en) 2013-09-11
CN103294780B CN103294780B (en) 2017-02-08

Family

ID=49095642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310175569.7A Active CN103294780B (en) 2013-05-13 2013-05-13 Directory mapping relationship mining device and directory mapping relationship mining device

Country Status (1)

Country Link
CN (1) CN103294780B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886034A (en) * 2014-03-05 2014-06-25 北京百度网讯科技有限公司 Method and equipment for building indexes and matching inquiry input information of user
CN106469144A (en) * 2016-08-29 2017-03-01 东软集团股份有限公司 Text similarity computing method and device
CN112464062A (en) * 2020-11-16 2021-03-09 国网(苏州)城市能源研究院有限责任公司 Mapping table calculation method for supporting multi-format statistical yearbook data capture
CN114925764A (en) * 2022-05-16 2022-08-19 浙江经建工程管理有限公司 Engineering management file classification and identification method and system based on big data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050091033A1 (en) * 2003-10-23 2005-04-28 Microsoft Corporation System and method for performing analysis on word variants
CN102591475A (en) * 2011-12-29 2012-07-18 北京百度网讯科技有限公司 Content input method and system for online editor
CN102662952A (en) * 2012-03-02 2012-09-12 成都康赛电子科大信息技术有限责任公司 Chinese text parallel data mining method based on hierarchy

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050091033A1 (en) * 2003-10-23 2005-04-28 Microsoft Corporation System and method for performing analysis on word variants
CN102591475A (en) * 2011-12-29 2012-07-18 北京百度网讯科技有限公司 Content input method and system for online editor
CN102662952A (en) * 2012-03-02 2012-09-12 成都康赛电子科大信息技术有限责任公司 Chinese text parallel data mining method based on hierarchy

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886034A (en) * 2014-03-05 2014-06-25 北京百度网讯科技有限公司 Method and equipment for building indexes and matching inquiry input information of user
CN103886034B (en) * 2014-03-05 2019-03-19 北京百度网讯科技有限公司 A kind of method and apparatus of inquiry input information that establishing index and matching user
CN106469144A (en) * 2016-08-29 2017-03-01 东软集团股份有限公司 Text similarity computing method and device
CN112464062A (en) * 2020-11-16 2021-03-09 国网(苏州)城市能源研究院有限责任公司 Mapping table calculation method for supporting multi-format statistical yearbook data capture
CN112464062B (en) * 2020-11-16 2024-05-07 国网(苏州)城市能源研究院有限责任公司 Mapping table calculation method for supporting multi-format statistics annual-image data capture
CN114925764A (en) * 2022-05-16 2022-08-19 浙江经建工程管理有限公司 Engineering management file classification and identification method and system based on big data
CN114925764B (en) * 2022-05-16 2022-12-09 浙江经建工程管理有限公司 Engineering management file classification and identification method and system based on big data

Also Published As

Publication number Publication date
CN103294780B (en) 2017-02-08

Similar Documents

Publication Publication Date Title
CN110543574B (en) Knowledge graph construction method, device, equipment and medium
US11151179B2 (en) Method, apparatus and electronic device for determining knowledge sample data set
CN105426514B (en) Personalized mobile application APP recommended method
US10248715B2 (en) Media content recommendation method and apparatus
CN107832229A (en) A kind of system testing case automatic generating method based on NLP
CN105468605A (en) Entity information map generation method and device
US20060200464A1 (en) Method and system for generating a document summary
US20110106805A1 (en) Method and system for searching multilingual documents
CN102915299A (en) Word segmentation method and device
CN102043843A (en) Method and obtaining device for obtaining target entry based on target application
CN111259271A (en) Comment information display method and device, electronic equipment and computer readable medium
Wang et al. Keyword extraction from online product reviews based on bi-directional LSTM recurrent neural network
CN102760142A (en) Method and device for extracting subject label in search result aiming at searching query
CN110413738A (en) A kind of information processing method, device, server and storage medium
CN103927309A (en) Method and device for marking information labels for business objects
CN103106287A (en) Processing method and processing system for retrieving sentences by user
CN105677725A (en) Preset parsing method for tourism vertical search engine
CN108875743B (en) Text recognition method and device
US9870433B2 (en) Data processing method and system of establishing input recommendation
CN110209781B (en) Text processing method and device and related equipment
CN111538830B (en) French searching method, device, computer equipment and storage medium
CN103294780A (en) Directory mapping relationship mining device and directory mapping relationship mining device
CN105404677A (en) Tree structure based retrieval method
WO2021139242A1 (en) Presentation file generation method, apparatus, and device and storage medium
CN103853763B (en) The method and apparatus for obtaining information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant