CN102169495A - Industry dictionary generating method and device - Google Patents

Industry dictionary generating method and device Download PDF

Info

Publication number
CN102169495A
CN102169495A CN2011100896985A CN201110089698A CN102169495A CN 102169495 A CN102169495 A CN 102169495A CN 2011100896985 A CN2011100896985 A CN 2011100896985A CN 201110089698 A CN201110089698 A CN 201110089698A CN 102169495 A CN102169495 A CN 102169495A
Authority
CN
China
Prior art keywords
industry
term
candidate
document
correlation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011100896985A
Other languages
Chinese (zh)
Other versions
CN102169495B (en
Inventor
何伟平
王名悠
吴永强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qunar Cayman Islands Co Ltd
Original Assignee
Qunar Cayman Islands Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qunar Cayman Islands Co Ltd filed Critical Qunar Cayman Islands Co Ltd
Priority to CN201110089698.5A priority Critical patent/CN102169495B/en
Publication of CN102169495A publication Critical patent/CN102169495A/en
Application granted granted Critical
Publication of CN102169495B publication Critical patent/CN102169495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides an industry dictionary generating method and an industry dictionary generating device. The method comprises the following steps of: acquiring a document collection corresponding to the initial industry glossaries according to the initial industry glossaries; acquiring candidate glossaries according to the document collection; performing industry relevance analysis on the candidate glossaries to acquire relevant candidate glossaries; performing co-occurrence analysis and incidence relation excavation on the relevant candidate glossaries to generate industry vocabularies; and adding the industry vocabularies into industry dictionaries. Due to the adoption of the technical scheme, the industry dictionaries can be generated, and the problems of high cost, low efficiency and the like which are generated when workers search the industry vocabularies in the prior art are solved.

Description

Industry dictionary generating method and device
Technical field
The present invention relates to data mining technology, relate in particular to a kind of industry dictionary generating method and device.
Background technology
The industry dictionary is the term of the certain industry represented with minimum linguistic unit and the set of idiom, for example mechanical industry dictionary, tourism industry dictionary etc.In the prior art, the technology close with the industry dictionary comprises text classification feature selecting technology and domain body (Domain Ontology) storehouse constructing technology.
Text classification feature selecting technology is to realize a kind of most important method of feature space dimensionality reduction in the text classification system, it carries out participle to the text in the training set earlier, add up the occurrence frequency of word in training set then, the feature of use when selecting some speech and train by feature selecting algorithm again as sorter.Wherein, common feature selecting algorithm has: mutual information, document frequency, the verification of card side, information gain etc.Wherein, selected go out train as sorter the time feature class that uses be similar to vocabulary in the industry dictionary.But, because text classification feature selecting technology is in order to realize classification, the generalization ability that mainly is and raising disaggregated model too high with the solution data dimension in implementation procedure is a target, therefore, the precision of the speech that comes out via the choice of technology of text classification feature selecting is lower, can't satisfy the demand of industry dictionary, therefore, can't directly adopt text classification feature selecting technology to generate the industry dictionary high capacity, high precision etc.
Body is a kind of expression to domain knowledge, is used for that systematization is carried out in the existence of objective world and describes, and makes things convenient for reusing with mutual of knowledge.The field ontology library constructing technology focuses on the relevant notion in discovery field, and the mutual relationship between the body.Usually, body is created by the domain expert.The process that field ontology library robotization at present makes up generally comprises: data processing: text is carried out natural language processing, participle for example, part-of-speech tagging etc.; Notion is extracted: extract notion by some language rules (for example part of speech combination) or statistic algorithm; Semantic association extracts: wait to determine relation between notion and the notion by some grammar rules.As the above analysis: the field ontology library constructing technology mainly by artificial setting the rule or adopt extensive language material to train and find; Wherein, the artificial rule of setting is fixed, and its recall rate is lower; And the language material training need is prepared a large amount of language materials, and is not only consuming time but also require great effort.In addition, the field ontology library constructing technology also needs to set up connecting each other between each body, makes it realizing there is bigger difficulty in the robotization, and based on this, existing field ontology library is created technology also can't be directly with generating the industry dictionary.
And prior art mainly to be mode by artificial collection form the industry dictionary, this generates the mode cost height of industry dictionary, efficient is low, therefore, is badly in need of providing a kind of technical scheme of automatic generation industry dictionary to overcome the defective of prior art.
Summary of the invention
The invention provides a kind of industry dictionary generating method and device,, improve the efficient that generates the industry dictionary, reduce manufacturing cost in order to generate the industry dictionary.
The invention provides a kind of industry dictionary generating method, comprising:
According to initial industry slang, obtain the collection of document of described initial industry slang correspondence;
According to described collection of document, obtain candidate's term;
Described candidate's term is carried out the analysis of the industry degree of correlation, obtain the correlation candidate term;
Described correlation candidate term is carried out co-occurrence analysis and incidence relation excavation, generate industry vocabulary;
Described industry vocabulary is added the industry dictionary.
The invention provides a kind of industry dictionary creating apparatus, comprising:
First acquisition module is used for obtaining the collection of document of described initial industry slang correspondence according to initial industry slang;
Second acquisition module is used for obtaining candidate's term according to described collection of document;
The 3rd acquisition module is used for described candidate's term is carried out the analysis of business association degree, obtains the correlation candidate term;
Generation module is used for described correlation candidate term is carried out co-occurrence analysis and incidence relation excavation, generates industry vocabulary;
Add module, be used for described industry vocabulary is added the industry dictionary.
Industry dictionary generating method provided by the invention and device, obtain corresponding collection of document according to initial industry slang, and from collection of document, obtain candidate's term, candidate's term is carried out processing such as the analysis of business association degree, co-occurrence analysis and incidence relation excavation, generate industry vocabulary, and add the industry dictionary.Adopt technical solution of the present invention to generate the industry dictionary according to initial industry slang and corresponding collection of document, compared with prior art, this technical scheme can generate industry vocabulary automatically, need not manual search, improve the efficient that generates the industry dictionary, saved manufacturing cost.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do one to the accompanying drawing of required use in embodiment or the description of the Prior Art below introduces simply, apparently, accompanying drawing in describing below is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
The process flow diagram of the industry dictionary generating method that Fig. 1 provides for the embodiment of the invention one;
The process flow diagram of the industry dictionary generating method that Fig. 2 provides for the embodiment of the invention two;
The structural representation of the industry dictionary creating apparatus that Fig. 3 provides for the embodiment of the invention three.
Embodiment
For the purpose, technical scheme and the advantage that make the embodiment of the invention clearer, below in conjunction with the accompanying drawing in the embodiment of the invention, technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.
The process flow diagram of the industry dictionary generating method that Fig. 1 provides for the embodiment of the invention one.As shown in Figure 1, the method for present embodiment comprises:
Step 11, basis be industry slang initially, obtains the collection of document of initial industry slang correspondence;
Concrete, the user provides initial industry slang to the industry dictionary creating apparatus, by the industry dictionary creating apparatus with initial industry slang as query word, from search engine, obtain the collection of document of initial industry slang correspondence, comprised the magnanimity document relevant in the search engine with the sector.Wherein, initial industry slang can be made up of single speech or phrase, and speech or phrase are organized according to different categorys of employment.For example: the initial industry slang of the tourism industry that the user provides can be organized in the following manner:
Eat: cuisines snack special product diet;
Live: the hotel accommodation hotel;
OK: traffic self-driving travel folder.
The industry dictionary creating apparatus obtains collection of document from search engine a kind of embodiment comprises: the initial industry slang that the industry dictionary creating apparatus will belong to same category of employment carries out various combination, obtains initial industry slang combination; Make up as query word with each the initial industry slang that obtains then, the query interface that utilizes search engine to provide is inquired about, obtain and maximally related several (for example 10) documents of query word, after all initial industry slang combinations are all inquired about as query word, obtain the document of specifying number; The document of the appointment number of being obtained promptly forms collection of document.In the collection of document of present embodiment, document is also organized according to category of employment.
Step 12, according to collection of document, obtain candidate's term;
Concrete, a kind of embodiment of step 12 comprises:
Step 121, collection of document is carried out pre-service, obtain the word sequence set;
Wherein, pre-service mainly is meant carries out word segmentation processing to each document in the collection of document, promptly document is carried out word segmentation, obtains a series of speech.Because Chinese text is unlike English, the space is arranged as natural delimiter between the word of English each row, and do not have tangible delimiter between the speech of Chinese and the speech, for the ease of the industry dictionary creating apparatus Chinese document is handled automatically, need carry out word segmentation to document, form a series of speech.Wherein, word segmentation processing can adopt the segmenting method based on dictionary, also can adopt the segmenting method based on statistics.Because the accuracy of participle has certain influence to the quality of the industry dictionary of final generation, therefore, need select suitable segmenting method according to industrial nature.
In addition, this pre-service can also be carried out part-of-speech tagging, stop operations such as speech or synonym processing except comprising word segmentation processing.Wherein, part-of-speech tagging is meant and is the concrete part of speech of each speech appointment in the document; Common part of speech generally has: noun, verb, adjective, adverbial word, preposition, conjunction etc.Because the industry vocabulary that comprises in the industry dictionary generally all has clearer and more definite meaning, it is smaller that the speech of some part of speech (for example preposition) is called the possibility of industry vocabulary, therefore, can at first filter out a part of speech by part-of-speech tagging.Behind aforesaid operations, collection of document finally becomes relatively terse, as to have marked a part of speech word sequence set.
Step 123, filtration treatment is carried out in word sequence set, obtain candidate's term.
The process that the industry dictionary creating apparatus obtains candidate's term comprises: at first obtain phrase from the word sequence set, preferably, the industry dictionary is represented with the suffix data tree structure and is repeated substring as phrase in conjunction with the extraction of corresponding repetition word string extraction algorithm, promptly by word sequence set is expressed as the suffix array, the problem that will ask the problem of repetition substring to change into the common prefix of asking suffix is then obtained phrase; Then, the industry dictionary creating apparatus select word frequency greater than the speech of word frequency threshold value or phrase as candidate word; Wherein, word frequency is meant the frequency that speech or phrase occur, and the word frequency threshold value is predefined.At last, the industry dictionary creating apparatus filters candidate word according to predefined filtering rule, obtains candidate's term from candidate word.Because industry vocabulary has distinct industry characteristic, therefore, need from the word sequence set, screen layer by layer, with the speech that progressively dwindles required processing or the scope of phrase.
Wherein, carry out filtration treatment for the ease of word sequence is gathered, the another kind of embodiment of present embodiment step 12 is: also comprised step 122 before step 123: descriptor extraction processing is carried out in set to word sequence, generates descriptor and controls vocabulary.Wherein, the descriptor extraction is meant mainly from the word sequence set and extracts the core vocabulary that can represent the document subject matter content that the core vocabulary of all documents promptly constitutes descriptor control vocabulary.Wherein, descriptor is extracted several different methods, for example: the algorithm of analyzing based on the algorithm of statistical classification, based on cooccurrence relation etc.
Based on above-mentioned embodiment, the filtering rule in the present embodiment can comprise: speech or phrase in (1) initial industry slang or the shielding dictionary can not be as candidate's terms; Wherein, the shielding dictionary is the dictionary that is formed by non-industry vocabulary.(2) speech that comprises of candidate's term must be the speech in the descriptor control vocabulary.(3) length restriction, promptly length less than 2 or all can not be as candidate's term greater than 4 phrase, promptly has only length greater than 1 speech greater than 1 speech or the quantity that comprises speech, and the phrase that perhaps comprises 2 to 4 speech just can be used as candidate's term.(4) can not be as the prefix of other phrases or the phrase of suffix (being incomplete phrase) as candidate's term.
Wherein, according to the quality of the required industry dictionary of category of employment, the industry dictionary creating apparatus can carry out filter operation according to the combination in any of above-mentioned arbitrary filtering rule or above-mentioned filtering rule, to form the industry dictionary of different quality.Wherein, the quality of filtering the back industry dictionary that is generated according to above-mentioned strictly all rules is the highest, and therefore, the combination of preferred strictly all rules is as the filtering rule of present embodiment.
Step 13, candidate's term is carried out the analysis of the industry degree of correlation, obtain the correlation candidate term;
Wherein, still many through candidate's term that above-mentioned steps is obtained, relevant with category of employment even candidate's term of some high frequencies also may not be certain, therefore, present embodiment is further analyzed by the industry degree of correlation incoherent candidate's term in candidate's term is removed.The analysis of the industry degree of correlation mainly is meant the degree of correlation between calculated candidate term and the category of employment.The industry dictionary creating apparatus is by the degree of correlation of calculated candidate term and category of employment, can choose the bigger some candidate's terms of the degree of correlation as the correlation candidate term, enter next step and handle operation, further to reduce to generate required speech of industry vocabulary or phrase scope.Wherein, the quantity of correlation candidate term can be specified in advance.
Step 14, the correlation candidate term is carried out co-occurrence analysis and incidence relation excavate, generate industry vocabulary;
This step mainly is meant utilizes entire document set or word sequence set that further excavation done in the correlation candidate term, analyzes the cooccurrence relation of each correlation candidate term and category of employment, statistics co-occurrence data; By the method for association rule mining co-occurrence data is handled then, found to surpass candidate's term of setting degree of correlation threshold value as industry vocabulary with the category of employment degree of correlation.
Step 15, industry vocabulary is added the industry dictionary.
Concrete, the industry dictionary creating apparatus adds the industry vocabulary that generates in the industry dictionary of corresponding category of employment, has promptly formed the industry dictionary of the sector classification.
The industry dictionary generating method of present embodiment, obtain corresponding collection of document according to initial industry slang, obtain candidate's term by collection of document being carried out data minings processing such as participle, part-of-speech tagging, filtering screening, obtain the correlation candidate term by candidate's term being carried out the analysis of the industry degree of correlation then, further the correlation candidate term is carried out co-occurrence analysis and incidence relation excavation, find with the category of employment degree of correlation greater than the correlation candidate term of degree of correlation threshold value and with it as industry vocabulary, add the industry dictionary, finally generate the industry dictionary.Present embodiment obtains industry vocabulary and generates the industry dictionary by modes such as several data analysis and excavations, solved the problem of from magnanimity information, extracting industry vocabulary on the one hand, can extract industry vocabulary automatically on the other hand, solved the problem of manual search, improve the efficient that generates the industry dictionary, saved manufacturing cost.
Further, present embodiment provides a kind of embodiment of step 122, and in this embodiment, the industry dictionary creating apparatus generates descriptor control vocabulary based on the statistical classification algorithm.This embodiment comprises two stages: training stage and cognitive phase.In the training stage, need prepare corpus in advance, this corpus comprises the descriptor (promptly training descriptor) of training document and training document correspondence; The industry dictionary creating apparatus carries out processing such as participle and part-of-speech tagging to above-mentioned corpus, generate the characteristic set and the descriptor judged result of each speech in the corpus, and this descriptor judged result is meant that this speech is a judged result of descriptor; Then, utilize sorting algorithm (for example support vector machine (SVM), naive Bayesian (
Figure BDA0000054728800000071
Bayes) etc.) characteristic set and the descriptor judged result of each speech are trained, generate sorter.At cognitive phase, the industry dictionary creating apparatus, first-selection is obtained the characteristic set of each speech in the word sequence set, and the characteristic set that utilizes sorter and each speech then is to whether the be the theme judgement of speech of each speech; Obtain descriptor according to being judged as the judged result that is, and then generate descriptor control vocabulary.Characteristic set mainly comprises word frequency-reverse document frequency (Term Frequency-Inverse Document Frequency; Abbreviate as: TF-IDF), the position that whether occurs in title, for the first time of part of speech, speech, the features such as length of speech.
Wherein, TF-IDF is a kind of weighting technique commonly used in information retrieval and the text mining, and TF refers to word frequency, is speech occurrence number sum in one piece of document; IDF is reverse document frequency, and it is defined as formula (1):
IDF i = log | D | | { d : t i ∈ d } | - - - ( 1 )
Wherein, | D| is the total number of files in the collection of document, | { d:t i∈ d}| represents to comprise word t iThe document number.
In addition, adopt the TF-IDF algorithm also can obtain descriptor, because descriptor is extracted the quality that precision will influence the industry dictionary of follow-up generation, therefore, present embodiment adopts and extracts descriptor jointly based on TF-IDF and other a plurality of features simultaneously, with the precision that guarantees that descriptor is extracted, improve the quality of industry dictionary.
Further, step 13 is obtained a kind of embodiment of correlation candidate term, comprising:
Step 131, industry dictionary creating apparatus adopt statistic algorithms such as verification of card side or information gain, calculate the degree of correlation of each candidate's term and affiliated category of employment; Preference card side's checking algorithm wherein.
The principle of the side's of card checking algorithm is: suppose that at first two variablees are independently (null hypothesises), observe the deviation of actual value and theoretical value then and determine whether theory is correct.If deviation is very little, then think sample error, accept null hypothesis, think that promptly two variablees are independently; Otherwise negate null hypothesis, think that promptly two variablees are correlated with.On this problem of the degree of correlation of calculated candidate term and category of employment, major concern be whether separate between candidate's term and the category of employment; If independent, illustrate that then this candidate's term is uncorrelated with category of employment, do not belong to this category of employment.Based on this, null hypothesis is that candidate's term and category of employment are separate, and operable observed value has four, as table 1 (" eating " with candidate's term " Chengdu snack " and category of employment is example).
Table 1
Belong to " eating " Do not belong to " eating " Amount to
Comprise " Chengdu snack " A B A+B
Do not comprise " Chengdu snack " C D C+D
Amount to A+C B+D N
Wherein, the number of times of A for occurring in " Chengdu snack " document under " eating " this category of employment; The number of times of B for occurring in " Chengdu snack " document under the other industry classification of non-" eating "; C is not for the number of documents of " Chengdu snack " occurring in the document under " eating " this category of employment, D is not for the number of documents of " Chengdu snack " occurring in the document under the other industry classification of non-" eating ".Calculate chi-square value according to formula (2):
X 2 ( t , c ) = ( AD - BC ) 2 ( A + B ) ( C + D ) - - - ( 2 )
Wherein, chi-square value is big more, illustrates that the degree of correlation that candidate's term " Chengdu snack " and category of employment " eat " is big more.
Step 132, industry dictionary creating apparatus are obtained the correlation candidate term of specifying number according to the size of the degree of correlation from candidate's term.
Concrete, to each category of employment, the industry dictionary creating apparatus calculates after the chi-square value of each the candidate's term under the sector classification according to above-mentioned formula (2), and chi-square value is sorted from big to small, k candidate's term enters next step calculating as the correlation candidate term before choosing.Wherein, k is the number of preassigned correlation candidate term, and k is the natural number more than or equal to 1.
Based on the foregoing description, a kind of embodiment that step 14 generates industry vocabulary comprises:
Step 141, industry dictionary creating apparatus to correlation candidate term and affiliated category of employment the occurrence number in document database add up, obtain co-occurrence data, described co-occurrence data comprises the numerical value and the second time numerical value of category of employment when occurring the separately first time when number of documents, each correlation candidate term and category of employment occur simultaneously;
In this explanation, document database herein is different with the collection of document that the initial industry slang of aforementioned basis obtains from search engine, aforesaid collection of document is a subclass of document database, promptly the document relevant with industry that comprise of document database herein is more, usually more than millions.
Wherein, the co-occurrence analysis is a kind of technological means commonly used in the data mining, and main thought is that thinking has closer contact between these two speech if two word frequency are numerous to be occurred in same context.Present embodiment just is being based on this principle and is finding conglomerate term more automatically in search engine process.Wherein, the context of co-occurrence analysis can be entire document, paragraph or sentence.Present embodiment is example with the document.
For example: for the industry slang d that comprises among correlation candidate term t and the category of employment c, if occur in same piece of writing document, then writing down the co-occurrence number of times is 1, is expressed as: and count (t, c)->1; The frequency of the independent appearance of statistical dependence candidate term t and category of employment c simultaneously, one piece of document is calculated once, is expressed as respectively: count (t)->1 and count (c)->1.According to above-mentioned processing, each document in the entire document database is all added up to correlation candidate data t and category of employment c, obtains co-occurrence data.This co-occurrence data comprises: number of documents, each correlation candidate term and category of employment occur simultaneously in all documents number of times (promptly numerical value) for the first time, the number of times that category of employment occurs separately in all documents (promptly numerical value) for the second time and each correlation candidate term number of times of appearance separately in all documents.For example: certain co-occurrence data comprises: and count (t, c)->100: expression correlation candidate term t and category of employment c have occurred in 100 pieces of documents jointly; Count (t)->2000: expression correlation candidate term t has occurred in 2000 pieces of documents; Count (c)->20000: expression category of employment c has occurred in 20000 pieces of documents; N->100000: the expression number of documents is 100,000, promptly always has 100,000 pieces of documents in the document database.
Step 142, association rule mining is carried out in co-occurrence data, obtain the strength of association of correlation candidate term and category of employment;
After obtaining co-occurrence data, according to association rule mining above-mentioned available data is handled, calculate support (Support) and degree of confidence (Confidence); Wherein the computing formula of support and degree of confidence is respectively formula (3) and formula (4).
Support(A→B)=P(A∪B) (3)
Confidence(A→B)=P(A|B) (4)
Above-mentioned co-occurrence data is applied to above-mentioned formula can obtains the formula (5) of degree of expressing support for and the formula (6) of expression degree of confidence:
Support(c->t)=count(t,c)/N (5)
Confidence(c->t)=count(t,c)/count(c)(6)
Formula (5) is used to calculate the number of times that each correlation candidate term and category of employment occur simultaneously and the ratio of number of documents, and this ratio is support; Formula (6) is used to calculate the ratio of the number of times that number of times that each correlation candidate term and category of employment occur simultaneously and category of employment occur separately, and this is than value representation degree of confidence.The strength of association that support of being represented respectively by formula (5) and formula (6) and degree of confidence are used to represent candidate's term t and category of employment c jointly.Wherein, in the present embodiment, set in advance support threshold value and confidence threshold value, be used for as the benchmark of judging the strength of association size.Support and degree of confidence that the industry dictionary creating apparatus obtains calculating compare with support threshold value and confidence threshold value respectively; Support and degree of confidence are called strong strength of association greater than the strength of association of support threshold value and confidence threshold value simultaneously; Otherwise, be called weak strength of association.
In addition, except calculate the strength of association between relevant industries term and the category of employment according to support and degree of confidence, other modes can also be arranged, for example: can utilize the degree of association of more emphasizing monopoly to replace above-mentioned degree of confidence.Wherein, can come the compute associations degree according to formula (7):
R = P ( C ) - P ( A ) P ( B ) P ( A ) P ( A ‾ ) P ( B ) P ( B ‾ ) - - - ( 7 )
Wherein, R represents the degree of association; The probability that P (A) expression correlation candidate term occurs in document database, the i.e. ratio of the number of times (comprising the number of times of independent appearance and the number of times that occurs simultaneously with industry slang) that in document database, occurs of correlation candidate data and number of documents; The probability that P (B) expression category of employment occurs in document database, the i.e. ratio of the number of times (comprising the number of times of independent appearance and the number of times that occurs simultaneously with the correlation candidate term) that in document database, occurs of industry classification and number of documents; P (C) expression correlation candidate term and category of employment appear at the probability in the document database simultaneously, and promptly correlation candidate term and category of employment appear at the number of times in the document database and the ratio of number of documents simultaneously.Further, according to new probability formula as can be known, P (C)=P (AB),
Figure BDA0000054728800000112
Based on this, after the degree of association that calculates correlation candidate term and category of employment, can utilize the support and the degree of association to represent strength of association between correlation candidate term and the category of employment.In like manner, can preestablish degree of association threshold value, the degree of association and degree of association threshold value are compared.The support and the degree of association are called strong strength of association greater than the strength of association of support threshold value and degree of association threshold value simultaneously; Otherwise, be called weak strength of association.
Step 143, select strength of association greater than the correlation candidate term of degree of association threshold value as industry vocabulary.
After obtaining the degree of strength of association, can select correlation candidate term under the strong strength of association as industry vocabulary, promptly select support and degree of confidence simultaneously greater than the correlation candidate term of support threshold value and confidence threshold value, perhaps select the correlation candidate term of the support and degree of association while greater than support threshold value and degree of association threshold value.
The industry dictionary generating method of present embodiment carries out search engine according to initial industry slang and obtains collection of document, can guarantee that collection of document comprises the industry document of the some relevant with industry, can guarantee the accuracy that industry vocabulary extracts; Then, collection of document is carried out processing such as participle, part-of-speech tagging, the analysis of the industry degree of correlation, co-occurrence analysis, correlation rule analysis and obtain industry vocabulary, can improve the accuracy rate and the recall rate of the industry vocabulary that obtains, guarantee the quality of the final industry dictionary that generates, solved problems such as manual search industry vocabulary cost height, efficient are low.
The process flow diagram of the industry dictionary generating method that Fig. 2 provides for the embodiment of the invention two.Present embodiment realizes that based on embodiment one its something in common repeats no more, and the difference of present embodiment and embodiment one is: also comprise after step 15:
Step 16, with the industry vocabulary in the industry dictionary again as initial industry slang, return execution in step 11.
The industry dictionary generating method of present embodiment, after generating the industry dictionary, the industry vocabulary in the newly-generated industry dictionary as initial industry slang, is repeated the generative process of industry dictionary, in each generative process, all can generate new industry vocabulary, the industry dictionary is enriched.
In addition, when the collection of document of category of employment correspondence changed, the flow process that also can trigger present embodiment was to upgrade the industry dictionary or further to enrich.
Based on the foregoing description, the correlation candidate term outside the industry vocabulary can also added the shielding dictionary before the execution in step 16.Can not be according to speech or phrase that filtering rule shields in the dictionary as can be known as candidate's term, therefore, can prevent to improve the efficient that generates the industry dictionary on the whole by the correlation candidate term outside the industry vocabulary being added the shielding dictionary participate in again calculating in the industry dictionary generative process next time.
The structural representation of the industry dictionary creating apparatus that Fig. 3 provides for the embodiment of the invention three.As shown in Figure 3, the device of present embodiment comprises: first acquisition module 31, second acquisition module 32, the 3rd acquisition module 33, generation module 34 and interpolation module 35.
Wherein, first acquisition module 31 is used for obtaining the collection of document of initial industry slang correspondence according to initial industry slang; Second acquisition module 32 is used for according to collection of document, obtains candidate's term; The 3rd acquisition module 33 is used for candidate's term is carried out the analysis of business association degree, obtains the correlation candidate term; Generation module 34 is used for the correlation candidate term is carried out co-occurrence analysis and incidence relation excavation, generates industry vocabulary; Add module 35, be used for industry vocabulary is added the industry dictionary.
Above-mentioned each functional module can be used for carrying out the detailed process of said method embodiment, to generate the industry dictionary.The principle of work of each functional module can see the corresponding description among the said method embodiment for details, does not repeat them here.
The industry dictionary creating apparatus of present embodiment, obtain corresponding collection of document according to initial industry slang, and from collection of document, obtain candidate's term, candidate's term is carried out processing such as the analysis of business association degree, co-occurrence analysis and incidence relation excavation, generate industry vocabulary, and add the industry dictionary.Adopt the industry dictionary creating apparatus of present embodiment to generate the industry dictionary automatically, need not manual search, improved the efficient that generates the industry dictionary, saved manufacturing cost according to initial industry slang and corresponding collection of document.
Further, the industry dictionary creating apparatus of present embodiment also comprises: trigger module 36.Trigger module 36, be used for after interpolation module 35 is added industry vocabulary to the industry dictionary, with the industry vocabulary in the industry dictionary again as initial industry slang, and trigger first acquisition module 31 and carry out according to initial industry slang, the operation of obtaining the collection of document of initial industry slang correspondence.
The industry dictionary creating apparatus of present embodiment can repeat the generative process of industry dictionary by trigger module, with the industry dictionary of enriching constantly.
In this explanation, the industry dictionary creating apparatus of present embodiment can be used for carrying out the flow process of the industry dictionary generating method that said method embodiment provides, the flow process of said method embodiment can be realized that then the industry dictionary creating apparatus can be computing machine, but is not limited to this by computer software programs.
One of ordinary skill in the art will appreciate that: all or part of step that realizes said method embodiment can be finished by the relevant hardware of programmed instruction, aforesaid program can be stored in the computer read/write memory medium, this program is carried out the step that comprises said method embodiment when carrying out; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CD.
It should be noted that at last: above embodiment only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to previous embodiment the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment put down in writing, and perhaps part technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (17)

1. an industry dictionary generating method is characterized in that, comprising:
According to initial industry slang, obtain the collection of document of described initial industry slang correspondence;
According to described collection of document, obtain candidate's term;
Described candidate's term is carried out the analysis of the industry degree of correlation, obtain the correlation candidate term;
Described correlation candidate term is carried out co-occurrence analysis and incidence relation excavation, generate industry vocabulary;
Described industry vocabulary is added the industry dictionary.
2. industry dictionary generating method according to claim 1 is characterized in that, also comprises after described industry vocabulary is added the industry dictionary:
Industry vocabulary in the described industry dictionary again as described initial industry slang, and is returned and carries out according to initial industry slang, the operation of obtaining the collection of document of described initial industry slang correspondence.
3. industry dictionary generating method according to claim 2, it is characterized in that, with the industry vocabulary in the described industry dictionary again as described initial industry slang, and return and carry out according to initial industry slang, the operation of obtaining the collection of document of described initial industry slang correspondence also comprises before:
Correlation candidate term outside the described industry vocabulary is added the shielding dictionary.
4. industry dictionary generating method according to claim 1 is characterized in that, and is described according to described collection of document, obtains candidate's term and comprises:
Described collection of document carries out pre-service, obtains the word sequence set;
Filtration treatment is carried out in described word sequence set, obtain described candidate's term.
5. industry dictionary generating method according to claim 4 is characterized in that, also comprises before filtration treatment being carried out in described word sequence set, obtain described candidate's term:
Descriptor is carried out in described word sequence set extract processing, generate descriptor control vocabulary.
6. according to each described industry dictionary generating method of claim 1-5, it is characterized in that, the initial industry slang of described basis, the collection of document that obtains described initial industry slang correspondence comprises:
Described initial industry slang is carried out various combination, obtain initial industry slang combination;
As query word, utilize search engine to obtain the document of specifying number described initial industry slang combination.
7. according to claim 4 or 5 described industry dictionary generating methods, it is characterized in that, described described collection of document carried out pre-service, obtain the word sequence set and comprise:
Each document in the described collection of document is carried out word segmentation processing respectively, obtain described word sequence set.
8. industry dictionary generating method according to claim 7 is characterized in that, describedly described collection of document is carried out pre-service also comprises:
Each document in the described collection of document is carried out part-of-speech tagging, stops speech or synonym processing.
9. industry dictionary generating method according to claim 5 is characterized in that, described described word sequence is gathered carried out descriptor extraction processing, generates descriptor control vocabulary and comprises:
Default corpus is carried out participle and part-of-speech tagging processing, generate the characteristic set and the descriptor judged result of each speech in the described corpus, described corpus comprises the training descriptor of training document and described training document correspondence;
Utilize sorting algorithm that the characteristic set and the descriptor judged result of each speech in the described corpus are trained, generate sorter;
Obtain the characteristic set of each speech in the described word sequence set;
According to the characteristic set of described sorter and described each speech to whether the be the theme judgement of speech of described each speech;
According to judged result, generate described descriptor control vocabulary.
10. according to claim 5 or 9 described industry dictionary generating methods, it is characterized in that, filtration treatment carried out in described word sequence set, obtain candidate's term and comprise:
In described word sequence set, extract the substring of repetition with the suffix data tree structure as phrase;
Select word frequency greater than the speech of word frequency threshold value or phrase as candidate word;
According to filtering rule, described candidate word is filtered, obtain described candidate's term.
11. industry dictionary generating method according to claim 10 is characterized in that, described filtering rule comprises following any one or its combination:
Speech or phrase in described initial industry slang or the shielding dictionary can not be as candidate's terms;
The speech that candidate's term comprises must be the speech in the described descriptor control vocabulary;
Length less than 1 speech or the quantity that comprises speech less than 2 or can not be as candidate's term greater than 4 phrase; Or
Phrase as the prefix of other phrases or suffix can not be as candidate's term.
12., it is characterized in that according to each described industry dictionary generating method of claim 1-5, described candidate's term is carried out the industry correlation analysis, obtain the correlation candidate term and comprise:
Adopt verification of card side or information gain algorithm, calculate the degree of correlation of each described candidate's term and affiliated category of employment;
According to the size of the degree of correlation, from described candidate's term, obtain the described correlation candidate term of specifying number.
13., it is characterized in that according to each described industry dictionary generating method of claim 1-5, describedly described correlation candidate term is carried out co-occurrence analysis and incidence relation excavate, generate industry vocabulary and comprise:
Occurrence number in document database is added up to described correlation candidate term and affiliated category of employment, obtain co-occurrence data, described co-occurrence data comprises the numerical value and the second time numerical value of described category of employment when occurring the separately first time when number of documents, each described correlation candidate term and described category of employment occur simultaneously;
Association rule mining is carried out in described co-occurrence data, obtain the strength of association of described correlation candidate term and described category of employment;
Select described strength of association greater than the correlation candidate term of degree of association threshold value as described industry vocabulary.
14. industry dictionary generating method according to claim 13 is characterized in that, association rule mining is carried out in described co-occurrence data, the strength of association of obtaining described correlation candidate term and described category of employment comprises:
Calculate each of numerical value and the ratio of described number of documents described first time, obtain the support of each described correlation candidate term correspondence;
Calculate described first time numerical value and described second time numerical value ratio, obtain degree of confidence.
15. industry dictionary generating method according to claim 13 is characterized in that, association rule mining is carried out in described co-occurrence data, the strength of association of obtaining described correlation candidate term and described category of employment comprises:
Calculate each of numerical value and the ratio of described number of documents described first time, obtain the support of each described correlation candidate term correspondence;
According to formula
Figure FDA0000054728790000041
Obtain the degree of association of each described correlation candidate term and described category of employment;
Wherein, R represents the degree of association;
The probability that the described correlation candidate term of P (A) expression occurs in described document database;
The probability that the described category of employment of P (B) expression occurs in described document database;
P (C) described correlation candidate term of expression and described category of employment appear at the probability in the described document database simultaneously.
16. an industry dictionary creating apparatus is characterized in that, comprising:
First acquisition module is used for obtaining the collection of document of described initial industry slang correspondence according to initial industry slang;
Second acquisition module is used for obtaining candidate's term according to described collection of document;
The 3rd acquisition module is used for described candidate's term is carried out the analysis of business association degree, obtains the correlation candidate term;
Generation module is used for described correlation candidate term is carried out co-occurrence analysis and incidence relation excavation, generates industry vocabulary;
Add module, be used for described industry vocabulary is added the industry dictionary.
17. industry dictionary creating apparatus according to claim 16 is characterized in that, also comprises:
Trigger module is used for industry vocabulary with described industry dictionary again as described initial industry slang, and triggers described first acquisition module and carry out according to initial industry slang, the operation of obtaining the collection of document of described initial industry slang correspondence.
CN201110089698.5A 2011-04-11 2011-04-11 Industry dictionary generating method and device Active CN102169495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110089698.5A CN102169495B (en) 2011-04-11 2011-04-11 Industry dictionary generating method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110089698.5A CN102169495B (en) 2011-04-11 2011-04-11 Industry dictionary generating method and device

Publications (2)

Publication Number Publication Date
CN102169495A true CN102169495A (en) 2011-08-31
CN102169495B CN102169495B (en) 2014-04-02

Family

ID=44490657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110089698.5A Active CN102169495B (en) 2011-04-11 2011-04-11 Industry dictionary generating method and device

Country Status (1)

Country Link
CN (1) CN102169495B (en)

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049532A (en) * 2012-12-21 2013-04-17 东莞中国科学院云计算产业技术创新与育成中心 Method for creating knowledge base engine on basis of sudden event emergency management and method for inquiring knowledge base engine
CN103092966A (en) * 2013-01-23 2013-05-08 盘古文化传播有限公司 Vocabulary mining method and device
CN103309857A (en) * 2012-03-06 2013-09-18 腾讯科技(深圳)有限公司 Method and equipment for determining classified linguistic data
CN104063422A (en) * 2014-05-20 2014-09-24 微梦创科网络科技(中国)有限公司 Iteration updating method and device of feature word banks of fields in SNS (Social Networking Service)
CN104361033A (en) * 2014-10-27 2015-02-18 深圳职业技术学院 Automatic cancer-related information collection method and system
CN104391852A (en) * 2014-09-15 2015-03-04 国家电网公司 Method and device for establishing keyword word bank
CN104903802A (en) * 2013-02-28 2015-09-09 发纮电机株式会社 Screen creation editor device and program
CN105159884A (en) * 2015-09-23 2015-12-16 百度在线网络技术(北京)有限公司 Method and device for establishing industry dictionary and industry identification method and device
CN105512191A (en) * 2015-11-25 2016-04-20 南京莱斯信息技术股份有限公司 Industry characteristics analyzer with artificial behavior learning capability
CN105528404A (en) * 2015-12-03 2016-04-27 北京锐安科技有限公司 Establishment method and apparatus of seed keyword dictionary, and extraction method and apparatus of keywords
CN105608130A (en) * 2015-12-16 2016-05-25 小米科技有限责任公司 Method and device for obtaining sentiment word knowledge base as well as terminal
CN105608083A (en) * 2014-11-13 2016-05-25 北京搜狗科技发展有限公司 Method and device for obtaining input library, and electronic equipment
CN105631007A (en) * 2015-12-29 2016-06-01 云南电网有限责任公司电力科学研究院 Industry technical information collecting method and system
CN105653519A (en) * 2015-12-30 2016-06-08 贺惠新 Mining method of field specific word
CN105677640A (en) * 2016-01-08 2016-06-15 中国科学院计算技术研究所 Domain concept extraction method for open texts
CN105760366A (en) * 2015-03-16 2016-07-13 国家计算机网络与信息安全管理中心 New word finding method aiming at specific field
CN105869056A (en) * 2016-03-31 2016-08-17 比美特医护在线(北京)科技有限公司 Information processing method and apparatus
CN105930509A (en) * 2016-05-11 2016-09-07 华东师范大学 Method and system for automatic extraction and refinement of domain concept based on statistics and template matching
CN106445907A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Domain lexicon generation method and apparatus
CN106445906A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Generation method and apparatus for medium-and-long phrase in domain lexicon
CN103678371B (en) * 2012-09-14 2017-10-10 富士通株式会社 Word library updating device, data integration device and method and electronic equipment
CN107423362A (en) * 2017-06-20 2017-12-01 阿里巴巴集团控股有限公司 Industry determines method, Method of Get Remote Object and device, client, server
CN107958014A (en) * 2016-10-18 2018-04-24 谷歌公司 Search engine
CN108038204A (en) * 2017-12-15 2018-05-15 福州大学 For the viewpoint searching system and method for social media
CN108647322A (en) * 2018-05-11 2018-10-12 四川师范大学 The method that word-based net identifies a large amount of Web text messages similarities
CN108694229A (en) * 2017-04-10 2018-10-23 富士通株式会社 String data analytical equipment and string data analysis method
CN105243129B (en) * 2015-09-30 2018-10-30 清华大学深圳研究生院 Item property Feature words clustering method
CN109408828A (en) * 2018-11-08 2019-03-01 四川长虹电器股份有限公司 Words partition system for television field semantic analysis
CN109684463A (en) * 2018-12-30 2019-04-26 广西财经学院 Compared based on weight and translates rear former piece extended method across language with what is excavated
CN109783649A (en) * 2019-01-02 2019-05-21 腾讯科技(深圳)有限公司 A kind of domain lexicon generation method and device
CN109885831A (en) * 2019-01-30 2019-06-14 广州杰赛科技股份有限公司 Key Term abstracting method, device, equipment and computer readable storage medium
CN110309175A (en) * 2018-03-02 2019-10-08 北大方正集团有限公司 Reference book method of calibration and reference book calibration equipment
CN110362803A (en) * 2019-07-19 2019-10-22 北京邮电大学 A kind of text template generation method based on the combination of domain features morphology
CN110619067A (en) * 2019-08-27 2019-12-27 深圳证券交易所 Industry classification-based retrieval method and retrieval device and readable storage medium
CN110619073A (en) * 2019-08-30 2019-12-27 北京影谱科技股份有限公司 Method and device for constructing video subtitle network expression dictionary based on Apriori algorithm
CN110717040A (en) * 2019-09-18 2020-01-21 平安科技(深圳)有限公司 Dictionary expansion method and device, electronic equipment and storage medium
CN111079428A (en) * 2019-12-27 2020-04-28 出门问问信息科技有限公司 Word segmentation and industry dictionary construction method and device and readable storage medium
WO2020124856A1 (en) * 2018-12-18 2020-06-25 众安信息技术服务有限公司 Diagnosis standardization method and device based on word vectors
CN111444326A (en) * 2020-03-30 2020-07-24 腾讯科技(深圳)有限公司 Text data processing method, device, equipment and storage medium
CN112632969A (en) * 2020-12-13 2021-04-09 复旦大学 Incremental industry dictionary updating method and system
CN112687403A (en) * 2021-01-08 2021-04-20 拉扎斯网络科技(上海)有限公司 Medicine dictionary generation and medicine search method and device
CN113743107A (en) * 2021-08-30 2021-12-03 北京字跳网络技术有限公司 Entity word extraction method and device and electronic equipment
CN114138945A (en) * 2022-01-19 2022-03-04 支付宝(杭州)信息技术有限公司 Entity identification method and device in data analysis
CN114238634A (en) * 2021-12-13 2022-03-25 北京智齿众服技术咨询有限公司 Regular expression generation method, application, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005015434A1 (en) * 2003-07-23 2005-02-17 International Business Machines Corporation Method and system for categorizing arabic text
JP2008117351A (en) * 2006-11-08 2008-05-22 Nomura Research Institute Ltd Search system
CN101251854A (en) * 2008-03-19 2008-08-27 深圳先进技术研究院 Method for creating index lexical item as well as data retrieval method and system
CN101963989A (en) * 2010-09-30 2011-02-02 大连理工大学 Word elimination process for extracting domain ontology concept

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005015434A1 (en) * 2003-07-23 2005-02-17 International Business Machines Corporation Method and system for categorizing arabic text
JP2008117351A (en) * 2006-11-08 2008-05-22 Nomura Research Institute Ltd Search system
CN101251854A (en) * 2008-03-19 2008-08-27 深圳先进技术研究院 Method for creating index lexical item as well as data retrieval method and system
CN101963989A (en) * 2010-09-30 2011-02-02 大连理工大学 Word elimination process for extracting domain ontology concept

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐建民 等: "一种基于术语簇和关联规则的文档聚类方法", 《计算机工程与应用》, 11 February 2007 (2007-02-11) *
陈霞 等: "基于本体论的关联规则的挖掘", 《计算机与数字工程》, vol. 35, no. 2, 20 February 2007 (2007-02-20), pages 32 - 34 *

Cited By (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309857A (en) * 2012-03-06 2013-09-18 腾讯科技(深圳)有限公司 Method and equipment for determining classified linguistic data
CN103309857B (en) * 2012-03-06 2018-11-09 深圳市世纪光速信息技术有限公司 A kind of taxonomy determines method and apparatus
CN103678371B (en) * 2012-09-14 2017-10-10 富士通株式会社 Word library updating device, data integration device and method and electronic equipment
CN103049532A (en) * 2012-12-21 2013-04-17 东莞中国科学院云计算产业技术创新与育成中心 Method for creating knowledge base engine on basis of sudden event emergency management and method for inquiring knowledge base engine
CN103092966A (en) * 2013-01-23 2013-05-08 盘古文化传播有限公司 Vocabulary mining method and device
CN104903802A (en) * 2013-02-28 2015-09-09 发纮电机株式会社 Screen creation editor device and program
CN104063422A (en) * 2014-05-20 2014-09-24 微梦创科网络科技(中国)有限公司 Iteration updating method and device of feature word banks of fields in SNS (Social Networking Service)
CN104063422B (en) * 2014-05-20 2018-02-27 微梦创科网络科技(中国)有限公司 The feature dictionary iteration update method and device in field in social networks
CN104391852A (en) * 2014-09-15 2015-03-04 国家电网公司 Method and device for establishing keyword word bank
CN104391852B (en) * 2014-09-15 2017-12-29 国家电网公司 A kind of method and apparatus for establishing keyword dictionary
CN104361033A (en) * 2014-10-27 2015-02-18 深圳职业技术学院 Automatic cancer-related information collection method and system
CN104361033B (en) * 2014-10-27 2017-06-09 深圳职业技术学院 A kind of automatic collection method of cancer relevant information and system
CN105608083A (en) * 2014-11-13 2016-05-25 北京搜狗科技发展有限公司 Method and device for obtaining input library, and electronic equipment
CN105608083B (en) * 2014-11-13 2019-09-03 北京搜狗科技发展有限公司 Obtain the method, apparatus and electronic equipment of input magazine
CN105760366B (en) * 2015-03-16 2018-06-29 国家计算机网络与信息安全管理中心 For the new word discovery method of specific area
CN105760366A (en) * 2015-03-16 2016-07-13 国家计算机网络与信息安全管理中心 New word finding method aiming at specific field
CN106445907A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Domain lexicon generation method and apparatus
CN106445906A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Generation method and apparatus for medium-and-long phrase in domain lexicon
CN105159884A (en) * 2015-09-23 2015-12-16 百度在线网络技术(北京)有限公司 Method and device for establishing industry dictionary and industry identification method and device
CN105159884B (en) * 2015-09-23 2018-06-29 百度在线网络技术(北京)有限公司 The method for building up and device of industry dictionary and industry recognition methods and device
CN105243129B (en) * 2015-09-30 2018-10-30 清华大学深圳研究生院 Item property Feature words clustering method
CN105512191A (en) * 2015-11-25 2016-04-20 南京莱斯信息技术股份有限公司 Industry characteristics analyzer with artificial behavior learning capability
CN105528404A (en) * 2015-12-03 2016-04-27 北京锐安科技有限公司 Establishment method and apparatus of seed keyword dictionary, and extraction method and apparatus of keywords
CN105608130A (en) * 2015-12-16 2016-05-25 小米科技有限责任公司 Method and device for obtaining sentiment word knowledge base as well as terminal
CN105631007A (en) * 2015-12-29 2016-06-01 云南电网有限责任公司电力科学研究院 Industry technical information collecting method and system
CN105653519A (en) * 2015-12-30 2016-06-08 贺惠新 Mining method of field specific word
CN105677640A (en) * 2016-01-08 2016-06-15 中国科学院计算技术研究所 Domain concept extraction method for open texts
CN105869056A (en) * 2016-03-31 2016-08-17 比美特医护在线(北京)科技有限公司 Information processing method and apparatus
CN105930509B (en) * 2016-05-11 2019-05-17 华东师范大学 Field concept based on statistics and template matching extracts refined method and system automatically
CN105930509A (en) * 2016-05-11 2016-09-07 华东师范大学 Method and system for automatic extraction and refinement of domain concept based on statistics and template matching
CN107958014B (en) * 2016-10-18 2021-11-09 谷歌公司 Search engine
CN107958014A (en) * 2016-10-18 2018-04-24 谷歌公司 Search engine
CN108694229B (en) * 2017-04-10 2022-06-03 富士通株式会社 String data analysis device and string data analysis method
CN108694229A (en) * 2017-04-10 2018-10-23 富士通株式会社 String data analytical equipment and string data analysis method
CN107423362A (en) * 2017-06-20 2017-12-01 阿里巴巴集团控股有限公司 Industry determines method, Method of Get Remote Object and device, client, server
CN108038204A (en) * 2017-12-15 2018-05-15 福州大学 For the viewpoint searching system and method for social media
CN110309175B (en) * 2018-03-02 2021-12-03 北大方正集团有限公司 Tool book checking method and tool book checking device
CN110309175A (en) * 2018-03-02 2019-10-08 北大方正集团有限公司 Reference book method of calibration and reference book calibration equipment
CN108647322B (en) * 2018-05-11 2021-12-17 四川师范大学 Method for identifying similarity of mass Web text information based on word network
CN108647322A (en) * 2018-05-11 2018-10-12 四川师范大学 The method that word-based net identifies a large amount of Web text messages similarities
CN109408828A (en) * 2018-11-08 2019-03-01 四川长虹电器股份有限公司 Words partition system for television field semantic analysis
WO2020124856A1 (en) * 2018-12-18 2020-06-25 众安信息技术服务有限公司 Diagnosis standardization method and device based on word vectors
CN109684463A (en) * 2018-12-30 2019-04-26 广西财经学院 Compared based on weight and translates rear former piece extended method across language with what is excavated
CN109684463B (en) * 2018-12-30 2022-11-22 广西财经学院 Cross-language post-translation and front-part extension method based on weight comparison and mining
CN109783649B (en) * 2019-01-02 2023-01-24 腾讯科技(深圳)有限公司 Domain dictionary generating method and device
CN109783649A (en) * 2019-01-02 2019-05-21 腾讯科技(深圳)有限公司 A kind of domain lexicon generation method and device
CN109885831A (en) * 2019-01-30 2019-06-14 广州杰赛科技股份有限公司 Key Term abstracting method, device, equipment and computer readable storage medium
CN110362803A (en) * 2019-07-19 2019-10-22 北京邮电大学 A kind of text template generation method based on the combination of domain features morphology
CN110619067A (en) * 2019-08-27 2019-12-27 深圳证券交易所 Industry classification-based retrieval method and retrieval device and readable storage medium
CN110619073A (en) * 2019-08-30 2019-12-27 北京影谱科技股份有限公司 Method and device for constructing video subtitle network expression dictionary based on Apriori algorithm
CN110619073B (en) * 2019-08-30 2022-04-22 北京影谱科技股份有限公司 Method and device for constructing video subtitle network expression dictionary based on Apriori algorithm
CN110717040A (en) * 2019-09-18 2020-01-21 平安科技(深圳)有限公司 Dictionary expansion method and device, electronic equipment and storage medium
WO2021051864A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Dictionary expansion method and apparatus, electronic device and storage medium
CN111079428A (en) * 2019-12-27 2020-04-28 出门问问信息科技有限公司 Word segmentation and industry dictionary construction method and device and readable storage medium
CN111079428B (en) * 2019-12-27 2023-09-19 北京羽扇智信息科技有限公司 Word segmentation and industry dictionary construction method and device and readable storage medium
CN111444326A (en) * 2020-03-30 2020-07-24 腾讯科技(深圳)有限公司 Text data processing method, device, equipment and storage medium
CN111444326B (en) * 2020-03-30 2023-10-20 腾讯科技(深圳)有限公司 Text data processing method, device, equipment and storage medium
CN112632969B (en) * 2020-12-13 2022-06-21 复旦大学 Incremental industry dictionary updating method and system
CN112632969A (en) * 2020-12-13 2021-04-09 复旦大学 Incremental industry dictionary updating method and system
CN112687403B (en) * 2021-01-08 2022-12-02 拉扎斯网络科技(上海)有限公司 Medicine dictionary generation and medicine search method and device
CN112687403A (en) * 2021-01-08 2021-04-20 拉扎斯网络科技(上海)有限公司 Medicine dictionary generation and medicine search method and device
CN113743107A (en) * 2021-08-30 2021-12-03 北京字跳网络技术有限公司 Entity word extraction method and device and electronic equipment
CN114238634A (en) * 2021-12-13 2022-03-25 北京智齿众服技术咨询有限公司 Regular expression generation method, application, device, equipment and storage medium
CN114238634B (en) * 2021-12-13 2022-08-02 北京智齿众服技术咨询有限公司 Regular expression generation method, application, device, equipment and storage medium
CN114138945A (en) * 2022-01-19 2022-03-04 支付宝(杭州)信息技术有限公司 Entity identification method and device in data analysis

Also Published As

Publication number Publication date
CN102169495B (en) 2014-04-02

Similar Documents

Publication Publication Date Title
CN102169495B (en) Industry dictionary generating method and device
CN109299480B (en) Context-based term translation method and device
CN103399901B (en) A kind of keyword abstraction method
US20150347385A1 (en) Systems and Methods for Determining Lexical Associations Among Words in a Corpus
US20160155058A1 (en) Non-factoid question-answering system and method
US8443008B2 (en) Cooccurrence dictionary creating system, scoring system, cooccurrence dictionary creating method, scoring method, and program thereof
CN102622338A (en) Computer-assisted computing method of semantic distance between short texts
CN106095753A (en) A kind of financial field based on comentropy and term credibility term recognition methods
CN102722518A (en) Information processing apparatus, information processing method, and program
EP3579120A1 (en) Extraction of tokens and relationship between tokens from documents to form an entity relationship map
Hengchen et al. A data-driven approach to studying changing vocabularies in historical newspaper collections
Patel et al. Extractive Based Automatic Text Summarization.
Dasgupta et al. A framework of customer review analysis using the aspect-based opinion mining approach
Erjavec et al. The slwac corpus of the sloveneweb
Dahir et al. Utilizing machine learning for sentiment analysis of IMDB movie review data
US11361565B2 (en) Natural language processing (NLP) pipeline for automated attribute extraction
Ousirimaneechai et al. Extraction of trend keywords and stop words from thai facebook pages using character n-grams
CN102982063A (en) Control method based on tuple elaboration of relation keywords extension
Baniata et al. Sentence representation network for Arabic sentiment analysis
CN110532551A (en) Method, equipment and the storage medium that text key word automatically extracts
Aumiller et al. UniHD@ CL-SciSumm 2020: Citation extraction as search
Khritankov et al. Discovering text reuse in large collections of documents: A study of theses in history sciences
Shrawankar et al. Construction of news headline from detailed news article
Shams et al. Intent Detection in Urdu Queries Using Fine-Tuned BERT Models
Garcia et al. Exploring the effectiveness of linguistic knowledge for biographical relation extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant