CN106126494A - Synonym finds method and device, data processing method and device - Google Patents

Synonym finds method and device, data processing method and device Download PDF

Info

Publication number
CN106126494A
CN106126494A CN201610429937.XA CN201610429937A CN106126494A CN 106126494 A CN106126494 A CN 106126494A CN 201610429937 A CN201610429937 A CN 201610429937A CN 106126494 A CN106126494 A CN 106126494A
Authority
CN
China
Prior art keywords
word
synonym
pending
phrase set
threshold value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610429937.XA
Other languages
Chinese (zh)
Other versions
CN106126494B (en
Inventor
张昊
朱频频
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhizhen Intelligent Network Technology Co Ltd
Original Assignee
Shanghai Zhizhen Intelligent Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhizhen Intelligent Network Technology Co Ltd filed Critical Shanghai Zhizhen Intelligent Network Technology Co Ltd
Priority to CN201610429937.XA priority Critical patent/CN106126494B/en
Publication of CN106126494A publication Critical patent/CN106126494A/en
Application granted granted Critical
Publication of CN106126494B publication Critical patent/CN106126494B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A kind of synonym finds method and device, data processing method and device, and described synonym finds that method includes: obtaining pending phrase set, described phrase set includes multiple word;For the arbitrary pending word in described phrase set, when described phrase set exists one or more target word, make described pending word to described target word smallest edit distance less than predetermined threshold value time, described pending word is defined as synonym pair with target word described in corresponding;Wherein, described smallest edit distance is calculated by edit distance approach and obtains, described edit distance approach includes deletion action, editing distance that editing distance corresponding to deletion action is corresponding less than remaining operation, editing distance corresponding to described deletion action is less than predetermined threshold value, and editing distance corresponding to remaining operation described in single is more than or equal to predetermined threshold value.Such scheme can improve the accuracy finding initialism.

Description

Synonym finds method and device, data processing method and device
Technical field
The present invention relates to data processing field, particularly relate to a kind of synonym and find method and device, data process side Method and device.
Background technology
Synonymy is very important semantic relation, is often applied to the natural language such as information retrieval, text classification In process task.Specifically, before carrying out the process task such as information retrieval or text classification, need to carry out synon obtaining Take and synon identification.Such as, in the application scenarios of information retrieval, one can be classified as by belonging to synon multiple word Class, when there are synon keyword in input text, can replace original keyword to scan for synonym, such that it is able to Searching system is made to be supplied to the more text to be confirmed of user.
The shorthand of intrinsic title, the word of these shorthands is often had in the written and daily expression of Chinese Being referred to as the initialism of intrinsic title, initialism is a part for former intrinsic title, and initialism is also synon one.Example As, " National People's Congress " is the initialism of " National People's Congress ", and " Chinese " is the initialism of " People's Republic of China (PRC) ", " Real Madrid " is the initialism of " Real Madrid " etc..
But, synonym of the prior art finds that method cannot preferably identify initialism, so that semantic understanding Accuracy relatively low.
Summary of the invention
Present invention solves the technical problem that being to provide a kind of synonym finds method and device, improves the standard finding initialism Really property.
For solving above-mentioned technical problem, the embodiment of the present invention provides a kind of synonym to find method, and described method includes: obtain Taking pending phrase set, described phrase set includes multiple word;For the arbitrary pending word in described phrase set, when There is one or more target word in described phrase set so that the smallest edit distance of described pending word to described target word During less than predetermined threshold value, described pending word is defined as synonym pair with target word described in corresponding;Wherein, described minimum volume Collecting distance and calculate acquisition by edit distance approach, described edit distance approach includes deletion action, described deletion action Editing distance that corresponding editing distance is corresponding less than remaining operation, editing distance corresponding to described deletion action is less than presetting threshold Being worth, described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.
Alternatively, described method also includes: calculate described pending word and remaining each word in described phrase set respectively Semantic similarity, and therefrom select the front N that semantic similitude angle value is higher more than the word of similarity threshold or semantic similitude angle value Individual word is as candidate word;
Described target word determines in the following manner: calculate the minimum of described pending word and each described candidate word respectively Editing distance, is less than the candidate word of predetermined threshold value as target word using the smallest edit distance with described pending word.
Alternatively, described pending word and the semantic similarity of remaining each word in described phrase set, bag are calculated respectively Include:
Each word in described phrase set is carried out vectorization;Result based on vectorization, calculates described pending word With the cosine similarity of remaining each word, described cosine similarity is as described semantic similarity.
Alternatively, each word in described phrase set is carried out vectorization, including:
Use word2vec method that each word in described phrase set is carried out vectorization.
Alternatively, described acquisition synon phrase set to be found, including:
Input language material is carried out participle, to obtain described phrase set.
Alternatively, utilizing dictionary for word segmentation that described input language material is carried out participle, described dictionary for word segmentation obtains in the following manner :
Described input language material is carried out pretreatment, to obtain text data;Described text data carries out branch process, To phrase data;Described phrase data is carried out word segmentation processing, after obtaining participle according to the independent word comprised in the dictionary of basis Term data;It is combined the term data after adjacent described participle processing, to generate candidate data string;To described time Serial data is selected to carry out judgement process, to find neologisms;Described neologisms are added described dictionary for word segmentation.
Alternatively, remaining operation described includes update and replacement operation, the editor that update described in single is corresponding Distance is more than or equal to predetermined threshold value, and editing distance corresponding to replacement operation described in single is more than or equal to predetermined threshold value.
The embodiment of the present invention also provides for a kind of data processing method, and described data processing method includes that above-mentioned synonym finds Method.
The embodiment of the present invention also provides for a kind of synonym and finds device, and described device includes:
Acquiring unit, is suitable to obtain pending phrase set, and described phrase set includes multiple word;
Synonym determines unit, is suitable to for the arbitrary pending word in described phrase set, when in described phrase set There is one or more target word so that the smallest edit distance of described pending word to described target word is less than predetermined threshold value Time, described pending word is defined as synonym pair with target word described in corresponding;
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, described edit distance approach bag Include deletion action, editing distance that editing distance corresponding to described deletion action is corresponding less than remaining operation, described deletion action Corresponding editing distance is less than predetermined threshold value, and described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.
Alternatively, described synonym finds that device also includes:
Candidate word chooses unit, is suitable to calculate described pending word and the language of remaining each word in described phrase set respectively Justice similarity, and therefrom select the top n word that semantic similitude angle value is higher more than the word of similarity threshold or semantic similitude angle value As candidate word;
Target word determines unit, be suitable to calculate respectively the minimum editor of described pending word and each described candidate word away from From, the smallest edit distance with described pending word is less than the candidate word of predetermined threshold value as target word.
Alternatively, described candidate word is chosen unit and is included:
Vectorization subelement, is suitable to each word in described phrase set is carried out vectorization;
Cosine similarity computation subunit, is suitable to result based on vectorization, calculates described pending word each with remaining The cosine similarity of word, described cosine similarity is as described semantic similarity.
Alternatively, described vectorization subelement use word2vec method each word in described phrase set is carried out to Quantify.
Alternatively, described acquiring unit includes:
Participle subelement, is suitable to input language material is carried out participle, to obtain described phrase set.
Alternatively, described participle subelement utilizes dictionary for word segmentation that described input language material is carried out participle, described dictionary for word segmentation Being obtained by dictionary for word segmentation acquiring unit, described dictionary for word segmentation acquiring unit is suitable to:
Described input language material is carried out pretreatment, to obtain text data;Described text data carries out branch process, To phrase data;Described phrase data is carried out word segmentation processing, after obtaining participle according to the independent word comprised in the dictionary of basis Term data;It is combined the term data after adjacent described participle processing, to generate candidate data string;To described time Serial data is selected to carry out judgement process, to find neologisms;Described neologisms are added described dictionary for word segmentation.
Alternatively, remaining operation described includes update and replacement operation, the editor that update described in single is corresponding Distance is more than or equal to predetermined threshold value, and editing distance corresponding to replacement operation described in single is more than or equal to predetermined threshold value.
The embodiment of the present invention also provides for a kind of data processing equipment, and described data processing equipment includes that above-mentioned synonym finds Device.
Compared with prior art, the technical scheme of the embodiment of the present invention has the advantages that
The embodiment of the present invention obtains pending phrase set;For the arbitrary pending word in described phrase set, when There is one or more target word in described phrase set so that the smallest edit distance of described pending word to described target word During less than predetermined threshold value, described pending word is defined as synonym pair with target word described in corresponding;Wherein, described minimum volume Collecting distance and calculate acquisition by edit distance approach, described edit distance approach includes deletion action, described deletion action Editing distance that corresponding editing distance is corresponding less than remaining operation, editing distance corresponding to described deletion action is less than presetting threshold Being worth, described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.On the one hand such scheme is compiled by restriction Collect editing distance that in distance method, the editing distance of deletion action operates less than remaining so that smallest edit distance is by excellent Deletion action is first used to obtain;On the other hand, the editing distance that the deletion action during smallest edit distance is corresponding is calculated Less than predetermined threshold value, when the editing distance of remaining operation correspondence of single is more than or equal to predetermined threshold value, thus, when pending simultaneously Word to the smallest edit distance of target word less than predetermined threshold value time, corresponding target word is only to be passed through deletion action by pending word Obtain, so that it is guaranteed that the part that synonym is pending word literal expression obtained by edit distance approach, and then make The initialism obtained is more accurate, improves the accuracy rate that initialism finds.
Further, by calculating pending word and the semantic similarity of remaining word in described phrase set, select multiple Candidate word, and then target word can be determined from the less scope that multiple candidate word are formed, owing to multiple candidate word are pending One subset of phrase set, so determining that from multiple candidate word target word can improve the efficiency determining synonym pair, simultaneously By using semantic similarity as another synonym performance assessment criteria, further increase the accuracy finding synonym pair, also Just improve the accuracy finding initialism.
Accompanying drawing explanation
Fig. 1 is the flow chart that a kind of synonym in the embodiment of the present invention finds method;
Fig. 2 is the flow chart of a kind of method obtaining dictionary for word segmentation in the embodiment of the present invention;
Fig. 3 is the flow chart that the another kind of synonym in the embodiment of the present invention finds method;
Fig. 4 is the structural representation that a kind of synonym in the embodiment of the present invention finds device;
Fig. 5 is the structural representation that the another kind of synonym in the embodiment of the present invention finds device.
Detailed description of the invention
The shorthand of intrinsic title, the word of these shorthands is often had in the written and daily expression of Chinese Being referred to as the initialism of intrinsic title, initialism is a part for former intrinsic title, and initialism is also synon one.Example As, " National People's Congress " is the initialism of " National People's Congress ", and " Chinese " is the initialism of " People's Republic of China (PRC) ", " Real Madrid " is the initialism of " Real Madrid " etc..But, synonym of the prior art finds that method can not preferably be known Other initialism, so that the accuracy of semantic understanding is relatively low.
The embodiment of the present invention obtains pending phrase set;For the arbitrary pending word in described phrase set, when There is one or more target word in described phrase set so that the smallest edit distance of described pending word to described target word During less than predetermined threshold value, described pending word is defined as synonym pair with target word described in corresponding;Wherein, described minimum volume Collecting distance and calculate acquisition by edit distance approach, described edit distance approach includes deletion action, described deletion action Editing distance that corresponding editing distance is corresponding less than remaining operation, editing distance corresponding to described deletion action is less than presetting threshold Being worth, described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.On the one hand such scheme is compiled by restriction Collect editing distance that in distance method, the editing distance of deletion action operates less than remaining so that smallest edit distance is by excellent Deletion action is first used to obtain;On the other hand, the editing distance that the deletion action during smallest edit distance is corresponding is calculated Less than predetermined threshold value, when the editing distance of remaining operation correspondence of single is more than or equal to predetermined threshold value, thus, when pending simultaneously Word to the smallest edit distance of target word less than predetermined threshold value time, corresponding target word is only to be passed through deletion action by pending word Obtain, so that it is guaranteed that the part that synonym is pending word literal expression obtained by edit distance approach, and then make The initialism obtained is more accurate, improves the accuracy rate that initialism finds.
Understandable for enabling the above-mentioned purpose of the present invention, feature and beneficial effect to become apparent from, below in conjunction with the accompanying drawings to this The specific embodiment of invention is described in detail.
Fig. 1 is the flow chart that a kind of synonym in the embodiment of the present invention finds method.Below in conjunction with the step shown in Fig. 1 Illustrate.
Step S101: obtain pending phrase set, described phrase set includes multiple word.
Described pending phrase collection is combined into the phrase set treating therefrom to find synonym pair.
In being embodied as, described pending phrase set obtains by input language material is carried out participle.Described defeated The data mode entering language material can be the non-text data such as speech data, it is also possible to is text data.When input language material is non-literary composition During notebook data, needing the object that it is first converted to text data, i.e. subsequent treatment is all text data.Described input language material can To obtain by obtaining question answering system and the conversation recording of user, it is also possible to come from the knowledge point data of manual sorting.
In being embodied as, above-mentioned input language material may be from a specific area, it is therefore to be understood that pending Phrase set in word be the semantic expression about this specific area, wherein may include and there is identical semanteme but express The word that form is different, i.e. synonym.Described specific area can be the bank field, education sector, sports field etc..
Such as, described input language material comes from the bank field, and the statement wherein having may use " China Merchants Bank " to express this One Bank Name, some statements then may use " China Merchants Bank " to express, and " China Merchants Bank " and " China Merchants Bank " are a synonym pair;Class As, in expression exist " industrial and commercial bank " and " industrial and commercial bank " this to synonym.Above-mentioned two groups of synonym centerings " China Merchants Bank " are " silver of promoting trade and investment Initialism OK ", " industrial and commercial bank " is the initialism of " industrial and commercial bank ".Certainly, expression is likely present the same of other non-initialisms Justice word, such as, " remit money " and " remittance money " be synonym, but the most there is not the relation of initialism.And the present embodiment is intended to lead to Cross step S101 and find initialism to step S102.
In being embodied as, input language material is carried out participle and is realized by dictionary for word segmentation, so that input language material Carry out the result of participle comprises initialism, in other words, in order to initialism participle from a statement is obtained, described point Word dictionary needs to comprise initialism, and initialism does not exist possibly as a kind of neologisms in general basic dictionary, institute Basis dictionary is updated by new word discovery so that initialism is added into the basic word of renewal as one of which neologisms with needs Allusion quotation, so that with the basic dictionary updated as dictionary for word segmentation to input language material participle.
In order to make described dictionary for word segmentation include initialism, described dictionary for word segmentation obtains in the following manner, refers to Fig. 2 Shown step.
S11: input language material is carried out pretreatment, to obtain text data.
In described input language material, Format Type may be more, for ease of input language material is carried out subsequent treatment, and need to be to input Language material carries out pretreatment, obtains text data.
In being embodied as, described pretreatment can by input language material uniform format be text formatting, and filter dirty word, One or more in sensitive word and stop words.When being text formatting by the uniform format of input language material, can be by current skill Art wouldn't be converted to the information filtering of text formatting and fall.
S12: described text data is carried out branch and processes, obtain phrase data.
It can be according to punctuate branch to input language material that branch processes, such as, fullstop, comma, exclamation, question mark etc. occurring Branch at punctuate.Obtain phrase data is the primary segmentation to language material herein, in order to determine the scope of follow-up word segmentation processing.
S13, carries out word segmentation processing to described phrase data, after obtaining participle according to the independent word comprised in the dictionary of basis Term data.
Described basis dictionary is for distinguishing for dictionary for word segmentation, may not contain initialism in the dictionary of described basis.Described Basis dictionary comprises multiple independent word, and the length of different individually words can be different.In being embodied as, carry out based on basis dictionary The process of word segmentation processing can utilize one or more in the two-way maximum matching method of dictionary, HMM method and CRF method.
Described word segmentation processing is that the phrase data to same a line carries out word segmentation processing, and described term data is all included in base Independent word in plinth dictionary.
S14, is combined the term data after adjacent described participle processing, to generate candidate data string.
Word segmentation processing is carried out according to basis dictionary, it is possible that using should be as the word of a word in certain field Data are divided into the situation of multiple term data, therefore need new word discovery.Follow-up imposing a condition is sieved from candidate data string Choosing, using the candidate data string that filters out as neologisms.Generation candidate data string, as the premise of above-mentioned screening process, can use Various ways completes.
In being embodied as, it is possible to use Bigram model using two words adjacent in the phrase data of same a line as time Select serial data.
Assuming that a statement S can be expressed as sequence S=w1w2 ... wn, language model is exactly the general of requirement statement S Rate p (S):
P (S)=p (w1, w2, w3, w4, w5 ..., wn)
=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1) (1)
In formula (1), probability statistics are based on Ngram model, and the amount of calculation of probability is the biggest, it is impossible to be applied in actual application. (Markov Assumption) is assumed: the appearance of next word only relies upon one or several before it based on Markov Word.Assume that the appearance of next word relies on a word before it, then have:
P (S)=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1)
=p (w1) p (w2 | w1) p (w3 | w2) ... p (wn | wn-1) (2)
Assume that the appearance of next word relies on two words before it, then have:
P (S)=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1)
=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | wn-1, wn-2) (3)
Formula (2) is the computing formula of Bigram probability, and formula (3) is the computing formula of trigram probability.By arranging Bigger n value, can arrange the more constraint information next word occur, have bigger discriminative power;By arranging more Little n value, the number of times that candidate data string occurs in new word discovery is more, it is provided that more reliable statistical information, has more High reliability.
In theory, n value is the biggest, and reliability is the highest, and in existing processing method, Trigram's is most;But Bigram's Amount of calculation is less, and system effectiveness is higher.
S15: judge whether described candidate data string is particular candidate serial data, described particular candidate serial data includes basis Noun, and be positioned at described basis noun specific phase be noun or adjective to the word of position.
Study discovery according to inventor, if if the specific phase of a basic noun is to noun on position or adjective, then should Basis noun very likely needs by as neologisms.Such as basis noun " blocks ", and the left side of " card " is noun, can form " dragon Card ", " elite school's card ", " platinum card ", " business card " etc..Therefore judge whether candidate data string is particular candidate serial data, can sentence Whether disconnected candidate data string meets comprises basis noun, and whether the specific phase of this basis noun is noun to the word of position Or adjective.
Position can be set by the specific phase of basis noun according to different basic nouns and language material, such as, works as language Material comprises multiple " card ", and when needing the title of various cards all as neologisms, can set the left side of basis noun as Noun or adjective.
In being embodied as, specific phase can be any one or two kinds of in left side and right side to position, can be according to need It is configured.
In being embodied as, it is referred to the frequency and determines described basis noun.Owing to basis noun can repeatedly in language material Occur, therefore be referred to the frequency and determine basis noun.It is understood that basis noun can also be selected by manual read Select and set.
S16: described candidate data string is carried out judgement process, to find neologisms;Described judgement processes and includes:
When described candidate data string is nonspecific candidate data string, calculate in described candidate data string each word with in it The comentropy of side word, and remove described comentropy candidate data string outside preset range;
When described candidate data string is particular candidate serial data, only calculate the word outside described particular candidate serial data With the comentropy of word inside it, remove described comentropy candidate data string outside preset range.
Owing to candidate data string includes two term data, when candidate data string being carried out judgement and processing, need respectively Judging the inner side comentropy of two term data, comentropy is to measure stochastic variable is probabilistic, computing formula As follows:
H (X)=-∑ p (xi)logp(xi)
Comentropy is the biggest, represents that the uncertainty of variable is the biggest, and the probability that the most each possible value occurs is average.As The probability that really certain value of variable occurs is 1, then entropy is 0.Show that variable, only when former value occurs, is an inevitable thing Part.
The left side comentropy of calculating word W and the formula of right side comentropy are as follows:
H1(W)=∑x∈X(#XW>0)P (x | W) log P (x | W), wherein X is all term data collection occurring in the W left side Close, H1(W) it is the left side comentropy of term data W.
H2(W)=∑x∈Y(#WY>0)P (y | W) log P (y | W), wherein Y is to occur in all term data collection on the right of W Close, H2(W) it is the right side comentropy of term data W.
Inner side comentropy is that candidate data string is fixed each independent term data successively, calculates and occurs at this term data In the case of another word occur comentropy.If candidate data string is (W1W2), then calculate the right side letter of term data W1 Breath entropy and the left side comentropy of term data W2.
Calculate the entropy of term data and the term data inside it in candidate data string and embody word inside this term data The confusion degree of language data.Such as, by calculating candidate data string W1W2Middle left side term data W1Right side comentropy and the right side Side term data W2Left side comentropy, it can be determined that term data W1And W2The confusion degree of inner side, such that it is able to by setting Preset range is screened, get rid of each word with its inside word constitute the probability characteristics value candidate outside preset range of neologisms Serial data.
In particular candidate serial data, the inner side comentropy of basis noun perhaps can be because, outside preset range, causing making Particular candidate serial data for neologisms is excluded, and such as, particular candidate serial data is " platinum card ", " elite school's card ", " Long Card " etc. Comprise basis noun " block " candidate data string time, word " platinum ", " name ", " imperial " right side comentropy in preset range, But the left side word " blocked " due to word is more chaotic, on the left of it, comentropy may be outside preset range, consequently, it is possible to cause candidate Candidate's serial data such as serial data " platinum card ", " elite school's card ", " Long Card " is by the eliminating of mistake.
Therefore when described candidate data string is particular candidate serial data, only calculate the word outside described particular candidate serial data Language with its inside the comentropy of word, remove described comentropy candidate data string outside preset range, no longer to basis noun Inner side comentropy calculate, it is to avoid the inner side comentropy of gene basis noun is outer in preset range and the false exclusion that causes.
S17: described neologisms are added described dictionary for word segmentation.
Also be a kind of neologisms due to initialism, then the new set of words obtained from input language material also includes initialism, from And neologisms are added dictionary for word segmentation and is also achieved that in dictionary for word segmentation and comprises initialism, and then can be with dictionary for word segmentation to described defeated Enter language material and carry out the described phrase set that participle obtains in the present embodiment.
Continue with and the step after obtaining pending phrase set is illustrated.
Step S102: for the arbitrary pending word in described phrase set, when described phrase set exists one or Multiple target words so that the smallest edit distance of described pending word to described target word less than predetermined threshold value time, described in wait to locate Reason word is defined as synonym pair with target word described in corresponding.
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, described edit distance approach bag Include deletion action, editing distance that editing distance corresponding to described deletion action is corresponding less than remaining operation, described deletion action Corresponding editing distance is less than predetermined threshold value, and described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.
In being embodied as, remaining operation described can include replacement operation and update.The volume of the present embodiment indication Collecting distance is that a word is taked edit operation to be converted into the editor's cost needed for another word, and namely number of operations is with every The product of cost needed for single stepping.And smallest edit distance, then refer to edit the editing distance of cost minimization.Often step operation is only Only for one of them word.
As a example by " industrial and commercial bank " conversion to " industrial and commercial bank ", editing distance and smallest edit distance are described below.Will " industrial and commercial silver OK " conversion can take different edit operation compound modes to obtain to " industrial and commercial bank ".Assume the editing distance of single step replacement operation Being 1000, the editing distance of single step deletion action is 1, and the editing distance of single step update is 1000.
By the first conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " it is: point 3 deletion actions are by " industrial and commercial bank " " work ", " business " and " silver-colored " delete, then carry out update and insert " work " and obtain " industrial and commercial bank ", then " industrial and commercial bank " arrives " work Editing distance OK " is 1003;
By the second conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " it is: point 2 deletion actions are by " industrial and commercial bank " " work " and " business " delete, then carry out a replacement operation and " silver-colored " is replaced with " work " obtain " industrial and commercial bank ", then " industrial and commercial bank " arrives The editing distance of " industrial and commercial bank " is 1002;
By the third conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " it is: point 2 deletion actions are by " industrial and commercial bank " " business " and " silver-colored " delete, obtain " industrial and commercial bank ", then the editing distance of " industrial and commercial bank " to " industrial and commercial bank " is 2.
It should be noted that the conversion regime of pending word " industrial and commercial bank " conversion to " industrial and commercial bank " is not limited to above-mentioned enumerating Operative combination, editing distance corresponding to different conversion regimes is different.But, in multiple conversion regime, minimum editor away from From being unique.It can be appreciated that the above-mentioned smallest edit distance by " industrial and commercial bank " conversion to " industrial and commercial bank " should be 2, i.e. by upper State the third conversion regime to obtain.
Therefore, for the arbitrary pending word in described phrase set, it determines that to the smallest edit distance of another word 's.By calculating arbitrary pending word and the smallest edit distance of other word in described phrase set, when in described phrase set There is one or more target word so that the smallest edit distance of described pending word to described target word is less than predetermined threshold value Time, described pending word is defined as synonym pair with target word described in corresponding.Such as, phrase collection be combined into L (A, B, C, D, E, F, G and H), for pending word A, it is assumed that target word comes from subset M (B, C, D, E, F, G and H), when (B, C, D, E, F, G and H) there is a word B in so that when the smallest edit distance of pending word A to this word B is less than predetermined threshold value, then A and B is synonym Word pair.
In order to ensure that the target word of the synonym centering searched out is the initialism of pending word in the present embodiment, i.e. breviary A part for the most pending word of word, in described edit distance approach, limits the editing distance that remaining operation of single is corresponding More than or equal to predetermined threshold value, and limit the editing distance that editing distance corresponding to deletion action is corresponding less than remaining operation, and The editing distance that not only deletion action described in single is corresponding is less than predetermined threshold value, and repeatedly (described repeatedly can be according to full name Word determines, such as to the number of words deleted maximum between initialism: maximum delete 5 words, is the most now 5 times) described deletion action Corresponding editing distance is again smaller than predetermined threshold value.
In being embodied as, it can also be multiple that the initialism found by said method can be one, needs explanation , it is not necessarily initialism between that found by the method for the present embodiment and the word of pending word composition synonym pair and closes System.Such as, in phrase set L, it is word B that the method for enforcement the present embodiment obtains one of them initialism of pending word A, and Another initialism finding pending word A is word C, and smallest edit distance and the pending word A of the most pending word A to word B arrive The smallest edit distance of word C is respectively less than predetermined threshold value, but is not necessarily initialism relation between word B and word C, i.e. it cannot be guaranteed that word B is the initialism of word C or initialism that word C is word B, but is synonym relation between word B and word C.
Needing also exist for explanation, by the method for the present embodiment, multiple the waiting that the same initialism that obtains is corresponding is located Differ between reason word and be set to synonym relation.Such as, in phrase set L, the method implementing the present embodiment obtains pending word A's Initialism is B, and the initialism being similarly obtained pending word D is B, but differs between word A and word D and be set to synonym relation.
Editing distance corresponding less than remaining operation owing to limiting editing distance corresponding to deletion action in the present embodiment, makes Must be when using edit distance approach to calculate smallest edit distance, the conversion of pending word is preferentially adopted to the edit operation of another word By deletion action, on the other hand, calculate editing distance corresponding to the deletion action during smallest edit distance less than presetting threshold Value, when the editing distance of remaining operation correspondence of single is more than or equal to predetermined threshold value, thus, when pending word is to target word simultaneously Smallest edit distance less than predetermined threshold value time, corresponding target word is only to be obtained by deletion action by pending word, thus Guarantee the part that the synonym obtained by edit distance approach is pending word literal expression, and then make the breviary obtained Word is more accurate, improves the accuracy rate that initialism finds.
Fig. 3 is the flow chart that a kind of synonym in the embodiment of the present invention finds method.Enter below in conjunction with step shown in Fig. 3 Row explanation.
Step S301: obtain pending phrase set, described phrase set includes multiple word.
The enforcement of this step can not repeat them here should refer to step S101 shown in Fig. 1.
Step S302: for the arbitrary pending word in described phrase set, calculates described pending word respectively with described The semantic similarity of remaining each word in phrase set, and therefrom select semantic similitude angle value to be more than word or the language of similarity threshold The justice higher top n word of Similarity value is as candidate word.
In implementing one, can be by comparing semantic similitude angle value and the similarity threshold of remaining word and pending word The size of value, is more than the word of similarity threshold as candidate word using semantic similitude angle value.It should be noted that described similarity threshold Value can carry out different presetting, and does not do any restriction, and now the number of candidate word changes with the change of similarity threshold.
In another implements, can obtain, by number N limiting candidate word, the time that semantic similitude angle value is higher Select word.Specifically, semantic similitude angle value is ranked up by from high to low order, takes the front N that semantic similitude angle value is higher Individual word is as candidate word.
It is to determine target word from candidate word in order to follow-up that this step selects candidate word from described phrase set.So, On the one hand, reduce the scope determining the target word constituting synonym pair with described pending word, such that it is able to reduce calculating Complexity, improves the efficiency finding initialism.On the other hand, by semantic similarity is determined whether synonym as another The performance assessment criteria of word, further increases the accuracy finding synonym pair, namely improves the accuracy finding initialism.
In being embodied as, when calculating the semantic similarity of remaining each word in described pending word and described phrase set Can pass through following steps:
First, each word in described phrase set is carried out vectorization;
Secondly, result based on vectorization, calculate the cosine similarity of described pending word and remaining each word, described remaining String similarity is as described semantic similarity.It is understood that after calculating cosine similarity, can therefrom select cosine similar The angle value top n word higher more than the word of similarity threshold or cosine similarity value is as candidate word.
In being embodied as, can use word2vec method that each word in described phrase set is carried out vectorization. It is pointed out that and can also use other existing methods that each word in described phrase set is carried out vectorization.
Step S303: when there is one or more target word in described phrase set so that described pending word is to described When the smallest edit distance of target word is less than predetermined threshold value, described pending word is defined as synonym with target word described in corresponding Word pair.
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, described edit distance approach bag Including deletion action, editing distance that editing distance corresponding to deletion action is corresponding less than remaining operation, described deletion action is corresponding Editing distance less than predetermined threshold value.
Described target word determines in the following manner: calculate the minimum of described pending word and each described candidate word respectively Editing distance, is less than the candidate word of predetermined threshold value as described target word using the smallest edit distance with described pending word.
In the present embodiment, remaining operation described includes update and replacement operation.Update described in single is corresponding Editing distance more than or equal to predetermined threshold value, editing distance corresponding to replacement operation described in single is more than or equal to presetting threshold Value.
Below with example explanation step S301 to the enforcement of step S303, each of which step using one implement as Example, should not be a limitation of the present invention.
Implementing step S301, obtain pending phrase collection and be combined into Q (A, B, C and D), wherein A, B, C and D can be all to wait to locate Reason word, it is assumed that A is specially " China Merchants Bank ", B be " industrial and commercial bank ", C be " China Merchants Bank ", D is " industrial and commercial bank ".
Following steps are " China Merchants Bank " example with pending word as A.
Implement step S302, use word2vec method for each word (A, B, C and D) in phrase set Q carry out to Quantify, result based on vectorization, calculate the cosine similarity of pending word A and remaining each word B, C and D, obtain cosine phase It is D, C and B like angle value from high to low order, therefrom selects front 2 words that cosine similarity value is higher as candidate word, i.e. select Select word D " industrial and commercial bank " and word C " China Merchants Bank " as candidate word.
Implement step S303, for pending word A " China Merchants Bank ", use edit distance approach to calculate pending word respectively The smallest edit distance of A " China Merchants Bank " and candidate word D " industrial and commercial bank ", and pending word A " China Merchants Bank " and candidate word C The smallest edit distance of " China Merchants Bank ".
In the edit distance approach of this example, editing distance corresponding to deletion action is less than update and replacement operation pair The editing distance answered, editing distance corresponding to described deletion action less than predetermined threshold value, the volume that update described in single is corresponding Volume distance is more than or equal to predetermined threshold value, and editing distance corresponding to replacement operation described in single is more than or equal to predetermined threshold value.False If editing distance corresponding to single deletion action is 1, the editing distance that single update is corresponding is 1000, single replacement operation Corresponding editing distance is 1000, and predetermined threshold value is 10, then:
In pending word " China Merchants Bank " is converted to all edit operations combination of candidate word D " industrial and commercial bank ", by 1 The editing distance that step replacement operation obtains is minimum, specifically " trick " is replaced with " work ", so smallest edit distance is 1000;
In pending word " China Merchants Bank " is converted to all edit operations combination of candidate word C " China Merchants Bank ", deleted by 2 steps The editing distance that division operation obtains is minimum, deletes " business " and " silver-colored " the most respectively, so smallest edit distance is 2;
In above-mentioned calculated smallest edit distance, it is 2 less than the smallest edit distance of predetermined threshold value 10, therefore corresponding Target word be candidate word C " China Merchants Bank ", determine that pending word A " China Merchants Bank " and candidate word C " China Merchants Bank " are synonym pair, " recruit OK " it is the initialism of " China Merchants Bank ".
And for example, it is assumed that pending phrase collection is combined into P (" China Merchants Bank ", " industrial and commercial bank " and " industrial and commercial bank "), for waiting to locate Reason word " China Merchants Bank ", calculates " China Merchants Bank " and " industrial and commercial bank " respectively, and the semantic phase of " China Merchants Bank " and " industrial and commercial bank " Like degree, obtain " China Merchants Bank " and the semantic similarity of " industrial and commercial bank " and " China Merchants Bank " and the semantic similarity of " industrial and commercial bank " Value is all higher than similarity threshold.Then smallest edit distance is calculated, owing to the editing distance of deletion action is less than replacing in calculating The editing distance of operation, the most each step preferentially uses deletion action:
" China Merchants Bank " conversion can be converted to by taking 2 step deletion actions and 1 step replacement operation to " industrial and commercial bank " is minimum. Specifically, minimum operation can be by deleting " trick " and " business ", and to replace " silver-colored " be that " work " obtains.And the single step of deletion action is compiled Volume distance is 1, and the single step editing distance of replacement operation is 1000, therefore calculate " China Merchants Bank " arrive " industrial and commercial bank " minimum edit away from From for 1002;
Similarly, " China Merchants Bank " conversion is obtained to " industrial and commercial bank " is minimum by 1 step replacement operation, specifically, replaces Changing " trick " is " work ", and the single step editing distance of replacement operation is 1000, therefore calculates " China Merchants Bank " conversion to " industrial and commercial silver Smallest edit distance OK " is 1000.
It can be seen that " China Merchants Bank " conversion is to the smallest edit distance of " industrial and commercial bank ", and " China Merchants Bank " changes to " work Business bank " smallest edit distance be all higher than predetermined threshold value 10, so candidate word " industrial and commercial bank " and candidate word " industrial and commercial bank " are the most not It is described target word, say, that pending phrase set does not exist and forms same word pair with pending word " China Merchants Bank " Word.
Editing distance corresponding less than remaining operation owing to limiting editing distance corresponding to deletion action in the present embodiment, makes Must be in edit distance approach during smallest edit distance to be calculated, the conversion of pending word is preferentially adopted to the edit operation of other words Use deletion action.On this basis,
Calculate editing distance corresponding to the deletion action during smallest edit distance less than predetermined threshold value, simultaneously single its When editing distance corresponding to remaining operation is more than or equal to predetermined threshold value, thus, when pending word to target word minimum edit away from From during less than predetermined threshold value, corresponding target word is only to be obtained by deletion action by pending word, so that it is guaranteed that by editor The part that synonym is pending word literal expression that distance method obtains, and then make the initialism obtained more accurate, Improve the accuracy rate that initialism finds.
Further, the present embodiment by calculating the semantic similarity of remaining word in pending word and described phrase set, Select multiple candidate word, and then target word can be determined from the less scope that multiple candidate word are formed, owing to multiple candidate word are One subset of pending phrase set, so determining that from multiple candidate word target word can improve the effect determining synonym pair Rate, simultaneously by semantic similarity is determined whether synon performance assessment criteria as another, further increases discovery The accuracy of synonym pair.
The embodiment of the present invention additionally provides a kind of data processing method finding method based on above-mentioned synonym.Described data Processing method carries out synon judgement by thesaurus, and thesaurus includes the above-mentioned synonym of employing and finds The initialism that method obtains.Below described data processing method is illustrated.
Described data processing method includes: obtain knowledge point, and described knowledge point includes question sentence and corresponding answer;To described Question sentence carries out the arbitrary key word after participle, judges whether described key word exists synonym according to thesaurus;When described pass When keyword exists synonym, the synonym found is replaced corresponding key word;The question sentence that storage obtains after replacing, and will replace After the question sentence that obtains add this knowledge point.
Such as, finding method by above-mentioned synonym, obtain the initialism that " China Merchants Bank " is " China Merchants Bank ", both are synonym One group of synonym pair in dictionary.Implement described data processing method below:
Obtaining a knowledge point, wherein, question sentence is " China Merchants Bank's credit card is the most open-minded ", and corresponding answer is S;
Question sentence " China Merchants Bank's credit card is the most open-minded " is carried out one of them key word " China Merchants Bank " that participle obtains, Judge whether the key word " China Merchants Bank " that participle obtains exists synonym according to thesaurus;Owing to there is " China Merchants Bank " Synonym is its initialism " China Merchants Bank ", then " China Merchants Bank " is replaced the key word in question sentence " China Merchants Bank's credit card is the most open-minded " " China Merchants Bank ", storage replace after question sentence " China Merchants Bank's credit card is the most open-minded ", and will replace after question sentence " China Merchants Bank's credit card is such as What is open-minded " add knowledge point.The most former knowledge point is extended for: question sentence has " China Merchants Bank's credit card is the most open-minded " and " China Merchants Bank The credit card is the most open-minded ", corresponding answer S.Synonym therein " China Merchants Bank " uses above-mentioned synonym to find, and method obtains, no longer Repeat.
It can thus be seen that above-mentioned synonym finds that method may be used for the question sentence in expanding knowledge a little, and then reach to expand Fill the effect of knowledge base, such that it is able to when the expression using initialism to carry out different question sentence, still can reply and answer accordingly Case, and then improve the semantic understanding ability of Intelligent Answer System and reply the accuracy rate of answer.It should be noted that above-mentioned synonym Word finds that method can be applied not only to the storehouse that expands knowledge, it is also possible to for information search.When being applied to information search, it is possible not only to Search obtains the information that keyword is relevant, it is also possible to search obtains the initialism of keyword or the information that full name word is relevant.
Fig. 4 is the structural representation that a kind of synonym in the embodiment of the present invention finds device.Described synonym finds dress Put and may include that acquiring unit 401 and synonym determine unit 402;
Described acquiring unit 401, is suitable to obtain pending phrase set, and described phrase set includes multiple word;
Described synonym determines unit 402, is suitable to for the arbitrary pending word in described phrase set, when described phrase Set exists one or more target word so that the smallest edit distance of described pending word to described target word is less than presetting During threshold value, described pending word is defined as synonym pair with target word described in corresponding;
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, described edit distance approach bag Include deletion action, editing distance that editing distance corresponding to described deletion action is corresponding less than remaining operation, described deletion action Corresponding editing distance is less than predetermined threshold value, and described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.
In being embodied as, remaining operation described includes update and replacement operation, and update described in single is corresponding Editing distance more than or equal to predetermined threshold value, editing distance corresponding to replacement operation described in single is more than or equal to presetting threshold Value.
In being embodied as, described acquiring unit 401 includes participle subelement, is suitable to input language material is carried out participle, with Obtain described phrase set.In being embodied as, described participle subelement utilizes dictionary for word segmentation to carry out described input language material point Word, described dictionary for word segmentation is obtained by dictionary for word segmentation acquiring unit, and described dictionary for word segmentation acquiring unit is suitable to:
Described input language material is carried out pretreatment, to obtain text data;Described text data carries out branch process, To phrase data;Described phrase data is carried out word segmentation processing, after obtaining participle according to the independent word comprised in the dictionary of basis Term data;It is combined the term data after adjacent described participle processing, to generate candidate data string;To described time Serial data is selected to carry out judgement process, to find neologisms;Described neologisms are added described dictionary for word segmentation.
Can be to should refer to Fig. 1 about the structure of the synonym discovery device described in the present embodiment and the explanation of beneficial effect Synonym find the step of method and the explanation of beneficial effect, repeat no more.
Fig. 5 is the structural representation that a kind of synonym in the embodiment of the present invention finds device.Synonym as shown in Figure 5 Find that device may include that acquiring unit 501, candidate word choose unit 502, target word determines that unit 503 and synonym determine Unit 504.
Described acquiring unit 501, is suitable to obtain pending phrase set, and described phrase set includes multiple word.
Described synonym determines unit 504, is suitable to for the arbitrary pending word in described phrase set, when described phrase Set exists one or more target word so that the smallest edit distance of described pending word to described target word is less than presetting During threshold value, described pending word is defined as synonym pair with target word described in corresponding.Wherein, described smallest edit distance is Being calculated by edit distance approach and to obtain, in described edit distance approach, editing distance corresponding to deletion action is less than it The editing distance that remaining operation is corresponding, editing distance corresponding to described deletion action less than predetermined threshold value, remaining operation described in single Corresponding editing distance is more than or equal to predetermined threshold value.
In being embodied as, remaining operation described includes update and replacement operation, and update described in single is corresponding Editing distance more than or equal to predetermined threshold value, editing distance corresponding to replacement operation described in single is more than or equal to presetting threshold Value.
In being embodied as, described acquiring unit 501 includes participle subelement 5011, is suitable to carry out input language material point Word, to obtain described phrase set.
In being embodied as, described participle subelement 5011 utilizes dictionary for word segmentation that described input language material is carried out participle, institute Stating dictionary for word segmentation to be obtained by dictionary for word segmentation acquiring unit, described dictionary for word segmentation acquiring unit is suitable to:
Described input language material is carried out pretreatment, to obtain text data;Described text data carries out branch process, To phrase data;Described phrase data is carried out word segmentation processing, after obtaining participle according to the independent word comprised in the dictionary of basis Term data;It is combined the term data after adjacent described participle processing, to generate candidate data string;To described time Serial data is selected to carry out judgement process, to find neologisms;Described neologisms are added described dictionary for word segmentation.
In being embodied as, described synonym finds that device can also include:
Candidate word chooses unit 502, is suitable to calculate described pending word and remaining each word in described phrase set respectively Semantic similarity, and therefrom select the front N that semantic similitude angle value is higher more than the word of similarity threshold or semantic similitude angle value Individual word is as candidate word;
Target word determines unit 503, is suitable to the minimum editor calculating described pending word respectively with each described candidate word Distance, is less than the candidate word of predetermined threshold value as target word using the smallest edit distance with described pending word.
In being embodied as, described candidate word is chosen unit 502 and be may include that
Vectorization subelement 5021, is suitable to each word in described phrase set is carried out vectorization;
Cosine similarity computation subunit 5022, is suitable to result based on vectorization, calculate described pending word and remaining The cosine similarity of each word, described cosine similarity is as described semantic similarity.
In being embodied as, can use word2vec method that each word in described phrase set is carried out vectorization.
Can be to should refer to Fig. 3 about the structure of the synonym discovery device described in the present embodiment and the explanation of beneficial effect Synonym find the step of method and the explanation of beneficial effect, repeat no more.
The embodiment of the present invention also provides for a kind of data processing equipment, and described data processing equipment uses shown in Fig. 4 or Fig. 5 Synonym finds device, and described data processing equipment may include that
Knowledge point acquiring unit, is suitable to obtain knowledge point, and described knowledge point includes question sentence and corresponding answer;
Synonym searches unit, is suitable to the arbitrary key word after described question sentence carries out participle, judges according to thesaurus Whether described key word exists synonym;
Replacement unit, is suitable to when described key word exists synonym, and the synonym found is replaced corresponding key word;
Knowledge point expansion unit, the question sentence obtained after being suitable to storage replacement, and the question sentence that will obtain after replacing add this and know Know point.
The structure of described data processing equipment and beneficial effect can refer to the explanation of above-mentioned data processing method, the most superfluous State.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can Completing instructing relevant hardware by program, this program can be stored in a computer-readable recording medium, storage Medium may include that ROM, RAM, disk or CD etc..
Although present disclosure is as above, but the present invention is not limited to this.Any those skilled in the art, without departing from this In the spirit and scope of invention, all can make various changes or modifications, therefore protection scope of the present invention should be with claim institute Limit in the range of standard.

Claims (16)

1. a synonym finds method, it is characterised in that including:
Obtaining pending phrase set, described phrase set includes multiple word;
For the arbitrary pending word in described phrase set, when described phrase set exists one or more target word, make When obtaining the described pending word smallest edit distance to described target word less than predetermined threshold value, described pending word and corresponding Described target word is defined as synonym pair;
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, and described edit distance approach includes deleting Division operation, editing distance that editing distance corresponding to described deletion action is corresponding less than remaining operation, described deletion action is corresponding Editing distance less than predetermined threshold value, editing distance corresponding to remaining operation described in single is more than or equal to predetermined threshold value.
Synonym the most according to claim 1 finds method, it is characterised in that described method also includes:
Calculate described pending word and the semantic similarity of remaining each word in described phrase set respectively, and therefrom select semanteme The Similarity value top n word higher more than the word of similarity threshold or semantic similitude angle value is as candidate word;
Described target word determines in the following manner: calculate the minimum editor of described pending word and each described candidate word respectively Distance, is less than the candidate word of predetermined threshold value as target word using the smallest edit distance with described pending word.
Synonym the most according to claim 2 finds method, it is characterised in that calculate described pending word respectively with described The semantic similarity of remaining each word in phrase set, including:
Each word in described phrase set is carried out vectorization;
Result based on vectorization, calculates the cosine similarity of described pending word and remaining each word, described cosine similarity As described semantic similarity.
Synonym the most according to claim 3 finds method, it is characterised in that enter each word in described phrase set Row vector, including:
Use word2vec method that each word in described phrase set is carried out vectorization.
Synonym the most according to claim 1 finds method, it is characterised in that described acquisition synon phrase to be found Set, including:
Input language material is carried out participle, to obtain described phrase set.
Synonym the most according to claim 5 finds method, it is characterised in that utilize dictionary for word segmentation to described input language material Carrying out participle, described dictionary for word segmentation obtains in the following manner:
Described input language material is carried out pretreatment, to obtain text data;
Described text data carries out branch process, obtain phrase data;
According to the independent word that comprises in the dictionary of basis, described phrase data is carried out word segmentation processing, to obtain the word number after participle According to;
It is combined the term data after adjacent described participle processing, to generate candidate data string;
Described candidate data string is carried out judgement process, to find neologisms;
Described neologisms are added described dictionary for word segmentation.
Synonym the most according to claim 1 find method, it is characterised in that described remaining operation include update and Replacement operation, editing distance corresponding to update described in single more than or equal to predetermined threshold value, replacement operation pair described in single The editing distance answered is more than or equal to predetermined threshold value.
8. a data processing method, it is characterised in that include that the synonym described in any one of claim 1-7 finds method.
9. a synonym finds device, it is characterised in that including:
Acquiring unit, is suitable to obtain pending phrase set, and described phrase set includes multiple word;
Synonym determines unit, is suitable to for the arbitrary pending word in described phrase set, when existing in described phrase set One or more target words so that when the smallest edit distance of described pending word to described target word is less than predetermined threshold value, institute State pending word and be defined as synonym pair with target word described in corresponding;
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, and described edit distance approach includes deleting Division operation, editing distance that editing distance corresponding to described deletion action is corresponding less than remaining operation, described deletion action is corresponding Editing distance less than predetermined threshold value, editing distance corresponding to remaining operation described in single is more than or equal to predetermined threshold value.
Synonym the most according to claim 9 finds device, it is characterised in that described device also includes:
Candidate word chooses unit, is suitable to calculate described pending word and the semantic phase of remaining each word in described phrase set respectively Like degree, and therefrom select the top n word conduct that semantic similitude angle value is higher more than the word of similarity threshold or semantic similitude angle value Candidate word;
Target word determines unit, is suitable to the smallest edit distance calculating described pending word respectively with each described candidate word, will It is less than the candidate word of predetermined threshold value as target word with the smallest edit distance of described pending word.
11. synonyms according to claim 10 find device, it is characterised in that described candidate word is chosen unit and included:
Vectorization subelement, is suitable to each word in described phrase set is carried out vectorization;
Cosine similarity computation subunit, is suitable to result based on vectorization, calculates described pending word and remaining each word Cosine similarity, described cosine similarity is as described semantic similarity.
12. synonyms according to claim 11 find device, it is characterised in that described vectorization subelement uses Word2vec method carries out vectorization to each word in described phrase set.
13. synonyms according to claim 9 find device, it is characterised in that described acquiring unit includes:
Participle subelement, is suitable to input language material is carried out participle, to obtain described phrase set.
14. synonyms according to claim 13 find device, it is characterised in that described participle subelement utilizes participle word Allusion quotation carries out participle to described input language material, and described dictionary for word segmentation is obtained by dictionary for word segmentation acquiring unit, and described dictionary for word segmentation obtains Take unit to be suitable to:
Described input language material is carried out pretreatment, to obtain text data;Described text data carries out branch process, obtain language Sentence data;According to the independent word that comprises in the dictionary of basis, described phrase data is carried out word segmentation processing, to obtain the word after participle Language data;It is combined the term data after adjacent described participle processing, to generate candidate data string;To described candidate's number Judgement process is carried out, to find neologisms according to string;Described neologisms are added described dictionary for word segmentation.
15. synonyms according to claim 9 find device, it is characterised in that remaining operation described includes update And replacement operation, editing distance corresponding to update described in single more than or equal to predetermined threshold value, replacement operation described in single Corresponding editing distance is more than or equal to predetermined threshold value.
16. 1 kinds of data processing equipments, it is characterised in that include that the synonym described in any one of claim 9-15 finds dress Put.
CN201610429937.XA 2016-06-16 2016-06-16 Synonym finds method and device, data processing method and device Active CN106126494B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610429937.XA CN106126494B (en) 2016-06-16 2016-06-16 Synonym finds method and device, data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610429937.XA CN106126494B (en) 2016-06-16 2016-06-16 Synonym finds method and device, data processing method and device

Publications (2)

Publication Number Publication Date
CN106126494A true CN106126494A (en) 2016-11-16
CN106126494B CN106126494B (en) 2018-12-28

Family

ID=57470670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610429937.XA Active CN106126494B (en) 2016-06-16 2016-06-16 Synonym finds method and device, data processing method and device

Country Status (1)

Country Link
CN (1) CN106126494B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649783A (en) * 2016-12-28 2017-05-10 上海智臻智能网络科技股份有限公司 Synonym mining method and apparatus
CN106649816A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Synonym filtering method and device
CN106777283A (en) * 2016-12-29 2017-05-31 北京奇虎科技有限公司 The method for digging and device of a kind of synonym
CN106776543A (en) * 2016-11-23 2017-05-31 上海智臻智能网络科技股份有限公司 New word discovery method, device, terminal and server
CN106933806A (en) * 2017-03-15 2017-07-07 北京大数医达科技有限公司 The determination method and apparatus of medical synonym
CN107180026A (en) * 2017-05-02 2017-09-19 苏州大学 The event phrase learning method and device of a kind of word-based embedded Semantic mapping
CN107577668A (en) * 2017-09-15 2018-01-12 电子科技大学 Social media non-standard word correcting method based on semanteme
CN107621892A (en) * 2017-10-18 2018-01-23 北京百度网讯科技有限公司 For obtaining the method and device of information
CN108170806A (en) * 2017-12-28 2018-06-15 东软集团股份有限公司 Sensitive word detection filter method, device and computer equipment
CN108255810A (en) * 2018-01-10 2018-07-06 北京神州泰岳软件股份有限公司 Near synonym method for digging, device and electronic equipment
CN108829799A (en) * 2018-06-05 2018-11-16 中国人民公安大学 Based on the Text similarity computing method and system for improving LDA topic model
CN109741190A (en) * 2018-12-27 2019-05-10 清华大学 A kind of method, system and the equipment of the classification of personal share bulletin
WO2020061910A1 (en) * 2018-09-27 2020-04-02 北京字节跳动网络技术有限公司 Method and apparatus used for generating information
CN113689923A (en) * 2020-05-19 2021-11-23 北京平安联想智慧医疗信息技术有限公司 Medical data processing apparatus, system and method
CN113761905A (en) * 2020-07-01 2021-12-07 北京沃东天骏信息技术有限公司 Method and device for constructing domain modeling vocabulary

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561813A (en) * 2009-05-27 2009-10-21 东北大学 Method for analyzing similarity of character string under Web environment
US20110060712A1 (en) * 2009-09-09 2011-03-10 Ichiro Harashima Method and system for design check knowledge construction
CN102750282A (en) * 2011-04-19 2012-10-24 北京百度网讯科技有限公司 Synonym template mining method and device as well as synonym mining method and device
CN102760134A (en) * 2011-04-28 2012-10-31 北京百度网讯科技有限公司 Method and device for mining synonyms
CN104978356A (en) * 2014-04-10 2015-10-14 阿里巴巴集团控股有限公司 Synonym identification method and device
CN105095204A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Method and device for obtaining synonym
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561813A (en) * 2009-05-27 2009-10-21 东北大学 Method for analyzing similarity of character string under Web environment
US20110060712A1 (en) * 2009-09-09 2011-03-10 Ichiro Harashima Method and system for design check knowledge construction
CN102750282A (en) * 2011-04-19 2012-10-24 北京百度网讯科技有限公司 Synonym template mining method and device as well as synonym mining method and device
CN102760134A (en) * 2011-04-28 2012-10-31 北京百度网讯科技有限公司 Method and device for mining synonyms
CN104978356A (en) * 2014-04-10 2015-10-14 阿里巴巴集团控股有限公司 Synonym identification method and device
CN105095204A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Method and device for obtaining synonym
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李华 等: "基于动态规划的缩写发现算法", 《武汉大学学报(工学版)》 *
王宝勋 等: "一种基于无监督学习的词变体识别方法", 《中文信息学报》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776543B (en) * 2016-11-23 2019-09-06 上海智臻智能网络科技股份有限公司 New word discovery method, apparatus, terminal and server
CN106776543A (en) * 2016-11-23 2017-05-31 上海智臻智能网络科技股份有限公司 New word discovery method, device, terminal and server
CN110516235A (en) * 2016-11-23 2019-11-29 上海智臻智能网络科技股份有限公司 New word discovery method, apparatus, terminal and server
CN106649783A (en) * 2016-12-28 2017-05-10 上海智臻智能网络科技股份有限公司 Synonym mining method and apparatus
CN106649816A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Synonym filtering method and device
CN106777283A (en) * 2016-12-29 2017-05-31 北京奇虎科技有限公司 The method for digging and device of a kind of synonym
CN106777283B (en) * 2016-12-29 2021-02-26 北京奇虎科技有限公司 Synonym mining method and synonym mining device
CN106649816B (en) * 2016-12-29 2020-06-09 北京奇虎科技有限公司 Synonym filtering method and device
CN106933806A (en) * 2017-03-15 2017-07-07 北京大数医达科技有限公司 The determination method and apparatus of medical synonym
CN107180026A (en) * 2017-05-02 2017-09-19 苏州大学 The event phrase learning method and device of a kind of word-based embedded Semantic mapping
CN107577668A (en) * 2017-09-15 2018-01-12 电子科技大学 Social media non-standard word correcting method based on semanteme
CN107621892A (en) * 2017-10-18 2018-01-23 北京百度网讯科技有限公司 For obtaining the method and device of information
CN108170806A (en) * 2017-12-28 2018-06-15 东软集团股份有限公司 Sensitive word detection filter method, device and computer equipment
CN108170806B (en) * 2017-12-28 2020-11-20 东软集团股份有限公司 Sensitive word detection and filtering method and device and computer equipment
CN108255810A (en) * 2018-01-10 2018-07-06 北京神州泰岳软件股份有限公司 Near synonym method for digging, device and electronic equipment
CN108829799A (en) * 2018-06-05 2018-11-16 中国人民公安大学 Based on the Text similarity computing method and system for improving LDA topic model
WO2020061910A1 (en) * 2018-09-27 2020-04-02 北京字节跳动网络技术有限公司 Method and apparatus used for generating information
CN109741190A (en) * 2018-12-27 2019-05-10 清华大学 A kind of method, system and the equipment of the classification of personal share bulletin
CN113689923A (en) * 2020-05-19 2021-11-23 北京平安联想智慧医疗信息技术有限公司 Medical data processing apparatus, system and method
CN113689923B (en) * 2020-05-19 2024-06-18 北京平安联想智慧医疗信息技术有限公司 Medical data processing device, system and method
CN113761905A (en) * 2020-07-01 2021-12-07 北京沃东天骏信息技术有限公司 Method and device for constructing domain modeling vocabulary

Also Published As

Publication number Publication date
CN106126494B (en) 2018-12-28

Similar Documents

Publication Publication Date Title
CN106126494B (en) Synonym finds method and device, data processing method and device
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN104699763B (en) The text similarity gauging system of multiple features fusion
CN106528532B (en) Text error correction method, device and terminal
CN107515877B (en) Sensitive subject word set generation method and device
Zhang et al. Joint word segmentation and POS tagging using a single perceptron
US9223779B2 (en) Text segmentation with multiple granularity levels
CN106326303B (en) A kind of spoken semantic analysis system and method
CN105183923A (en) New word discovery method and device
CN101751455B (en) Method for automatically generating title by adopting artificial intelligence technology
CN103970733B (en) A kind of Chinese new word identification method based on graph structure
CN110349568A (en) Speech retrieval method, apparatus, computer equipment and storage medium
CN101021838A (en) Text handling method and system
CN106445921B (en) Utilize the Chinese text terminology extraction method of quadratic mutual information
CN105224682A (en) New word discovery method and device
CN106897290B (en) Method and device for establishing keyword model
CN103268313A (en) Method and device for semantic analysis of natural language
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
WO2017091985A1 (en) Method and device for recognizing stop word
CN103488782B (en) A kind of method utilizing lyrics identification music emotion
KR101638535B1 (en) Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same
US9652997B2 (en) Method and apparatus for building emotion basis lexeme information on an emotion lexicon comprising calculation of an emotion strength for each lexeme
CN109934251A (en) A kind of method, identifying system and storage medium for rare foreign languages text identification
Ahmadi et al. A hybrid method for Persian named entity recognition
Hadj Ameur et al. Restoration of Arabic diacritics using a multilevel statistical model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant