CN106126494A - Synonym finds method and device, data processing method and device - Google Patents
Synonym finds method and device, data processing method and device Download PDFInfo
- Publication number
- CN106126494A CN106126494A CN201610429937.XA CN201610429937A CN106126494A CN 106126494 A CN106126494 A CN 106126494A CN 201610429937 A CN201610429937 A CN 201610429937A CN 106126494 A CN106126494 A CN 106126494A
- Authority
- CN
- China
- Prior art keywords
- word
- synonym
- pending
- phrase set
- threshold value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
A kind of synonym finds method and device, data processing method and device, and described synonym finds that method includes: obtaining pending phrase set, described phrase set includes multiple word;For the arbitrary pending word in described phrase set, when described phrase set exists one or more target word, make described pending word to described target word smallest edit distance less than predetermined threshold value time, described pending word is defined as synonym pair with target word described in corresponding;Wherein, described smallest edit distance is calculated by edit distance approach and obtains, described edit distance approach includes deletion action, editing distance that editing distance corresponding to deletion action is corresponding less than remaining operation, editing distance corresponding to described deletion action is less than predetermined threshold value, and editing distance corresponding to remaining operation described in single is more than or equal to predetermined threshold value.Such scheme can improve the accuracy finding initialism.
Description
Technical field
The present invention relates to data processing field, particularly relate to a kind of synonym and find method and device, data process side
Method and device.
Background technology
Synonymy is very important semantic relation, is often applied to the natural language such as information retrieval, text classification
In process task.Specifically, before carrying out the process task such as information retrieval or text classification, need to carry out synon obtaining
Take and synon identification.Such as, in the application scenarios of information retrieval, one can be classified as by belonging to synon multiple word
Class, when there are synon keyword in input text, can replace original keyword to scan for synonym, such that it is able to
Searching system is made to be supplied to the more text to be confirmed of user.
The shorthand of intrinsic title, the word of these shorthands is often had in the written and daily expression of Chinese
Being referred to as the initialism of intrinsic title, initialism is a part for former intrinsic title, and initialism is also synon one.Example
As, " National People's Congress " is the initialism of " National People's Congress ", and " Chinese " is the initialism of " People's Republic of China (PRC) ",
" Real Madrid " is the initialism of " Real Madrid " etc..
But, synonym of the prior art finds that method cannot preferably identify initialism, so that semantic understanding
Accuracy relatively low.
Summary of the invention
Present invention solves the technical problem that being to provide a kind of synonym finds method and device, improves the standard finding initialism
Really property.
For solving above-mentioned technical problem, the embodiment of the present invention provides a kind of synonym to find method, and described method includes: obtain
Taking pending phrase set, described phrase set includes multiple word;For the arbitrary pending word in described phrase set, when
There is one or more target word in described phrase set so that the smallest edit distance of described pending word to described target word
During less than predetermined threshold value, described pending word is defined as synonym pair with target word described in corresponding;Wherein, described minimum volume
Collecting distance and calculate acquisition by edit distance approach, described edit distance approach includes deletion action, described deletion action
Editing distance that corresponding editing distance is corresponding less than remaining operation, editing distance corresponding to described deletion action is less than presetting threshold
Being worth, described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.
Alternatively, described method also includes: calculate described pending word and remaining each word in described phrase set respectively
Semantic similarity, and therefrom select the front N that semantic similitude angle value is higher more than the word of similarity threshold or semantic similitude angle value
Individual word is as candidate word;
Described target word determines in the following manner: calculate the minimum of described pending word and each described candidate word respectively
Editing distance, is less than the candidate word of predetermined threshold value as target word using the smallest edit distance with described pending word.
Alternatively, described pending word and the semantic similarity of remaining each word in described phrase set, bag are calculated respectively
Include:
Each word in described phrase set is carried out vectorization;Result based on vectorization, calculates described pending word
With the cosine similarity of remaining each word, described cosine similarity is as described semantic similarity.
Alternatively, each word in described phrase set is carried out vectorization, including:
Use word2vec method that each word in described phrase set is carried out vectorization.
Alternatively, described acquisition synon phrase set to be found, including:
Input language material is carried out participle, to obtain described phrase set.
Alternatively, utilizing dictionary for word segmentation that described input language material is carried out participle, described dictionary for word segmentation obtains in the following manner
:
Described input language material is carried out pretreatment, to obtain text data;Described text data carries out branch process,
To phrase data;Described phrase data is carried out word segmentation processing, after obtaining participle according to the independent word comprised in the dictionary of basis
Term data;It is combined the term data after adjacent described participle processing, to generate candidate data string;To described time
Serial data is selected to carry out judgement process, to find neologisms;Described neologisms are added described dictionary for word segmentation.
Alternatively, remaining operation described includes update and replacement operation, the editor that update described in single is corresponding
Distance is more than or equal to predetermined threshold value, and editing distance corresponding to replacement operation described in single is more than or equal to predetermined threshold value.
The embodiment of the present invention also provides for a kind of data processing method, and described data processing method includes that above-mentioned synonym finds
Method.
The embodiment of the present invention also provides for a kind of synonym and finds device, and described device includes:
Acquiring unit, is suitable to obtain pending phrase set, and described phrase set includes multiple word;
Synonym determines unit, is suitable to for the arbitrary pending word in described phrase set, when in described phrase set
There is one or more target word so that the smallest edit distance of described pending word to described target word is less than predetermined threshold value
Time, described pending word is defined as synonym pair with target word described in corresponding;
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, described edit distance approach bag
Include deletion action, editing distance that editing distance corresponding to described deletion action is corresponding less than remaining operation, described deletion action
Corresponding editing distance is less than predetermined threshold value, and described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.
Alternatively, described synonym finds that device also includes:
Candidate word chooses unit, is suitable to calculate described pending word and the language of remaining each word in described phrase set respectively
Justice similarity, and therefrom select the top n word that semantic similitude angle value is higher more than the word of similarity threshold or semantic similitude angle value
As candidate word;
Target word determines unit, be suitable to calculate respectively the minimum editor of described pending word and each described candidate word away from
From, the smallest edit distance with described pending word is less than the candidate word of predetermined threshold value as target word.
Alternatively, described candidate word is chosen unit and is included:
Vectorization subelement, is suitable to each word in described phrase set is carried out vectorization;
Cosine similarity computation subunit, is suitable to result based on vectorization, calculates described pending word each with remaining
The cosine similarity of word, described cosine similarity is as described semantic similarity.
Alternatively, described vectorization subelement use word2vec method each word in described phrase set is carried out to
Quantify.
Alternatively, described acquiring unit includes:
Participle subelement, is suitable to input language material is carried out participle, to obtain described phrase set.
Alternatively, described participle subelement utilizes dictionary for word segmentation that described input language material is carried out participle, described dictionary for word segmentation
Being obtained by dictionary for word segmentation acquiring unit, described dictionary for word segmentation acquiring unit is suitable to:
Described input language material is carried out pretreatment, to obtain text data;Described text data carries out branch process,
To phrase data;Described phrase data is carried out word segmentation processing, after obtaining participle according to the independent word comprised in the dictionary of basis
Term data;It is combined the term data after adjacent described participle processing, to generate candidate data string;To described time
Serial data is selected to carry out judgement process, to find neologisms;Described neologisms are added described dictionary for word segmentation.
Alternatively, remaining operation described includes update and replacement operation, the editor that update described in single is corresponding
Distance is more than or equal to predetermined threshold value, and editing distance corresponding to replacement operation described in single is more than or equal to predetermined threshold value.
The embodiment of the present invention also provides for a kind of data processing equipment, and described data processing equipment includes that above-mentioned synonym finds
Device.
Compared with prior art, the technical scheme of the embodiment of the present invention has the advantages that
The embodiment of the present invention obtains pending phrase set;For the arbitrary pending word in described phrase set, when
There is one or more target word in described phrase set so that the smallest edit distance of described pending word to described target word
During less than predetermined threshold value, described pending word is defined as synonym pair with target word described in corresponding;Wherein, described minimum volume
Collecting distance and calculate acquisition by edit distance approach, described edit distance approach includes deletion action, described deletion action
Editing distance that corresponding editing distance is corresponding less than remaining operation, editing distance corresponding to described deletion action is less than presetting threshold
Being worth, described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.On the one hand such scheme is compiled by restriction
Collect editing distance that in distance method, the editing distance of deletion action operates less than remaining so that smallest edit distance is by excellent
Deletion action is first used to obtain;On the other hand, the editing distance that the deletion action during smallest edit distance is corresponding is calculated
Less than predetermined threshold value, when the editing distance of remaining operation correspondence of single is more than or equal to predetermined threshold value, thus, when pending simultaneously
Word to the smallest edit distance of target word less than predetermined threshold value time, corresponding target word is only to be passed through deletion action by pending word
Obtain, so that it is guaranteed that the part that synonym is pending word literal expression obtained by edit distance approach, and then make
The initialism obtained is more accurate, improves the accuracy rate that initialism finds.
Further, by calculating pending word and the semantic similarity of remaining word in described phrase set, select multiple
Candidate word, and then target word can be determined from the less scope that multiple candidate word are formed, owing to multiple candidate word are pending
One subset of phrase set, so determining that from multiple candidate word target word can improve the efficiency determining synonym pair, simultaneously
By using semantic similarity as another synonym performance assessment criteria, further increase the accuracy finding synonym pair, also
Just improve the accuracy finding initialism.
Accompanying drawing explanation
Fig. 1 is the flow chart that a kind of synonym in the embodiment of the present invention finds method;
Fig. 2 is the flow chart of a kind of method obtaining dictionary for word segmentation in the embodiment of the present invention;
Fig. 3 is the flow chart that the another kind of synonym in the embodiment of the present invention finds method;
Fig. 4 is the structural representation that a kind of synonym in the embodiment of the present invention finds device;
Fig. 5 is the structural representation that the another kind of synonym in the embodiment of the present invention finds device.
Detailed description of the invention
The shorthand of intrinsic title, the word of these shorthands is often had in the written and daily expression of Chinese
Being referred to as the initialism of intrinsic title, initialism is a part for former intrinsic title, and initialism is also synon one.Example
As, " National People's Congress " is the initialism of " National People's Congress ", and " Chinese " is the initialism of " People's Republic of China (PRC) ",
" Real Madrid " is the initialism of " Real Madrid " etc..But, synonym of the prior art finds that method can not preferably be known
Other initialism, so that the accuracy of semantic understanding is relatively low.
The embodiment of the present invention obtains pending phrase set;For the arbitrary pending word in described phrase set, when
There is one or more target word in described phrase set so that the smallest edit distance of described pending word to described target word
During less than predetermined threshold value, described pending word is defined as synonym pair with target word described in corresponding;Wherein, described minimum volume
Collecting distance and calculate acquisition by edit distance approach, described edit distance approach includes deletion action, described deletion action
Editing distance that corresponding editing distance is corresponding less than remaining operation, editing distance corresponding to described deletion action is less than presetting threshold
Being worth, described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.On the one hand such scheme is compiled by restriction
Collect editing distance that in distance method, the editing distance of deletion action operates less than remaining so that smallest edit distance is by excellent
Deletion action is first used to obtain;On the other hand, the editing distance that the deletion action during smallest edit distance is corresponding is calculated
Less than predetermined threshold value, when the editing distance of remaining operation correspondence of single is more than or equal to predetermined threshold value, thus, when pending simultaneously
Word to the smallest edit distance of target word less than predetermined threshold value time, corresponding target word is only to be passed through deletion action by pending word
Obtain, so that it is guaranteed that the part that synonym is pending word literal expression obtained by edit distance approach, and then make
The initialism obtained is more accurate, improves the accuracy rate that initialism finds.
Understandable for enabling the above-mentioned purpose of the present invention, feature and beneficial effect to become apparent from, below in conjunction with the accompanying drawings to this
The specific embodiment of invention is described in detail.
Fig. 1 is the flow chart that a kind of synonym in the embodiment of the present invention finds method.Below in conjunction with the step shown in Fig. 1
Illustrate.
Step S101: obtain pending phrase set, described phrase set includes multiple word.
Described pending phrase collection is combined into the phrase set treating therefrom to find synonym pair.
In being embodied as, described pending phrase set obtains by input language material is carried out participle.Described defeated
The data mode entering language material can be the non-text data such as speech data, it is also possible to is text data.When input language material is non-literary composition
During notebook data, needing the object that it is first converted to text data, i.e. subsequent treatment is all text data.Described input language material can
To obtain by obtaining question answering system and the conversation recording of user, it is also possible to come from the knowledge point data of manual sorting.
In being embodied as, above-mentioned input language material may be from a specific area, it is therefore to be understood that pending
Phrase set in word be the semantic expression about this specific area, wherein may include and there is identical semanteme but express
The word that form is different, i.e. synonym.Described specific area can be the bank field, education sector, sports field etc..
Such as, described input language material comes from the bank field, and the statement wherein having may use " China Merchants Bank " to express this
One Bank Name, some statements then may use " China Merchants Bank " to express, and " China Merchants Bank " and " China Merchants Bank " are a synonym pair;Class
As, in expression exist " industrial and commercial bank " and " industrial and commercial bank " this to synonym.Above-mentioned two groups of synonym centerings " China Merchants Bank " are " silver of promoting trade and investment
Initialism OK ", " industrial and commercial bank " is the initialism of " industrial and commercial bank ".Certainly, expression is likely present the same of other non-initialisms
Justice word, such as, " remit money " and " remittance money " be synonym, but the most there is not the relation of initialism.And the present embodiment is intended to lead to
Cross step S101 and find initialism to step S102.
In being embodied as, input language material is carried out participle and is realized by dictionary for word segmentation, so that input language material
Carry out the result of participle comprises initialism, in other words, in order to initialism participle from a statement is obtained, described point
Word dictionary needs to comprise initialism, and initialism does not exist possibly as a kind of neologisms in general basic dictionary, institute
Basis dictionary is updated by new word discovery so that initialism is added into the basic word of renewal as one of which neologisms with needs
Allusion quotation, so that with the basic dictionary updated as dictionary for word segmentation to input language material participle.
In order to make described dictionary for word segmentation include initialism, described dictionary for word segmentation obtains in the following manner, refers to Fig. 2
Shown step.
S11: input language material is carried out pretreatment, to obtain text data.
In described input language material, Format Type may be more, for ease of input language material is carried out subsequent treatment, and need to be to input
Language material carries out pretreatment, obtains text data.
In being embodied as, described pretreatment can by input language material uniform format be text formatting, and filter dirty word,
One or more in sensitive word and stop words.When being text formatting by the uniform format of input language material, can be by current skill
Art wouldn't be converted to the information filtering of text formatting and fall.
S12: described text data is carried out branch and processes, obtain phrase data.
It can be according to punctuate branch to input language material that branch processes, such as, fullstop, comma, exclamation, question mark etc. occurring
Branch at punctuate.Obtain phrase data is the primary segmentation to language material herein, in order to determine the scope of follow-up word segmentation processing.
S13, carries out word segmentation processing to described phrase data, after obtaining participle according to the independent word comprised in the dictionary of basis
Term data.
Described basis dictionary is for distinguishing for dictionary for word segmentation, may not contain initialism in the dictionary of described basis.Described
Basis dictionary comprises multiple independent word, and the length of different individually words can be different.In being embodied as, carry out based on basis dictionary
The process of word segmentation processing can utilize one or more in the two-way maximum matching method of dictionary, HMM method and CRF method.
Described word segmentation processing is that the phrase data to same a line carries out word segmentation processing, and described term data is all included in base
Independent word in plinth dictionary.
S14, is combined the term data after adjacent described participle processing, to generate candidate data string.
Word segmentation processing is carried out according to basis dictionary, it is possible that using should be as the word of a word in certain field
Data are divided into the situation of multiple term data, therefore need new word discovery.Follow-up imposing a condition is sieved from candidate data string
Choosing, using the candidate data string that filters out as neologisms.Generation candidate data string, as the premise of above-mentioned screening process, can use
Various ways completes.
In being embodied as, it is possible to use Bigram model using two words adjacent in the phrase data of same a line as time
Select serial data.
Assuming that a statement S can be expressed as sequence S=w1w2 ... wn, language model is exactly the general of requirement statement S
Rate p (S):
P (S)=p (w1, w2, w3, w4, w5 ..., wn)
=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1) (1)
In formula (1), probability statistics are based on Ngram model, and the amount of calculation of probability is the biggest, it is impossible to be applied in actual application.
(Markov Assumption) is assumed: the appearance of next word only relies upon one or several before it based on Markov
Word.Assume that the appearance of next word relies on a word before it, then have:
P (S)=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1)
=p (w1) p (w2 | w1) p (w3 | w2) ... p (wn | wn-1) (2)
Assume that the appearance of next word relies on two words before it, then have:
P (S)=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1)
=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | wn-1, wn-2) (3)
Formula (2) is the computing formula of Bigram probability, and formula (3) is the computing formula of trigram probability.By arranging
Bigger n value, can arrange the more constraint information next word occur, have bigger discriminative power;By arranging more
Little n value, the number of times that candidate data string occurs in new word discovery is more, it is provided that more reliable statistical information, has more
High reliability.
In theory, n value is the biggest, and reliability is the highest, and in existing processing method, Trigram's is most;But Bigram's
Amount of calculation is less, and system effectiveness is higher.
S15: judge whether described candidate data string is particular candidate serial data, described particular candidate serial data includes basis
Noun, and be positioned at described basis noun specific phase be noun or adjective to the word of position.
Study discovery according to inventor, if if the specific phase of a basic noun is to noun on position or adjective, then should
Basis noun very likely needs by as neologisms.Such as basis noun " blocks ", and the left side of " card " is noun, can form " dragon
Card ", " elite school's card ", " platinum card ", " business card " etc..Therefore judge whether candidate data string is particular candidate serial data, can sentence
Whether disconnected candidate data string meets comprises basis noun, and whether the specific phase of this basis noun is noun to the word of position
Or adjective.
Position can be set by the specific phase of basis noun according to different basic nouns and language material, such as, works as language
Material comprises multiple " card ", and when needing the title of various cards all as neologisms, can set the left side of basis noun as
Noun or adjective.
In being embodied as, specific phase can be any one or two kinds of in left side and right side to position, can be according to need
It is configured.
In being embodied as, it is referred to the frequency and determines described basis noun.Owing to basis noun can repeatedly in language material
Occur, therefore be referred to the frequency and determine basis noun.It is understood that basis noun can also be selected by manual read
Select and set.
S16: described candidate data string is carried out judgement process, to find neologisms;Described judgement processes and includes:
When described candidate data string is nonspecific candidate data string, calculate in described candidate data string each word with in it
The comentropy of side word, and remove described comentropy candidate data string outside preset range;
When described candidate data string is particular candidate serial data, only calculate the word outside described particular candidate serial data
With the comentropy of word inside it, remove described comentropy candidate data string outside preset range.
Owing to candidate data string includes two term data, when candidate data string being carried out judgement and processing, need respectively
Judging the inner side comentropy of two term data, comentropy is to measure stochastic variable is probabilistic, computing formula
As follows:
H (X)=-∑ p (xi)logp(xi)
Comentropy is the biggest, represents that the uncertainty of variable is the biggest, and the probability that the most each possible value occurs is average.As
The probability that really certain value of variable occurs is 1, then entropy is 0.Show that variable, only when former value occurs, is an inevitable thing
Part.
The left side comentropy of calculating word W and the formula of right side comentropy are as follows:
H1(W)=∑x∈X(#XW>0)P (x | W) log P (x | W), wherein X is all term data collection occurring in the W left side
Close, H1(W) it is the left side comentropy of term data W.
H2(W)=∑x∈Y(#WY>0)P (y | W) log P (y | W), wherein Y is to occur in all term data collection on the right of W
Close, H2(W) it is the right side comentropy of term data W.
Inner side comentropy is that candidate data string is fixed each independent term data successively, calculates and occurs at this term data
In the case of another word occur comentropy.If candidate data string is (W1W2), then calculate the right side letter of term data W1
Breath entropy and the left side comentropy of term data W2.
Calculate the entropy of term data and the term data inside it in candidate data string and embody word inside this term data
The confusion degree of language data.Such as, by calculating candidate data string W1W2Middle left side term data W1Right side comentropy and the right side
Side term data W2Left side comentropy, it can be determined that term data W1And W2The confusion degree of inner side, such that it is able to by setting
Preset range is screened, get rid of each word with its inside word constitute the probability characteristics value candidate outside preset range of neologisms
Serial data.
In particular candidate serial data, the inner side comentropy of basis noun perhaps can be because, outside preset range, causing making
Particular candidate serial data for neologisms is excluded, and such as, particular candidate serial data is " platinum card ", " elite school's card ", " Long Card " etc.
Comprise basis noun " block " candidate data string time, word " platinum ", " name ", " imperial " right side comentropy in preset range,
But the left side word " blocked " due to word is more chaotic, on the left of it, comentropy may be outside preset range, consequently, it is possible to cause candidate
Candidate's serial data such as serial data " platinum card ", " elite school's card ", " Long Card " is by the eliminating of mistake.
Therefore when described candidate data string is particular candidate serial data, only calculate the word outside described particular candidate serial data
Language with its inside the comentropy of word, remove described comentropy candidate data string outside preset range, no longer to basis noun
Inner side comentropy calculate, it is to avoid the inner side comentropy of gene basis noun is outer in preset range and the false exclusion that causes.
S17: described neologisms are added described dictionary for word segmentation.
Also be a kind of neologisms due to initialism, then the new set of words obtained from input language material also includes initialism, from
And neologisms are added dictionary for word segmentation and is also achieved that in dictionary for word segmentation and comprises initialism, and then can be with dictionary for word segmentation to described defeated
Enter language material and carry out the described phrase set that participle obtains in the present embodiment.
Continue with and the step after obtaining pending phrase set is illustrated.
Step S102: for the arbitrary pending word in described phrase set, when described phrase set exists one or
Multiple target words so that the smallest edit distance of described pending word to described target word less than predetermined threshold value time, described in wait to locate
Reason word is defined as synonym pair with target word described in corresponding.
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, described edit distance approach bag
Include deletion action, editing distance that editing distance corresponding to described deletion action is corresponding less than remaining operation, described deletion action
Corresponding editing distance is less than predetermined threshold value, and described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.
In being embodied as, remaining operation described can include replacement operation and update.The volume of the present embodiment indication
Collecting distance is that a word is taked edit operation to be converted into the editor's cost needed for another word, and namely number of operations is with every
The product of cost needed for single stepping.And smallest edit distance, then refer to edit the editing distance of cost minimization.Often step operation is only
Only for one of them word.
As a example by " industrial and commercial bank " conversion to " industrial and commercial bank ", editing distance and smallest edit distance are described below.Will " industrial and commercial silver
OK " conversion can take different edit operation compound modes to obtain to " industrial and commercial bank ".Assume the editing distance of single step replacement operation
Being 1000, the editing distance of single step deletion action is 1, and the editing distance of single step update is 1000.
By the first conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " it is: point 3 deletion actions are by " industrial and commercial bank "
" work ", " business " and " silver-colored " delete, then carry out update and insert " work " and obtain " industrial and commercial bank ", then " industrial and commercial bank " arrives " work
Editing distance OK " is 1003;
By the second conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " it is: point 2 deletion actions are by " industrial and commercial bank "
" work " and " business " delete, then carry out a replacement operation and " silver-colored " is replaced with " work " obtain " industrial and commercial bank ", then " industrial and commercial bank " arrives
The editing distance of " industrial and commercial bank " is 1002;
By the third conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " it is: point 2 deletion actions are by " industrial and commercial bank "
" business " and " silver-colored " delete, obtain " industrial and commercial bank ", then the editing distance of " industrial and commercial bank " to " industrial and commercial bank " is 2.
It should be noted that the conversion regime of pending word " industrial and commercial bank " conversion to " industrial and commercial bank " is not limited to above-mentioned enumerating
Operative combination, editing distance corresponding to different conversion regimes is different.But, in multiple conversion regime, minimum editor away from
From being unique.It can be appreciated that the above-mentioned smallest edit distance by " industrial and commercial bank " conversion to " industrial and commercial bank " should be 2, i.e. by upper
State the third conversion regime to obtain.
Therefore, for the arbitrary pending word in described phrase set, it determines that to the smallest edit distance of another word
's.By calculating arbitrary pending word and the smallest edit distance of other word in described phrase set, when in described phrase set
There is one or more target word so that the smallest edit distance of described pending word to described target word is less than predetermined threshold value
Time, described pending word is defined as synonym pair with target word described in corresponding.Such as, phrase collection be combined into L (A, B, C, D, E,
F, G and H), for pending word A, it is assumed that target word comes from subset M (B, C, D, E, F, G and H), when (B, C, D, E, F, G and
H) there is a word B in so that when the smallest edit distance of pending word A to this word B is less than predetermined threshold value, then A and B is synonym
Word pair.
In order to ensure that the target word of the synonym centering searched out is the initialism of pending word in the present embodiment, i.e. breviary
A part for the most pending word of word, in described edit distance approach, limits the editing distance that remaining operation of single is corresponding
More than or equal to predetermined threshold value, and limit the editing distance that editing distance corresponding to deletion action is corresponding less than remaining operation, and
The editing distance that not only deletion action described in single is corresponding is less than predetermined threshold value, and repeatedly (described repeatedly can be according to full name
Word determines, such as to the number of words deleted maximum between initialism: maximum delete 5 words, is the most now 5 times) described deletion action
Corresponding editing distance is again smaller than predetermined threshold value.
In being embodied as, it can also be multiple that the initialism found by said method can be one, needs explanation
, it is not necessarily initialism between that found by the method for the present embodiment and the word of pending word composition synonym pair and closes
System.Such as, in phrase set L, it is word B that the method for enforcement the present embodiment obtains one of them initialism of pending word A, and
Another initialism finding pending word A is word C, and smallest edit distance and the pending word A of the most pending word A to word B arrive
The smallest edit distance of word C is respectively less than predetermined threshold value, but is not necessarily initialism relation between word B and word C, i.e. it cannot be guaranteed that word
B is the initialism of word C or initialism that word C is word B, but is synonym relation between word B and word C.
Needing also exist for explanation, by the method for the present embodiment, multiple the waiting that the same initialism that obtains is corresponding is located
Differ between reason word and be set to synonym relation.Such as, in phrase set L, the method implementing the present embodiment obtains pending word A's
Initialism is B, and the initialism being similarly obtained pending word D is B, but differs between word A and word D and be set to synonym relation.
Editing distance corresponding less than remaining operation owing to limiting editing distance corresponding to deletion action in the present embodiment, makes
Must be when using edit distance approach to calculate smallest edit distance, the conversion of pending word is preferentially adopted to the edit operation of another word
By deletion action, on the other hand, calculate editing distance corresponding to the deletion action during smallest edit distance less than presetting threshold
Value, when the editing distance of remaining operation correspondence of single is more than or equal to predetermined threshold value, thus, when pending word is to target word simultaneously
Smallest edit distance less than predetermined threshold value time, corresponding target word is only to be obtained by deletion action by pending word, thus
Guarantee the part that the synonym obtained by edit distance approach is pending word literal expression, and then make the breviary obtained
Word is more accurate, improves the accuracy rate that initialism finds.
Fig. 3 is the flow chart that a kind of synonym in the embodiment of the present invention finds method.Enter below in conjunction with step shown in Fig. 3
Row explanation.
Step S301: obtain pending phrase set, described phrase set includes multiple word.
The enforcement of this step can not repeat them here should refer to step S101 shown in Fig. 1.
Step S302: for the arbitrary pending word in described phrase set, calculates described pending word respectively with described
The semantic similarity of remaining each word in phrase set, and therefrom select semantic similitude angle value to be more than word or the language of similarity threshold
The justice higher top n word of Similarity value is as candidate word.
In implementing one, can be by comparing semantic similitude angle value and the similarity threshold of remaining word and pending word
The size of value, is more than the word of similarity threshold as candidate word using semantic similitude angle value.It should be noted that described similarity threshold
Value can carry out different presetting, and does not do any restriction, and now the number of candidate word changes with the change of similarity threshold.
In another implements, can obtain, by number N limiting candidate word, the time that semantic similitude angle value is higher
Select word.Specifically, semantic similitude angle value is ranked up by from high to low order, takes the front N that semantic similitude angle value is higher
Individual word is as candidate word.
It is to determine target word from candidate word in order to follow-up that this step selects candidate word from described phrase set.So,
On the one hand, reduce the scope determining the target word constituting synonym pair with described pending word, such that it is able to reduce calculating
Complexity, improves the efficiency finding initialism.On the other hand, by semantic similarity is determined whether synonym as another
The performance assessment criteria of word, further increases the accuracy finding synonym pair, namely improves the accuracy finding initialism.
In being embodied as, when calculating the semantic similarity of remaining each word in described pending word and described phrase set
Can pass through following steps:
First, each word in described phrase set is carried out vectorization;
Secondly, result based on vectorization, calculate the cosine similarity of described pending word and remaining each word, described remaining
String similarity is as described semantic similarity.It is understood that after calculating cosine similarity, can therefrom select cosine similar
The angle value top n word higher more than the word of similarity threshold or cosine similarity value is as candidate word.
In being embodied as, can use word2vec method that each word in described phrase set is carried out vectorization.
It is pointed out that and can also use other existing methods that each word in described phrase set is carried out vectorization.
Step S303: when there is one or more target word in described phrase set so that described pending word is to described
When the smallest edit distance of target word is less than predetermined threshold value, described pending word is defined as synonym with target word described in corresponding
Word pair.
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, described edit distance approach bag
Including deletion action, editing distance that editing distance corresponding to deletion action is corresponding less than remaining operation, described deletion action is corresponding
Editing distance less than predetermined threshold value.
Described target word determines in the following manner: calculate the minimum of described pending word and each described candidate word respectively
Editing distance, is less than the candidate word of predetermined threshold value as described target word using the smallest edit distance with described pending word.
In the present embodiment, remaining operation described includes update and replacement operation.Update described in single is corresponding
Editing distance more than or equal to predetermined threshold value, editing distance corresponding to replacement operation described in single is more than or equal to presetting threshold
Value.
Below with example explanation step S301 to the enforcement of step S303, each of which step using one implement as
Example, should not be a limitation of the present invention.
Implementing step S301, obtain pending phrase collection and be combined into Q (A, B, C and D), wherein A, B, C and D can be all to wait to locate
Reason word, it is assumed that A is specially " China Merchants Bank ", B be " industrial and commercial bank ", C be " China Merchants Bank ", D is " industrial and commercial bank ".
Following steps are " China Merchants Bank " example with pending word as A.
Implement step S302, use word2vec method for each word (A, B, C and D) in phrase set Q carry out to
Quantify, result based on vectorization, calculate the cosine similarity of pending word A and remaining each word B, C and D, obtain cosine phase
It is D, C and B like angle value from high to low order, therefrom selects front 2 words that cosine similarity value is higher as candidate word, i.e. select
Select word D " industrial and commercial bank " and word C " China Merchants Bank " as candidate word.
Implement step S303, for pending word A " China Merchants Bank ", use edit distance approach to calculate pending word respectively
The smallest edit distance of A " China Merchants Bank " and candidate word D " industrial and commercial bank ", and pending word A " China Merchants Bank " and candidate word C
The smallest edit distance of " China Merchants Bank ".
In the edit distance approach of this example, editing distance corresponding to deletion action is less than update and replacement operation pair
The editing distance answered, editing distance corresponding to described deletion action less than predetermined threshold value, the volume that update described in single is corresponding
Volume distance is more than or equal to predetermined threshold value, and editing distance corresponding to replacement operation described in single is more than or equal to predetermined threshold value.False
If editing distance corresponding to single deletion action is 1, the editing distance that single update is corresponding is 1000, single replacement operation
Corresponding editing distance is 1000, and predetermined threshold value is 10, then:
In pending word " China Merchants Bank " is converted to all edit operations combination of candidate word D " industrial and commercial bank ", by 1
The editing distance that step replacement operation obtains is minimum, specifically " trick " is replaced with " work ", so smallest edit distance is 1000;
In pending word " China Merchants Bank " is converted to all edit operations combination of candidate word C " China Merchants Bank ", deleted by 2 steps
The editing distance that division operation obtains is minimum, deletes " business " and " silver-colored " the most respectively, so smallest edit distance is 2;
In above-mentioned calculated smallest edit distance, it is 2 less than the smallest edit distance of predetermined threshold value 10, therefore corresponding
Target word be candidate word C " China Merchants Bank ", determine that pending word A " China Merchants Bank " and candidate word C " China Merchants Bank " are synonym pair, " recruit
OK " it is the initialism of " China Merchants Bank ".
And for example, it is assumed that pending phrase collection is combined into P (" China Merchants Bank ", " industrial and commercial bank " and " industrial and commercial bank "), for waiting to locate
Reason word " China Merchants Bank ", calculates " China Merchants Bank " and " industrial and commercial bank " respectively, and the semantic phase of " China Merchants Bank " and " industrial and commercial bank "
Like degree, obtain " China Merchants Bank " and the semantic similarity of " industrial and commercial bank " and " China Merchants Bank " and the semantic similarity of " industrial and commercial bank "
Value is all higher than similarity threshold.Then smallest edit distance is calculated, owing to the editing distance of deletion action is less than replacing in calculating
The editing distance of operation, the most each step preferentially uses deletion action:
" China Merchants Bank " conversion can be converted to by taking 2 step deletion actions and 1 step replacement operation to " industrial and commercial bank " is minimum.
Specifically, minimum operation can be by deleting " trick " and " business ", and to replace " silver-colored " be that " work " obtains.And the single step of deletion action is compiled
Volume distance is 1, and the single step editing distance of replacement operation is 1000, therefore calculate " China Merchants Bank " arrive " industrial and commercial bank " minimum edit away from
From for 1002;
Similarly, " China Merchants Bank " conversion is obtained to " industrial and commercial bank " is minimum by 1 step replacement operation, specifically, replaces
Changing " trick " is " work ", and the single step editing distance of replacement operation is 1000, therefore calculates " China Merchants Bank " conversion to " industrial and commercial silver
Smallest edit distance OK " is 1000.
It can be seen that " China Merchants Bank " conversion is to the smallest edit distance of " industrial and commercial bank ", and " China Merchants Bank " changes to " work
Business bank " smallest edit distance be all higher than predetermined threshold value 10, so candidate word " industrial and commercial bank " and candidate word " industrial and commercial bank " are the most not
It is described target word, say, that pending phrase set does not exist and forms same word pair with pending word " China Merchants Bank "
Word.
Editing distance corresponding less than remaining operation owing to limiting editing distance corresponding to deletion action in the present embodiment, makes
Must be in edit distance approach during smallest edit distance to be calculated, the conversion of pending word is preferentially adopted to the edit operation of other words
Use deletion action.On this basis,
Calculate editing distance corresponding to the deletion action during smallest edit distance less than predetermined threshold value, simultaneously single its
When editing distance corresponding to remaining operation is more than or equal to predetermined threshold value, thus, when pending word to target word minimum edit away from
From during less than predetermined threshold value, corresponding target word is only to be obtained by deletion action by pending word, so that it is guaranteed that by editor
The part that synonym is pending word literal expression that distance method obtains, and then make the initialism obtained more accurate,
Improve the accuracy rate that initialism finds.
Further, the present embodiment by calculating the semantic similarity of remaining word in pending word and described phrase set,
Select multiple candidate word, and then target word can be determined from the less scope that multiple candidate word are formed, owing to multiple candidate word are
One subset of pending phrase set, so determining that from multiple candidate word target word can improve the effect determining synonym pair
Rate, simultaneously by semantic similarity is determined whether synon performance assessment criteria as another, further increases discovery
The accuracy of synonym pair.
The embodiment of the present invention additionally provides a kind of data processing method finding method based on above-mentioned synonym.Described data
Processing method carries out synon judgement by thesaurus, and thesaurus includes the above-mentioned synonym of employing and finds
The initialism that method obtains.Below described data processing method is illustrated.
Described data processing method includes: obtain knowledge point, and described knowledge point includes question sentence and corresponding answer;To described
Question sentence carries out the arbitrary key word after participle, judges whether described key word exists synonym according to thesaurus;When described pass
When keyword exists synonym, the synonym found is replaced corresponding key word;The question sentence that storage obtains after replacing, and will replace
After the question sentence that obtains add this knowledge point.
Such as, finding method by above-mentioned synonym, obtain the initialism that " China Merchants Bank " is " China Merchants Bank ", both are synonym
One group of synonym pair in dictionary.Implement described data processing method below:
Obtaining a knowledge point, wherein, question sentence is " China Merchants Bank's credit card is the most open-minded ", and corresponding answer is S;
Question sentence " China Merchants Bank's credit card is the most open-minded " is carried out one of them key word " China Merchants Bank " that participle obtains,
Judge whether the key word " China Merchants Bank " that participle obtains exists synonym according to thesaurus;Owing to there is " China Merchants Bank "
Synonym is its initialism " China Merchants Bank ", then " China Merchants Bank " is replaced the key word in question sentence " China Merchants Bank's credit card is the most open-minded "
" China Merchants Bank ", storage replace after question sentence " China Merchants Bank's credit card is the most open-minded ", and will replace after question sentence " China Merchants Bank's credit card is such as
What is open-minded " add knowledge point.The most former knowledge point is extended for: question sentence has " China Merchants Bank's credit card is the most open-minded " and " China Merchants Bank
The credit card is the most open-minded ", corresponding answer S.Synonym therein " China Merchants Bank " uses above-mentioned synonym to find, and method obtains, no longer
Repeat.
It can thus be seen that above-mentioned synonym finds that method may be used for the question sentence in expanding knowledge a little, and then reach to expand
Fill the effect of knowledge base, such that it is able to when the expression using initialism to carry out different question sentence, still can reply and answer accordingly
Case, and then improve the semantic understanding ability of Intelligent Answer System and reply the accuracy rate of answer.It should be noted that above-mentioned synonym
Word finds that method can be applied not only to the storehouse that expands knowledge, it is also possible to for information search.When being applied to information search, it is possible not only to
Search obtains the information that keyword is relevant, it is also possible to search obtains the initialism of keyword or the information that full name word is relevant.
Fig. 4 is the structural representation that a kind of synonym in the embodiment of the present invention finds device.Described synonym finds dress
Put and may include that acquiring unit 401 and synonym determine unit 402;
Described acquiring unit 401, is suitable to obtain pending phrase set, and described phrase set includes multiple word;
Described synonym determines unit 402, is suitable to for the arbitrary pending word in described phrase set, when described phrase
Set exists one or more target word so that the smallest edit distance of described pending word to described target word is less than presetting
During threshold value, described pending word is defined as synonym pair with target word described in corresponding;
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, described edit distance approach bag
Include deletion action, editing distance that editing distance corresponding to described deletion action is corresponding less than remaining operation, described deletion action
Corresponding editing distance is less than predetermined threshold value, and described in single, editing distance of remaining operation correspondence is more than or equal to predetermined threshold value.
In being embodied as, remaining operation described includes update and replacement operation, and update described in single is corresponding
Editing distance more than or equal to predetermined threshold value, editing distance corresponding to replacement operation described in single is more than or equal to presetting threshold
Value.
In being embodied as, described acquiring unit 401 includes participle subelement, is suitable to input language material is carried out participle, with
Obtain described phrase set.In being embodied as, described participle subelement utilizes dictionary for word segmentation to carry out described input language material point
Word, described dictionary for word segmentation is obtained by dictionary for word segmentation acquiring unit, and described dictionary for word segmentation acquiring unit is suitable to:
Described input language material is carried out pretreatment, to obtain text data;Described text data carries out branch process,
To phrase data;Described phrase data is carried out word segmentation processing, after obtaining participle according to the independent word comprised in the dictionary of basis
Term data;It is combined the term data after adjacent described participle processing, to generate candidate data string;To described time
Serial data is selected to carry out judgement process, to find neologisms;Described neologisms are added described dictionary for word segmentation.
Can be to should refer to Fig. 1 about the structure of the synonym discovery device described in the present embodiment and the explanation of beneficial effect
Synonym find the step of method and the explanation of beneficial effect, repeat no more.
Fig. 5 is the structural representation that a kind of synonym in the embodiment of the present invention finds device.Synonym as shown in Figure 5
Find that device may include that acquiring unit 501, candidate word choose unit 502, target word determines that unit 503 and synonym determine
Unit 504.
Described acquiring unit 501, is suitable to obtain pending phrase set, and described phrase set includes multiple word.
Described synonym determines unit 504, is suitable to for the arbitrary pending word in described phrase set, when described phrase
Set exists one or more target word so that the smallest edit distance of described pending word to described target word is less than presetting
During threshold value, described pending word is defined as synonym pair with target word described in corresponding.Wherein, described smallest edit distance is
Being calculated by edit distance approach and to obtain, in described edit distance approach, editing distance corresponding to deletion action is less than it
The editing distance that remaining operation is corresponding, editing distance corresponding to described deletion action less than predetermined threshold value, remaining operation described in single
Corresponding editing distance is more than or equal to predetermined threshold value.
In being embodied as, remaining operation described includes update and replacement operation, and update described in single is corresponding
Editing distance more than or equal to predetermined threshold value, editing distance corresponding to replacement operation described in single is more than or equal to presetting threshold
Value.
In being embodied as, described acquiring unit 501 includes participle subelement 5011, is suitable to carry out input language material point
Word, to obtain described phrase set.
In being embodied as, described participle subelement 5011 utilizes dictionary for word segmentation that described input language material is carried out participle, institute
Stating dictionary for word segmentation to be obtained by dictionary for word segmentation acquiring unit, described dictionary for word segmentation acquiring unit is suitable to:
Described input language material is carried out pretreatment, to obtain text data;Described text data carries out branch process,
To phrase data;Described phrase data is carried out word segmentation processing, after obtaining participle according to the independent word comprised in the dictionary of basis
Term data;It is combined the term data after adjacent described participle processing, to generate candidate data string;To described time
Serial data is selected to carry out judgement process, to find neologisms;Described neologisms are added described dictionary for word segmentation.
In being embodied as, described synonym finds that device can also include:
Candidate word chooses unit 502, is suitable to calculate described pending word and remaining each word in described phrase set respectively
Semantic similarity, and therefrom select the front N that semantic similitude angle value is higher more than the word of similarity threshold or semantic similitude angle value
Individual word is as candidate word;
Target word determines unit 503, is suitable to the minimum editor calculating described pending word respectively with each described candidate word
Distance, is less than the candidate word of predetermined threshold value as target word using the smallest edit distance with described pending word.
In being embodied as, described candidate word is chosen unit 502 and be may include that
Vectorization subelement 5021, is suitable to each word in described phrase set is carried out vectorization;
Cosine similarity computation subunit 5022, is suitable to result based on vectorization, calculate described pending word and remaining
The cosine similarity of each word, described cosine similarity is as described semantic similarity.
In being embodied as, can use word2vec method that each word in described phrase set is carried out vectorization.
Can be to should refer to Fig. 3 about the structure of the synonym discovery device described in the present embodiment and the explanation of beneficial effect
Synonym find the step of method and the explanation of beneficial effect, repeat no more.
The embodiment of the present invention also provides for a kind of data processing equipment, and described data processing equipment uses shown in Fig. 4 or Fig. 5
Synonym finds device, and described data processing equipment may include that
Knowledge point acquiring unit, is suitable to obtain knowledge point, and described knowledge point includes question sentence and corresponding answer;
Synonym searches unit, is suitable to the arbitrary key word after described question sentence carries out participle, judges according to thesaurus
Whether described key word exists synonym;
Replacement unit, is suitable to when described key word exists synonym, and the synonym found is replaced corresponding key word;
Knowledge point expansion unit, the question sentence obtained after being suitable to storage replacement, and the question sentence that will obtain after replacing add this and know
Know point.
The structure of described data processing equipment and beneficial effect can refer to the explanation of above-mentioned data processing method, the most superfluous
State.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can
Completing instructing relevant hardware by program, this program can be stored in a computer-readable recording medium, storage
Medium may include that ROM, RAM, disk or CD etc..
Although present disclosure is as above, but the present invention is not limited to this.Any those skilled in the art, without departing from this
In the spirit and scope of invention, all can make various changes or modifications, therefore protection scope of the present invention should be with claim institute
Limit in the range of standard.
Claims (16)
1. a synonym finds method, it is characterised in that including:
Obtaining pending phrase set, described phrase set includes multiple word;
For the arbitrary pending word in described phrase set, when described phrase set exists one or more target word, make
When obtaining the described pending word smallest edit distance to described target word less than predetermined threshold value, described pending word and corresponding
Described target word is defined as synonym pair;
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, and described edit distance approach includes deleting
Division operation, editing distance that editing distance corresponding to described deletion action is corresponding less than remaining operation, described deletion action is corresponding
Editing distance less than predetermined threshold value, editing distance corresponding to remaining operation described in single is more than or equal to predetermined threshold value.
Synonym the most according to claim 1 finds method, it is characterised in that described method also includes:
Calculate described pending word and the semantic similarity of remaining each word in described phrase set respectively, and therefrom select semanteme
The Similarity value top n word higher more than the word of similarity threshold or semantic similitude angle value is as candidate word;
Described target word determines in the following manner: calculate the minimum editor of described pending word and each described candidate word respectively
Distance, is less than the candidate word of predetermined threshold value as target word using the smallest edit distance with described pending word.
Synonym the most according to claim 2 finds method, it is characterised in that calculate described pending word respectively with described
The semantic similarity of remaining each word in phrase set, including:
Each word in described phrase set is carried out vectorization;
Result based on vectorization, calculates the cosine similarity of described pending word and remaining each word, described cosine similarity
As described semantic similarity.
Synonym the most according to claim 3 finds method, it is characterised in that enter each word in described phrase set
Row vector, including:
Use word2vec method that each word in described phrase set is carried out vectorization.
Synonym the most according to claim 1 finds method, it is characterised in that described acquisition synon phrase to be found
Set, including:
Input language material is carried out participle, to obtain described phrase set.
Synonym the most according to claim 5 finds method, it is characterised in that utilize dictionary for word segmentation to described input language material
Carrying out participle, described dictionary for word segmentation obtains in the following manner:
Described input language material is carried out pretreatment, to obtain text data;
Described text data carries out branch process, obtain phrase data;
According to the independent word that comprises in the dictionary of basis, described phrase data is carried out word segmentation processing, to obtain the word number after participle
According to;
It is combined the term data after adjacent described participle processing, to generate candidate data string;
Described candidate data string is carried out judgement process, to find neologisms;
Described neologisms are added described dictionary for word segmentation.
Synonym the most according to claim 1 find method, it is characterised in that described remaining operation include update and
Replacement operation, editing distance corresponding to update described in single more than or equal to predetermined threshold value, replacement operation pair described in single
The editing distance answered is more than or equal to predetermined threshold value.
8. a data processing method, it is characterised in that include that the synonym described in any one of claim 1-7 finds method.
9. a synonym finds device, it is characterised in that including:
Acquiring unit, is suitable to obtain pending phrase set, and described phrase set includes multiple word;
Synonym determines unit, is suitable to for the arbitrary pending word in described phrase set, when existing in described phrase set
One or more target words so that when the smallest edit distance of described pending word to described target word is less than predetermined threshold value, institute
State pending word and be defined as synonym pair with target word described in corresponding;
Wherein, described smallest edit distance is calculated by edit distance approach and obtains, and described edit distance approach includes deleting
Division operation, editing distance that editing distance corresponding to described deletion action is corresponding less than remaining operation, described deletion action is corresponding
Editing distance less than predetermined threshold value, editing distance corresponding to remaining operation described in single is more than or equal to predetermined threshold value.
Synonym the most according to claim 9 finds device, it is characterised in that described device also includes:
Candidate word chooses unit, is suitable to calculate described pending word and the semantic phase of remaining each word in described phrase set respectively
Like degree, and therefrom select the top n word conduct that semantic similitude angle value is higher more than the word of similarity threshold or semantic similitude angle value
Candidate word;
Target word determines unit, is suitable to the smallest edit distance calculating described pending word respectively with each described candidate word, will
It is less than the candidate word of predetermined threshold value as target word with the smallest edit distance of described pending word.
11. synonyms according to claim 10 find device, it is characterised in that described candidate word is chosen unit and included:
Vectorization subelement, is suitable to each word in described phrase set is carried out vectorization;
Cosine similarity computation subunit, is suitable to result based on vectorization, calculates described pending word and remaining each word
Cosine similarity, described cosine similarity is as described semantic similarity.
12. synonyms according to claim 11 find device, it is characterised in that described vectorization subelement uses
Word2vec method carries out vectorization to each word in described phrase set.
13. synonyms according to claim 9 find device, it is characterised in that described acquiring unit includes:
Participle subelement, is suitable to input language material is carried out participle, to obtain described phrase set.
14. synonyms according to claim 13 find device, it is characterised in that described participle subelement utilizes participle word
Allusion quotation carries out participle to described input language material, and described dictionary for word segmentation is obtained by dictionary for word segmentation acquiring unit, and described dictionary for word segmentation obtains
Take unit to be suitable to:
Described input language material is carried out pretreatment, to obtain text data;Described text data carries out branch process, obtain language
Sentence data;According to the independent word that comprises in the dictionary of basis, described phrase data is carried out word segmentation processing, to obtain the word after participle
Language data;It is combined the term data after adjacent described participle processing, to generate candidate data string;To described candidate's number
Judgement process is carried out, to find neologisms according to string;Described neologisms are added described dictionary for word segmentation.
15. synonyms according to claim 9 find device, it is characterised in that remaining operation described includes update
And replacement operation, editing distance corresponding to update described in single more than or equal to predetermined threshold value, replacement operation described in single
Corresponding editing distance is more than or equal to predetermined threshold value.
16. 1 kinds of data processing equipments, it is characterised in that include that the synonym described in any one of claim 9-15 finds dress
Put.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610429937.XA CN106126494B (en) | 2016-06-16 | 2016-06-16 | Synonym finds method and device, data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610429937.XA CN106126494B (en) | 2016-06-16 | 2016-06-16 | Synonym finds method and device, data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106126494A true CN106126494A (en) | 2016-11-16 |
CN106126494B CN106126494B (en) | 2018-12-28 |
Family
ID=57470670
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610429937.XA Active CN106126494B (en) | 2016-06-16 | 2016-06-16 | Synonym finds method and device, data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106126494B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649783A (en) * | 2016-12-28 | 2017-05-10 | 上海智臻智能网络科技股份有限公司 | Synonym mining method and apparatus |
CN106649816A (en) * | 2016-12-29 | 2017-05-10 | 北京奇虎科技有限公司 | Synonym filtering method and device |
CN106777283A (en) * | 2016-12-29 | 2017-05-31 | 北京奇虎科技有限公司 | The method for digging and device of a kind of synonym |
CN106776543A (en) * | 2016-11-23 | 2017-05-31 | 上海智臻智能网络科技股份有限公司 | New word discovery method, device, terminal and server |
CN106933806A (en) * | 2017-03-15 | 2017-07-07 | 北京大数医达科技有限公司 | The determination method and apparatus of medical synonym |
CN107180026A (en) * | 2017-05-02 | 2017-09-19 | 苏州大学 | The event phrase learning method and device of a kind of word-based embedded Semantic mapping |
CN107577668A (en) * | 2017-09-15 | 2018-01-12 | 电子科技大学 | Social media non-standard word correcting method based on semanteme |
CN107621892A (en) * | 2017-10-18 | 2018-01-23 | 北京百度网讯科技有限公司 | For obtaining the method and device of information |
CN108170806A (en) * | 2017-12-28 | 2018-06-15 | 东软集团股份有限公司 | Sensitive word detection filter method, device and computer equipment |
CN108255810A (en) * | 2018-01-10 | 2018-07-06 | 北京神州泰岳软件股份有限公司 | Near synonym method for digging, device and electronic equipment |
CN108829799A (en) * | 2018-06-05 | 2018-11-16 | 中国人民公安大学 | Based on the Text similarity computing method and system for improving LDA topic model |
CN109741190A (en) * | 2018-12-27 | 2019-05-10 | 清华大学 | A kind of method, system and the equipment of the classification of personal share bulletin |
WO2020061910A1 (en) * | 2018-09-27 | 2020-04-02 | 北京字节跳动网络技术有限公司 | Method and apparatus used for generating information |
CN113689923A (en) * | 2020-05-19 | 2021-11-23 | 北京平安联想智慧医疗信息技术有限公司 | Medical data processing apparatus, system and method |
CN113761905A (en) * | 2020-07-01 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Method and device for constructing domain modeling vocabulary |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561813A (en) * | 2009-05-27 | 2009-10-21 | 东北大学 | Method for analyzing similarity of character string under Web environment |
US20110060712A1 (en) * | 2009-09-09 | 2011-03-10 | Ichiro Harashima | Method and system for design check knowledge construction |
CN102750282A (en) * | 2011-04-19 | 2012-10-24 | 北京百度网讯科技有限公司 | Synonym template mining method and device as well as synonym mining method and device |
CN102760134A (en) * | 2011-04-28 | 2012-10-31 | 北京百度网讯科技有限公司 | Method and device for mining synonyms |
CN104978356A (en) * | 2014-04-10 | 2015-10-14 | 阿里巴巴集团控股有限公司 | Synonym identification method and device |
CN105095204A (en) * | 2014-04-17 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Method and device for obtaining synonym |
CN105183923A (en) * | 2015-10-27 | 2015-12-23 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
-
2016
- 2016-06-16 CN CN201610429937.XA patent/CN106126494B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561813A (en) * | 2009-05-27 | 2009-10-21 | 东北大学 | Method for analyzing similarity of character string under Web environment |
US20110060712A1 (en) * | 2009-09-09 | 2011-03-10 | Ichiro Harashima | Method and system for design check knowledge construction |
CN102750282A (en) * | 2011-04-19 | 2012-10-24 | 北京百度网讯科技有限公司 | Synonym template mining method and device as well as synonym mining method and device |
CN102760134A (en) * | 2011-04-28 | 2012-10-31 | 北京百度网讯科技有限公司 | Method and device for mining synonyms |
CN104978356A (en) * | 2014-04-10 | 2015-10-14 | 阿里巴巴集团控股有限公司 | Synonym identification method and device |
CN105095204A (en) * | 2014-04-17 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Method and device for obtaining synonym |
CN105183923A (en) * | 2015-10-27 | 2015-12-23 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
Non-Patent Citations (2)
Title |
---|
李华 等: "基于动态规划的缩写发现算法", 《武汉大学学报(工学版)》 * |
王宝勋 等: "一种基于无监督学习的词变体识别方法", 《中文信息学报》 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776543B (en) * | 2016-11-23 | 2019-09-06 | 上海智臻智能网络科技股份有限公司 | New word discovery method, apparatus, terminal and server |
CN106776543A (en) * | 2016-11-23 | 2017-05-31 | 上海智臻智能网络科技股份有限公司 | New word discovery method, device, terminal and server |
CN110516235A (en) * | 2016-11-23 | 2019-11-29 | 上海智臻智能网络科技股份有限公司 | New word discovery method, apparatus, terminal and server |
CN106649783A (en) * | 2016-12-28 | 2017-05-10 | 上海智臻智能网络科技股份有限公司 | Synonym mining method and apparatus |
CN106649816A (en) * | 2016-12-29 | 2017-05-10 | 北京奇虎科技有限公司 | Synonym filtering method and device |
CN106777283A (en) * | 2016-12-29 | 2017-05-31 | 北京奇虎科技有限公司 | The method for digging and device of a kind of synonym |
CN106777283B (en) * | 2016-12-29 | 2021-02-26 | 北京奇虎科技有限公司 | Synonym mining method and synonym mining device |
CN106649816B (en) * | 2016-12-29 | 2020-06-09 | 北京奇虎科技有限公司 | Synonym filtering method and device |
CN106933806A (en) * | 2017-03-15 | 2017-07-07 | 北京大数医达科技有限公司 | The determination method and apparatus of medical synonym |
CN107180026A (en) * | 2017-05-02 | 2017-09-19 | 苏州大学 | The event phrase learning method and device of a kind of word-based embedded Semantic mapping |
CN107577668A (en) * | 2017-09-15 | 2018-01-12 | 电子科技大学 | Social media non-standard word correcting method based on semanteme |
CN107621892A (en) * | 2017-10-18 | 2018-01-23 | 北京百度网讯科技有限公司 | For obtaining the method and device of information |
CN108170806A (en) * | 2017-12-28 | 2018-06-15 | 东软集团股份有限公司 | Sensitive word detection filter method, device and computer equipment |
CN108170806B (en) * | 2017-12-28 | 2020-11-20 | 东软集团股份有限公司 | Sensitive word detection and filtering method and device and computer equipment |
CN108255810A (en) * | 2018-01-10 | 2018-07-06 | 北京神州泰岳软件股份有限公司 | Near synonym method for digging, device and electronic equipment |
CN108829799A (en) * | 2018-06-05 | 2018-11-16 | 中国人民公安大学 | Based on the Text similarity computing method and system for improving LDA topic model |
WO2020061910A1 (en) * | 2018-09-27 | 2020-04-02 | 北京字节跳动网络技术有限公司 | Method and apparatus used for generating information |
CN109741190A (en) * | 2018-12-27 | 2019-05-10 | 清华大学 | A kind of method, system and the equipment of the classification of personal share bulletin |
CN113689923A (en) * | 2020-05-19 | 2021-11-23 | 北京平安联想智慧医疗信息技术有限公司 | Medical data processing apparatus, system and method |
CN113689923B (en) * | 2020-05-19 | 2024-06-18 | 北京平安联想智慧医疗信息技术有限公司 | Medical data processing device, system and method |
CN113761905A (en) * | 2020-07-01 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Method and device for constructing domain modeling vocabulary |
Also Published As
Publication number | Publication date |
---|---|
CN106126494B (en) | 2018-12-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106126494B (en) | Synonym finds method and device, data processing method and device | |
CN107480143B (en) | Method and system for segmenting conversation topics based on context correlation | |
CN104699763B (en) | The text similarity gauging system of multiple features fusion | |
CN106528532B (en) | Text error correction method, device and terminal | |
CN107515877B (en) | Sensitive subject word set generation method and device | |
Zhang et al. | Joint word segmentation and POS tagging using a single perceptron | |
US9223779B2 (en) | Text segmentation with multiple granularity levels | |
CN106326303B (en) | A kind of spoken semantic analysis system and method | |
CN105183923A (en) | New word discovery method and device | |
CN101751455B (en) | Method for automatically generating title by adopting artificial intelligence technology | |
CN103970733B (en) | A kind of Chinese new word identification method based on graph structure | |
CN110349568A (en) | Speech retrieval method, apparatus, computer equipment and storage medium | |
CN101021838A (en) | Text handling method and system | |
CN106445921B (en) | Utilize the Chinese text terminology extraction method of quadratic mutual information | |
CN105224682A (en) | New word discovery method and device | |
CN106897290B (en) | Method and device for establishing keyword model | |
CN103268313A (en) | Method and device for semantic analysis of natural language | |
CN110134777B (en) | Question duplication eliminating method and device, electronic equipment and computer readable storage medium | |
WO2017091985A1 (en) | Method and device for recognizing stop word | |
CN103488782B (en) | A kind of method utilizing lyrics identification music emotion | |
KR101638535B1 (en) | Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same | |
US9652997B2 (en) | Method and apparatus for building emotion basis lexeme information on an emotion lexicon comprising calculation of an emotion strength for each lexeme | |
CN109934251A (en) | A kind of method, identifying system and storage medium for rare foreign languages text identification | |
Ahmadi et al. | A hybrid method for Persian named entity recognition | |
Hadj Ameur et al. | Restoration of Arabic diacritics using a multilevel statistical model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |