WO2012153524A1 - 同義表現判定装置、方法及びプログラム - Google Patents
同義表現判定装置、方法及びプログラム Download PDFInfo
- Publication number
- WO2012153524A1 WO2012153524A1 PCT/JP2012/003023 JP2012003023W WO2012153524A1 WO 2012153524 A1 WO2012153524 A1 WO 2012153524A1 JP 2012003023 W JP2012003023 W JP 2012003023W WO 2012153524 A1 WO2012153524 A1 WO 2012153524A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- input
- similarity
- word
- distribution
- utterances
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Definitions
- the present invention relates to a synonym expression determination apparatus, a synonym expression determination method, and a synonym expression determination program for determining whether or not a synonym expression is present.
- the synonym expression dictionary is one of the language resources necessary for realizing an accurate search for a query having a complicated syntax structure such as a natural sentence. Synonymous expressions usually need to be constructed for each field of documents to be searched. However, in order to secure a person in charge with specialized knowledge for a long time, it requires a lot of human costs, and thus a technique for automatically constructing a synonym expression dictionary is required.
- synonymous expressions of binary relations expressed by pairs of body words and idioms include “turn on power” and “turn on power switch”.
- a pretext constituting the input binary relation is described as an input premise, and a prescriptive expression constituting the input binary relation is described as an input prescriptive.
- Non-Patent Document 1 As a technique for extracting synonymous expressions of binary relations, as described in Non-Patent Document 1, contexts around binary relations are collected as feature quantities from a document set, and binary relations having similar feature quantities are collected. There is a method of extracting as synonymous expressions.
- a prescriptive word related to the input prescriptive word in the document set or a dialect other than the input prescriptive word having the case relation of the input prescriptive word is used. For example, from the sentence “Graduate from a university as a principal and get a job at a company”, the characteristics of the binary relation “Graduate from a university” are obtained as “feature at a principal” and “get a job”.
- Non-Patent Document 2 describes a technique that collects the distribution of appearance frequencies of predicates that have a binary relation between an input statement and a document set as a feature value of the input statement, and extracts an input statement with a similar feature value as a synonymous expression. Has been.
- Non-Patent Document 1 it is difficult to obtain a sufficient amount of features for extracting synonymous expressions of binary relations. This is because a feature quantity cannot be acquired from a sentence in which a binary relation appears alone.
- Non-Patent Document 2 when the input predicate and the input body word have ambiguity, the feature quantities are not similar, and thus the binary relation that is synonymous cannot be determined.
- the present invention provides a synonym expression determination device, a synonym expression determination method, and a synonym expression determination program that can correctly determine a synonym expression of a binary relation even when an input predicate or an input body word has ambiguity. For the purpose.
- the synonym expression judging device inputs a binary relation set composed of a body word and a word, and determines whether or not the inputted pair of binary relations is synonymous between the input body word and the input word.
- the synonym determining means for determining the similarity between the input terms and the similarity between the input terms based on the distribution of the appearance frequency of the body terms in the binomial relationship with the input terms.
- inter-word similarity calculation means for calculating using a distribution of only the body used in the same concept as the input body.
- the synonym expression judging method inputs a set of binary relations composed of a body word and a pretext, and determines whether or not the input pair of binary relations is synonymous between the input body language and the input word.
- calculating the similarity between the input words based on the distribution of the appearance frequency of the body words that are in binary relation in the document set with the input words It is characterized by calculating using the distribution of only the word used in the concept of the same kind.
- the synonym expression determination program inputs a binary relation set composed of body words and prescriptions to a computer, and inputs whether or not the input binary relation pair is synonymous with the input body language.
- the synonym determination process that uses each similarity between the predicates and the similarity between the input predicates is calculated based on the distribution of the appearance frequency of the body words that are binomial in the input predicates and the document set. In doing so, it is characterized in that inter-predicate similarity calculation processing is executed by using only the distribution of the body used in the same kind of concept as the input body.
- FIG. 1 is a diagram illustrating a configuration example of a synonymous expression determination apparatus according to the present invention.
- a synonym expression determination device includes a data processing device 1 that operates under program control, a storage device 2 that stores information, an input device 3 such as a keyboard, and an output device 4 such as a display device. Including.
- the input device 3 has a function of inputting data indicating two sets of binary relations to the data processing device 1 in accordance with a user operation or the like.
- the binary relation represents a pair of a prescriptive word and a body word in the case relation.
- the input device 3 inputs, for example, data indicating “power-on” and data indicating “power switch-on” to the data processing device 1 as two sets of binary relations.
- the number is not limited to two sets, and may be three or more sets.
- the output device 4 has a function of outputting a processing result by the data processing device 1.
- the output device 4 is realized by a display device such as a display device, and displays the processing result of the data processing device 1 on the display unit.
- the data processing apparatus 1 includes an appearance frequency calculation means 10, an appearance frequency correction means 11, an inter-word similarity calculation means 12, an inter-body similarity calculation means 13, and a synonym determination means 14.
- the data processing device 1 is realized by an information processing device such as a personal computer that operates according to a program.
- the appearance frequency calculation means 10 has a function of extracting a binary relation from document data stored in the document storage unit 20 (hereinafter simply referred to as a document) and calculating each appearance frequency. Specifically, the appearance frequency calculation means 10 is realized by a CPU of an information processing apparatus that operates according to a program.
- the appearance frequency correcting unit 11 has a function of obtaining a degree to which a prescriptive word or a body word included in a document set is used in the same concept as an input word or an input body word with reference to the concept class storage unit 22. .
- the appearance frequency correction unit 11 has a function of correcting the appearance frequency of the binary relation included in the document set according to the obtained degree.
- the appearance frequency correction unit 11 is realized by a CPU of an information processing apparatus that operates according to a program.
- the inter-word similarity calculation means 12 determines the corrected appearance frequency or the distribution of the appearance frequency of the input word and the word set that is in a binary relation in the document set as the feature value of the input word, and features between the input words It has a function to calculate the degree of similarity of quantities.
- the inter-word similarity calculation means 12 is specifically realized by a CPU of an information processing apparatus that operates according to a program.
- the inter-symbol similarity calculation means 13 determines the corrected appearance frequency or the distribution of the appearance frequency of the predicates that have a binary relation between the input speech and the document set as the feature value of the input speech, and the feature values between the input speech are similar. It has a function to calculate the degree to do. Specifically, the inter-symbol similarity calculation means 13 is realized by a CPU of an information processing apparatus that operates according to a program.
- the synonym determining means 14 determines two input binary relations as synonymous expressions when the degree of similarity between the predicates and the degree of similarity between the body words satisfy a predetermined condition, and outputs the determination result as an output device 4 is provided.
- the synonym determination means 14 is realized by a CPU of an information processing apparatus that operates according to a program.
- the storage device 2 includes a document storage unit 20, an appearance frequency storage unit 21, an identical class affiliation probability storage unit 22, and a corrected appearance frequency storage unit 23.
- the storage device 2 is realized by an optical disk device, a magnetic disk device, or the like.
- the document storage unit 20 stores a document set.
- the appearance frequency storage unit 21 stores data indicating the appearance frequency of the binary relation included in the document set.
- the data indicating the appearance frequency is registered in the appearance frequency storage unit 21 by, for example, the appearance frequency calculation unit 10.
- the concept class storage unit 22 stores data indicating the type of concept class to which the predicate or body belongs. These data are, for example, manually determined in advance and registered in the concept class storage unit 22. For example, it is automatically registered in the concept class storage unit 22 by calculation based on statistical values and the like.
- the corrected appearance frequency storage unit 23 stores data indicating the appearance frequency after correcting the binomial relationship. These data are registered in the corrected appearance frequency storage unit 23 by, for example, the appearance frequency correction unit 11.
- FIG. 2 is a flowchart illustrating an example of processing executed by the synonymous expression determination device.
- data indicating “power-on” and data indicating “power switch-on” is input to the data processing device 1 as two sets of binary relations from the input device 3 will be described. To do.
- the appearance frequency calculation means 10 extracts the binary relations from the document stored in the document storage unit 20, and determines each appearance frequency. Calculation is performed (step S1 in FIG. 2).
- the binary relation represents a pair of a prescriptive word and a body word in the case relation.
- step S1 the appearance frequency calculation means 10 extracts the binary relations included in the document using, for example, a morphological analysis / syntactic analysis tool such as CaboCha.
- CaboCha is described in literature (http://chasen.org/ ⁇ taku/software/cabocha/).
- the appearance frequency calculation means 10 divides a sentence into words using a morphological analysis tool, and gives a part of speech to each word. For example, the morphological analysis of the sentence “turn on the power switch” will result in “power [noun-general] / switch [noun-general] / [participant-case particle] / input [noun-sa variable connection / verb-verb- Independent] ”is output as a morphological analysis result.
- the appearance frequency calculation means 10 summarizes the morphological analysis results into phrases using a syntax analysis tool, and assigns dependency relationships between phrases.
- the above morphological analysis results are summarized into two clauses: (1) ⁇ Power / Switch / ⁇ (2) ⁇ Turn on / On ⁇ . Between the clauses (1) and (2), A dependency relationship is assigned with (1) as the origin and (2) as the destination.
- the appearance frequency calculation means 10 extracts the binary relation by the following method.
- the appearance frequency calculation means 10 detects a phrase of a predicate.
- the phrase of the predicate is a phrase in which the morpheme at the beginning of the phrase is “verb-independence”, “noun-adjective verb stem”, “noun-sa-variant connection”.
- the appearance frequency calculating means 10 determines whether the clause from which the clause of the predicate is related is a clause of the body in a case relationship with the clause of the preach.
- the phrase clause is a phrase whose morphemes at the beginning of the phrase are “noun-general”, “noun-sa-variant connection”, and “noun-adjective verb stem”. Whether or not it is in a case relationship with a predicate clause is determined by whether the last morpheme of the body clause is “particle-case particle” or “particle-co-particle”.
- the appearance frequency calculating means 10 recognizes a word obtained by removing a particle having a case relationship with a predicate from a word series of a prescriptive phrase as a prescriptive word and a word series of the prescriptive phrase as a prescriptive word. In the above example, "Power switch-turn on" is obtained.
- particles that have a case relationship with a predicate may be included in the body. In this case, “turn on the power switch” is obtained. By including particles, it becomes possible to distinguish the difference in the meaning of binary relations due to the difference in particles. On the other hand, there is a demerit that the appearance frequency is dispersed.
- the appearance frequency calculation means 10 calculates the appearance frequency of the extracted binary relation and stores the calculation result in the appearance frequency storage unit 21.
- FIG. 3 shows an example of data stored in the appearance frequency storage unit.
- the vertical axis represents body language
- the horizontal axis represents precaution
- the value in the table represents the frequency of occurrence of binary relations.
- the appearance frequency of “power switch-turn on” is 10.
- the appearance frequency correcting unit 11 refers to the concept class storage unit 22 to determine the degree to which the prescription or the body statement included in the document set is used in the same concept as the input body or the input body language. Then, the appearance frequency correcting unit 11 corrects the appearance frequency of the binary relation included in the document set according to the obtained degree (step S2 in FIG. 2).
- the concept class storage unit 22 stores data indicating the type of concept class to which the predicate or body belongs. These values are stored in advance. The probability value may be determined manually or automatically by calculation. Hereinafter, one method automatically determined will be described.
- the type of concept class to which the body language belongs is determined using probabilistic clustering such as GMM (multidimensional normal distribution).
- GMM multidimensional normal distribution
- GMM multidimensional normal distribution
- the nomenclature N is expressed as vector data having the number of types of the nouns as the number of dimensions, and the value of each dimension gives the appearance frequency of the related nouns in the nomenclature N. Therefore, the dimension of the multidimensional normal distribution is also the number of types of the term V.
- N) that the word N belongs to a is obtained using the EM algorithm.
- N) is given as an initial state.
- the mean and variance of the multidimensional normal distribution of a are updated based on P (a
- N) based on this new multidimensional normal distribution. This is repeated a finite number of times to determine P (a
- FIG. 4 shows a storage example of the concept class to which the body language belongs in the concept class storage unit 22.
- the concept class to which the body belongs belongs is given with probability P (a
- FIG. 4B shows a storage example of the concept class to which the prescription belongs in the concept class storage unit 22.
- the concept class to which the predicate belongs is given by probability.
- the appearance frequency correction unit 11 refers to the concept class storage unit 22 to determine the degree to which a prescriptive or body phrase included in a document set is used in the same concept as the input word or input body language. First, the appearance frequency correcting means 11 obtains the degree CS (N, IN) that the word N included in the document set is used in the same concept as the input word IN using the following equation (1).
- a represents a concept class.
- N) represents the probability that N belongs to a.
- the appearance frequency correcting means 11 determines the degree CS (N, IN1) that the expression N included in the document set is used in the same concept as the input expressions IN1 and IN2. , IN2) is obtained using the following equation (2).
- the input syntax is “power” and “power switch”.
- the terminology included in the document set is “power”, “power switch”, “button”, “school”, and “university” from FIG.
- the appearance frequency correcting means 11 obtains the degree to which the word P included in the document set is used in the same concept as the input words IP1 and IP2 using the following equations (3) and (4).
- CS (P, IP1, IP2) Max ⁇ CN (P, IP1), CN (P, IP2) ⁇ Equation (3)
- CS (P, IP) ⁇ b min ⁇ P (P, b), P (IP, b) ⁇ Equation (4)
- the input precautions are “put in” and “put in”.
- the terms included in the document set are “insert”, “insert”, “append”, “fall”, and “stabilize” from FIG. When CS is calculated from these, it becomes as follows.
- CS 0 may be used.
- the appearance frequency correcting means 11 corrects the appearance frequency of each binary relation stored in the appearance frequency storage unit 21 using the CS obtained above.
- the former uses CS (P, IP1, IP2), and the latter uses CS (N, NP1, NP2).
- CS CS
- As a correction method for example, there is a method of setting 0 if the value of CS is less than a preset threshold value.
- FIG. 5A shows a storage example of the corrected appearance frequency storage unit 23 in which the threshold is set to 0.6 and the appearance frequency is corrected by paying attention to the terminology related to the binomial relationship.
- (b) of FIG. 5 shows the result of correcting the appearance frequency by setting the threshold value to 0.6 while paying attention to the binomial relationship.
- As a correction method there is a method of multiplying the appearance frequency by the value of CS.
- the inter-speech similarity calculation means 12 determines the corrected appearance frequency or distribution of appearance frequencies of the input terms and the document set that is in a binary relation with the document set as the feature value of the input terms. The degree of similarity between the feature quantities is calculated.
- the inter-symbol similarity calculating means 13 determines the corrected appearance frequency or the distribution of the appearance frequency of the predicates that are in a binary relation between the input body and the document set as the feature quantity of the input body language, and the feature quantity between the input body words The degree of similarity is calculated (step S3 in FIG. 2). Note that the order of processing executed by the inter-word similarity calculation unit 12 and the inter-body similarity calculation unit 13 may be performed first.
- the inter-word similarity calculation means 12 first determines the corrected appearance frequency or the distribution of appearance frequencies of the input word and the word set that is in a binary relation with the document set as the feature value of the input word. For example, when the input predicates are V1 and V2, the inter-predicate similarity calculation means 12 uses ⁇ P (V1
- V2) represent values (in this case, probabilities) obtained by normalizing the corrected appearance frequency of the body terms that are binomial with V1 or V2.
- n represents an arbitrary word selected from the entire set N.
- the corrected appearance frequency represents the appearance frequency corrected by paying attention to the body language in the appearance frequency correcting unit 11.
- the inter-word similarity calculation means 12 calculates the score Score (V1, V2) that the feature quantities between the predicates are similar. Specifically, calculation is performed using the following equation (5).
- the score calculation method is not limited to the method using the above equation (5).
- the input cosine similarity is calculated using ⁇ f (V1, n)
- f (V1, n) and f (V2, n) represent the appearance frequency after correction of the binary relation in which V1 or V2 coincides.
- Score (V1, V2) 0.263.
- the inter-symbol similarity calculation means 13 also obtains the degree of similarity of the feature quantities between the input sentences in the same manner as the inter-phrase similarity calculation means 12.
- N1 and N2 are used as the expressions constituting the binary relation
- N2) normalize the corrected appearance frequency of the predicates that are binary relations with N1 or N2. Value (here probability).
- v represents an arbitrary word selected from the entire set V.
- the inter-symbol similarity calculating means 13 calculates the score Score (N1, N2) that the feature quantities between the predicates are similar. Specifically, calculation is performed using the following equation (6).
- Score (N1, N2) P (power switch
- Score (N1, N2) 0.276.
- the synonym determination means 14 determines the two sets of input binary relations as synonymous expressions when the similarity between the predicates and the similarity between the body words satisfy the conditions specified in advance, and the determination result is It outputs to the output device 4 (step S4 of FIG. 2).
- the condition specified in advance is, for example, that the product of the similarity between the predicates and the similarity between the body words is equal to or greater than the specified value.
- a method of summing or averaging the similarities, or a method that requires that both the similarity between the predicates and the similarity between the body words are greater than or equal to the specified value is not limited.
- the input device 3 and the output device 4 are used as an interface between a human and a computer.
- the input device 3 and the output device 4 receive inputs from other devices and systems and output the determination results to the devices. You can also use it.
- the synonym determination means 14 may output the product of the degree of similarity as it is. Further, for example, instead of using the synonym determination means 14, the calculation results of the inter-word similarity calculation means 12 and the inter-body similarity calculation means 13 may be output as they are.
- the synonymous expression of the binary relation can be correctly determined. This is because, when calculating the similarity between input words, the distribution of the appearance frequency of only the words used in the same concept as the input word is used as the feature amount. Another reason is that, when calculating the similarity between input body words, the distribution of the appearance frequency of only the words used in the same concept as the input word is used as the feature amount.
- the meaning of the input predicate determined in relation to the input body is semantic A.
- the use of the appearance frequency distribution of only the body language having the same kind of concept as the input body language as the feature amount means that the appearance frequency distribution of the body language having a binary relation with the input premise used in the meaning A is used as the feature amount. means. For this reason, the feature quantities between the input phrases that are synonymous expressions are similar.
- the meaning of “turning on” and “turning on” determined by “turning on the power” and “turning on the power switch” is the meaning A.
- the appearance frequency of only the body having the same kind of concept as the input body as a feature value is the distribution of the appearance frequency of the body words in the binomial relationship between “insert” and “put” used in the meaning A It means to use.
- the feature quantities of “put” and “put” are similar.
- the meaning of the input statement determined in relation to the input word is defined as the meaning B.
- the distribution of the appearance frequency of only predicates having the same kind of concept as the input predicate as the feature amount uses the distribution of the appearance frequency of the predicates having a binary relation with the input body term used in the meaning B. Means that. For this reason, the feature quantities between the input body sentences that are synonymous expressions are similar.
- FIG. 6 and FIG. 7 respectively show the values calculated by the method described in Non-Patent Document 2 and the values calculated by the proposed method (that is, the present values). It is a comparison with the value calculated in the embodiment.
- the similarity between input body words is 0.192
- the similarity between input words is 0.2
- the product of both is 0.038.
- the proposed method is used, the similarity between input body words is 0.263, the similarity between input words is 0.276, and the product of both is 0.072. This also shows that the proposed method can correctly determine the synonym even in an ambiguity input word or input word.
- the synonym expression determination device inputs a binary relation set composed of a body word and a word, and determines whether or not they are synonymous with each other between the input body words and the input word.
- the synonym expression determination device for determining the similarity between the input terms, and calculating the similarity between the input terms based on the distribution of the appearance frequency of the body terms in the binomial relationship with the input terms In this case, use only the distribution of the body used in the same type of concept as the input body, and the similarity between the input body and the distribution of the appearance frequency of the predicates that are binomial in the input body and the document set.
- the distribution of only the predicates used in the same kind of concept as the input predicates is used.
- FIG. 8 is a block diagram illustrating a minimum configuration example of the synonym expression determination device.
- the synonym expression determination device includes synonym determination means 14 and inter-word similarity calculation means 12 as the minimum components.
- the synonym determination unit 14 inputs a binary relation set composed of a body word and a noun, and whether or not the input binary relation set is synonymous. Is determined using the similarities between the input body language and the input language.
- the inter-word similarity calculation means 12 calculates the similarity between the input words based on the distribution of the frequency of appearance of the word in binary relation in the document set with the input word. Use only the word distribution used in the concept of the same kind.
- the synonym expression determination device having the minimum configuration, even when the input predicate or the input form word has ambiguity, it is possible to correctly determine the synonym expression of the binary relation.
- a synonym expression determination device inputs a binary relation set composed of a body word and a pretext, and determines whether or not the input binary relation pair is synonymous between the input body language and the input word
- the synonym determination means for example, realized by the synonym determination means 14 for determining using the respective similarities and the similarity between the input terms is expressed in terms of the binary terms in the input terms and the document set.
- the similarity between words used for calculation is calculated using only the distribution of the words used in the same kind of concept as the input word (for example, the word similarity calculation means) 12).
- the synonym expression determination device inputs a binary relation set composed of a body word and a word, and determines whether or not the input pair of binary relations is synonymous between the input body words and the input word Synonym determining means for determining the similarity between the input vocabulary and the synonym determining means (for example, realized by the synonym determining means 14)
- the similarity between words used for calculation is calculated using only the distribution of the words used in the same kind of concept as the input word (for example, the word similarity calculation means) 12)
- the similarity between the input expressions is calculated based on the distribution of the frequency of appearance of the predicates having a binary relation in the document set with the input expressions, and the same kind of concept as the input predicates.
- Interdisciplinary classes calculated using the distribution of only the used predicates Degrees calculation means e.g., as implemented by the nominal inter-similarity calculating unit 13), characterized in that a.
- the synonym expression determination device inputs a pair of binary relations composed of body words and prescriptions, and determines whether or not the input pair of binary relations is synonymous between the input body words and the input words.
- Synonym determining means for example, realized by the synonym determining means 14
- a concept class storage means for example, a concept class
- the class classifying means stores the degree to which the prescriptive word or the body language included in the document set is used in the same concept as the input word or the input body language.
- Appearance frequency correction means (which is realized by the appearance frequency correction means 11) that corrects the appearance frequency of the binary relation included in the document set according to the degree obtained by referring to the input word and the document set.
- Term relation Prediction similarity calculation means (for example, inter-prediction calculation) that determines a corrected appearance frequency or distribution of appearance frequencies of a certain word as a feature value of the input word and calculates a degree of similarity between the feature values of the input word
- the appearance frequency or the distribution of the appearance frequency corrected for the predicates having a binary relation between the input syntax and the document set is defined as the feature quantity of the input speech, and the features between the input speech It is characterized by comprising inter-body similarity calculation means (for example, realized by inter-body similarity calculation means 13) for calculating the degree of similarity of the quantities.
- the synonym expression determination device may be configured to include a particle having a case relationship with the predicate in the body language constituting the binary relation.
- the synonym determination unit is configured such that an input binary relation set is synonymous when the similarity between input body words and the similarity between input words satisfy a predetermined condition. It may be configured to determine that there is.
- the present invention can be applied to an application for realizing an accurate search for a query having a complicated syntax structure such as a natural sentence.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
CS(電源スイッチ,電源スイッチ) = 1.0
CS(ボタン,電源,電源スイッチ) = 0.6
CS(学校,電源,電源スイッチ) = 0.1
CS(大学,電源,電源スイッチ) = 0.1
CS(P, IP) = Σb min{ P(P,b), P(IP,b) } 式(4)
CS(入れる, 入れる, 投入する) = 1.0
CS(付ける, 入れる, 投入する) = 0.7
CS(落ちる, 入れる, 投入する) = 0.2
CS(安定する, 入れる, 投入する) = 0.2
2 記憶装置
3 入力装置
4 出力装置
10 出現頻度計算手段
11 出現頻度補正手段
12 用言間類似度計算手段
13 体言間類似度計算手段
14 同義判定手段
20 文書記憶部
21 出現頻度記憶部
22 概念クラス記憶部
23 補正出現頻度記憶部
Claims (7)
- 体言と用言とから構成される二項関係の組を入力し、入力した前記二項関係の組が同義であるか否かを入力体言間と入力用言間とのそれぞれの類似度を用いて判定する同義判定手段と、
前記入力用言間の類似度を、入力用言と文書集合において二項関係にある体言の出現頻度の分布に基づいて計算する際に、前記入力体言と同種の概念で用いられている体言のみの分布を用いて計算する用言間類似度計算手段とを
備えたことを特徴とする同義表現判定装置。 - 体言と用言とから構成される二項関係の組を入力し、入力した前記二項関係の組が同義であるか否かを入力体言間と入力用言間とのそれぞれの類似度を用いて判定する同義判定手段と、
前記入力用言間の類似度を、入力用言と文書集合において二項関係にある体言の出現頻度の分布に基づいて計算する際に、前記入力体言と同種の概念で用いられている体言のみの分布を用いて計算する用言間類似度計算手段と、
前記入力体言間の類似度を、入力体言と文書集合において二項関係にある用言の出現頻度の分布に基づいて計算する際に、前記入力用言と同種の概念で用いられている用言のみの分布を用いて計算する体言間類似度計算手段とを
備えたことを特徴とする同義表現判定装置。 - 体言と用言とから構成される二項関係の組を入力し、入力した前記二項関係の組が同義であるか否かを入力体言間と入力用言間とのそれぞれの類似度を用いて判定する同義判定手段と、
用言または体言が所属する概念クラスの種類を格納した概念クラス記憶手段と、
文書集合に含まれる用言または体言が、入力用言または入力体言と同一の概念で使用される度合いを前記概念クラス記憶手段が格納する概念クラスの種類を参照して求め、前記文書集合に含まれる二項関係の出現頻度を前記度合いに応じて補正する出現頻度補正手段と、
前記入力用言と前記文書集合で二項関係にある体言の補正した出現頻度または出現頻度の分布を、前記入力用言の特徴量として定め、前記入力用言間の特徴量が類似する度合いを計算する用言間類似度計算手段と、
前記入力体言と前記文書集合で二項関係にある用言の補正した出現頻度または出現頻度の分布を、前記入力体言の特徴量として定め、前記入力体言間の特徴量が類似する度合いを計算する体言間類似度計算手段とを
備えたことを特徴とする同義表現判定装置。 - 二項関係を構成する体言に、用言と格関係にある助詞も含める
請求項1から請求項3のうちのいずれか1項に記載の同義表現判定装置。 - 同義判定手段は、入力体言間の類似度と入力用言間の類似度とがあらかじめ定められた条件を満たす場合に、入力した二項関係の組が同義であると判定する
請求項1から請求項4のうちのいずれか1項に記載の同義表現判定装置。 - 体言と用言とから構成される二項関係の組を入力し、入力した前記二項関係の組が同義であるか否かを入力体言間と入力用言間とのそれぞれの類似度を用いて判定し、
前記入力用言間の類似度を、入力用言と文書集合において二項関係にある体言の出現頻度の分布に基づいて計算する際に、前記入力体言と同種の概念で用いられている体言のみの分布を用いて計算する
ことを特徴とする同義表現判定方法。 - コンピュータに、
体言と用言とから構成される二項関係の組を入力し、入力した前記二項関係の組が同義であるか否かを入力体言間と入力用言間とのそれぞれの類似度を用いて判定する同義判定処理と、
前記入力用言間の類似度を、入力用言と文書集合において二項関係にある体言の出現頻度の分布に基づいて計算する際に、前記入力体言と同種の概念で用いられている体言のみの分布を用いて計算する用言間類似度計算処理とを
実行させるための同義表現判定プログラム。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201280022780.9A CN103562907B (zh) | 2011-05-10 | 2012-05-09 | 用于评估同义表达的设备、方法和程序 |
SG2013080577A SG194709A1 (en) | 2011-05-10 | 2012-05-09 | Device, method and program for assessing synonymous expressions |
JP2012548252A JP5234232B2 (ja) | 2011-05-10 | 2012-05-09 | 同義表現判定装置、方法及びプログラム |
US14/117,297 US9262402B2 (en) | 2011-05-10 | 2012-05-09 | Device, method and program for assessing synonymous expressions |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2011105589 | 2011-05-10 | ||
JP2011-105589 | 2011-05-10 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2012153524A1 true WO2012153524A1 (ja) | 2012-11-15 |
Family
ID=47139012
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2012/003023 WO2012153524A1 (ja) | 2011-05-10 | 2012-05-09 | 同義表現判定装置、方法及びプログラム |
Country Status (5)
Country | Link |
---|---|
US (1) | US9262402B2 (ja) |
JP (1) | JP5234232B2 (ja) |
CN (1) | CN103562907B (ja) |
SG (1) | SG194709A1 (ja) |
WO (1) | WO2012153524A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014119988A (ja) * | 2012-12-17 | 2014-06-30 | Nippon Telegr & Teleph Corp <Ntt> | 同義判定装置、同義学習装置、及びプログラム |
JP2016021136A (ja) * | 2014-07-14 | 2016-02-04 | 株式会社東芝 | 類義語辞書作成装置 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108140019B (zh) * | 2015-10-09 | 2021-05-11 | 三菱电机株式会社 | 语言模型生成装置、语言模型生成方法以及记录介质 |
CN106777283B (zh) * | 2016-12-29 | 2021-02-26 | 北京奇虎科技有限公司 | 一种同义词的挖掘方法及装置 |
CN107818081A (zh) * | 2017-09-25 | 2018-03-20 | 沈阳航空航天大学 | 基于深度语义模型与语义角色标注的句子相似度评估方法 |
CN110442760B (zh) * | 2019-07-24 | 2022-02-15 | 银江技术股份有限公司 | 一种问答检索***的同义词挖掘方法及装置 |
CN111241124B (zh) * | 2020-01-07 | 2023-10-03 | 百度在线网络技术(北京)有限公司 | 一种需求模型构建方法、装置、电子设备和介质 |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5418716A (en) * | 1990-07-26 | 1995-05-23 | Nec Corporation | System for recognizing sentence patterns and a system for recognizing sentence patterns and grammatical cases |
US6675159B1 (en) * | 2000-07-27 | 2004-01-06 | Science Applic Int Corp | Concept-based search and retrieval system |
NO316480B1 (no) * | 2001-11-15 | 2004-01-26 | Forinnova As | Fremgangsmåte og system for tekstuell granskning og oppdagelse |
US20050071150A1 (en) * | 2002-05-28 | 2005-03-31 | Nasypny Vladimir Vladimirovich | Method for synthesizing a self-learning system for extraction of knowledge from textual documents for use in search |
CA2536265C (en) * | 2003-08-21 | 2012-11-13 | Idilia Inc. | System and method for processing a query |
WO2006119578A1 (en) * | 2005-05-13 | 2006-11-16 | Curtin University Of Technology | Comparing text based documents |
US20070073533A1 (en) * | 2005-09-23 | 2007-03-29 | Fuji Xerox Co., Ltd. | Systems and methods for structural indexing of natural language text |
CN101595474B (zh) * | 2007-01-04 | 2012-07-11 | 思解私人有限公司 | 语言分析 |
US8374844B2 (en) * | 2007-06-22 | 2013-02-12 | Xerox Corporation | Hybrid system for named entity resolution |
US8674462B2 (en) | 2007-07-25 | 2014-03-18 | Infineon Technologies Ag | Sensor package |
WO2009026140A2 (en) * | 2007-08-16 | 2009-02-26 | Hollingsworth William A | Automatic text skimming using lexical chains |
US8868562B2 (en) * | 2007-08-31 | 2014-10-21 | Microsoft Corporation | Identification of semantic relationships within reported speech |
US8594996B2 (en) * | 2007-10-17 | 2013-11-26 | Evri Inc. | NLP-based entity recognition and disambiguation |
WO2009051068A1 (ja) * | 2007-10-19 | 2009-04-23 | Nec Corporation | 文書分析方法、文書分析システム及び文書分析用プログラム |
US20090326925A1 (en) * | 2008-06-27 | 2009-12-31 | Microsoft Corporation | Projecting syntactic information using a bottom-up pattern matching algorithm |
US20090326924A1 (en) * | 2008-06-27 | 2009-12-31 | Microsoft Corporation | Projecting Semantic Information from a Language Independent Syntactic Model |
CN101894102A (zh) | 2010-07-16 | 2010-11-24 | 浙江工商大学 | 一种主观性文本情感倾向性分析方法和装置 |
-
2012
- 2012-05-09 SG SG2013080577A patent/SG194709A1/en unknown
- 2012-05-09 US US14/117,297 patent/US9262402B2/en active Active
- 2012-05-09 WO PCT/JP2012/003023 patent/WO2012153524A1/ja active Application Filing
- 2012-05-09 CN CN201280022780.9A patent/CN103562907B/zh active Active
- 2012-05-09 JP JP2012548252A patent/JP5234232B2/ja active Active
Non-Patent Citations (2)
Title |
---|
CHIKARA HASHIMOTO ET AL.: "Web-jo no Teigibun kara no Iikae Chishiki Kakutoku", THE ASSOCIATION FOR NATURAL LANGUAGE PROCESSING DAI 17 KAI NENJI TAIKAI HAPPYO RONBUNSHU, 7 March 2011 (2011-03-07), pages 748 - 751 * |
RYO NISHIMURA ET AL.: "Mailing List ni Toko sareta Mail o Riyo shite Aimai na Shitsumon ni Toikaesu Shitsumon Oto System no Sakusei", THE ASSOCIATION FOR NATURAL LANGUAGE PROCESSING DAI 13 KAI NENJI TAIKAI HAPPYO RONBUNSHU, 19 March 2007 (2007-03-19), pages 1164 - 1167 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014119988A (ja) * | 2012-12-17 | 2014-06-30 | Nippon Telegr & Teleph Corp <Ntt> | 同義判定装置、同義学習装置、及びプログラム |
JP2016021136A (ja) * | 2014-07-14 | 2016-02-04 | 株式会社東芝 | 類義語辞書作成装置 |
Also Published As
Publication number | Publication date |
---|---|
SG194709A1 (en) | 2013-12-30 |
US9262402B2 (en) | 2016-02-16 |
JPWO2012153524A1 (ja) | 2014-07-31 |
US20140343922A1 (en) | 2014-11-20 |
JP5234232B2 (ja) | 2013-07-10 |
CN103562907A (zh) | 2014-02-05 |
CN103562907B (zh) | 2016-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5234232B2 (ja) | 同義表現判定装置、方法及びプログラム | |
US10262062B2 (en) | Natural language system question classifier, semantic representations, and logical form templates | |
US7890539B2 (en) | Semantic matching using predicate-argument structure | |
EP3179384A1 (en) | Method and device for parsing interrogative sentence in knowledge base | |
US20150227505A1 (en) | Word meaning relationship extraction device | |
KR101573854B1 (ko) | 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 장치 및 방법 | |
US20180082680A1 (en) | Syntactic re-ranking of potential transcriptions during automatic speech recognition | |
KR101495240B1 (ko) | 교정 어휘 쌍을 이용한 통계적 문맥 철자오류 교정 장치 및 방법 | |
KR101627428B1 (ko) | 딥 러닝을 이용하는 구문 분석 모델 구축 방법 및 이를 수행하는 장치 | |
US20220245353A1 (en) | System and method for entity labeling in a natural language understanding (nlu) framework | |
US20240028650A1 (en) | Method, apparatus, and computer-readable medium for determining a data domain associated with data | |
Toral et al. | Linguistically-augmented perplexity-based data selection for language models | |
US20220237383A1 (en) | Concept system for a natural language understanding (nlu) framework | |
Yuwana et al. | On part of speech tagger for Indonesian language | |
Rasooli et al. | Unsupervised morphology-based vocabulary expansion | |
JP2011065380A (ja) | 意見分類装置およびプログラム | |
Channell et al. | Automated grammatical tagging of child language samples | |
US10296585B2 (en) | Assisted free form decision definition using rules vocabulary | |
US20220229986A1 (en) | System and method for compiling and using taxonomy lookup sources in a natural language understanding (nlu) framework | |
US20220229990A1 (en) | System and method for lookup source segmentation scoring in a natural language understanding (nlu) framework | |
US20220245352A1 (en) | Ensemble scoring system for a natural language understanding (nlu) framework | |
US20220229998A1 (en) | Lookup source framework for a natural language understanding (nlu) framework | |
US20220229987A1 (en) | System and method for repository-aware natural language understanding (nlu) using a lookup source framework | |
Sheng et al. | EDMSpell: Incorporating the error discriminator mechanism into chinese spelling correction for the overcorrection problem | |
WO2018025317A1 (ja) | 自然言語処理装置及び自然言語処理方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2012548252 Country of ref document: JP Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12782584 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14117297 Country of ref document: US |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12782584 Country of ref document: EP Kind code of ref document: A1 |