WO2006137516A1 - Binary relation extracting device - Google Patents

Binary relation extracting device Download PDF

Info

Publication number
WO2006137516A1
WO2006137516A1 PCT/JP2006/312592 JP2006312592W WO2006137516A1 WO 2006137516 A1 WO2006137516 A1 WO 2006137516A1 JP 2006312592 W JP2006312592 W JP 2006312592W WO 2006137516 A1 WO2006137516 A1 WO 2006137516A1
Authority
WO
WIPO (PCT)
Prior art keywords
solution
binary relation
feature
candidate
extracted
Prior art date
Application number
PCT/JP2006/312592
Other languages
French (fr)
Japanese (ja)
Inventor
Masaki Murata
Tomohiro Mitsumori
Kouichi Doi
Yasushi Fukuda
Original Assignee
National Institute Of Information And Communications Technology
National University Corporation NARA Institute of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Institute Of Information And Communications Technology, National University Corporation NARA Institute of Science and Technology filed Critical National Institute Of Information And Communications Technology
Publication of WO2006137516A1 publication Critical patent/WO2006137516A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data

Definitions

  • Binary relationship extraction device information retrieval device using binary relationship extraction processing, binary relationship extraction processing method, information retrieval processing method using binary relationship extraction processing, binary relationship extraction processing program, and binary Information retrieval processing program using relation extraction processing
  • the present invention provides a binary relation extraction technique and a binary relation extraction process for extracting a pair of expressions (words, character strings, etc.) having a binary relation from text data using supervised machine learning processing. It relates to the information retrieval technology used.
  • Non-Patent Document 1 a pattern frame for extracting information to be obtained using the predicate term structure that is the result of parsing is given, extracted from a corpus with a correct answer, and the extracted pattern is inappropriate. By extracting the correct patterns, the selected information is extracted using the selected patterns.
  • Non-patent literature 1 Akane Yakushiji et al., “Medical / biological information extraction using predicate structure pattern”, 11th Annual Conference of the Language Processing Society of Japan, March 2005
  • Non-Patent Document 1 In order to improve the accuracy of the no-turn, the pattern is selected against the learning conos to improve the accuracy of the binary relation extraction process.
  • the object of the present invention is to be used for all problems of extracting binary relations from text data.
  • the purpose of this study is to provide a binary relation extraction device that can extract binary relations with high performance even for complex problems.
  • Another object of the present invention is to provide an information retrieval device using the binary relation extraction processing, each processing method executed by these devices, and a program for causing a computer to function as these devices. Is to provide.
  • the present invention is a binary relation extraction processing device that extracts binary relations appearing in sentence data stored in a computer-readable storage device using machine learning processing.
  • Teacher data in which teacher data is stored including cases that consist of pairs of problems and solutions, where the problem is a binary relationship that appears in the sentence data and the solution should be extracted Storage means; and 2) taking out the case from the teacher data storage means, extracting predetermined information as a feature for each previous article example, and generating a pair of the solution and the set of extracted features. 3)
  • the combination of the solution and the feature set is determined by the machine learning process.
  • Machine learning means for storing learning information as learning result information in the learning result storage means, and 4) extracting the binary relation elements from the text data stored in the storage device.
  • Candidate extraction means for extracting a pair composed of the elements, and using the extracted pair as a binary relation candidate.
  • a feature extracting means for extracting the predetermined information as a feature for a binary relation candidate; 6) based on the learning result information stored in the learning result storage means, V, and the binary relation candidate.
  • a solution estimation means for estimating the degree, which is likely to be the solution in the case of a feature set of , Symptoms of the binomial relationship A and a binary relation extraction means for selecting as binary relations to be extracted
  • the present invention stores teacher data including a case in which solution information indicating that a binary relation to be extracted is added to a binary relation appearing in sentence data in a teacher data storage unit. deep. A case is then taken from the teacher data storage means by the solution-feature pair extraction means. For each case, the specified information is extracted as a feature, and a set of the extracted feature set and solution is generated.
  • the machine learning means determines what kind of solution is used for what type of feature set! /, Based on a predetermined machine learning algorithm. Machine learning processing is performed, and information indicating “what kind of solution is obtained in the case of what feature set” is stored in the learning result storage means as learning result information.
  • candidate extraction means extracts binary relational elements from the text data stored in the storage device, extracts a pair composed of the elements, and extracts the extracted pair as a binary relational If it is a candidate, the feature extraction means extracts predetermined information as a feature for the binomial relation candidate by the extraction process similar to the extraction process performed by the feature pair extraction means. Then, based on the learning result information stored in the learning result storage means, the solution estimation means estimates the likelihood and degree of the solution in the case of a set of binary candidate features.
  • the term relation extraction means extracts the binomial relation candidate from the estimation result when the likelihood of the binary relation candidate is better than a predetermined level.
  • the present invention is an information search device for extracting a search result using a binary relation extraction process result using a supervised machine learning process in an information search process using a plurality of search keywords.
  • Teacher data including cases that consist of a combination of a problem and a solution, where the problem is a binary relationship with the search key as an element and the solution should be extracted 2) taking the example of the previous article from the teacher data storage means, extracting predetermined information as a feature for each case, and combining the solution with the extracted feature set 3)
  • the feature pair extraction means for generating 3) Based on a predetermined machine learning algorithm, for which combination of the solution and the feature set, what kind of feature set will result in the solution?
  • Machine learning process Machine learning means for storing, in the learning result storage means, information indicating whether the solution is obtained in the case of a set of features, and 4) an input search keyword pair using a plurality of input search keywords.
  • a feature extraction means for extracting the predetermined information as a feature for the binomial relationship candidate by an extraction process similar to the extraction process performed by the means; and 7) the learning result information stored in the learning result storage means
  • the solution estimation means for estimating the likelihood of the solution in the case of the set of features of the binomial relationship candidate!
  • the candidate of the binary relationship is If the probability of the solution is better than a predetermined level, the binary relation candidate is selected as a binary relation to be extracted, and the text data including the selected binary relation is selected as a search result. And a search result extracting means for extracting.
  • teacher data including cases in which solution information indicating that a binary relation to be extracted is added to a binary relation having a search keyword as an element is stored in a teacher data storage unit.
  • the feature pair extraction means extracts cases from the teacher data storage means, extracts predetermined information as features for each case, and generates a set of extracted feature sets and solutions.
  • the machine learning means uses machine learning to determine what type of feature set results in a set of solution and feature set based on a predetermined machine learning algorithm. Processed information is stored in the learning result storage means as learning result information indicating “what kind of solution will be the case for what feature set”.
  • the binary relation candidate is selected and selected as the binary relation to be extracted.
  • Text that includes a binary relation List data is extracted as a search result.
  • the present invention relates to a binary relation extraction processing method and an information retrieval processing method using the binary relation extraction processing method realized by the binary relation extraction device or the information search device, respectively.
  • the present invention also provides a binary relation extraction processing program for causing a computer to execute each processing step executed as the binary relation extraction processing method or the information retrieval processing method, and a binary relation This is an information retrieval processing program using the extraction processing method.
  • a new binary relation candidate can be obtained by performing machine learning using text data that is manually assigned a tag indicating whether or not a binary relation to be extracted is used as learning data. If given, it can be determined whether the candidate is a binary relation to be extracted. For example, by using a “pair of interacting protein names” with a tag indicating whether or not the binary relation to be extracted is used as learning data, the desired “interacting protein” is retrieved from a text database. Can be obtained.
  • search keyword pair For the two search keywords of AND search in the information search process, a “search keyword pair” with a tag indicating whether or not the search result document has a meaningful relationship.
  • FIG. 1 is a diagram showing a configuration example of a binary relation extraction device according to the present invention.
  • FIG. 2 is a diagram showing a processing flow of a binary relation extraction device.
  • FIG. 3 is a diagram showing an example of teacher data.
  • FIG. 4 is a diagram showing the concept of margin maximization in the support vector machine method.
  • FIG. 5 is a diagram showing an example of a pair with a set of binary relational features shown in FIG.
  • FIG. 6 is a diagram showing a configuration example of an information search device according to the present invention.
  • FIG. 7 is a diagram showing a flow of processing of the information search device.
  • FIG. 8 is a diagram showing an example of a set of teacher data and a set of features of the binary relation.
  • FIG. 9 A diagram showing an example of a set of teacher data and a set of features of the binary relation.
  • FIG. 10 is a diagram showing an example of a set of teacher data and a set of features of the binary relation. Explanation of symbols
  • the binary relation extraction device 1 uses the teacher data, which is text data with a tag indicating whether the binary relation is to be extracted, what binary pair should be extracted. It is a processor that performs machine learning to obtain the binary relation 3 from the given text data 2 and extracts the binary relation 3 to be extracted.
  • FIG. 1 shows a configuration example of a binary relation extraction apparatus 1 according to the present invention.
  • the binary relation extraction device 1 includes a teacher data storage unit 11, a feature pair extraction unit 12, a machine learning unit 13, a learning result storage unit 14, a candidate extraction unit 15, a feature extraction unit 16, a solution estimation unit 17, and a binary term.
  • a relationship extraction unit 18 is provided.
  • the teacher data storage unit 11 is means for storing text data that is teacher data used in the machine learning process.
  • the binary relation elements (one element is referred to as the first element and the other element as the second element) appearing in the text data sentence are the binary relations to be extracted. Use the case where the answer is information about whether or not the power is available. Specifically, for a sentence that contains two or more binary relation elements in one sentence of text data, it is a pair (positive example) that should be extracted for a pair of binary relation elements in that sentence. Or a tag that indicates the solution of either a pair (negative example) that should not be extracted. If three or more binary elements are included in a sentence, a tag is assigned to each pair that is a combination of all elements. As an example of teacher data, a binary relation to which only a solution indicating the pair to be extracted (positive example) is given may be used.
  • the solution-feature pair extraction unit 12 is a processing unit that extracts a set of a solution and a set of features from cases in the text data stored in the teacher data storage unit 11.
  • a feature is information used in machine learning processing.
  • the feature-feature pair extraction unit 12 includes, as features, for example, binary relation elements, word Z characters appearing around the elements and their appearance positions and order, parts of speech information of elements and surrounding words, morphological analysis information, Information such as parsing information, appearance distance between elements, and presence / absence of other binary relation elements between elements are extracted.
  • the machine learning unit 13 uses the supervised machine learning to determine what type of solution is likely to be generated from the combination of the solution extracted by the feature pair extraction unit 12 and the set of features. It is a processing means that learns by law.
  • the learning result is stored in the learning result storage unit 14.
  • the feature extraction unit 16 is a processing unit that extracts a predetermined feature for a binary relation candidate extracted from the text data 2.
  • the solution estimation unit 17 refers to the learning result in the learning result storage unit 14, and for each candidate of the binary relation, what kind of solution (classification destination) is likely to occur in the case of the set of features. This is a processing means for estimating the degree.
  • the binary relation extraction unit 18 is estimated from the binary relation candidates to have a high degree of solution indicating that the binary relation should be extracted. This is a processing means that outputs the data as binary relations 3.
  • FIG. 2 shows a processing flow of the binary relation extraction apparatus 1.
  • the teacher data storage unit 11 of the binary relation extraction device 1 stores, as teacher data, a binary relation that is a pair of elements having a certain meaning, a force (positive) that is a binary relation to be extracted, or an extraction. Should not be done, text data 2 including a case where the information of the “solution” of the binomial relationship (negative) or the difference is given is stored.
  • text data 2 including an example to which a predetermined solution is given may be stored only for the pair to be extracted.
  • the pair to which the solution of text data 2 is given is considered to have been given the (positive) solution that is the binary relation to be extracted, and the solution is given! Pairs should not be extracted and are treated as if they were given binomial (negative) solutions! /.
  • the feature-feature pair extraction unit 12 extracts a predetermined feature for each case from the teacher data in the teacher data storage unit 11, and collects the solution (information given by the tag) and the extracted features. Generate a pair with (Step Sl).
  • the solution-feature pair extraction unit 12 extracts the binary relations by using the text data force, which is the teacher data, and a predetermined tag, and for the extracted binary relation elements, the morphological analysis process, the parsing process, and the appearance of the elements Predetermined features are extracted by calculating the position and distance between elements.
  • the machine learning unit 13 determines what kind of solution (positive or negative) from the combination of the solution generated by the solution-feature pair extraction unit 12 and the feature set.
  • the learning result is stored in the learning result storage unit 14 (step S2).
  • the machine learning unit 13 uses, as a supervised machine learning method, for example, a machine learning process using any one of the methods such as the k-nearest neighbor method, the simple Bayes method, the decision list method, the maximum entropy method, and the support vector machine method. I do.
  • the candidate extraction unit 15 inputs the text data 2 from which the binary relation is to be extracted, and extracts the binary relation candidate from the input text data 2 (step S3).
  • the candidate extraction unit 15 divides the text data into sentence units, treats only sentences in which two or more binary relation elements appear in one sentence as processing targets, and extracts binary relation candidates from the sentence.
  • the feature extraction unit 16 extracts features for each binary relation candidate extracted from the text data 2 by processing similar to the processing in the solution-feature pair extraction unit 12 (step S4).
  • the solution estimation unit 17 determines, for each candidate, what kind of solution is likely to be in the case of the set of features !, that is, “is likely to be positive” or “is likely to be negative! /,”.
  • the degree is estimated based on the learning result in the learning result storage unit 14 (step S5).
  • the binomial relationship extraction unit 18 outputs a predetermined degree of candidates as binomial relationships 3 to be extracted from the candidates estimated to be “probably positive” with better, degree (step S6).
  • the binary relation extraction device 1 extracts binary relations of protein expressions (protein names) that interact with each other from a text database of biomedical papers. Assume that the representation is specified with 100% accuracy.
  • a binary relation element In the process of creating the teacher data, when a specific expression such as a protein expression, disease name and treatment method is extracted as a binary relation element, for example, a binary relation element is as follows. To do.
  • the expressions that are binary relational elements are extracted based on the tags.
  • FIG. 3 shows an example of teacher data.
  • English text data including binary relations with interacting protein expressions as elements is used as teacher data.
  • a tag indicating the solution (correct Zpositiv e) is attached to the teacher data only for the binary relation to be extracted.
  • teacher data containing only positive cases is used in the machine learning process.
  • Fig. 3 (B) shows an example of tags attached to teacher data.
  • the teacher data includes two binary pairs P1 and P2.
  • the binary relation (pair) P1 consists of the first element 1 “(161 ⁇ 2 & — 0 ateninj, the second element p2“ presenilin 1 ”.
  • the binary relation (pair) P2 has the first element pl It consists of “presenilin (PS) 1” and the second element p2 “delta—catenin”.
  • the solution-feature pair extraction unit 12 extracts a set of a solution and a set of features from the examples in the text data stored in the teacher data storage unit 11. For example, the following information Extract information.
  • a word or character that appears around a binary relational element For example, a predetermined number of word Z letters before the first element (first element) of the binary relation, a predetermined number of word Z letters after the second element (second element), the first element and the second A predetermined number of words Z between elements; Z letters;
  • part-of-speech information is acquired using existing morphological analysis processing methods such as the morphological analysis system “ChaSen” (see: http: ⁇ chasen.aist-nara.ac.jp/index. h tml.ja) 0
  • Part-of-speech information in the case of English text data is, for example, “Transformation-Base d Error-Driven Learning and Natural Language Processing: A Case Study in Part— of-Speech Tagging (Eric Brill, Computational Linguistics, Vol.21, No.4, p.543-565, 1 995).
  • the solution-feature pair extraction unit 12 extracts features from the example of the teacher data with tags as shown in Fig. 3 (B), and generates a set of feature sets and solutions. For example, in the case of binary relation P2, as shown in Fig. 5, a set of a solution (positive) and the following feature set is generated.
  • the machine learning unit 13 Based on this solution and the set of features, the machine learning unit 13 performs machine learning processing on which feature set is likely to be a positive (positive), and the learning result is stored in the learning result storage unit. 14 to fe. .
  • the machine learning unit 13 uses, for example, a k-nearest neighbor method, a simple Bayes method, a decision list method, a maximum entropy method, a support vector machine method as a supervised machine learning method.
  • the k-nearest neighbor method is a technique that uses the most similar k cases instead of the most similar case, and obtains a classification destination (solution) by majority decision of these k cases.
  • k is a predetermined integer number, and generally an odd number between 1 and 9 is used.
  • the Simple Bayes method is a method that estimates the probability of each classification based on Bayes' theorem and uses the probability value as the classification destination for obtaining the largest classification.
  • the context b is a set of features f ( ⁇ F, l ⁇ j ⁇ k) that have been set in advance.
  • p (b) is the appearance probability of context b. Here, it is independent of the classification a and is not calculated because it is a constant.
  • P (a) (where P is a tilde at the top of p) and P (f I a) are the probabilities estimated by the teacher data, respectively, and the appearance probability of class a, means the probability of having f. If the value obtained by maximum likelihood estimation is used as P (f I a), the value often becomes a negative value, and it may be difficult to determine the classification destination because the value of Eq. (2) is zero. Therefore, smoothing is performed. Here, we used the smoothing using the following formula (3). [0057] [Equation 2]
  • freq (f, a) means the number of cases with the feature f ; and classification power, and freq (a) means the number of cases with classification power.
  • a set of features and classification destinations is used as a rule, and these are stored in a list in a predetermined priority order.
  • the priority order is determined in the list. This is a method that compares the data of the force input with the rule feature and places the classification destination of the rule with the matching feature as the classification destination of the input.
  • the probability value of each classification is obtained using only one of the features f (EF, l ⁇ j ⁇ k) that has been set in advance.
  • the probability of outputting classification a in a context b is given by
  • f) (where P is a tilde at the top of p) is the rate of occurrence of classification a when the feature f is in the context.
  • the maximum entropy method is an equation that represents entropy while satisfying the following equation (6), where F is a set of features fj (l ⁇ j ⁇ k) set in advance (7
  • the probability distribution p (a, b) when) is maximized is obtained, and the classification with the highest probability value is obtained among the probabilities of each classification determined according to the probability distribution.
  • a and B mean a set of classifications and contexts
  • g (a, b) is 1 if the context b has a feature f, and if it has classification power, and 0 otherwise Means a function.
  • P (a I f) (where P is a tilde at the top of p) means the rate of occurrence of (a, b) in the known data.
  • the expected value of the frequency of the output and feature pair is obtained by multiplying the probability p and the function g that means the appearance of the output and feature pair.
  • the support vector machine method is a method for classifying data consisting of two categories by dividing the space into hyperplanes.
  • FIG. 4 shows the concept of margin maximization in the support vector machine method.
  • white circles indicate positive examples
  • black circles indicate negative examples
  • solid lines indicate hyperplanes that divide space
  • broken lines indicate planes that represent the boundaries of the margin area.
  • Fig. 4 is a conceptual diagram when the interval between the positive and negative examples is narrow (small margin)
  • Fig. 4 (B) is when the interval between the positive and negative examples is wide (large FIG.
  • the training data may include a small number of examples in the internal area of the margin! /, And the linear part of the hyperplane.
  • a non-linear extension (introduction of a kernel function) is used.
  • This extended method is equivalent to classification using the following discriminant function, and the two classes can be discriminated based on whether the output value of the discriminant function is positive or negative.
  • each ⁇ is the value when maximizing Eq. (9) under the constraints of Eqs. (10) and (11).
  • the function K is called a kernel function, and various functions are used. In this embodiment, the following polynomial is used.
  • C and d are experimentally set constants.
  • C was fixed to 1 throughout the entire process.
  • two types of d are tested, 1 and 2.
  • X where a> 0 is called a support vector, and the part that is the sum of Eq. (8) is usually calculated using this example only. In other words, only the cases called support vectors in the learning data are used for actual analysis.
  • the support vector machine method handles data with two classifications. Therefore, when dealing with cases with three or more classifications, it is usually used in combination with methods such as the pair-wise method or the one-VS rest method.
  • the pairwise method generates all pairs (n (n— 1) Z2) of two different classification destinations for data with n classifications, and determines which is better for each pair.
  • Binary classifier, sand In other words, it is obtained by the support vector machine method processing module, and finally the classification destination is obtained by majority decision of the classification destination by n (n ⁇ 1) Z2 binary classification.
  • the one VS rest method is classified into three groups, classification destination a and others, classification destination b and others, and classification destination c and others. And learning processing for each set using the support vector machine method.
  • the learning results of the three sets of support vector machines are used.
  • the separation plane force of the machine This is the method to find the classification destination of the one that is farthest away. For example, if a candidate is most distant from the separation plane force in the support vector machine created by the learning process of the “classification destination a and other” pair, the candidate classification destination is assumed to be a.
  • the candidate extraction unit 15 extracts binomial candidates from the input new text data 2. Specifically, text data 2 is divided into sentence units, and expressions (character strings) that are binary relation elements in each sentence are extracted. Then, it is checked whether there are two or more expressions that are binary relation elements in a sentence, and all two combinations (pairs) of binary relation elements in a sentence are candidates for binary relations. Generate as
  • new text data 2 is divided into paragraphs, expressions that are binary-related elements in each paragraph are extracted, and all two combinations of paragraphs having two or more elements from the same paragraph are extracted.
  • (Pair) may be generated as a binary relation candidate.
  • the method described in the method for generating teacher data described above is used as a method for extracting an expression that is a binary relation element from text data 2. For example, an expression that matches the description of the pattern or dictionary is extracted, and an expression estimated based on the learning result of supervised machine learning is extracted.
  • the pair of elements is determined as a binary relation candidate. If three or more elements appear in a sentence, the element Any pair of combinations of is a candidate for a binary relation.
  • the feature extraction unit 16 extracts similar features from the binomial relationship candidates by the same processing as the solution-feature pair extraction unit 12.
  • the solution estimation unit 17 obtains a positive solution (in the case of a set of feature features of each candidate). Estimate the likelihood of positive). Based on the estimation result of the solution estimation unit 17, the binomial relationship extraction unit 18 outputs a binomial relationship 2 that is likely to be a positive solution and has a high degree of estimation.
  • the above features were extracted and the support vector machine method was used as the machine learning process.
  • the F value is the harmonic mean of recall and precision.
  • the recall is the ratio that indicates how much of the binomial relation to be extracted from the text data 2 was output.
  • the relevance ratio is a ratio indicating how much of the binary relations extracted by the binary relation extraction device 1 is the binary relation to be extracted.
  • the machine learning unit 13 uses the given teacher data to generate a set of each binary relation solution and feature set based on a predetermined machine learning algorithm. What kind of feature set results in a solution! /, And what kind of feature set results in a machine learning process! Is stored in the learning result storage unit 14 as learning result information, and the solution estimation unit 17 is likely to obtain the solution for the case of the feature set of binomial relation candidates based on the learning result information. Estimate! / ⁇ degree! ⁇ .
  • the machine learning unit 13 uses the feature set of feature data extracted between the cases of the teacher data. It is defined as the similarity between cases based on the ratio of overlapping features (the number of the same features), and the defined similarity and the cases are stored in the learning result storage unit 14 as learning result information. deep.
  • the solution estimation unit 17 refers to the similarity and the case defined by the learning result storage unit 14, and extracts the binary relation extracted from the text data 2. Therefore, k cases are selected from the cases in the learning result storage unit 14 in order, and the classification destination determined by the majority of the selected k cases is binomial. Estimated as the classification target (solution) of the relationship candidate. In other words, the solution estimation unit 17 determines the degree of likelihood of being a solution in the case of a set of binomial candidate features. The number of votes obtained by the classification.
  • the machine learning unit 13 uses a combination of a solution of the case and a set of features as learning result information according to the case of the teacher data. It is stored in the learning result storage unit 14. Then, when new text data 2 is input, the solution estimation unit 17 uses the features of the learning result information stored in the learning result storage unit 14 and a set of features based on the Bayes theorem. The probability of each classification in the case of a set of binary relation candidate features acquired by the extraction unit 16 is calculated, and the classification having the highest probability value is selected as the classification of the features of the binary relation candidate (solution ). In other words, the solution estimator 17 determines the degree of likelihood of being a solution in the case of a set of candidate features of a binomial relationship as the probability of each classification, here “to be extracted” t. Probability.
  • the machine learning unit 13 stores a list of learning data examples in which rules of features and classification destinations are arranged in a predetermined priority order. Store in Part 14. Then, when new text data 2 is input, the solution estimation unit 17 extracts the features of the binary relation candidates and the rules of the rules extracted from the text data 2 in descending order of the priority of the list in the learning result storage unit 14. And the classification destination of the rule with the same feature is estimated as the candidate classification destination (solution). In other words, the solution estimator 17 determines the degree of likelihood of being a solution in the case of a set of candidate features of a binomial relationship as a predetermined priority or a numerical value or scale corresponding thereto, in this case “to be extracted”. The priority in the list of probabilities of classification.
  • the machine learning unit 13 specifies a class that can also solve the case power of teacher data, satisfies a predetermined conditional expression, and performs entropy.
  • a probability distribution that also has a set of features when maximizing the expression shown and a binomial force of classification that can be a solution is obtained and stored in the learning result storage unit 14. Then, when new text data 2 is input, the solution estimation unit 17 uses the probability distribution in the learning result storage unit 14 to generate a text.
  • the probability of the classification that can be the solution for the set of candidate features of the binary relation extracted from the strike data 2 is obtained, the classification that can be the solution having the largest probability value is identified, and the identified classification is determined for the candidate.
  • Estimate the solution the solution estimator 17 determines the likelihood of a solution in the case of a set of binary candidate features as the probability of being classified into each class, in this case “to be extracted” t, To do.
  • the machine learning unit 13 identifies a class that can also solve the case power of teacher data, and divides the class into positive and negative examples. Then, in the space where the feature set of the case is dimensioned according to a predetermined execution function using a kernel function, the interval between the positive example and the negative example of the case is maximized, and the positive example and the negative example are exceeded. A hyperplane to be divided by the plane is obtained and stored in the learning result storage unit 14. Then, when new text data 2 is input, the solution estimation unit 17 uses the hyperplane of the learning result storage unit 14 to set a set of candidate features of the binary relation extracted from the text data 2.
  • the solution estimator 17 determines the degree of likelihood of a solution in the case of a set of candidate features of a binary relation by determining the degree of distance from the separation plane to the space of the positive example (binary relation to be extracted). Say it. More specifically, when a binary relation to be extracted is a positive example and a binary relation that should not be extracted is a negative example, an example that is located in the space on the positive example side with respect to the separation plane is The distance of the separation plane force of the case is the degree of the case.
  • the solution-feature pair extraction unit 12 may use, for example, “words of two elements themselves” as the feature.
  • the first word Z character string, the second word Z character string from the front of the element, the first word Z character string from the rear, and the second word Z character string are used as features. May be. In the case of Fig. 3 (A), the feature is
  • the first element is“ presenilin (PS) 1 ”;
  • the second element is "delta-cateninj
  • the first word of the first element is “presenilin”
  • the second word is “(PS)”
  • the second word from the end of the first element is “(PS)”;
  • the first word from the end is “1”;
  • the first word of the second element is "delta"
  • the second word is “-”
  • the first character of the second element is "d"
  • the part of speech of the second word is “verb”;
  • the word before the second element is “of”;
  • each state is characterized as “medium”, a state from 5 to 9 as “distance large”, and a state of 10 or more as “distance extra large”, “the distance between two elements is , “Long distance” is the feature.
  • the appearance order of elements may be used as a feature.
  • the first element is“ disease name ”and the second element is“ treatment method ”” or “the first element is“ treatment method ”and the second element is Information that “is a disease name” is a feature.
  • the binary relation extraction device 1 uses, as teacher data, a binary relation between a disease name and a treatment method, a binary relation between a disease name and a protein expression, a disease name, as well as a binary relation between interacting protein expressions.
  • These binary relationships can be extracted from text data 2 of biomedical papers by giving examples of various binary relationships such as term relationships.
  • text data including the following binary relations can be used as teacher data.
  • Oral corticosteroids (element: treatment method) are the preference of many for the treatment of CIDP (element: disease name), being much less expensive than IVIG (element: treatment method) infosi on or TA (element: treatment method).
  • CM complementary metaplasia
  • organ name is mainly found in gastric m ucosa (element: organ name) that harboursgastric cancer (element: disease name)”
  • Variant Creutzfeldt- Jakob disease (element: disease name) is a transmissible spongiform encephalopathy believed to be caused by the bovine (element: animal species) spongiform enc ephalopathy agent, an abnormal isoformof the prion protein (PrP (sc)) (element : Protein expression)
  • AIDP disease name
  • CIDP disease name
  • carbohy drate epitope -NeuAcalpha2-8NeuAcalpha2-3Galbetal-4Glc-
  • BSE PrP
  • a teacher data for machine learning processing is used.
  • the binary relation estimated to be worthy of extracting new text data is automatically generated. Can be extracted automatically. This avoids the complexity of generating patterns used for binary relation extraction processing.
  • the performance of binary relation extraction processing can be expected to improve.
  • the information retrieval device 4 regards the relationship between two search keywords in AND search processing as a meaningful binary relationship, and is a relationship that should be extracted for a binary relationship having this search keyword as an element ( (Positive) Or, the relationship to be extracted, machine (learning) using the teacher data to which the tag indicating the solution of either (negative) is attached, and two searches from the search text data 5 to be searched.
  • This is a processing device that outputs the search results 6 that contain the keywords, and the search keyword pairs that are estimated to be binary relations to be extracted.
  • FIG. 6 shows a configuration example of the information search device 4 according to the present invention.
  • the information retrieval device 4 includes an information retrieval unit 40, a teacher data storage unit 41, a feature pair extraction unit 42, a machine learning unit 43, a learning result storage unit 44, a candidate extraction unit 45, a feature extraction unit 46, and a solution estimation unit 47. , And a search result extraction unit 48.
  • the teacher data storage unit 41, feature feature pair extraction unit 42, machine learning unit 43, learning result storage unit 44, candidate extraction unit 45, feature extraction unit 46, and solution estimation unit 47 of the information search device 4 are Teacher data storage unit 11, feature pair extraction unit 12, machine learning unit 13, learning result storage unit 14, candidate extraction unit 15, feature extraction unit 16, and solution estimation unit 17 of the binary relation extraction device 1 shown in 1 And processing means for performing similar processing.
  • the information search unit 40 searches the search text data 5 using the search keyword given in the AND search process, and acquires the corresponding article (text data).
  • the candidate extraction unit 45 extracts a binary relation candidate having the same character string (word) pair as two search keywords included in the article acquired by the information search unit 40 as elements.
  • the search result extraction unit 48 extracts an estimated positive solution (which should be extracted from the binary relation candidates of the articles searched from the search text data 5). Binomial That are better than a certain degree) and output the search result 6 as information that identifies the article or article that contains the extracted binary relation candidate.
  • Fig. 7 shows the processing flow of the information retrieval device 4.
  • the teacher data storage unit 41 of the information search device 4 stores, as teacher data, a binary relation that has two search keywords given by the AND search process as elements, and a force (positive) or an extraction that is a binary relation to be extracted. Text data including cases with information on “solution” that is either negative or negative (negative) should be stored.
  • the solution-feature pair extraction unit 42 extracts a predetermined feature for each case from the teacher data in the teacher data storage unit 41, and collects the solution (information given by the tag) and the extracted feature collection. A pair is generated (step S11).
  • the solution-feature pair extraction unit 42 extracts the binary relations by using a predetermined tag for the text data force that is the teacher data.
  • search keywords the morphological analysis process, the syntax analysis process, the element Predetermined features are extracted by, for example, calculating the position of the appearance and the distance between elements.
  • the machine learning unit 43 determines what kind of solution (positive or negative) is used for any feature set from the set of the solution generated by the solution-feature pair extraction unit 42 and the feature set.
  • the learning result is stored in the learning result storage unit 44 (step S12).
  • the machine learning unit 43 uses, as a supervised machine learning method, a machine learning method such as a k-nearest neighbor method, a simple basis method, a decision list method, a maximum entropy method, or a support vector machine method. Process.
  • the candidate extraction unit 45 generates all two combinations (pairs) using the two input search keywords given in the AND search process (step S13).
  • the information search unit 40 performs an AND search on the search text data 5 using two pairs of input search keywords to extract articles (text data) including the input search keyword pairs.
  • all two combinations (pairs) are extracted as binomial relation candidates (step S14).
  • the feature extraction unit 46 uses a process similar to the process in the solution-feature pair extraction unit 42 to obtain a predetermined feature for each binary relation candidate appearing in the searched article. set Is extracted (step SI 5).
  • the solution estimation unit 47 determines, for each candidate, what kind of solution is likely to be in the case of the set of features, that is, the degree of “probably positive” or “prone to negative”. Estimate based on 14 learning results (step S16). Then, the search result extraction unit 48 selects, as a binary relation to be extracted, a candidate that is estimated to be “prone to be positive” to a better degree than a predetermined degree from candidates for the binary relation, and includes this binary relation. The article or information identifying the article is output as search result 6 (step S17).
  • the information search device 4 uses the text data for search from the search text data 5 as text data including a binary relation whose elements are character strings that can be two search keywords used in the AND search processing. Then, a binary relation candidate is created using the input search keyword given in the AND search process as an element, and an article is extracted from the search text data 5 using this binary relation candidate. Estimate whether or not the binary relation candidate of the input search keyword included in the searched article should be extracted, and determine the degree to which it should be extracted. The included article is output as search result 6.
  • FIGS. 8 to 10 show examples of teacher data stored in the teacher data storage unit 41 and examples of features extracted by the teacher data force solution / feature pair extraction unit 42.
  • the teacher data Dl and D2 in Figs. 8 and 9 are given a tag indicating that the solution is positive for the binary relation to be extracted.
  • the teacher data D3 in Fig. 10 is given a tag that indicates that the binary relation should not be extracted and that the solution is negative.
  • the teacher data D1 in Fig. 8 includes a binary relation pair P3, which is a pair of two search keywords.
  • the binary relation (pair) P3 has the first element pi (search key K1) "K Large ”, second element p2 (search key K2)“ total length ”, and a positive relation (positive) is given to the binary relation pair P3. Yes.
  • the teacher data D2 in Fig. 9 includes a binary relationship pair P4, which is a pair of two search keywords, and the binary relationship (pair) P4 has the first element pi (search key K1 ) “Kyoto Univ.”, Second element p 2 (search key K2) “total length”, and the positive pair (P4) is given to the binary relation pair P4.
  • the teachers in Fig. 8 and Fig. 9 can be judged as the contents of “Kyoto University President”.
  • the teacher data D3 in Fig. 10 includes a pair of binary relations, which is a pair of two search keywords.
  • the binary relation (pair) P5 is included, and the binary relation (pair) P5 consists of the first element pi (search key K1) “Kyoto Univ.” And the second element p2 (search key K2) “total length”. Is given a negative solution. This is because “Kyoto University” and “President” appear in the same data, but they are not related to each other and can be judged not to be the contents of “President of Kyoto University”.
  • the solution-feature pair extraction unit 42 extracts a set of a solution and a set of features from the example of the teacher data stored in the teacher data storage unit 41.
  • the features are the two words before and after the element (search keyword) and the part of speech of the word.
  • search keyword the element
  • the part of speech of the word For example, taking teacher data D1 as an example,
  • the next word is “ga”;
  • the part of speech of the next word is “particle”;
  • the second word is “attendance”
  • the part of speech of the word after the second is “noun”.
  • solution-feature pair extraction unit 42 can extract information as described in the binary relation extraction process as a feature.
  • the machine learning unit 43 is likely to be any solution (positive Z negative) in any feature set! /, Machine learning processing is performed, and the learning result is stored in the learning result storage unit 44.
  • the machine learning unit 43 uses the above-described processing methods such as the k-nearest neighbor method, the simple Bayes method, the decision list method, the maximum entropy method, and the servo vector machine method as supervised machine learning methods.
  • the information search unit 40 performs an AND search on the search text data 5 based on the given input search keywords “Kyoto Univ.” And “total length”, and acquires articles including the input search keywords. .
  • the candidate extraction unit 45 extracts a binary relation candidate from the extracted article. Specifically, binomial relation candidates are extracted from the input search keywords included in the article that is the search result of the AND search.
  • the feature extraction unit 46 extracts the same features as the feature pair extraction unit 42 from the binary relation candidates, and the solution estimation unit 47 uses the learning result stored in the learning result storage unit 44.
  • the search result extraction unit 48 extracts a binary relationship that is likely to be an estimated positive solution from the binomial relationship candidates based on the estimation result of the solution estimation unit 47, and this binary term. Information that identifies articles and articles that contain relationships is output as search results 6.
  • the candidate extraction unit 45 generates all combinations (pairs) of two input search keywords from a given input search keyword, and sets the generated pair as a binary relation candidate. Then, the information search unit 40 performs an AND search process using each binary relation candidate element (two input search keywords). Then, the feature extraction unit 46 extracts a set of predetermined features for the binary relation candidates appearing in the extracted article.
  • the solution estimation unit 47 estimates the degree of likelihood of the solution for each candidate candidate set of features. To do. Input validation If each binary keyword candidate that is a search keyword pair appears only once in the searched article, all of these binary candidates are positive (extracted)
  • the search result 6 is the information that identifies the article and the article.
  • the search result 6 is the information that identifies the article and the article.
  • the candidate extraction unit 45 generates a pair of all two input search keywords from the given input search keyword, and sets the generated pair as a binary relation candidate. Then, the information search unit 40 performs an AND search process using each binary relation candidate element (two input search keywords). Then, the feature extraction unit 46 extracts a set of predetermined features for the binary relation candidates appearing in the extracted article.
  • the solution estimation unit 47 estimates the degree of likelihood of the solution in the case of each candidate feature set. To do. Each binary search candidate that is a pair of input search keywords appears only once in the searched article! /, And sometimes all of these binary search candidates. Then, the degree of positive (to be extracted) is estimated, and all the binomial relation candidates are multiplied by the estimated positive degree as the positive degree of the article. . And positive degree
  • the search result 6 is the article that is estimated to be V, Gayo, and the information that identifies the article.
  • the degree of positiveness is estimated for the multiple binary relationship candidates that appear.
  • the highest value is! And the degree is the degree of the binary relation candidates.
  • the degree of each binary relation is obtained, and the result obtained by multiplying the degree is the positive degree of the article.
  • the search result 6 is the article that is estimated to be V, Gayo, and the information that identifies the article.
  • teacher data for machine learning processing is used. From the new text data for search 5, simply prepare text data with an evaluation of whether or not it is a binary relation to be extracted from the binary relation between two search keywords in AND search processing. Therefore, it is possible to automatically extract articles that contain binary relations that deserve to be extracted.
  • the information search device 4 of the present invention includes a search keyword by evaluating the relation of search keywords appearing in the article of the search result of the AND search process using the binary relation extraction process.
  • a search keyword by evaluating the relation of search keywords appearing in the article of the search result of the AND search process using the binary relation extraction process.
  • improvement in the performance of information retrieval can be expected by improving the accuracy of supervised machine learning.
  • the example of the binary relation composed of two elements has been described in the binary relation extraction process and the information search process.
  • the present invention can also be applied to a ternary relationship composed of three elements.
  • the solution-feature pair extraction unit 12 determines the feature of this ternary relationship by, for example, the first two elements in the first element (the element that appears first), the third element (the last element) The word information of all the words between the first and second elements (elements appearing in the middle) and all the words between the second and third elements
  • the machine learning unit 13 can learn the easiness of the solution based on the set of features of the ternary relation, and the binary relation extraction unit 18 can handle the extraction of the ternary relation.
  • the solution given to the ternary relation is the same as in the case of the binary relation: “ternary relation that should be extracted” or “ternary relation that should not be extracted”.
  • each processing means of the binary relation extraction device 1 includes the binary relation obtained by decomposing the ternary relation of the teacher data, the binary relation between the first element and the second element, and the second element.
  • the binary relation between the first and third elements and the binary relation between the first and third elements are treated as separate binary relations.
  • the ternary relationship that should be extracted should be extracted.
  • the binary relation extraction unit 18 can obtain the certainty of extraction when extracting the binary relation 3. Then, as the confidence of the ternary relation created by combining multiple binary relations, the product of the confidence of the combined binary relations is used to extract the one with the highest confidence of the ternary relation. To do.
  • the confidence level of the binomial relationship uses the confidence level calculated in the normal machine learning process.
  • Such ternary relation extraction processing can be performed in the information retrieval apparatus 4 in the same manner. For example, when searching for articles related to “General Manager of Kyoto University in 2000”, the ternary relationship using three search keywords of “2000”, “Kyoto University”, and “General Director” is used as teacher data. Given search data, search result 6 of AND search using these three search keywords is output from search text data 5.
  • the present invention can be implemented as a program read and executed by a computer.
  • the program for realizing the present invention can be stored in an appropriate recording medium such as a portable medium memory, a semiconductor memory, and a hard disk that can be read by a computer. It is provided by transmission and reception using various communication networks via the communication interface.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Provided is a device capable of extracting a binary relation efficiently even for a complex problem. A solution-identity pair extraction unit (12) extracts the identify of a case, from a teacher data storage unit (11) stored with teacher data containing a case, in which a solution indicating what is to be extracted is given to the binary relation to appear in text data, thereby to create a combination of the identity set and the solution. A machine-learning unit (13) machine-learns it, by a predetermined machine-learning method, what solution an identity set takes, and stores the learning result information in a learning result storage unit (14). A candidate extraction unit (15) extracts a candidate for the binary relation from text data (2), and an identity extraction unit (16) extracts the set of identities of the candidates of the binary relation. On the basis of the learning result information, a solution estimation unit (17) estimates the feasibility degree for the solution of the case of the identity set of the candidates for the binary relation. From the estimation result, a binary relation extraction unit (18) extracts the candidate for the binary relation of an excellent estimation of a positive solution.

Description

明 細 書  Specification
二項関係抽出装置,二項関係抽出処理を用いた情報検索装置,二項関 係抽出処理方法,二項関係抽出処理を用いた情報検索処理方法,二項関係抽 出処理プログラム,および二項関係抽出処理を用いた情報検索処理プログラム 技術分野  Binary relationship extraction device, information retrieval device using binary relationship extraction processing, binary relationship extraction processing method, information retrieval processing method using binary relationship extraction processing, binary relationship extraction processing program, and binary Information retrieval processing program using relation extraction processing
[0001] 本発明は,教師あり機械学習処理を用いて,テキストデータから二項関係を持つ表 現 (語,文字列など)の対を抽出する二項関係抽出技術および二項関係抽出処理を 用いた情報検索技術に関する。  [0001] The present invention provides a binary relation extraction technique and a binary relation extraction process for extracting a pair of expressions (words, character strings, etc.) having a binary relation from text data using supervised machine learning processing. It relates to the information retrieval technology used.
背景技術  Background art
[0002] テキストデータベースなど力も情報を抽出する手法として,関連する語句の二項関 係に着目して希望する情報を抽出する方法が知られている。例えば,非特許文献 1 の手法では,構文解析結果である述語項構造を用いて求める情報を抽出するため のパターンフレームを与えて,正解付きのコーパスから抽出し,抽出したパターンのう ち不適切なパターンを排除することによって選別したパターンを用いて適合する情報 を抽出している。  [0002] As a technique for extracting information, such as a text database, a method for extracting desired information by paying attention to binary relations of related words is known. For example, in the method of Non-Patent Document 1, a pattern frame for extracting information to be obtained using the predicate term structure that is the result of parsing is given, extracted from a corpus with a correct answer, and the extracted pattern is inappropriate. By extracting the correct patterns, the selected information is extracted using the selected patterns.
非特許文献 1:薬師寺あかね他著, 「述語項構造パターンを用いた医学 ·生物学分野 情報抽出」,言語処理学会第 11回年次大会, 2005年 3月  Non-patent literature 1: Akane Yakushiji et al., “Medical / biological information extraction using predicate structure pattern”, 11th Annual Conference of the Language Processing Society of Japan, March 2005
発明の開示  Disclosure of the invention
発明が解決しょうとする課題  Problems to be solved by the invention
[0003] 従来では,人手によって作成したパターンを用いて二項関係を抽出処理する手法 が主に用いられていた。また,非特許文献 1の手法では,ノターンの精度を良くする ために学習コーノスと照らし合わせてパターンの選別を行って,二項関係の抽出処 理の精度向上を図っている。  [0003] Conventionally, a technique for extracting binary relations using patterns created manually is mainly used. In the method of Non-Patent Document 1, in order to improve the accuracy of the no-turn, the pattern is selected against the learning conos to improve the accuracy of the binary relation extraction process.
[0004] しかし,二項関係の抽出ルールとしてパターンを用いる場合に,対象となる問題が 複雑になると,ノターンが煩雑になるという問題がある。そのため,ノターンを利用す る手法には限界があった。また,抽出手法の性能も高くならないという問題もあった。  [0004] However, there is a problem that when a pattern is used as a binary relation extraction rule, the problem becomes complicated if the target problem becomes complicated. For this reason, there is a limit to the method using noturn. Another problem is that the performance of the extraction method does not increase.
[0005] 本発明の目的は,テキストデータから二項関係を抽出するすべての問題に利用で き,複雑な問題についても性能よく二項関係を抽出できる二項関係抽出装置を提供 することである。また,本発明の別の目的は,前記二項関係抽出処理を使用した情 報検索装置,およびこれらの装置で実行される各処理方法,およびこれらの装置とし てコンピュータを機能させるためのプログラムを提供することである。 [0005] The object of the present invention is to be used for all problems of extracting binary relations from text data. The purpose of this study is to provide a binary relation extraction device that can extract binary relations with high performance even for complex problems. Another object of the present invention is to provide an information retrieval device using the binary relation extraction processing, each processing method executed by these devices, and a program for causing a computer to function as these devices. Is to provide.
課題を解決するための手段  Means for solving the problem
[0006] 本発明は,コンピュータが読み取り可能な記憶装置に格納された文データ中に出 現する二項関係を,機械学習処理を用いて抽出する二項関係抽出処理装置であつ て, 1)問題と解との組で構成される事例であって,問題が文データ中に出現する二 項関係であって解が抽出するべき二項関係であるものを含む教師データが格納され た教師データ記憶手段と, 2)前記教師データ記憶手段から前記事例を取り出し,前 記事例ごとに,所定の情報を素性として抽出し,前記解と前記抽出した素性の集合と の組を生成する解 素性対抽出手段と, 3)所定の機械学習アルゴリズムにもとづい て,前記解と素性の集合との組について,どのような素性の集合の場合に前記解とな るかと!/、うことを機械学習処理し,前記どのような素性の集合の場合に前記解となる かということを示す情報を学習結果情報として学習結果記憶手段に保存する機械学 習手段と, 4)前記記憶装置に格納されたテキストデータから,前記二項関係の要素 を抽出し,前記要素で構成される対を抽出し,前記抽出した対を二項関係の候補と する候補抽出手段と, 5)前記解 素性対抽出手段が行う抽出処理と同様の抽出処 理によって,前記二項関係の候補について前記所定の情報を素性として抽出する素 性抽出手段と, 6)前記学習結果記憶手段に格納された前記学習結果情報にもとづ V、て,前記二項関係の候補の素性の集合の場合の前記解となりやす 、度合 、を推 定する解推定手段と, 7)前記推定結果として,前記二項関係の候補について前記 解となりやすい度合いが所定の程度より良い場合に,前記二項関係の候補を,抽出 するべき二項関係として選択する二項関係抽出手段とを備える  [0006] The present invention is a binary relation extraction processing device that extracts binary relations appearing in sentence data stored in a computer-readable storage device using machine learning processing. Teacher data in which teacher data is stored, including cases that consist of pairs of problems and solutions, where the problem is a binary relationship that appears in the sentence data and the solution should be extracted Storage means; and 2) taking out the case from the teacher data storage means, extracting predetermined information as a feature for each previous article example, and generating a pair of the solution and the set of extracted features. 3) Based on a predetermined machine learning algorithm, the combination of the solution and the feature set is determined by the machine learning process. And in the case of any set of features Machine learning means for storing learning information as learning result information in the learning result storage means, and 4) extracting the binary relation elements from the text data stored in the storage device. Candidate extraction means for extracting a pair composed of the elements, and using the extracted pair as a binary relation candidate. 5) By the extraction process similar to the extraction process performed by the feature pair extraction means, A feature extracting means for extracting the predetermined information as a feature for a binary relation candidate; 6) based on the learning result information stored in the learning result storage means, V, and the binary relation candidate. 7) a solution estimation means for estimating the degree, which is likely to be the solution in the case of a feature set of , Symptoms of the binomial relationship A and a binary relation extraction means for selecting as binary relations to be extracted
ことを特徴とする。  It is characterized by that.
[0007] 本発明は,文データ中に出現する二項関係に,抽出するべき二項関係であることを 示す解の情報が付与された事例を含む教師データを教師データ記憶手段に記憶し ておく。そして,解-素性対抽出手段によって,教師データ記憶手段から事例を取り 出し,事例ごとに,所定の情報を素性として抽出し,抽出した素性の集合と解との組 を生成する。さらに,機械学習手段によって,所定の機械学習アルゴリズムにもとづ いて,解と素性の集合との組について,どのような素性の集合の場合にどのような解 となるかと!/、うことを機械学習処理し, 「どのような素性の集合の場合にどのような解と なるかということ」を示す情報を学習結果情報として学習結果記憶手段に保存する。 [0007] The present invention stores teacher data including a case in which solution information indicating that a binary relation to be extracted is added to a binary relation appearing in sentence data in a teacher data storage unit. deep. A case is then taken from the teacher data storage means by the solution-feature pair extraction means. For each case, the specified information is extracted as a feature, and a set of the extracted feature set and solution is generated. In addition, the machine learning means determines what kind of solution is used for what type of feature set! /, Based on a predetermined machine learning algorithm. Machine learning processing is performed, and information indicating “what kind of solution is obtained in the case of what feature set” is stored in the learning result storage means as learning result information.
[0008] その後,候補抽出手段によって,記憶装置に格納されたテキストデータから,二項 関係の要素を抽出し,前記要素で構成される対を抽出し,前記抽出した対を二項関 係の候補とすると,素性抽出手段によって,解 素性対抽出手段が行う抽出処理と 同様の抽出処理によって,二項関係の候補について所定の情報を素性として抽出 する。そして,解推定手段によって,学習結果記憶手段に格納された学習結果情報 にもとづ!/、て,二項関係の候補の素性の集合の場合の解となりやす 、度合 、を推定 し,二項関係抽出手段によって,推定結果から,二項関係の候補について解となり やす 、度合 、が所定の程度より良 、場合に,その二項関係の候補を抽出する。  [0008] After that, candidate extraction means extracts binary relational elements from the text data stored in the storage device, extracts a pair composed of the elements, and extracts the extracted pair as a binary relational If it is a candidate, the feature extraction means extracts predetermined information as a feature for the binomial relation candidate by the extraction process similar to the extraction process performed by the feature pair extraction means. Then, based on the learning result information stored in the learning result storage means, the solution estimation means estimates the likelihood and degree of the solution in the case of a set of binary candidate features. The term relation extraction means extracts the binomial relation candidate from the estimation result when the likelihood of the binary relation candidate is better than a predetermined level.
[0009] また,本発明は,複数の検索キーワードによる情報検索処理において,教師あり機 械学習処理を用いた二項関係抽出処理結果を利用して検索結果を抽出する情報検 索装置であって, 1)問題と解との組で構成される事例であって,問題が検索キーヮ ードを要素とする二項関係であって解が抽出するべき二項関係であるものを含む教 師データが格納された教師データ記憶手段と, 2)前記教師データ記憶手段から前 記事例を取り出し,前記事例ごとに,所定の情報を素性として抽出し,前記解と前記 抽出した素性の集合との組を生成する解 素性対抽出手段と, 3)所定の機械学習 アルゴリズムにもとづいて,前記解と素性の集合との組について,どのような素性の集 合の場合に前記解となるかと 、うことを機械学習処理し,前記どのような素性の集合 の場合に前記解となるかということを示す情報を学習結果情報として学習結果記憶 手段に保存する機械学習手段と, 4)入力された複数の検索キーワードを用いた入力 検索キーワード対を生成し,検索対象となるテキストデータ力 前記入力検索キーヮ 一ド対を含むテキストデータを抽出して取得する情報検索手段と, 5)前記検索して 取得された各テキストデータ力 前記入力検索キーワードで構成される対を生成し, 前記生成した対を二項関係の候補とする候補抽出手段と, 6)前記解 素性対抽出 手段が行う抽出処理と同様の抽出処理によって,前記二項関係の候補について前 記所定の情報を素性として抽出する素性抽出手段と, 7)前記学習結果記憶手段に 格納された前記学習結果情報にもとづいて,前記二項関係の候補の素性の集合の 場合の前記解となりやす!/ヽ度合!ヽを推定する解推定手段と, 8)前記推定結果として ,前記二項関係の候補にっ 、て前記解となりやす 、度合 、が所定の程度より良 、場 合に,前記二項関係の候補を抽出するべき二項関係として選択し,前記選択した二 項関係を含むテキストデータを検索結果として抽出する検索結果抽出手段とを備え ることを特徴とする。 [0009] Further, the present invention is an information search device for extracting a search result using a binary relation extraction process result using a supervised machine learning process in an information search process using a plurality of search keywords. 1) Teacher data including cases that consist of a combination of a problem and a solution, where the problem is a binary relationship with the search key as an element and the solution should be extracted 2) taking the example of the previous article from the teacher data storage means, extracting predetermined information as a feature for each case, and combining the solution with the extracted feature set 3) The feature pair extraction means for generating 3) Based on a predetermined machine learning algorithm, for which combination of the solution and the feature set, what kind of feature set will result in the solution? Machine learning process Machine learning means for storing, in the learning result storage means, information indicating whether the solution is obtained in the case of a set of features, and 4) an input search keyword pair using a plurality of input search keywords. A text data force to be generated and to be searched; an information search means for extracting and acquiring text data including a pair of input search keys; 5) each text data force acquired by the search using the input search keyword Candidate extraction means for generating a pair to be configured and using the generated pair as a binary relation candidate; and 6) extracting the feature pair A feature extraction means for extracting the predetermined information as a feature for the binomial relationship candidate by an extraction process similar to the extraction process performed by the means; and 7) the learning result information stored in the learning result storage means Based on the above, the solution estimation means for estimating the likelihood of the solution in the case of the set of features of the binomial relationship candidate! / ヽ degree degree! ヽ, 8) As the estimation result, the candidate of the binary relationship is If the probability of the solution is better than a predetermined level, the binary relation candidate is selected as a binary relation to be extracted, and the text data including the selected binary relation is selected as a search result. And a search result extracting means for extracting.
[0010] 本発明は,検索キーワードを要素とする二項関係に,抽出するべき二項関係である ことを示す解の情報を付与された事例を含む教師データを教師データ記憶手段に 記憶しておく。そして,解 素性対抽出手段によって,教師データ記憶手段から事例 を取り出し,事例ごとに,所定の情報を素性として抽出し,抽出した素性の集合と解と の組を生成する。さらに,機械学習手段によって,所定の機械学習アルゴリズムにも とづいて,解と素性の集合との組について,どのような素性の集合の場合にどのよう な解となるかと 、うことを機械学習処理し, 「どのような素性の集合の場合にどのような 解となるかということ」を示す情報を学習結果情報として学習結果記憶手段に保存す る。  [0010] According to the present invention, teacher data including cases in which solution information indicating that a binary relation to be extracted is added to a binary relation having a search keyword as an element is stored in a teacher data storage unit. deep. The feature pair extraction means extracts cases from the teacher data storage means, extracts predetermined information as features for each case, and generates a set of extracted feature sets and solutions. Furthermore, the machine learning means uses machine learning to determine what type of feature set results in a set of solution and feature set based on a predetermined machine learning algorithm. Processed information is stored in the learning result storage means as learning result information indicating “what kind of solution will be the case for what feature set”.
[0011] その後,情報検索手段によって,入力された複数の検索キーワードを用いた入力 検索キーワード対を生成し,検索対象となるテキストデータから入力検索キーワード 対を含むテキストデータを抽出して取得すると,候補抽出手段によって,検索して取 得された各テキストデータから,入力検索キーワードで構成される対を生成し,前記 生成した対を二項関係の候補とする。そして,素性抽出手段によって,解 素性対 抽出手段が行う抽出処理と同様の抽出処理によって,二項関係の候補について所 定の情報を素性として抽出する。さらに,解推定手段によって,学習結果記憶手段に 格納された学習結果情報にもとづ!/、て,二項関係の候補の素性の集合の場合の解 となりやすい度合いを推定すると,検索結果抽出手段によって,推定結果として,二 項関係の候補にっ 、て解となりやす 、度合 、が所定の程度より良 、場合に,その二 項関係の候補を抽出するべき二項関係として選択し,選択した二項関係を含むテキ ストデータを検索結果として抽出する。 [0011] After that, when an input search keyword pair using a plurality of input search keywords is generated by the information search means, and text data including the input search keyword pair is extracted and acquired from the text data to be searched, A pair composed of input search keywords is generated from each text data obtained by searching by the candidate extraction means, and the generated pair is set as a binary relation candidate. Then, the feature extraction means extracts specified information as features for the binomial relation candidates by the same extraction process as the feature pair extraction means. In addition, if the solution estimation means estimates the degree of likelihood of being a solution in the case of a set of binary candidate features based on the learning result information stored in the learning result storage means, search result extraction is performed. Depending on the means, if the candidate of the binary relation is likely to be a solution and the degree is better than a predetermined level, the binary relation candidate is selected and selected as the binary relation to be extracted. Text that includes a binary relation List data is extracted as a search result.
[0012] また,本発明は,前記二項関係抽出装置または前記情報検索装置でそれぞれ実 現される二項関係抽出処理方法,二項関係抽出処理方法を用いた情報検索処理方 法である。  [0012] Further, the present invention relates to a binary relation extraction processing method and an information retrieval processing method using the binary relation extraction processing method realized by the binary relation extraction device or the information search device, respectively.
[0013] また,本発明は,前記二項関係抽出処理方法または前記情報検索処理方法として 実行されるそれぞれの処理過程を,コンピュータに実行させるための二項関係抽出 処理プログラム,および,二項関係抽出処理方法を用いた情報検索処理プログラム である。  [0013] The present invention also provides a binary relation extraction processing program for causing a computer to execute each processing step executed as the binary relation extraction processing method or the information retrieval processing method, and a binary relation This is an information retrieval processing program using the extraction processing method.
発明の効果  The invention's effect
[0014] 本発明によれば,抽出するべき二項関係か否かを示すタグを人手によって付与し たテキストデータを学習データとして利用して機械学習を行うことによって,新しい二 項関係の候補が与えられた場合に,その候補が抽出するべき二項関係か否かを判 断することができる。例えば,抽出する二項関係である力否かのタグを付与した「相互 作用をする蛋白質の名称の対」を学習データとして用いることによって,テキストデー タベースなどから,希望する「相互作用をする蛋白質の名称の対」の情報を取得する ことができる。  [0014] According to the present invention, a new binary relation candidate can be obtained by performing machine learning using text data that is manually assigned a tag indicating whether or not a binary relation to be extracted is used as learning data. If given, it can be determined whether the candidate is a binary relation to be extracted. For example, by using a “pair of interacting protein names” with a tag indicating whether or not the binary relation to be extracted is used as learning data, the desired “interacting protein” is retrieved from a text database. Can be obtained.
[0015] また,情報検索処理における AND検索の二つの検索キーワードについて,その検 索結果の文書にぉ 、て意味のある関係であるか否かのタグを付与した「検索キーヮ ードの対」を学習データとして用いることによって,検索対象のテキストデータから意 味のある検索結果を抽出することができる。  [0015] For the two search keywords of AND search in the information search process, a “search keyword pair” with a tag indicating whether or not the search result document has a meaningful relationship. By using as learning data, it is possible to extract meaningful search results from the text data to be searched.
[0016] 本発明は,テキストデータから二項関係を抽出するすべての問題に利用することが できるため,きわめて汎用 ¾が高い。  [0016] Since the present invention can be used for all problems of extracting binary relations from text data, it is extremely versatile.
図面の簡単な説明  Brief Description of Drawings
[0017] [図 1]本発明にかかる二項関係抽出装置の構成例を示す図である。 FIG. 1 is a diagram showing a configuration example of a binary relation extraction device according to the present invention.
[図 2]二項関係抽出装置の処理の流れを示す図である。  FIG. 2 is a diagram showing a processing flow of a binary relation extraction device.
[図 3]教師データの例を示す図である。  FIG. 3 is a diagram showing an example of teacher data.
[図 4]サポートベクトルマシン法のマージン最大化の概念を示す図である。  FIG. 4 is a diagram showing the concept of margin maximization in the support vector machine method.
[図 5]図 3に示す二項関係の素性の集合との組の例を示す図である。 [図 6]本発明にかかる情報検索装置の構成例を示す図である。 FIG. 5 is a diagram showing an example of a pair with a set of binary relational features shown in FIG. FIG. 6 is a diagram showing a configuration example of an information search device according to the present invention.
[図 7]情報検索装置の処理の流れを示す図である。 FIG. 7 is a diagram showing a flow of processing of the information search device.
[図 8]教師データおよび,その二項関係の素性の集合との組の例を示す図である。  FIG. 8 is a diagram showing an example of a set of teacher data and a set of features of the binary relation.
[図 9]教師データおよび,その二項関係の素性の集合との組の例を示す図である。 [Fig. 9] A diagram showing an example of a set of teacher data and a set of features of the binary relation.
[図 10]教師データおよび,その二項関係の素性の集合との組の例を示す図である。 符号の説明 FIG. 10 is a diagram showing an example of a set of teacher data and a set of features of the binary relation. Explanation of symbols
1 二項関係抽出装置  1 Binary relation extractor
11 教師データ記憶部  11 Teacher data storage
12 解 素性対抽出部  12 feature pair extractor
13 機械学習部  13 Machine Learning Department
14 学習結果記憶部  14 Learning result memory
15 候補抽出部  15 Candidate extractor
16 素性抽出部  16 Feature extraction unit
17 解推定部  17 Solution estimation part
18 二項関係抽出部  18 Binary relation extractor
2 テキストデータ  2 Text data
3 二項関係  3 Binary relations
4 情報検索装置  4 Information retrieval device
40 情報検索部  40 Information Search Department
41 教師データ記憶部  41 Teacher data storage
42 解 素性対抽出部  42 Feature pair extractor
43 機械学習部  43 Machine Learning Department
44 学習結果記憶部  44 Learning result memory
45 候補抽出部  45 Candidate extractor
46 素性抽出部  46 Feature Extractor
47 解推定部  47 Solution estimation part
48 検索結果抽出部  48 Search result extraction part
5 検索用テキストデータ 6 検索結果 5 Search text data 6 Search results
発明を実施するための最良の形態  BEST MODE FOR CARRYING OUT THE INVENTION
[0019] 以下,本発明の二項関係抽出装置 1の実施例を説明する。 Hereinafter, an embodiment of the binary relation extracting apparatus 1 of the present invention will be described.
[0020] 二項関係抽出装置 1は,抽出するべき二項関係力否かのタグを付与したテキストデ ータである教師データを用いて,どのような語句の対が抽出するべき二項関係である かを機械学習し,与えられたテキストデータ 2から,二項関係の候補を取得して,抽出 するべき二項関係 3を抽出する処理装置である。  [0020] The binary relation extraction device 1 uses the teacher data, which is text data with a tag indicating whether the binary relation is to be extracted, what binary pair should be extracted. It is a processor that performs machine learning to obtain the binary relation 3 from the given text data 2 and extracts the binary relation 3 to be extracted.
[0021] 図 1に,本発明にかかる二項関係抽出装置 1の構成例を示す。二項関係抽出装置 1は,教師データ記憶部 11,解 素性対抽出部 12,機械学習部 13,学習結果記憶 部 14,候補抽出部 15,素性抽出部 16,解推定部 17,および二項関係抽出部 18を 備える。  FIG. 1 shows a configuration example of a binary relation extraction apparatus 1 according to the present invention. The binary relation extraction device 1 includes a teacher data storage unit 11, a feature pair extraction unit 12, a machine learning unit 13, a learning result storage unit 14, a candidate extraction unit 15, a feature extraction unit 16, a solution estimation unit 17, and a binary term. A relationship extraction unit 18 is provided.
[0022] 教師データ記憶部 11は,機械学習処理において使用される教師データとなるテキ ストデータを記憶する手段である。  [0022] The teacher data storage unit 11 is means for storing text data that is teacher data used in the machine learning process.
[0023] 教師データとして,テキストデータの文中に出現している二項関係の要素(一方の 要素を第 1要素,他方の要素を第 2要素という)を問題,抽出するべき二項関係であ る力否かの情報を解とする事例を用いる。具体的には,テキストデータの一つの文中 に二個以上の二項関係の要素を含む文のみについて,その文中の二項関係にある 要素の対について,抽出するべき対 (正例)であるか,抽出するべきではない対 (負 例)かのいずれかの解を示すタグを人手によって付与する。一文中に三個以上の二 項関係の要素を含む場合には,要素のすべての組み合わせである対それぞれにつ いてタグを付与する。なお,教師データの事例として,抽出するべき対 (正例)を示す 解のみが付与された二項関係を使用してもよい。  [0023] As the teacher data, the binary relation elements (one element is referred to as the first element and the other element as the second element) appearing in the text data sentence are the binary relations to be extracted. Use the case where the answer is information about whether or not the power is available. Specifically, for a sentence that contains two or more binary relation elements in one sentence of text data, it is a pair (positive example) that should be extracted for a pair of binary relation elements in that sentence. Or a tag that indicates the solution of either a pair (negative example) that should not be extracted. If three or more binary elements are included in a sentence, a tag is assigned to each pair that is a combination of all elements. As an example of teacher data, a binary relation to which only a solution indicating the pair to be extracted (positive example) is given may be used.
[0024] 解-素性対抽出部 12は,教師データ記憶部 11に記憶されているテキストデータ内 の事例から,解と素性の集合との組を抽出する処理手段である。  The solution-feature pair extraction unit 12 is a processing unit that extracts a set of a solution and a set of features from cases in the text data stored in the teacher data storage unit 11.
[0025] 素性は,機械学習処理で使用する情報である。解-素性対抽出部 12は,素性とし て,例えば,二項関係の要素,要素の周囲に出現する単語 Z文字とその出現位置 や順序,要素や周囲の単語の品詞情報,形態素解析情報,構文解析情報,要素間 の出現距離,要素間での他の二項関係の要素の有無などの情報を抽出する。 [0026] 機械学習部 13は,解 素性対抽出部 12によって抽出された解と素性の集合との 組から,どのような素性のときにどのような解になりやすいかを,教師あり機械学習法 により学習する処理手段である。その学習結果は,学習結果記憶部 14に保存される A feature is information used in machine learning processing. The feature-feature pair extraction unit 12 includes, as features, for example, binary relation elements, word Z characters appearing around the elements and their appearance positions and order, parts of speech information of elements and surrounding words, morphological analysis information, Information such as parsing information, appearance distance between elements, and presence / absence of other binary relation elements between elements are extracted. [0026] The machine learning unit 13 uses the supervised machine learning to determine what type of solution is likely to be generated from the combination of the solution extracted by the feature pair extraction unit 12 and the set of features. It is a processing means that learns by law. The learning result is stored in the learning result storage unit 14.
[0027] 素性抽出部 16は,テキストデータ 2から抽出された二項関係の候補について,所定 の素性を抽出する処理手段である。 The feature extraction unit 16 is a processing unit that extracts a predetermined feature for a binary relation candidate extracted from the text data 2.
[0028] 解推定部 17は,学習結果記憶部 14の学習結果を参照して,二項関係の各候補に ついて,その素性の集合の場合に,どのような解 (分類先)になりやすいかの度合い を推定する処理手段である。 [0028] The solution estimation unit 17 refers to the learning result in the learning result storage unit 14, and for each candidate of the binary relation, what kind of solution (classification destination) is likely to occur in the case of the set of features. This is a processing means for estimating the degree.
[0029] 二項関係抽出部 18は,解推定部 17の推定結果にもとづいて,二項関係の候補か ら,抽出するべき二項関係であることを示す解となる度合いが高いと推定されたもの を,二項関係 3として出力する処理手段である。 [0029] Based on the estimation result of the solution estimation unit 17, the binary relation extraction unit 18 is estimated from the binary relation candidates to have a high degree of solution indicating that the binary relation should be extracted. This is a processing means that outputs the data as binary relations 3.
[0030] 図 2に,二項関係抽出装置 1の処理の流れを示す。 FIG. 2 shows a processing flow of the binary relation extraction apparatus 1.
[0031] 二項関係抽出装置 1の教師データ記憶部 11には,教師データとして,ある意味を 持つ要素の対である二項関係に,抽出するべき二項関係である力 (正)または抽出 するべきでな 、二項関係であるか (負)の 、ずれかの「解」の情報が付与された事例 を含むテキストデータ 2を記憶しておく。  [0031] The teacher data storage unit 11 of the binary relation extraction device 1 stores, as teacher data, a binary relation that is a pair of elements having a certain meaning, a force (positive) that is a binary relation to be extracted, or an extraction. Should not be done, text data 2 including a case where the information of the “solution” of the binomial relationship (negative) or the difference is given is stored.
[0032] なお,抽出するべき対にのみ,所定の解を付与した事例を含むテキストデータ 2を 記憶しておくようにしてもよい。この場合には,テキストデータ 2の解が付与された対は ,抽出するべき二項関係である(正)の解が与えられているとみなされ,解が付与され て!、な 、残りの対は抽出するべきではな 、二項関係 (負)の解が与えられて!/、るとみ なして扱われる。  [0032] Note that text data 2 including an example to which a predetermined solution is given may be stored only for the pair to be extracted. In this case, the pair to which the solution of text data 2 is given is considered to have been given the (positive) solution that is the binary relation to be extracted, and the solution is given! Pairs should not be extracted and are treated as if they were given binomial (negative) solutions! /.
[0033] まず,解 素性対抽出部 12は,教師データ記憶部 11の教師データから各事例に ついて,所定の素性を抽出し,解 (タグによって付与された情報)と抽出した素性の集 合との組を生成する (ステップ Sl)。解-素性対抽出部 12は,教師データであるテキ ストデータ力 所定のタグによって二項関係を抽出し,抽出した二項関係の要素につ いて,形態素解析処理,構文解析処理,要素の出現位置や要素間の距離の算出処 理などを行って,所定の素性を抽出する。 [0034] そして,機械学習部 13は,解-素性対抽出部 12により生成された解と素性の集合 との組から,どのような素性の集合のときにどのような解 (正または負)になりやすいか を機械学習法により学習し,学習結果を学習結果記憶部 14に格納する (ステップ S2 )。機械学習部 13は,教師あり機械学習法として,例えば, k近傍法,シンプルベイズ 法,決定リスト法,最大エントロピ一法,サポートベクトルマシン法などの手法のいず れかを用いて機械学習処理を行う。 [0033] First, the feature-feature pair extraction unit 12 extracts a predetermined feature for each case from the teacher data in the teacher data storage unit 11, and collects the solution (information given by the tag) and the extracted features. Generate a pair with (Step Sl). The solution-feature pair extraction unit 12 extracts the binary relations by using the text data force, which is the teacher data, and a predetermined tag, and for the extracted binary relation elements, the morphological analysis process, the parsing process, and the appearance of the elements Predetermined features are extracted by calculating the position and distance between elements. [0034] Then, the machine learning unit 13 determines what kind of solution (positive or negative) from the combination of the solution generated by the solution-feature pair extraction unit 12 and the feature set. The learning result is stored in the learning result storage unit 14 (step S2). The machine learning unit 13 uses, as a supervised machine learning method, for example, a machine learning process using any one of the methods such as the k-nearest neighbor method, the simple Bayes method, the decision list method, the maximum entropy method, and the support vector machine method. I do.
[0035] その後,候補抽出部 15は,二項関係を抽出したいテキストデータ 2を入力し,入力 したテキストデータ 2から二項関係の候補を抽出する (ステップ S3)。候補抽出部 15 は,テキストデータを文単位に分割し,一文中に二以上の二項関係の要素が出現す る文についてのみ処理対象として扱い,その文から二項関係の候補を抽出する。  [0035] Thereafter, the candidate extraction unit 15 inputs the text data 2 from which the binary relation is to be extracted, and extracts the binary relation candidate from the input text data 2 (step S3). The candidate extraction unit 15 divides the text data into sentence units, treats only sentences in which two or more binary relation elements appear in one sentence as processing targets, and extracts binary relation candidates from the sentence.
[0036] 素性抽出部 16は,解—素性対抽出部 12での処理とほぼ同様の処理によって,テ キストデータ 2から抽出した二項関係の各候補について素性を抽出する (ステップ S4 [0036] The feature extraction unit 16 extracts features for each binary relation candidate extracted from the text data 2 by processing similar to the processing in the solution-feature pair extraction unit 12 (step S4).
) o ) o
[0037] 解推定部 17は,各候補について,その素性の集合の場合にどのような解になりや す!、か,すなわち「正となりやす 、」か「負となりやす!/、か」の度合 、を学習結果記憶 部 14の学習結果をもとに推定する (ステップ S5)。そして,二項関係抽出部 18は,よ り良 、度合 、で「正となりやす 、」と推定された候補のなかから,所定の程度の候補を 抽出するべき二項関係 3として出力する (ステップ S6)。  [0037] The solution estimation unit 17 determines, for each candidate, what kind of solution is likely to be in the case of the set of features !, that is, “is likely to be positive” or “is likely to be negative! /,”. The degree is estimated based on the learning result in the learning result storage unit 14 (step S5). Then, the binomial relationship extraction unit 18 outputs a predetermined degree of candidates as binomial relationships 3 to be extracted from the candidates estimated to be “probably positive” with better, degree (step S6).
[0038] 次に,本発明の二項関係抽出処理の具体例を説明する。本例では,二項関係抽 出装置 1を,生物医学関係の論文のテキストデータベースから,相互作用のある蛋白 質表現 (蛋白質名)の二項関係を抽出するものとし,テキストデータベースでの蛋白 質表現を 100%の精度で特定しているものと仮定する。  Next, a specific example of the binary relation extraction process of the present invention will be described. In this example, the binary relation extraction device 1 extracts binary relations of protein expressions (protein names) that interact with each other from a text database of biomedical papers. Assume that the representation is specified with 100% accuracy.
[0039] また,二項関係を構成する要素は同一文中に出現するものとする。なお,二項関係 を構成する要素は,同一段落内,同一文書内に出現する要素同士であってもよい。  [0039] In addition, it is assumed that the elements constituting the binary relation appear in the same sentence. Note that the elements that make up the binary relation may be elements that appear in the same paragraph or the same document.
[0040] 教師データを作成する処理において,二項関係の要素となる表現,例えば,蛋白 質表現,病名と治療方法などの特定の表現を二項関係の要素として取り出す場合に は,以下のようにして行う。  [0040] In the process of creating the teacher data, when a specific expression such as a protein expression, disease name and treatment method is extracted as a binary relation element, for example, a binary relation element is as follows. To do.
[0041] 1)ルールを用いて要素を取り出す。 人手によって, 「NF— Kappa [A— Z] ,ただし, [A— Z]は Aから Zまでのいずれか の文字」などのパターンを定義して,該当する表現を抽出する。このパターンによって , NF -Kappa A, NF— Kappa Bなどの蛋白質名の表現である要素を抽出する。 [0041] 1) Extract an element using a rule. By manually defining a pattern such as “NF—Kappa [A—Z], where [A—Z] is any letter from A to Z”, the corresponding expression is extracted. Based on this pattern, elements that are expression of protein names such as NF-Kappa A and NF-Kappa B are extracted.
[0042] 2)辞書を用いて要素を取り出す。  [0042] 2) Extract elements using a dictionary.
病名や治療方法などの表現が記載された辞書を使用して,それらの辞書にあった表 現 (文字列,単語列など)とまったく同じ文字列等を,病名や治療方法の表現である 要素として抽出する。  Using dictionaries that describe expressions such as disease names and treatment methods, elements that are exactly the same as the expressions (character strings, word strings, etc.) in those dictionaries Extract as
[0043] 3)機械学習処理によって要素を取り出す。  [0043] 3) Elements are extracted by machine learning processing.
蛋白質表現,病名と治療方法などの表現の前後に開始位置タグと終了位置タグとを 付与したテキストデータを,学習データとして用意する。そして,このタグ付きの学習 データを用いた機械学習処理を行って,その学習結果を利用して,タグが付いてい な 、新し 、テキストデータの該当する表現の開始位置と終了位置にタグを挿入する ことで要素を特定する。  Prepare text data with start and end position tags before and after expressions such as protein expression, disease name and treatment method as learning data. Then, machine learning processing using the tagged learning data is performed, and the learning result is used to add a tag to the start position and end position of the corresponding expression in the text data. The element is specified by inserting.
[0044] 4)所定の二項関係を示す情報を用いて取り出す。  [0044] 4) Extracted using information indicating a predetermined binary relationship.
あら力じめ二項関係の要素になりうる表現にタグが付与されたデータを利用して,そ のタグをもとに二項関係の要素である表現を抽出する。  By using data with tags attached to expressions that can become binary relational elements, the expressions that are binary relational elements are extracted based on the tags.
[0045] 図 3に,教師データの例を示す。図 3 (A)に示すような,相互作用のある蛋白質表 現を要素とする二項関係を含む英文テキストデータを,教師データとして使用する。 本例では,教師データには,抽出するべき二項関係についてのみ,解 (正 Zpositiv e)を示すタグが付与される。すなわち,機械学習処理において,正の事例のみを含 む教師データが使用される。  FIG. 3 shows an example of teacher data. As shown in Fig. 3 (A), English text data including binary relations with interacting protein expressions as elements is used as teacher data. In this example, a tag indicating the solution (correct Zpositiv e) is attached to the teacher data only for the binary relation to be extracted. In other words, teacher data containing only positive cases is used in the machine learning process.
[0046] 図 3 (B)に,教師データに付与されているタグの例を示す。教師データには,二つ の二項関係の対 P1,対 P2が含まれる。二項関係(対) P1は,第1要素 1「(16½&— 0 ateninj ,第 2要素 p2「presenilin 1」で構成されている。また,二項関係(対) P2は ,第 1要素 pl「presenilin (PS) 1」,第 2要素 p2「delta— catenin」で構成されて いる。 [0046] Fig. 3 (B) shows an example of tags attached to teacher data. The teacher data includes two binary pairs P1 and P2. The binary relation (pair) P1 consists of the first element 1 “(16½ & — 0 ateninj, the second element p2“ presenilin 1 ”. Also, the binary relation (pair) P2 has the first element pl It consists of “presenilin (PS) 1” and the second element p2 “delta—catenin”.
[0047] 解-素性対抽出部 12は,教師データ記憶部 11に記憶されているテキストデータ内 の事例から,解と素性の集合との組を抽出する。例えば,素性として,以下のような情 報を抽出する。 [0047] The solution-feature pair extraction unit 12 extracts a set of a solution and a set of features from the examples in the text data stored in the teacher data storage unit 11. For example, the following information Extract information.
[0048] 1)二項関係の要素の周囲に出現する単語または文字。例えば,二項関係の第 1要 素 (最初の要素)の前方の所定数の単語 Z文字,第 2要素(二番目の要素)の後方の 所定数の単語 Z文字,第 1要素と第 2要素の間の所定数の単語 Z文字;  [0048] 1) A word or character that appears around a binary relational element. For example, a predetermined number of word Z letters before the first element (first element) of the binary relation, a predetermined number of word Z letters after the second element (second element), the first element and the second A predetermined number of words Z between elements; Z letters;
2)二項関係の要素の周囲に出現する単語 Z文字の出現位置,出現順序など; 2) Words appearing around binary elements, Z character appearance position, appearance order, etc .;
3)二項関係の二つの要素; 3) Two elements of binary relations;
4)二項関係の要素または周囲の単語の品詞情報,形態素解析情報など; 4) Binary elements or parts of speech information of surrounding words, morphological analysis information, etc .;
5)二項関係の要素または周囲の単語の構文解析情報; 5) Parsing information of binary relational elements or surrounding words;
6)二項関係の第 1要素と第 2要素との出現距離;  6) Appearance distance between the first and second elements of the binary relationship;
7)二項関係の第 1要素と第 2要素の間での要素の出現の有無;  7) Presence / absence of an element between the first and second elements of the binary relationship;
素性のうち,例えば,品詞情報は,形態素解析システム「ChaSen」などの既存の形 態素解析処理手法を使用して取得する(参照: http:〃 chasen.aist- nara.ac.jp/index.h tml.ja) 0英語のテキストデータの場合の品詞情報は,例えば, 「Transformation-Base d Error-Driven Learning and Natural Language Processing: A Case Study in Part— of -Speech TaggingJ (Eric Brill, Computational Linguistics, Vol.21, No.4, p.543- 565, 1 995)を使用して取得する。 Among the features, for example, part-of-speech information is acquired using existing morphological analysis processing methods such as the morphological analysis system “ChaSen” (see: http: 〃 chasen.aist-nara.ac.jp/index. h tml.ja) 0 Part-of-speech information in the case of English text data is, for example, “Transformation-Base d Error-Driven Learning and Natural Language Processing: A Case Study in Part— of-Speech Tagging (Eric Brill, Computational Linguistics, Vol.21, No.4, p.543-565, 1 995).
[0049] ここでは,二項関係の要素が,同一段落中に出現する場合には,素性として,二項 関係の要素が文をまたぐ力否かという情報を用いてもよい。また,二項関係の要素が ,同一文書内に出現する場合には,素性として,二項関係の要素が文をまたぐか否 かと 、う情報,段落をまたぐ力否かと!、う情報を用いてもょ 、。  [0049] Here, when a binary relation element appears in the same paragraph, information on whether or not the binary relation element has power over a sentence may be used as a feature. Also, when a binary relation element appears in the same document, the feature uses whether the binary relation element crosses a sentence, whether it is cross-sentence information, whether it has power over a paragraph, and so on. Well, ...
[0050] 解-素性対抽出部 12は,図 3 (B)に示すようなタグが付与された教師データの事 例から,素性を抽出し,素性の集合と解との組を生成する。例えば,二項関係 P2の 事例について,図 5に示すように,解 (positive :正)と,以下の素性の集合との組が 生成されるとする。  [0050] The solution-feature pair extraction unit 12 extracts features from the example of the teacher data with tags as shown in Fig. 3 (B), and generates a set of feature sets and solutions. For example, in the case of binary relation P2, as shown in Fig. 5, a set of a solution (positive) and the following feature set is generated.
「第 1要素の前方 3単語内に「for」, 「interaction」, 「with」が出現;  “The words“ for ”,“ interaction ”and“ with ”appear in the first three words of the first element;
要素間に「and」, 「cloned」, 「the」, 「滅」, 「-」, 「length」, 「cDNA」, 「of」, 「human」力 S 出現;  “And”, “cloned”, “the”, “destructive”, “-”, “length”, “cDNA”, “of”, “human” forces S appear between elements;
第 2要素の後方 3単語内に「which」, rencodedJ , 「1225」が出現」。 [0051] 機械学習部 13は,この解と素性の集合とをもとに,どのような素性の集合の場合に 解 (positive)となりやすいかを機械学習処理し,学習結果を学習結果記憶部 14に feす。。 In the rear 3 in the word of the second element "which", r e ncodedJ, "1225" is the appearance ". [0051] Based on this solution and the set of features, the machine learning unit 13 performs machine learning processing on which feature set is likely to be a positive (positive), and the learning result is stored in the learning result storage unit. 14 to fe. .
[0052] 機械学習部 13は,教師あり機械学習法として,例えば, k近傍法,シンプルベイズ 法,決定リスト法,最大エントロピ一法,サポートベクトルマシン法などの手法を用いる  [0052] The machine learning unit 13 uses, for example, a k-nearest neighbor method, a simple Bayes method, a decision list method, a maximum entropy method, a support vector machine method as a supervised machine learning method.
[0053] k近傍法は,最も類似する一つの事例のかわりに,最も類似する k個の事例を用い て,この k個の事例での多数決によって分類先 (解)を求める手法である。 kは,あらか じめ定める整数の数字であって,一般的に, 1から 9の間の奇数を用いる。シンプル ベイズ法は,ベイズの定理にもとづいて各分類になる確率を推定し,その確率値が 最も大き ヽ分類を求める分類先とする方法である。 [0053] The k-nearest neighbor method is a technique that uses the most similar k cases instead of the most similar case, and obtains a classification destination (solution) by majority decision of these k cases. k is a predetermined integer number, and generally an odd number between 1 and 9 is used. The Simple Bayes method is a method that estimates the probability of each classification based on Bayes' theorem and uses the probability value as the classification destination for obtaining the largest classification.
[0054] シンプルベイズ法において,文脈 bで分類 aを出力する確率は,以下の式(1 )で与 えられる。  [0054] In the Simple Bayes method, the probability of outputting classification a in context b is given by the following equation (1).
[0055] [数 1]
Figure imgf000014_0001
Figure imgf000014_0002
[0055] [Equation 1]
Figure imgf000014_0001
Figure imgf000014_0002
[0056] ただし,ここで文脈 bは,あら力じめ設定しておいた素性 f (≡F, l≤j≤k)の集合で ある。 p (b)は,文脈 bの出現確率である。ここで,分類 aに非依存であって定数のため に計算しない。 P (a) (ここで Pは pの上部にチルダ)と P (f I a)は,それぞれ教師デー タカゝら推定された確率であって,分類 aの出現確率,分類 aのときに素性 fを持つ確率 を意味する。 P (f I a)として最尤推定を行って求めた値を用いると,しばしば値がゼ 口となり,式(2)の値がゼロで分類先を決定することが困難な場合が生じる。そのため ,スーム一ジングを行う。ここでは,以下の式(3)を用いてスーム一ジングを行ったも のを用いる。 [0057] [数 2] [0056] Here, the context b is a set of features f (≡F, l≤j≤k) that have been set in advance. p (b) is the appearance probability of context b. Here, it is independent of the classification a and is not calculated because it is a constant. P (a) (where P is a tilde at the top of p) and P (f I a) are the probabilities estimated by the teacher data, respectively, and the appearance probability of class a, means the probability of having f. If the value obtained by maximum likelihood estimation is used as P (f I a), the value often becomes a negative value, and it may be difficult to determine the classification destination because the value of Eq. (2) is zero. Therefore, smoothing is performed. Here, we used the smoothing using the following formula (3). [0057] [Equation 2]
(r レ, /»+ο.οικ ") (3) ( r , /»+ο.οικ ") (3)
Aん I " ー freq(a) + 0M* freq(a) A I "ー fr eq (a) + 0M * freq (a)
[0058] ただし, freq (f , a)は,素性 f;を持ち,かつ分類力 である事例の個数, freq (a)は ,分類力 である事例の個数を意味する。 [0058] Here, freq (f, a) means the number of cases with the feature f ; and classification power, and freq (a) means the number of cases with classification power.
[0059] 決定リスト法は,素性と分類先の組とを規則とし,それらをあらかじめ定めた優先順 序でリストに蓄えおき,検出する対象となる入力が与えられたときに,リストで優先順 位の高いところ力 入力のデータと規則の素性とを比較し,素性が一致した規則の分 類先をその入力の分類先とする方法である。  [0059] In the decision list method, a set of features and classification destinations is used as a rule, and these are stored in a list in a predetermined priority order. When an input to be detected is given, the priority order is determined in the list. This is a method that compares the data of the force input with the rule feature and places the classification destination of the rule with the matching feature as the classification destination of the input.
[0060] 決定リスト方法では,あら力じめ設定しておいた素性 f (EF, l≤j≤k)のうち,いず れか一つの素性のみを文脈として各分類の確率値を求める。ある文脈 bで分類 aを出 力する確率は以下の式によって与えられる。  [0060] In the decision list method, the probability value of each classification is obtained using only one of the features f (EF, l≤j≤k) that has been set in advance. The probability of outputting classification a in a context b is given by
[0061] p (a I b) =p (a | fmax ) (4)  [0061] p (a I b) = p (a | fmax) (4)
ただし, fmaxは以下の式によって与えられる。  Fmax is given by the following equation.
[0062] [数 3] max [0062] [ Equation 3] m ax
Figure imgf000015_0001
Figure imgf000015_0001
[0063] また, P (a | f ) (ここで Pは pの上部にチルダ)は,素性 fを文脈に持つ場合の分類 a の出現の割合である。 [0063] P (a | f) (where P is a tilde at the top of p) is the rate of occurrence of classification a when the feature f is in the context.
[0064] 最大エントロピ一法は,あらかじめ設定しておいた素性 fj (l≤j≤k)の集合を Fとす るとき,以下の式 (6)を満足しながらエントロピーを意味する式 (7)を最大にするとき の確率分布 p (a, b)を求め,その確率分布にしたがって求まる各分類の確率のうち, 最も大きい確率値を持つ分類を求める分類先とする方法である。  [0064] The maximum entropy method is an equation that represents entropy while satisfying the following equation (6), where F is a set of features fj (l≤j≤k) set in advance (7 The probability distribution p (a, b) when) is maximized is obtained, and the classification with the highest probability value is obtained among the probabilities of each classification determined according to the probability distribution.
[0065] [数 4]
Figure imgf000016_0001
Figure imgf000016_0002
[0065] [Equation 4]
Figure imgf000016_0001
Figure imgf000016_0002
H(p) = - ∑ p( ,b)log(p(a,b)) (7) a A9beB H (p) =-∑ p (, b) log (p (a, b)) (7) a A 9 beB
[0066] ただし, A, Bは分類と文脈の集合を意味し, g (a, b)は文脈 bに素性 fがあって,な おかつ分類力 の場合 1となり,それ以外で 0となる関数を意味する。また, P (a I f ) ( ここで Pは pの上部にチルダ)は,既知データでの(a, b)の出現の割合を意味する。 [0066] where A and B mean a set of classifications and contexts, and g (a, b) is 1 if the context b has a feature f, and if it has classification power, and 0 otherwise Means a function. P (a I f) (where P is a tilde at the top of p) means the rate of occurrence of (a, b) in the known data.
[0067] 式 (6)は,確率 pと出力と素性の組の出現を意味する関数 gをかけることで出力と素 性の組の頻度の期待値を求めることになつており,右辺の既知データにおける期待 値と,左辺の求める確率分布に基づいて計算される期待値が等しいことを制約として ,エントロピー最大化(確率分布の平滑化)を行なって,出力と文脈の確率分布を求 めるものとなっている。最大エントロピ一法の詳細については,以下の参考文献 1お よび参考文献 2を参照された 、。 [0067] In equation (6), the expected value of the frequency of the output and feature pair is obtained by multiplying the probability p and the function g that means the appearance of the output and feature pair. Obtain the probability distribution of the output and the context by maximizing entropy (smoothing of the probability distribution) with the restriction that the expected value in the data and the expected value calculated based on the probability distribution calculated on the left side are equal It has become a thing. For details on the maximum entropy method, see references 1 and 2 below.
考文献 1: Eric Sven Ristad, Maximum Entropy Modeling for Natural Language, (A CL/EACL Tutorial Program, Madrid, 1997) ;  Reference 1: Eric Sven Ristad, Maximum Entropy Modeling for Natural Language, (A CL / EACL Tutorial Program, Madrid, 1997);
参考文献 2 : Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta, (http://www.mnemonic.com/ software/ memt, 1998))  Reference 2: Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta, (http://www.mnemonic.com/software/memt, 1998))
サポートベクトルマシン法は,空間を超平面で分割することにより,二つの分類から なるデータを分類する手法である。  The support vector machine method is a method for classifying data consisting of two categories by dividing the space into hyperplanes.
[0068] 図 4にサポートベクトルマシン法のマージン最大化の概念を示す。図 4において, 白 丸は正例,黒丸は負例を意味し,実線は空間を分割する超平面を意味し,破線はマ 一ジン領域の境界を表す面を意味する。図 4 ( は,正例と負例の間隔が狭い場合 (スモールマージン)の概念図,図 4 (B)は,正例と負例の間隔が広い場合 (ラージマ 一ジン)の概念図である。 FIG. 4 shows the concept of margin maximization in the support vector machine method. In Fig. 4, white circles indicate positive examples, black circles indicate negative examples, solid lines indicate hyperplanes that divide space, and broken lines indicate planes that represent the boundaries of the margin area. Fig. 4 (is a conceptual diagram when the interval between the positive and negative examples is narrow (small margin), and Fig. 4 (B) is when the interval between the positive and negative examples is wide (large FIG.
[0069] このとき,二つの分類が正例と負例力 なるものとすると,学習データにおける正例 と負例の間隔 (マージン)が大き 、ものほどオープンデータで誤った分類をする可能 性が低いと考えられ,図 4(B)に示すように,このマージンを最大にする超平面を求 め,それを用いて分類を行なう。  [0069] At this time, if the two classifications are positive and negative, the distance between the positive and negative examples (margin) in the training data is large, and the possibility of incorrect classification with open data increases. As shown in Fig. 4 (B), the hyperplane that maximizes this margin is obtained and classified using it.
[0070] 基本的には上記のとおりであるが,通常,学習データにおいてマージンの内部領 域に少数の事例が含まれてもよ!/、とする手法の拡張や,超平面の線形の部分を非線 型にする拡張 (カーネル関数の導入)がなされたものが用いられる。  [0070] Basically, it is as described above. Usually, the training data may include a small number of examples in the internal area of the margin! /, And the linear part of the hyperplane. A non-linear extension (introduction of a kernel function) is used.
[0071] この拡張された方法は,以下の識別関数を用いて分類することと等価であり,その 識別関数の出力値が正か負かによって二つの分類を判別することができる。  [0071] This extended method is equivalent to classification using the following discriminant function, and the two classes can be discriminated based on whether the output value of the discriminant function is positive or negative.
[0072] [数 5]  [0072] [Equation 5]
Figure imgf000017_0001
Figure imgf000017_0001
bi =∑ocjyJK(xj,xi) b i = ∑oc j y J K (x j , x i )
ゾ =1  Z = 1
[0073] ただし, Xは識別したい事例の文脈 (素性の集合)を, Xと y (i=l, ···, \, γ≡{1, [0073] where X is the context (set of features) of the case to be identified, X and y (i = l, ..., \, γ≡ {1,
i ] ]  i]]
1})は学習データの文脈と分類先を意味し,関数 sgnは,  1}) means the context and classification destination of the learning data, and the function sgn is
sgn(x) =1(χ≥0)  sgn (x) = 1 (χ≥0)
― 1、 otherwise )  ― 1, otherwise)
であり,また,各 αは式(10)と式(11)の制約のもとの式(9)を最大にする場合のも のである。  In addition, each α is the value when maximizing Eq. (9) under the constraints of Eqs. (10) and (11).
[0074] [数 6] L(a) = - -∑ ,α^γ^χ, , χ^) (9) [0074] [Equation 6] L (a) =--∑, α ^ γ ^ χ,, χ ^) (9)
i=i 2 ij=i  i = i 2 ij = i
0≤",≤C (/ = 1"..,/) (i o) i 0≤ ", ≤C (/ = 1" .., /) (i o) i
∑" = 0 (I D =1 ∑ " = 0 (ID = 1
[0075] また,関数 Kはカーネル関数と呼ばれ,様々なものが用いられるが,本形態では以 下の多項式のものを用いる。 [0075] The function K is called a kernel function, and various functions are used. In this embodiment, the following polynomial is used.
[0076] K (x, y) = (x -y+ l) d ( 12)  [0076] K (x, y) = (x -y + l) d (12)
C, dは実験的に設定される定数である。後述する具体例では Cはすべての処理を通 して 1に固定した。また, dは, 1と 2の二種類を試している。ここで, a >0となる Xは, サポートベクトルと呼ばれ,通常,式(8)の和をとつている部分は,この事例のみを用 いて計算される。つまり,実際の解析には学習データのうちサポートベクトルと呼ばれ る事例のみしか用いられな 、。  C and d are experimentally set constants. In the specific example described later, C was fixed to 1 throughout the entire process. Also, two types of d are tested, 1 and 2. Here, X where a> 0 is called a support vector, and the part that is the sum of Eq. (8) is usually calculated using this example only. In other words, only the cases called support vectors in the learning data are used for actual analysis.
[0077] なお,拡張されたサポートベクトルマシン法の詳細については,以下の参考文献 3 および参考文献 4を参照された ヽ。  [0077] For details of the extended support vector machine method, see Reference 3 and Reference 4 below.
考文献 3 : Nello Cnstianini and John Shawe- Taylor, An Introduction to support Vector Machines and other kernel-based learning methods, (Cambridge University P ress,2000) ;  Reference 3: Nello Cnstianini and John Shawe- Taylor, An Introduction to support Vector Machines and other kernel-based learning methods, (Cambridge University Press, 2000);
参考文献 4 : Taku Kudoh, Tinysvm:Support Vector machines , (http : //cl . aist-nar a. ac . jp/taku— ku〃software/Tiny SVM/index.html,2000))  Reference 4: Taku Kudoh, Tinysvm: Support Vector machines, (http: // cl .aist-nar a.ac .jp / taku— ku〃software / Tiny SVM / index.html, 2000))
サポートベクトルマシン法は,分類の数が 2個のデータを扱うものである。したがって ,分類の数が 3個以上の事例を扱う場合には,通常,これにペアワイズ法またはワン VSレスト法などの手法を組み合わせて用いることになる。  The support vector machine method handles data with two classifications. Therefore, when dealing with cases with three or more classifications, it is usually used in combination with methods such as the pair-wise method or the one-VS rest method.
[0078] ペアワイズ法は, n個の分類を持つデータの場合に,異なる二つの分類先のあらゆ るペア (n (n— 1) Z2個)を生成し,各ペアごとにどちらがよいかを二値分類器,すな わちサポートベクトルマシン法処理モジュールで求めて,最終的に, n(n— 1) Z2個 の二値分類による分類先の多数決によって,分類先を求める方法である。 [0078] The pairwise method generates all pairs (n (n— 1) Z2) of two different classification destinations for data with n classifications, and determines which is better for each pair. Binary classifier, sand In other words, it is obtained by the support vector machine method processing module, and finally the classification destination is obtained by majority decision of the classification destination by n (n−1) Z2 binary classification.
[0079] ワン VSレスト法は,例えば, a, b, cという三つの分類先があるときは,分類先 aとそ の他,分類先 bとその他,分類先 cとその他,という三つの組を生成し,それぞれの組 についてサポートベクトルマシン法で学習処理する。そして,学習結果による推定処 理において,その三つの組のサポートベクトルマシンの学習結果を利用する。推定 すべき二項関係の候補が,その三つのサポートベクトルマシンではどのように推定さ れるかを見て,その三つのサポートベクトルマシンのうち,その他でないほうの分類先 であって,かつサポートベクトルマシンの分離平面力 最も離れた場合のものの分類 先を求める解とする方法である。例えば,ある候補が, 「分類先 aとその他」の組の学 習処理で作成したサポートベクトルマシンにおいて分離平面力 最も離れた場合に は,その候補の分類先は aと推定する。  [0079] For example, when there are three classification destinations a, b, and c, the one VS rest method is classified into three groups, classification destination a and others, classification destination b and others, and classification destination c and others. And learning processing for each set using the support vector machine method. In the estimation process based on the learning results, the learning results of the three sets of support vector machines are used. By looking at how the three candidate support relations can be estimated by the three support vector machines, it is the other destination of the three support vector machines and the support vector. The separation plane force of the machine This is the method to find the classification destination of the one that is farthest away. For example, if a candidate is most distant from the separation plane force in the support vector machine created by the learning process of the “classification destination a and other” pair, the candidate classification destination is assumed to be a.
[0080] その後,候補抽出部 15は,入力された新しいテキストデータ 2から,二項関係の候 補を抽出する。具体的には,テキストデータ 2を文単位に分割し,各文中の二項関係 の要素となる表現 (文字列)を抽出する。そして,一文中に二項関係の要素となる表 現が二個以上存在する力否かを調べ,一文中にある二項関係の要素のすべての二 つの組み合わせ (対)を二項関係の候補として生成する。  [0080] Thereafter, the candidate extraction unit 15 extracts binomial candidates from the input new text data 2. Specifically, text data 2 is divided into sentence units, and expressions (character strings) that are binary relation elements in each sentence are extracted. Then, it is checked whether there are two or more expressions that are binary relation elements in a sentence, and all two combinations (pairs) of binary relation elements in a sentence are candidates for binary relations. Generate as
[0081] また,新しいテキストデータ 2を各段落に分割し,各段落中の二項関係の要素となる 表現を抽出し,同じ段落内から二以上の要素がある段落について,すべての二つの 組み合わせ (対)を二項関係の候補として生成してもよい。または,テキストデータ 2の 一文書内からの二項関係の要素となる表現を抽出し,すべての二つの組み合わせ( 対)を二項関係の候補として生成してもよ 、。  [0081] In addition, new text data 2 is divided into paragraphs, expressions that are binary-related elements in each paragraph are extracted, and all two combinations of paragraphs having two or more elements from the same paragraph are extracted. (Pair) may be generated as a binary relation candidate. Alternatively, it is possible to extract expressions that are binary relation elements from one document of text data 2 and generate all two combinations (pairs) as binary relation candidates.
[0082] テキストデータ 2から二項関係の要素となる表現を抽出する手法としては,前述の教 師データの生成方法で説明した手法を使用する。例えば,パターンや辞書の記述と 合致する表現を抽出する,教師あり機械学習の学習結果にもとづいて推定した表現 を抽出する。  [0082] The method described in the method for generating teacher data described above is used as a method for extracting an expression that is a binary relation element from text data 2. For example, an expression that matches the description of the pattern or dictionary is extracted, and an expression estimated based on the learning result of supervised machine learning is extracted.
[0083] テキストデータ 2の一文中に二個以上の要素が出現する場合に,その要素の対を 二項関係の候補とする。なお,一文中に三個以上の要素が出現する場合には,要素 のあらゆる組み合わせの対を二項関係の候補とする。 [0083] When two or more elements appear in one sentence of text data 2, the pair of elements is determined as a binary relation candidate. If three or more elements appear in a sentence, the element Any pair of combinations of is a candidate for a binary relation.
[0084] そして,素性抽出部 16は,二項関係の候補から,解-素性対抽出部 12と同様の 処理によって同様の素性を抽出する。  Then, the feature extraction unit 16 extracts similar features from the binomial relationship candidates by the same processing as the solution-feature pair extraction unit 12.
[0085] 解推定部 17は,学習結果記憶部 14に記憶されている学習結果をもとに,各二項 関係の候補にっ 、て,その候補の素性の集合の場合に正の解 (positive)のなりや すさを推定する。二項関係抽出部 18は,解推定部 17の推定結果をもとに二項関係 の候補から,正の解となりやす 、推定の度合 、が高 、ものを二項関係 2として出力す る。  [0085] Based on the learning result stored in the learning result storage unit 14, the solution estimation unit 17 obtains a positive solution (in the case of a set of feature features of each candidate). Estimate the likelihood of positive). Based on the estimation result of the solution estimation unit 17, the binomial relationship extraction unit 18 outputs a binomial relationship 2 that is likely to be a positive solution and has a high degree of estimation.
[0086] 本例では,上記の素性を抽出し,機械学習処理としてサポートベクトルマシン法を 用いた。 10分割のクロスノくリデーシヨンを利用して精度を調べたところ, F値 =47. 5 %の精度が得られた。 F値は,再現率と適合率の調和平均をいう。再現率は,テキス トデータ 2から抽出するべき二項関係のうち,どの程度のものが出力できたかを示す 割合である。適合率は,二項関係抽出装置 1が抽出した二項関係のうち,どの程度 のものが取り出すべき二項関係であつたかを示す割合である。  [0086] In this example, the above features were extracted and the support vector machine method was used as the machine learning process. When the accuracy was examined using a 10-part cross knot reduction, an accuracy of F value = 47.5% was obtained. The F value is the harmonic mean of recall and precision. The recall is the ratio that indicates how much of the binomial relation to be extracted from the text data 2 was output. The relevance ratio is a ratio indicating how much of the binary relations extracted by the binary relation extraction device 1 is the binary relation to be extracted.
[0087] 二項関係抽出装置 1では,機械学習部 13によって,所定の機械学習アルゴリズム にもとづいて,与えられた教師データを用いて,各二項関係の解と素性の集合との組 につ 、て,どのような素性の集合の場合にどのような解となるかと!/、うことを機械学習 処理し,どのような素性の集合の場合にどのような解となるかと!/、うことを示す情報を 学習結果情報として学習結果記憶部 14に保存し,解推定部 17によって,この学習 結果情報にもとづいて,二項関係の候補の素性の集合の場合についての前記解と なりやす!/ヽ度合!ヽを推定する。  [0087] In the binary relation extraction apparatus 1, the machine learning unit 13 uses the given teacher data to generate a set of each binary relation solution and feature set based on a predetermined machine learning algorithm. What kind of feature set results in a solution! /, And what kind of feature set results in a machine learning process! Is stored in the learning result storage unit 14 as learning result information, and the solution estimation unit 17 is likely to obtain the solution for the case of the feature set of binomial relation candidates based on the learning result information. Estimate! / ヽ degree! ヽ.
[0088] 二項関係抽出装置 1において,機械学習手法として k近傍法を用いる場合には,機 械学習部 13は,教師データの事例同士で,その事例力 抽出された素性の集合のう ち重複する素性の割合(同じ素性をいくつ持っているかの割合)にもとづく事例同士 の類似度と定義して,前記定義した類似度と事例とを学習結果情報として学習結果 記憶部 14に記憶しておく。  [0088] When the k-nearest neighbor method is used as the machine learning method in the binomial relation extraction device 1, the machine learning unit 13 uses the feature set of feature data extracted between the cases of the teacher data. It is defined as the similarity between cases based on the ratio of overlapping features (the number of the same features), and the defined similarity and the cases are stored in the learning result storage unit 14 as learning result information. deep.
[0089] そして,解推定部 17は,新しいテキストデータ 2が入力されたときに,学習結果記憶 部 14の定義した類似度と事例を参照して,テキストデータ 2から抽出された二項関係 の候補にっ 、て,その候補の類似度が高 、順に k個の事例を学習結果記憶部 14の 事例から選択し,選択した k個の事例での多数決によって決まった分類先を,二項関 係の候補の分類先 (解)として推定する。すなわち,解推定部 17では,二項関係の候 補の素性の集合の場合にある解となりやすさの度合 、を,選択した k個の事例での 多数決の票数,ここでは「抽出するべき」という分類が獲得した票数とする。また,機 械学習手法として,シンプルベイズ法を用いる場合には,機械学習部 13は,教師デ ータの事例にっ 、て,前記事例の解と素性の集合との組を学習結果情報として学習 結果記憶部 14に記憶する。そして,解推定部 17は,新しいテキストデータ 2が入力さ れたときに,学習結果記憶部 14の学習結果情報の解と素性の集合との組をもとに, ベイズの定理にもとづいて素性抽出部 16で取得した二項関係の候補の素性の集合 の場合の各分類になる確率を算出して,その確率の値が最も大きい分類を,その二 項関係の候補の素性の分類 (解)と推定する。すなわち,解推定部 17では,二項関 係の候補の素性の集合の場合にある解となりやすさの度合 、を,各分類になる確率 ,ここでは「抽出するべき」 t 、う分類になる確率とする。 [0089] Then, when new text data 2 is input, the solution estimation unit 17 refers to the similarity and the case defined by the learning result storage unit 14, and extracts the binary relation extracted from the text data 2. Therefore, k cases are selected from the cases in the learning result storage unit 14 in order, and the classification destination determined by the majority of the selected k cases is binomial. Estimated as the classification target (solution) of the relationship candidate. In other words, the solution estimation unit 17 determines the degree of likelihood of being a solution in the case of a set of binomial candidate features. The number of votes obtained by the classification. When the simple Bayes method is used as a machine learning method, the machine learning unit 13 uses a combination of a solution of the case and a set of features as learning result information according to the case of the teacher data. It is stored in the learning result storage unit 14. Then, when new text data 2 is input, the solution estimation unit 17 uses the features of the learning result information stored in the learning result storage unit 14 and a set of features based on the Bayes theorem. The probability of each classification in the case of a set of binary relation candidate features acquired by the extraction unit 16 is calculated, and the classification having the highest probability value is selected as the classification of the features of the binary relation candidate (solution ). In other words, the solution estimator 17 determines the degree of likelihood of being a solution in the case of a set of candidate features of a binomial relationship as the probability of each classification, here “to be extracted” t. Probability.
[0090] また,機械学習手法として決定リスト法を用いる場合には,機械学習部 13は,教師 データの事例について,素性と分類先との規則を所定の優先順序で並べたリストを 学習結果記憶部 14に記憶する。そして,新しいテキストデータ 2が入力されたときに, 解推定部 17は,学習結果記憶部 14のリストの優先順位の高い順にテキストデータ 2 力 抽出された二項関係の候補の素性と規則の素性とを比較し,素性が一致した規 則の分類先をその候補の分類先 (解)として推定する。すなわち,解推定部 17では, 二項関係の候補の素性の集合の場合にある解となりやすさの度合 、を,所定の優先 順位またはそれに相当する数値,尺度,ここでは「抽出するべき」という分類になる確 率のリストにおける優先順位とする。 [0090] When the decision list method is used as the machine learning method, the machine learning unit 13 stores a list of learning data examples in which rules of features and classification destinations are arranged in a predetermined priority order. Store in Part 14. Then, when new text data 2 is input, the solution estimation unit 17 extracts the features of the binary relation candidates and the rules of the rules extracted from the text data 2 in descending order of the priority of the list in the learning result storage unit 14. And the classification destination of the rule with the same feature is estimated as the candidate classification destination (solution). In other words, the solution estimator 17 determines the degree of likelihood of being a solution in the case of a set of candidate features of a binomial relationship as a predetermined priority or a numerical value or scale corresponding thereto, in this case “to be extracted”. The priority in the list of probabilities of classification.
[0091] また,機械学習手法として最大エントロピ一法を使用する場合には,機械学習部 13 は,教師データの事例力も解となりうる分類を特定し,所定の条件式を満足しかつェ ントロピーを示す式を最大にするときの素性の集合と解となりうる分類の二項力もなる 確率分布を求めて学習結果記憶部 14に記憶する。そして,新しいテキストデータ 2が 入力されたときに,解推定部 17は,学習結果記憶部 14の確率分布を利用して,テキ ストデータ 2から抽出された二項関係の候補の素性の集合についてその解となりうる 分類の確率を求めて,最も大きい確率値を持つ解となりうる分類を特定し,その特定 した分類をその候補の解と推定する。すなわち,解推定部 17では,二項関係の候補 の素性の集合の場合にある解となりやすさの度合いを,各分類になる確率,ここでは 「抽出するべき」 t 、う分類になる確率とする。 [0091] When the maximum entropy method is used as a machine learning method, the machine learning unit 13 specifies a class that can also solve the case power of teacher data, satisfies a predetermined conditional expression, and performs entropy. A probability distribution that also has a set of features when maximizing the expression shown and a binomial force of classification that can be a solution is obtained and stored in the learning result storage unit 14. Then, when new text data 2 is input, the solution estimation unit 17 uses the probability distribution in the learning result storage unit 14 to generate a text. The probability of the classification that can be the solution for the set of candidate features of the binary relation extracted from the strike data 2 is obtained, the classification that can be the solution having the largest probability value is identified, and the identified classification is determined for the candidate. Estimate the solution. In other words, the solution estimator 17 determines the likelihood of a solution in the case of a set of binary candidate features as the probability of being classified into each class, in this case “to be extracted” t, To do.
[0092] また,機械学習手法としてサポートベクトルマシン法を使用する場合には,機械学 習部 13は,教師データの事例力も解となりうる分類を特定し,分類を正例と負例に分 割して,カーネル関数を用いた所定の実行関数にしたがって事例の素性の集合を次 元とする空間上で,その事例の正例と負例の間隔を最大にし,かつ正例と負例を超 平面で分割する超平面を求めて学習結果記憶部 14に記憶する。そして,新しいテキ ストデータ 2が入力されたときに,解推定部 17は,学習結果記憶部 14の超平面を利 用して,テキストデータ 2から抽出された二項関係の候補の素性の集合が超平面で 分割された空間において正例側力負例側のどちらにある力を特定し,その特定され た結果にもとづいて定まる分類を,その候補の解と推定する。すなわち,解推定部 17 では,二項関係の候補の素性の集合の場合にある解となりやすさの度合いを,分離 平面からの正例 (抽出するべき二項関係)の空間への距離の大きさとする。より詳しく は,抽出するべき二項関係を正例,抽出するべきではない二項関係を負例とする場 合に,分離平面に対して正例側の空間に位置する事例が「抽出するべき事例」と判 断され,その事例の分離平面力もの距離をその事例の度合 、とする。  [0092] When the support vector machine method is used as the machine learning method, the machine learning unit 13 identifies a class that can also solve the case power of teacher data, and divides the class into positive and negative examples. Then, in the space where the feature set of the case is dimensioned according to a predetermined execution function using a kernel function, the interval between the positive example and the negative example of the case is maximized, and the positive example and the negative example are exceeded. A hyperplane to be divided by the plane is obtained and stored in the learning result storage unit 14. Then, when new text data 2 is input, the solution estimation unit 17 uses the hyperplane of the learning result storage unit 14 to set a set of candidate features of the binary relation extracted from the text data 2. In the space divided by hyperplanes, the force on either the positive side or the negative side is identified, and the classification determined based on the identified result is estimated as the candidate solution. In other words, the solution estimator 17 determines the degree of likelihood of a solution in the case of a set of candidate features of a binary relation by determining the degree of distance from the separation plane to the space of the positive example (binary relation to be extracted). Say it. More specifically, when a binary relation to be extracted is a positive example and a binary relation that should not be extracted is a negative example, an example that is located in the space on the positive example side with respect to the separation plane is The distance of the separation plane force of the case is the degree of the case.
[0093] また,解-素性対抽出部 12では,素性として,例えば, 「二つの要素自体の単語」 を使用してもよい。また, 「要素の前方から一つ目の単語 Z文字列,二つ目の単語 Z 文字列,後方から一つ目の単語 Z文字列,二つ目の単語 Z文字列」を素性として使 用してもよい。図 3 (A)の場合には,素性は,  Further, the solution-feature pair extraction unit 12 may use, for example, “words of two elements themselves” as the feature. In addition, the first word Z character string, the second word Z character string from the front of the element, the first word Z character string from the rear, and the second word Z character string are used as features. May be. In the case of Fig. 3 (A), the feature is
「第 1要素が「presenilin (PS) 1」;  “The first element is“ presenilin (PS) 1 ”;
第 2要素が「delta - cateninj;  The second element is "delta-cateninj;
第 1要素の一つ目の単語が「presenilin」;  The first word of the first element is “presenilin”;
同二つ目の単語が「(PS)」;  The second word is “(PS)”;
第 1要素の最後から二つ目の単語が「(PS)」; 同最後から一つ目の単語が「1」; The second word from the end of the first element is “(PS)”; The first word from the end is “1”;
第 2要素の一つ目の単語が「delta」;  The first word of the second element is "delta";
同二つ目の単語が「-」;  The second word is “-”;
第 2要素の最後から二つ目の単語が「-」;  The second word from the end of the second element is "-";
同最後から一つ目の単語が「cateninである」となる。  The first word from the end becomes “catenin”.
[0094] または, [0094] Or,
「第 1要素の最初の 1文字が「P」;  “The first letter of the first element is“ P ”;
同最初の 2文字が「pr」;  The first two characters are “pr”;
同最初の 3文字が「pre」;  The first three letters are “pre”;
同最後の 1文字が「1」;  The last character is “1”;
同最後の 2文字が「スペース, 1」;  The last two characters are “space, 1”;
同最後の 3文字が「),スペース, 1」;  The last three characters are “), space, 1”;
第 2要素の最初の 1文字が「d」;  The first character of the second element is "d";
同最初の 2文字が「de」;  The first two letters are “de”;
同最初の 3文字が「del」;  The first three letters are “del”;
同最後の 1文字が「n」;  The last character is “n”;
同最後の 2文字が「in」;  The last two characters are “in”;
同最後の 3文字が「nin」である」となる。  The last three letters are “nin”.
[0095] また,要素の前後 2単語の単語自体とその品詞情報を素性とする場合には,素性 は, [0095] Also, if the feature is the two words before and after the element and its part-of-speech information,
「第 1要素の二つ前の単語は「interaction」;  “The word before the first element is“ interaction ”;
同二つ前の単語の品詞は「名詞」;  The part of speech of the previous two words is “noun”;
同一つ前の単語は「with」;  The previous word is “with”;
同一つ前の単語の品詞は「前置詞」;  The part of speech of the previous word is “preposition”;
同一つ後の単語は「and」;  The next word is “and”;
同一つ後の単語の品詞は「接続詞」;  The part of speech of the next word is “connective”;
同二つ後の単語は「cloned」;  The word after the second is “cloned”;
同二つ後の単語の品詞は「動詞」; 第 2要素の二つ前の単語は「of」; The part of speech of the second word is “verb”; The word before the second element is “of”;
同二つ前の単語の品詞は「前置詞」;  The part of speech of the previous two words is “preposition”;
同一つ前の単語は「human」;  The previous word is “human”;
同一つ前の単語の品詞は「名詞」;  The part of speech of the previous word is “noun”;
同一つ後の単語は「which」;  The next word is “which”;
同一つ後の単語の品詞は「代名詞」;  The part of speech of the next word is “pronoun”;
同二つ後の単語は「encoded」;  The word after the second is "encoded";
同二つ後の単語の品詞は「動詞」である」となる。  The part of speech of the next two words is “verb”.
[0096] また,二つの要素の間の距離として,その要素間にある単語の数を素性として用い る場合には, 「二つの要素間の距離は, 「9」である」という情報が素性となる。 [0096] When the number of words between two elements is used as a feature as the distance between two elements, the information that the distance between the two elements is "9" is the feature. It becomes.
[0097] また,二つの要素の間の単語数が 0から 1の状態を「距離小」とし, 2から 4の状態を[0097] In addition, the state where the number of words between two elements is 0 to 1 is “small distance”, and the state 2 to 4 is
「距離中」とし, 5から 9の状態を「距離大」とし, 10以上の状態を「距離特大」とするそ れぞれの状態を素性とする場合に, 「二つの要素間の距離は, 「距離大」である」とい う情報が素性となる。 If each state is characterized as “medium”, a state from 5 to 9 as “distance large”, and a state of 10 or more as “distance extra large”, “the distance between two elements is , “Long distance” is the feature.
[0098] また,二つの要素の間に他の要素がないかどうかという状態を素性とする場合に, 「 二つの要素の間に他の要素はない」という情報が素性となる。  [0098] Also, when the feature is whether there is no other element between the two elements, the information that "there is no other element between the two elements" is the feature.
[0099] さらに,二項関係の要素として異種の用語が設定されるような場合には,要素の出 現順位を素性として用いてもよい。例えば,病名と治療方法の二項関係の場合には , 「第 1要素が「病名」で第 2要素が「治療方法」である」または「第 1要素が「治療方法 」で第 2要素が「病名」である」との情報が素性となる。  [0099] Furthermore, when different terms are set as binary relational elements, the appearance order of elements may be used as a feature. For example, in the case of a binary relationship between a disease name and a treatment method, “the first element is“ disease name ”and the second element is“ treatment method ”” or “the first element is“ treatment method ”and the second element is Information that “is a disease name” is a feature.
[0100] 二項関係抽出装置 1は,教師データとして,相互作用のある蛋白質表現の二項関 係以外に,病名と治療方法との二項関係,病名と蛋白質表現との二項関係,病名と 器官 (臓器)との二項関係,病名と動物種との二項関係,病名と関連のある化学物質 との二項関係,蛋白質表現とその蛋白質についてこれまでになされた実験方法との 二項関係などのさまざまな二項関係の事例を与えることによって,生物医学論文のテ キストデータ 2から,これらの対応する二項関係を抽出することができる。  [0100] The binary relation extraction device 1 uses, as teacher data, a binary relation between a disease name and a treatment method, a binary relation between a disease name and a protein expression, a disease name, as well as a binary relation between interacting protein expressions. The binary relationship between the disease name and animal species, the binary relationship between the disease name and the species, the chemical expression associated with the disease name, the protein expression and the experimental methods that have been performed so far. These binary relationships can be extracted from text data 2 of biomedical papers by giving examples of various binary relationships such as term relationships.
[0101] 例えば,教師データとして,以下のような二項関係を含むテキストデータを用いるこ とがでさる。 「Oral corticosteroids (要素:、冶療方法) are the preference of many for the treatment of CIDP (要素:病名), being much less expensive than IVIG (要素:治療方法) infosi on or TA (要素:治療方法) .」 [0101] For example, text data including the following binary relations can be used as teacher data. "Oral corticosteroids (element: treatment method) are the preference of many for the treatment of CIDP (element: disease name), being much less expensive than IVIG (element: treatment method) infosi on or TA (element: treatment method). "
Γΐη the CIDP (要素:病名) patient, the IgG antibody (要素:蛋白質表現) titer to G D3 (要素:ィ匕学物質表現) was remarkably elevated (titer, 1:10,000), indicating maxi mal avidity to the tetrasaccharide epitope(- NeuAcalpha2- 8NeuAcalpha2- 3Gaibetal -4Glc-).J  Γΐη the CIDP (element: disease name) patient, the IgG antibody (element: protein expression) titer to G D3 (element: 匕 chemical substance expression) was remarkably elevated (titer, 1: 10,000), indicating maxi mal avidity to the tetrasaccharide epitope (-NeuAcalpha2-8NeuAcalpha2-3Gaibetal-4Glc-). J
「Ciliated metaplasia (CM) in the stomach (要素:器官名) is mainly found in gastric m ucosa (要素:器官名) that harboursgastric cancer (要素:病名)」  “Ciliated metaplasia (CM) in the stomach (element: organ name) is mainly found in gastric m ucosa (element: organ name) that harboursgastric cancer (element: disease name)”
Variant Creutzfeldt- Jakob disease (CJD) (要素:病名) is a transmissible spongiform encephalopathy believed to be caused by the bovine (要素:動物種) spongiform enc ephalopathy agent, an abnormal isoformof the prion protein (PrP(sc)) (要素:蛋白質 表現) .」  Variant Creutzfeldt- Jakob disease (CJD) (element: disease name) is a transmissible spongiform encephalopathy believed to be caused by the bovine (element: animal species) spongiform enc ephalopathy agent, an abnormal isoformof the prion protein (PrP (sc)) (element : Protein expression)
「AIDP (要素:病名) and CIDP (要素:病名) having specific antibodies to the carbohy drate epitope(- NeuAcalpha2- 8NeuAcalpha2- 3Galbetal- 4Glc- ) of gangliosides. (要素:化学物質表現)」  “AIDP (element: disease name) and CIDP (element: disease name) having specific antibodies to the carbohy drate epitope (-NeuAcalpha2-8NeuAcalpha2-3Galbetal-4Glc-) of gangliosides. (Element: chemical expression)”
「Gene expression in archived frozen suralnerve biopsies of patients with chronic infl ammatory demyelinatingpolyneuropathy (CIDP) (要素:病名) was compared to that i n vasculitic nerve biopsies (VAS) and to normal nerve (NN) by DNA microarraytech nology (要素:実験方法).」  "Gene expression in archived frozen suralnerve biopsies of patients with chronic infl ammatory demyelinatingpolyneuropathy (CIDP) (element: disease name) was compared to that in vasculitic nerve biopsies (VAS) and to normal nerve (NN) by DNA microarraytech nology (element: experimental method ). "
「This novel interaction was identified in a yeast two-hybrid screen (要素:実験方法 ) using PrP(C) (要素: 白質表現) as bait and confirmed by an in vitro binding assa y and co— immunoprecipitationsj  “This novel interaction was identified in a yeast two-hybrid screen (element: experimental method) using PrP (C) (element: white matter expression) as bait and confirmed by an in vitro binding assay and co—immunoprecipitationsj
「Comparative study of the PrP(BSE) (要素:蛋白質表現) distribution in brains (要 素:器官名) from BSE (要素:病名) field cases using rapid tests (要素:検査法).」 また,例えば,会社の製品名とその製品に対する評判 (例えば,評判がいい,悪い などの情報)との対を,二項関係として抽出することもできる。  “Comparative study of the PrP (BSE) (element: protein expression) distribution in brains (element: organ name) from BSE (element: disease name) field cases using rapid tests. It is also possible to extract a pair of a product name and a reputation for the product (for example, information such as reputation or bad) as a binary relation.
以上のように,本発明の二項関係抽出装置 1によれば,機械学習処理用の教師デ ータとして,抽出するべき二項関係であるか否かの評価 (解)を付与したテキストデー タを用意するだけで,新しいテキストデータ力も抽出するべきものに値すると推定した 二項関係を自動的に抽出することが可能となる。これによつて,二項関係抽出処理に 使用するパターン生成の煩雑さを回避することができる。また,教師あり機械学習の 精度向上によって,二項関係抽出処理の性能の向上が期待できる。 As described above, according to the binary relation extracting apparatus 1 of the present invention, a teacher data for machine learning processing is used. By preparing text data with an evaluation (solution) of whether or not it is a binary relation to be extracted as a data, the binary relation estimated to be worthy of extracting new text data is automatically generated. Can be extracted automatically. This avoids the complexity of generating patterns used for binary relation extraction processing. In addition, by improving the accuracy of supervised machine learning, the performance of binary relation extraction processing can be expected to improve.
[0103] 次に,本発明の情報検索装置 4の実施例を説明する。  Next, an embodiment of the information search device 4 of the present invention will be described.
[0104] 情報検索装置 4は, AND検索処理の二つの検索キーワードの関係を意味のある 二項関係とみなして,この検索キーワードを要素とする二項関係について,抽出する べき関係であること (正)または,抽出するべき関係でな 、こと (負)の 、ずれかの解を 示すタグを付与した教師データを用いて機械学習し,検索対象である検索用テキスト データ 5から,二つの検索キーワードを含む記事であって,その検索キーワードの対 が抽出するべき二項関係であると推定されたものを検索結果 6として出力する処理装 置である。  [0104] The information retrieval device 4 regards the relationship between two search keywords in AND search processing as a meaningful binary relationship, and is a relationship that should be extracted for a binary relationship having this search keyword as an element ( (Positive) Or, the relationship to be extracted, machine (learning) using the teacher data to which the tag indicating the solution of either (negative) is attached, and two searches from the search text data 5 to be searched This is a processing device that outputs the search results 6 that contain the keywords, and the search keyword pairs that are estimated to be binary relations to be extracted.
[0105] 図 6に,本発明にかかる情報検索装置 4の構成例を示す。情報検索装置 4は,情報 検索部 40,教師データ記憶部 41,解 素性対抽出部 42,機械学習部 43,学習結 果記憶部 44,候補抽出部 45,素性抽出部 46,解推定部 47,および検索結果抽出 部 48を備える。  FIG. 6 shows a configuration example of the information search device 4 according to the present invention. The information retrieval device 4 includes an information retrieval unit 40, a teacher data storage unit 41, a feature pair extraction unit 42, a machine learning unit 43, a learning result storage unit 44, a candidate extraction unit 45, a feature extraction unit 46, and a solution estimation unit 47. , And a search result extraction unit 48.
[0106] 情報検索装置 4の教師データ記憶部 41,解 素性対抽出部 42,機械学習部 43, 学習結果記憶部 44,候補抽出部 45,素性抽出部 46,および解推定部 47は,図 1に 示す二項関係抽出装置 1の教師データ記憶部 11,解 素性対抽出部 12,機械学 習部 13,学習結果記憶部 14,候補抽出部 15,素性抽出部 16,および解推定部 17 とそれぞれ同様の処理を行う処理手段である。  [0106] The teacher data storage unit 41, feature feature pair extraction unit 42, machine learning unit 43, learning result storage unit 44, candidate extraction unit 45, feature extraction unit 46, and solution estimation unit 47 of the information search device 4 are Teacher data storage unit 11, feature pair extraction unit 12, machine learning unit 13, learning result storage unit 14, candidate extraction unit 15, feature extraction unit 16, and solution estimation unit 17 of the binary relation extraction device 1 shown in 1 And processing means for performing similar processing.
[0107] 情報検索部 40は, AND検索処理で与えられた検索キーワードを用いて検索用テ キストデータ 5を検索し,該当する記事 (テキストデータ)を取得する。  The information search unit 40 searches the search text data 5 using the search keyword given in the AND search process, and acquires the corresponding article (text data).
[0108] 候補抽出部 45は,情報検索部 40が取得した記事に含まれている二つの検索キー ワードと同じ文字列 (語)の対を要素とする二項関係の候補を抽出する。  [0108] The candidate extraction unit 45 extracts a binary relation candidate having the same character string (word) pair as two search keywords included in the article acquired by the information search unit 40 as elements.
[0109] 検索結果抽出部 48は,解推定部 47の推定結果をもとに,検索用テキストデータ 5 から検索された記事の二項関係の候補から,推定された正の解 (抽出するべき二項 関係であること)のなりやすさの度合いが所定の程度より良いものを抽出し,抽出した 二項関係の候補を含む記事または記事を特定する情報を検索結果 6として出力する [0109] Based on the estimation result of the solution estimation unit 47, the search result extraction unit 48 extracts an estimated positive solution (which should be extracted from the binary relation candidates of the articles searched from the search text data 5). Binomial That are better than a certain degree) and output the search result 6 as information that identifies the article or article that contains the extracted binary relation candidate.
[0110] 図 7に,情報検索装置 4の処理の流れを示す。情報検索装置 4の教師データ記憶 部 41には,教師データとして, AND検索処理で与えられる二つの検索キーワードを 要素とする二項関係に,抽出するべき二項関係である力 (正)または抽出するべきで な ヽニ項関係であるか (負)の 、ずれかの「解」の情報が付与された事例を含むテキ ストデータを記憶しておく。 [0110] Fig. 7 shows the processing flow of the information retrieval device 4. The teacher data storage unit 41 of the information search device 4 stores, as teacher data, a binary relation that has two search keywords given by the AND search process as elements, and a force (positive) or an extraction that is a binary relation to be extracted. Text data including cases with information on “solution” that is either negative or negative (negative) should be stored.
[0111] まず,解-素性対抽出部 42は,教師データ記憶部 41の教師データから各事例に ついて,所定の素性を抽出し,解 (タグによって付与された情報)と抽出した素性の集 合との組を生成する (ステップ S 11)。解—素性対抽出部 42は,教師データであるテ キストデータ力 所定のタグによって二項関係を抽出し,抽出した二項関係の要素( 検索キーワード)について,形態素解析処理,構文解析処理,要素の出現位置や要 素間の距離の算出処理などを行って,所定の素性を抽出する。  [0111] First, the solution-feature pair extraction unit 42 extracts a predetermined feature for each case from the teacher data in the teacher data storage unit 41, and collects the solution (information given by the tag) and the extracted feature collection. A pair is generated (step S11). The solution-feature pair extraction unit 42 extracts the binary relations by using a predetermined tag for the text data force that is the teacher data. For the extracted binary relation elements (search keywords), the morphological analysis process, the syntax analysis process, the element Predetermined features are extracted by, for example, calculating the position of the appearance and the distance between elements.
[0112] そして,機械学習部 43は,解—素性対抽出部 42により生成された解と素性の集合 との組から,どのような素性の集合のときにどのような解 (正または負)になりやすいか を機械学習法により学習し,学習結果を学習結果記憶部 44に格納する (ステップ S1 2)。機械学習部 43は,教師あり機械学習法として,例えば, k近傍法,シンプルべィ ズ法,決定リスト法,最大エントロピ一法,サポートベクトルマシン法などの手法のい ずれかを用いて機械学習処理を行う。  [0112] The machine learning unit 43 then determines what kind of solution (positive or negative) is used for any feature set from the set of the solution generated by the solution-feature pair extraction unit 42 and the feature set. The learning result is stored in the learning result storage unit 44 (step S12). The machine learning unit 43 uses, as a supervised machine learning method, a machine learning method such as a k-nearest neighbor method, a simple basis method, a decision list method, a maximum entropy method, or a support vector machine method. Process.
[0113] その後,候補抽出部 45は, AND検索処理で与えられた二つの入力検索キーヮー ドを用いてすべての二つの組み合わせ (対)を生成する (ステップ S 13)。情報検索部 40は,二つの入力検索キーワードの対を用いて検索用テキストデータ 5を AND検索 処理し,入力検索キーワード対を含む記事 (テキストデータ)を抽出し,候補抽出部 4 5は,検索処理によって抽出された記事に出現する入力検索キーワードを用いて,す ベての二つの組み合わせ (対)を二項関係の候補として抽出する (ステップ S 14)。  [0113] Thereafter, the candidate extraction unit 45 generates all two combinations (pairs) using the two input search keywords given in the AND search process (step S13). The information search unit 40 performs an AND search on the search text data 5 using two pairs of input search keywords to extract articles (text data) including the input search keyword pairs. Using the input search keywords that appear in the articles extracted by the processing, all two combinations (pairs) are extracted as binomial relation candidates (step S14).
[0114] そして,素性抽出部 46は,解-素性対抽出部 42での処理とほぼ同様の処理によ つて,検索した記事に出現している二項関係の各候補について,所定の素性の集合 を抽出する (ステップ SI 5)。 [0114] Then, the feature extraction unit 46 uses a process similar to the process in the solution-feature pair extraction unit 42 to obtain a predetermined feature for each binary relation candidate appearing in the searched article. set Is extracted (step SI 5).
[0115] 解推定部 47は,各候補について,その素性の集合の場合にどのような解になりや すいか,すなわち, 「正となりやすい」または「負となりやすいか」の度合いを学習結果 記憶部 14の学習結果をもとに推定する (ステップ S16)。そして,検索結果抽出部 48 は,二項関係の候補から,所定の程度より良い程度で「正となりやすい」と推定された ものを抽出するべき二項関係として選択し,この二項関係を含む記事または記事を 特定する情報を検索結果 6として出力する (ステップ S17)。  [0115] The solution estimation unit 47 determines, for each candidate, what kind of solution is likely to be in the case of the set of features, that is, the degree of “probably positive” or “prone to negative”. Estimate based on 14 learning results (step S16). Then, the search result extraction unit 48 selects, as a binary relation to be extracted, a candidate that is estimated to be “prone to be positive” to a better degree than a predetermined degree from candidates for the binary relation, and includes this binary relation. The article or information identifying the article is output as search result 6 (step S17).
[0116] 次に,本発明の情報検索処理の具体例を説明する。本例では,情報検索装置 4を ,検索用テキストデータ 5から, AND検索処理で使用される二つの検索キーワードと なりうる文字列を要素とする二項関係を含むテキストデータを教師データとする。そし て, AND検索処理で与えられた入力検索キーワードを要素とする二項関係の候補 を作成し,検索用テキストデータ 5からこの二項関係の候補を用いて検索を行い記事 を抽出する。検索された記事に含まれる入力検索キーワードの二項関係の候補が抽 出するべきであるカゝ否かを推定して,抽出するべきものと推定された度合 ヽがよ ヽニ 項関係の候補を含む記事を検索結果 6として出力するものとする。  Next, a specific example of the information search process of the present invention will be described. In this example, the information search device 4 uses the text data for search from the search text data 5 as text data including a binary relation whose elements are character strings that can be two search keywords used in the AND search processing. Then, a binary relation candidate is created using the input search keyword given in the AND search process as an element, and an article is extracted from the search text data 5 using this binary relation candidate. Estimate whether or not the binary relation candidate of the input search keyword included in the searched article should be extracted, and determine the degree to which it should be extracted. The included article is output as search result 6.
[0117] AND検索の検索キーワードとして, 「京大」と「総長」を設定すると仮定する。また, 検索キーワードの二項関係が正または負であるかの判断は人が行い,正または負の 解を示すタグを人手で付与する。したがって,機械学習処理において正の事例およ び負の事例を含む教師データが使用される。  [0117] Assume that "Kyoto University" and "General Manager" are set as search keywords for AND search. In addition, whether a binary relationship of search keywords is positive or negative is determined by a person, and a tag indicating a positive or negative answer is manually attached. Therefore, teacher data including positive and negative cases is used in the machine learning process.
[0118] 図 8〜図 10に,教師データ記憶部 41に記憶される教師データの例および,その教 師データ力 解—素性対抽出部 42によって抽出される素性の例を示す。本例では, 図 8および図 9の教師データ Dl, D2には,抽出するべき二項関係について解が正( positive)であることを示すタグが付与される。また,図 10の教師データ D3には,抽 出するべきでな 、二項関係につ!、て解が負(negative)であることを示すタグが付与 される。  FIGS. 8 to 10 show examples of teacher data stored in the teacher data storage unit 41 and examples of features extracted by the teacher data force solution / feature pair extraction unit 42. In this example, the teacher data Dl and D2 in Figs. 8 and 9 are given a tag indicating that the solution is positive for the binary relation to be extracted. In addition, the teacher data D3 in Fig. 10 is given a tag that indicates that the binary relation should not be extracted and that the solution is negative.
[0119] 図 8の教師データ D1には,二つの検索キーワードの対である二項関係の対 P3が 含まれ,二項関係 (対) P3は,第 1要素 pi (検索キー K1)「京大」,第 2要素 p2 (検索 キー K2)「総長」で構成され,二項関係の対 P3には正の解 (positive)が付与されて いる。 [0119] The teacher data D1 in Fig. 8 includes a binary relation pair P3, which is a pair of two search keywords. The binary relation (pair) P3 has the first element pi (search key K1) "K Large ”, second element p2 (search key K2)“ total length ”, and a positive relation (positive) is given to the binary relation pair P3. Yes.
[0120] 同様に,図 9の教師データ D2には,二つの検索キーワードの対である二項関係の 対 P4が含まれ,二項関係 (対) P4は,第 1要素 pi (検索キー K1)「京大」,第 2要素 p 2 (検索キー K2)「総長」で構成され,二項関係の対 P4には正の解 (positive)が付 与されている。図 8および図 9の教師データが「京大の総長」の内容であると判断でき るカゝらである。  [0120] Similarly, the teacher data D2 in Fig. 9 includes a binary relationship pair P4, which is a pair of two search keywords, and the binary relationship (pair) P4 has the first element pi (search key K1 ) “Kyoto Univ.”, Second element p 2 (search key K2) “total length”, and the positive pair (P4) is given to the binary relation pair P4. The teachers in Fig. 8 and Fig. 9 can be judged as the contents of “Kyoto University President”.
[0121] また,図 10の教師データ D3には,二つの検索キーワードの対である二項関係の対 [0121] In addition, the teacher data D3 in Fig. 10 includes a pair of binary relations, which is a pair of two search keywords.
P5が含まれ,二項関係 (対) P5は,第 1要素 pi (検索キー K1)「京大」,第 2要素 p2 ( 検索キー K2)「総長」で構成され,二項関係の対 P5には負の解 (negative)が付与 されている。同じデータ内に「京大」と「総長」とが出現しているが,相互に関係を持つ ものではなく, 「京大の総長」の内容でないと判断できるからである。 P5 is included, and the binary relation (pair) P5 consists of the first element pi (search key K1) “Kyoto Univ.” And the second element p2 (search key K2) “total length”. Is given a negative solution. This is because “Kyoto University” and “President” appear in the same data, but they are not related to each other and can be judged not to be the contents of “President of Kyoto University”.
[0122] 解-素性対抽出部 42は,教師データ記憶部 41に記憶されている教師データの事 例から,解と素性の集合との組を抽出する。例えば,素性として,要素 (検索キーヮー ド)の前後の二単語の単語自体,単語の品詞を素性とする。例えば教師データ D1を 例にとると,素性は, [0122] The solution-feature pair extraction unit 42 extracts a set of a solution and a set of features from the example of the teacher data stored in the teacher data storage unit 41. For example, the features are the two words before and after the element (search keyword) and the part of speech of the word. For example, taking teacher data D1 as an example,
「第 1要素の二つ前の単語は「今日」;  “The word before the first element is“ Today ”;
同二つ前の単語の品詞は「名詞」;  The part of speech of the previous two words is “noun”;
同一つ前の単語は「,」;  The previous word is “,”;
同一つ前の単語の品詞は「読点」;  The part of speech of the previous word is “reading”;
同一つ後の単語は「で」;  The next word is “de”;
同一つ後の単語の品詞は「助詞」;  The part of speech of the next word is “particle”;
同一つ後の単語は「の」;  The next word is “no”;
同一つ後の単語の品詞は「助詞」;  The part of speech of the next word is “particle”;
第 2要素の二つ前の単語は「で」;  The word before the second element is “de”;
同二つ前の単語の品詞は「助詞」;  The part of speech of the previous two words is “particle”;
同一つ前の単語は「,」;  The previous word is “,”;
同一つ前の単語の品詞は「読点」;  The part of speech of the previous word is “reading”;
同一つ後の単語は「が」; 同一つ後の単語の品詞は「助詞」; The next word is “ga”; The part of speech of the next word is “particle”;
同二つ後の単語は「出席」;  The second word is “attendance”;
同二つ後の単語の品詞は「名詞」である」となる。  The part of speech of the word after the second is “noun”.
[0123] なお,解-素性対抽出部 42は,二項関係抽出処理で説明したような情報を素性と して抽出することができる。  Note that the solution-feature pair extraction unit 42 can extract information as described in the binary relation extraction process as a feature.
[0124] 機械学習部 43は,この解と素性の集合とをもとに,どのような素性の集合の場合に どのような解(正(positive) Z負(negative) )となりやす!/、かを機械学習処理し,学 習結果を学習結果記憶部 44に記憶する。機械学習部 43は,教師あり機械学習法と して,例えば, k近傍法,シンプルベイズ法,決定リスト法,最大エントロピ一法,サボ ートベクトルマシン法などの前述の処理手法を用いる。  [0124] Based on this solution and the set of features, the machine learning unit 43 is likely to be any solution (positive Z negative) in any feature set! /, Machine learning processing is performed, and the learning result is stored in the learning result storage unit 44. The machine learning unit 43 uses the above-described processing methods such as the k-nearest neighbor method, the simple Bayes method, the decision list method, the maximum entropy method, and the servo vector machine method as supervised machine learning methods.
[0125] その後,情報検索部 40は,与えられた入力検索キーワード「京大」と「総長」とをもと に検索用テキストデータ 5を AND検索し,入力検索キーワードを含む記事を取得す る。そして,候補抽出部 45は,抽出された記事から二項関係の候補を抽出する。具 体的には, AND検索の検索結果である記事中に含まれる入力検索キーワードから 二項関係の候補を抽出する。 そして,素性抽出部 46は,二項関係の候補から,解 素性対抽出部 42と同じ素性を抽出し,解推定部 47は,学習結果記憶部 44に記 憶されている学習結果をもとに,各二項関係の候補について,その候補の素性の集 合の場合に正 (positive)または負(negative)のなりやすさの度合 、を推定する。検 索結果抽出部 48は,解推定部 47の推定結果をもとに二項関係の候補から,推定さ れた正の解となりやすさの度合いがよい二項関係を抽出し,この二項関係を含む記 事,記事を特定する情報を検索結果 6として出力する。  [0125] After that, the information search unit 40 performs an AND search on the search text data 5 based on the given input search keywords “Kyoto Univ.” And “total length”, and acquires articles including the input search keywords. . Then, the candidate extraction unit 45 extracts a binary relation candidate from the extracted article. Specifically, binomial relation candidates are extracted from the input search keywords included in the article that is the search result of the AND search. The feature extraction unit 46 extracts the same features as the feature pair extraction unit 42 from the binary relation candidates, and the solution estimation unit 47 uses the learning result stored in the learning result storage unit 44. Next, for each binary relation candidate, the degree of likelihood of being positive or negative in the case of a set of candidate features is estimated. The search result extraction unit 48 extracts a binary relationship that is likely to be an estimated positive solution from the binomial relationship candidates based on the estimation result of the solution estimation unit 47, and this binary term. Information that identifies articles and articles that contain relationships is output as search results 6.
[0126] 例えば,候補抽出部 45は,与えられた入力検索キーワードから,二つの入力検索 キーワードのすべての組み合わせ (対)を生成し,生成した対を二項関係の候補とす る。そして,情報検索部 40は,それぞれの二項関係の候補の要素(二つの入力検索 キーワード)を用いて AND検索処理を行う。そして,素性抽出部 46は,抽出された 記事に出現している二項関係の候補について所定の素性の集合を抽出する。  [0126] For example, the candidate extraction unit 45 generates all combinations (pairs) of two input search keywords from a given input search keyword, and sets the generated pair as a binary relation candidate. Then, the information search unit 40 performs an AND search process using each binary relation candidate element (two input search keywords). Then, the feature extraction unit 46 extracts a set of predetermined features for the binary relation candidates appearing in the extracted article.
[0127] 解推定部 47は,学習結果記憶部 44の学習結果をもとに,各二項関係の候補につ いて,その候補の素性の集合の場合に解のなりやすさの度合いを推定する。入力検 索キーワードの対である二項関係の候補それぞれが,検索されたその記事内で一つ ずつしか出現していないときは,それらすベての二項関係の候補が正 (抽出するべき[0127] Based on the learning result of the learning result storage unit 44, the solution estimation unit 47 estimates the degree of likelihood of the solution for each candidate candidate set of features. To do. Input validation If each binary keyword candidate that is a search keyword pair appears only once in the searched article, all of these binary candidates are positive (extracted)
)との度合いがよいと推定した場合に,その記事,記事を特定する情報を検索結果 6 とする。 )), The search result 6 is the information that identifies the article and the article.
[0128] また,入力検索キーワードの対である二項関係が,検索されたその記事内で複数 出現しているときは,出現する複数の二項関係の候補のうちの一つの候補について 正 (抽出するべき)との度合 ヽがよ!ヽと推定して ヽることを条件とし,さらに二項関係の 候補それぞれが,前述の条件をすベて満足して正の度合!、がよ!、と推定した場合に ,その記事,記事を特定する情報を検索結果 6とする。  [0128] In addition, when multiple binary relationships that are pairs of input search keywords appear in the searched article, one of the multiple binary relationship candidates that appear is correct ( The candidate should have a positive degree of satisfaction that all of the above-mentioned conditions are satisfied! , The search result 6 is the information that identifies the article and the article.
[0129] さらに,候補抽出部 45は,与えられた入力検索キーワードから,すべての二つの入 力検索キーワードの対を生成し,生成した対を二項関係の候補とする。そして,情報 検索部 40は,それぞれの二項関係の候補の要素(二つの入力検索キーワード)を用 いて AND検索処理を行う。そして,素性抽出部 46は,抽出された記事に出現してい る二項関係の候補について所定の素性の集合を抽出する。  [0129] Further, the candidate extraction unit 45 generates a pair of all two input search keywords from the given input search keyword, and sets the generated pair as a binary relation candidate. Then, the information search unit 40 performs an AND search process using each binary relation candidate element (two input search keywords). Then, the feature extraction unit 46 extracts a set of predetermined features for the binary relation candidates appearing in the extracted article.
[0130] 解推定部 47は,学習結果記憶部 44の学習結果をもとに,各二項関係の候補につ いて,その候補の素性の集合の場合に解のなりやすさの度合いを推定する。入力検 索キーワードの対である二項関係の候補それぞれが,検索されたその記事内で一つ ずつしか出現して!/、な 、ときは,それらすベての二項関係の候補にっ 、て正 (抽出 するべき)の度合 、を推定し,それらすベての二項関係の候補にっ 、て推定された 正の度合いを掛け合わせたものを,その記事の正の度合いとする。そして正の度合 [0130] Based on the learning result of the learning result storage unit 44, the solution estimation unit 47 estimates the degree of likelihood of the solution in the case of each candidate feature set. To do. Each binary search candidate that is a pair of input search keywords appears only once in the searched article! /, And sometimes all of these binary search candidates. Then, the degree of positive (to be extracted) is estimated, and all the binomial relation candidates are multiplied by the estimated positive degree as the positive degree of the article. . And positive degree
V、がよ 、と推定した記事,記事を特定する情報を検索結果 6とする。 The search result 6 is the article that is estimated to be V, Gayo, and the information that identifies the article.
[0131] また,入力検索キーワードの対である二項関係が,検索された記事内で複数出現 しているときは,出現する複数の二項関係の候補について正の度合いを推定し,そ れらの複数の二項関係の候補の推定した度合 、のうち,最も値がよ!、度合 、をその 二項関係の候補の度合いとする。そして,それぞれの二項関係の度合いを求め,求 めた度合いを掛け合わせたものを,その記事の正の度合いとする。そして正の度合 [0131] Also, when multiple binary relationships that are pairs of input search keywords appear in the searched article, the degree of positiveness is estimated for the multiple binary relationship candidates that appear. Among the estimated degrees of the plurality of binary relation candidates, the highest value is! And the degree is the degree of the binary relation candidates. Then, the degree of each binary relation is obtained, and the result obtained by multiplying the degree is the positive degree of the article. And positive degree
V、がよ 、と推定した記事,記事を特定する情報を検索結果 6とする。 The search result 6 is the article that is estimated to be V, Gayo, and the information that identifies the article.
[0132] 以上のように,本発明の情報検索装置 4によれば,機械学習処理用の教師データ として, AND検索処理の二つの検索キーワードの二項関係に,抽出するべき二項関 係であるか否かの評価を付与したテキストデータを用意するだけで,新 、検索用テ キストデータ 5から,抽出するべきものに値するとされた二項関係を含む記事を自動 的に抽出することが可能となる。 [0132] As described above, according to the information search device 4 of the present invention, teacher data for machine learning processing is used. From the new text data for search 5, simply prepare text data with an evaluation of whether or not it is a binary relation to be extracted from the binary relation between two search keywords in AND search processing. Therefore, it is possible to automatically extract articles that contain binary relations that deserve to be extracted.
[0133] 本発明の情報検索装置 4は, AND検索処理の検索結果の記事に出現する検索キ 一ワードの関係を,二項関係抽出処理を用いて評価することにより,検索キーワード を含んでいることによってヒットされたが,検索キーワード同士の関係がうすく,その結 果として内容的に無関係な,いわば検索意図からはずれるような内容の記事を排除 することができる。また,教師あり機械学習の精度向上によって,情報検索処理の性 能の向上が期待できる。  [0133] The information search device 4 of the present invention includes a search keyword by evaluating the relation of search keywords appearing in the article of the search result of the AND search process using the binary relation extraction process. However, it is possible to eliminate articles whose contents are unrelated to the search intention. In addition, improvement in the performance of information retrieval can be expected by improving the accuracy of supervised machine learning.
[0134] 以上の実施例においては,二項関係抽出処理および情報検索処理において,二 つの要素からなる二項関係の例を説明した。本発明は,三つの要素で構成される三 項関係についても適用することができる。  In the above embodiment, the example of the binary relation composed of two elements has been described in the binary relation extraction process and the information search process. The present invention can also be applied to a ternary relationship composed of three elements.
[0135] 例えば,二項関係抽出装置 1において,教師データとして,三つの要素の三項関 係を含むデータを用意する。そして,解—素性対抽出部 12は,この三項関係につい ての素性を,例えば,三つの要素のうちの,第 1要素(最初に出現する要素)の前方 二単語,第 3要素 (最後に出現する要素)の後方二単語,第 1要素と第 2要素(中間 に出現する要素)間の単語すベて,第 2要素と第 3要素間の単語すベての単語情報 とすることによって,機械学習部 13は,三項関係の素性の集合をもとに解のなりやす さを学習することができ,二項関係抽出部 18において,三項関係の抽出を扱うことが できる。なお,三項関係に与えられる解は,二項関係の場合と同様に, 「抽出するべ き三項関係」または「抽出するべきでな 、三項関係」とする。  [0135] For example, in the binary relation extraction apparatus 1, data including a ternary relation of three elements is prepared as teacher data. Then, the solution-feature pair extraction unit 12 determines the feature of this ternary relationship by, for example, the first two elements in the first element (the element that appears first), the third element (the last element) The word information of all the words between the first and second elements (elements appearing in the middle) and all the words between the second and third elements Thus, the machine learning unit 13 can learn the easiness of the solution based on the set of features of the ternary relation, and the binary relation extraction unit 18 can handle the extraction of the ternary relation. The solution given to the ternary relation is the same as in the case of the binary relation: “ternary relation that should be extracted” or “ternary relation that should not be extracted”.
[0136] 例えば,二項関係抽出装置 1において,教師データとして,三つの要素の三項関 係を含むデータを用意する。そして,二項関係抽出装置 1の各処理手段は,教師デ 一タの三項関係を分解して得られたそれぞれの二項関係,第 1要素と第 2要素の二 項関係,第 2要素と第 3要素の二項関係,第 1要素と第 3要素の二項関係をそれぞれ 別個の二項関係として扱う。そして,それぞれの二項関係すべてについて,抽出する べき三項関係であるかの解の度合いを算出し,算出した度合いを掛け合わせて求め た値をその三項関係の度合いとする。そして,その度合いの大きいものを抽出するべ き三項関係として取り出すようにする。 [0136] For example, in the binary relation extraction apparatus 1, data including a ternary relation of three elements is prepared as teacher data. Then, each processing means of the binary relation extraction device 1 includes the binary relation obtained by decomposing the ternary relation of the teacher data, the binary relation between the first element and the second element, and the second element. The binary relation between the first and third elements and the binary relation between the first and third elements are treated as separate binary relations. Then, for each binary relation, calculate the degree of solution to determine whether it is a ternary relation to be extracted, and multiply by the calculated degree. Is the degree of the ternary relationship. Then, the ternary relationship that should be extracted should be extracted.
[0137] このとき,機械学習部 13が,サポートベクトルマシン法を使用する場合には,分類 先が二つ(正または負)となるので,ペアワイズ法またはワン VSレスト法を用いて三項 関係を機械学習する。  [0137] At this time, when the machine learning unit 13 uses the support vector machine method, there are two classification destinations (positive or negative), so the ternary relation using the pairwise method or the one VS rest method is used. Machine learning.
[0138] また,二項関係抽出部 18では,二項関係 3の抽出の際に,抽出の確信度を求めら れるようにする。そして,二項関係を複数組み合わせて作成した三項関係の確信度と して,それぞれの組み合わせた二項関係の確信度の積を用いて,三項関係の確信 度の大きなものを取り出すようにする。二項関係の確信度は,通常の機械学習処理 において算出される確信度を利用する。  [0138] Also, the binary relation extraction unit 18 can obtain the certainty of extraction when extracting the binary relation 3. Then, as the confidence of the ternary relation created by combining multiple binary relations, the product of the confidence of the combined binary relations is used to extract the one with the highest confidence of the ternary relation. To do. The confidence level of the binomial relationship uses the confidence level calculated in the normal machine learning process.
[0139] このような三項関係の抽出処理は,情報検索装置 4においても同様に行うことがで きる。例えば, 「平成 12年の京大の総長」に関する記事を検索する場合に,教師デー タとして, 「平成 12年」, 「京大」,および「総長」の三つの検索キーワードによる三項 関係を含むデータを与えて,検索用テキストデータ 5から,これら三つの検索キーヮ ードによる AND検索の検索結果 6を出力する。  [0139] Such ternary relation extraction processing can be performed in the information retrieval apparatus 4 in the same manner. For example, when searching for articles related to “General Manager of Kyoto University in 2000”, the ternary relationship using three search keywords of “2000”, “Kyoto University”, and “General Director” is used as teacher data. Given search data, search result 6 of AND search using these three search keywords is output from search text data 5.
[0140] また,本例では,事例の二項関係または三項関係に付与する解の情報として, 「正  [0140] In this example, as the information of the solution to be given to the binary relation or ternary relation of the case,
(抽出するべき二項関係である)」または「負(抽出するべきでな ヽニ項関係である)」 を用いて説明したが,付与する解の情報として,例えば, 「相互作用のある」, 「反作 用のある」, 「作用がな 、」などの多分類のものであってもよ 、。  (It is a binary relation to be extracted) or “Negative (It is a binary relation that should not be extracted)”. As information on the solution to be given, for example, “Interaction”, It may be classified into multiple categories such as “with counter-action” or “no action”.
[0141] 以上,本発明をその実施の形態により説明したが,本発明はその主旨の範囲におい て種々の変形が可能であることは当然である。  [0141] While the present invention has been described with reference to the embodiments, it is obvious that the present invention can be variously modified within the scope of the gist thereof.
[0142] また,本発明は,コンピュータにより読み取られ実行されるプログラムとして実施する ことができる。本発明を実現するプログラムは,コンピュータが読み取り可能な,可搬 媒体メモリ,半導体メモリ,ハードディスクなどの適当な記録媒体に格納することがで き,これらの記録媒体に記録して提供され,または,通信インタフェースを介して種々 の通信網を利用した送受信により提供されるものである。  [0142] Further, the present invention can be implemented as a program read and executed by a computer. The program for realizing the present invention can be stored in an appropriate recording medium such as a portable medium memory, a semiconductor memory, and a hard disk that can be read by a computer. It is provided by transmission and reception using various communication networks via the communication interface.

Claims

請求の範囲  The scope of the claims
[1] コンピュータが読み取り可能な記憶装置に格納された文データ中に出現する二項 関係を,機械学習処理を用いて抽出する処理装置であって,  [1] A processing device that extracts binary relations that appear in sentence data stored in a computer-readable storage device using machine learning processing.
問題と解との組で構成される事例であって,問題が文データ中に出現する二項関 係であって解が抽出するべき二項関係であるものを含む教師データが格納された教 師データ記憶手段と,  An example consisting of a set of a problem and a solution, in which the teaching data that contains the binary relation that the problem appears in the sentence data and the binary relation that should be extracted is stored Teacher data storage means,
前記教師データ記憶手段から前記事例を取り出し,前記事例ごとに,所定の情報 を素性として抽出し,前記解と前記抽出した素性の集合との組を生成する解 素性 対抽出手段と,  Extracting the case from the teacher data storage unit, extracting predetermined information as a feature for each case, and generating a pair of the feature and the extracted feature set;
所定の機械学習アルゴリズムにもとづいて,前記解と素性の集合との組について, どのような素性の集合の場合に前記解となるかと 、うことを機械学習処理し,前記ど のような素性の集合の場合に前記解となるかということを示す情報を学習結果情報と して学習結果記憶手段に保存する機械学習手段と,  Based on a predetermined machine learning algorithm, a machine learning process is performed on the set of the feature and the feature set to determine the feature set in which the solution is obtained. Machine learning means for storing, in the learning result storage means, information indicating whether the solution is obtained in the case of a set, as learning result information;
前記記憶装置に格納されたテキストデータから,前記二項関係の要素を抽出し,前 記要素で構成される対を抽出し,前記抽出した対を二項関係の候補とする候補抽出 手段と,  Candidate extraction means for extracting the binary relation elements from the text data stored in the storage device, extracting a pair composed of the elements, and using the extracted pair as a binary relation candidate;
前記解 素性対抽出手段が行う抽出処理と同様の抽出処理によって,前記二項 関係の候補について前記所定の情報を素性として抽出する素性抽出手段と, 前記学習結果記憶手段に格納された前記学習結果情報にもとづいて,前記二項 関係の候補の素性の集合の場合の前記解となりやす 、度合 、を推定する解推定手 段と,  A feature extraction unit that extracts the predetermined information as a feature of the binary relation candidate by an extraction process similar to the extraction process performed by the feature pair extraction unit; and the learning result stored in the learning result storage unit Based on the information, a solution estimation means for estimating the degree of, which is likely to be the solution in the case of the set of features of the binomial relation candidates,
前記推定結果として,前記二項関係の候補について抽出するべき二項関係である ことを示す解となりやす 、度合 、が所定の程度より良 、場合に,前記二項関係の候 補を抽出するべき二項関係として選択する二項関係抽出手段とを備える  As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. A binary relation extracting means for selecting as a binary relation
ことを特徴とする二項関係抽出装置。  A binary relation extraction apparatus characterized by that.
[2] 前記教師データ記憶手段は,前記事例として,問題の二項関係が,抽出するべき 二項関係であることを示す正の解が与えられた正の事例と,問題の二項関係が,抽 出するべきではない二項関係であることを示す負の解が与えられた負の事例とを含 む教師データが格納される [2] The teacher data storage means includes, as the example, a positive case given a positive solution indicating that the binary relation of the problem is a binary relation to be extracted, and a binary relation of the problem. , And negative cases given a negative solution that indicates a binary relation that should not be extracted. Stored teacher data
ことを特徴とする請求項 1記載の二項関係抽出装置。  The binary relation extraction device according to claim 1, wherein:
[3] 前記機械学習手段は,前記教師データから,前記所定の情報である素性の集合と 解を示す情報との対で構成した規則を設定し,前記規則を所定の順序でリスト上に 並べたものを学習結果とし,前記規則のリストを学習結果情報として前記学習結果記 憶手段に格納し, [3] The machine learning means sets a rule composed of a pair of a feature set as the predetermined information and information indicating a solution from the teacher data, and arranges the rule on the list in a predetermined order. A list of rules is stored as learning result information in the learning result storage means,
前記解推定手段は,前記学習結果記憶手段に格納された前記学習結果情報であ る前記規則のリストを先頭力もチェックして,前記二項関係の候補から抽出された素 性の集合と一致する規則を検出し,検出した規則の解を示す情報をもとに,前記二 項関係の候補の解として推定する  The solution estimation means also checks a head force of the rule list which is the learning result information stored in the learning result storage means, and matches the set of features extracted from the binomial relation candidates. A rule is detected and estimated as a solution of the binary relation candidate based on information indicating the solution of the detected rule.
ことを特徴とする請求項 1または請求項 2のいずれか一項に記載の二項関係抽出 装置。  The binary relation extraction device according to claim 1, wherein the binary relation extraction apparatus is characterized in that:
[4] 前記機械学習手段は,前記教師データから,解となりうる分類を特定し,所定の条 件式を満足しかつエントロピーを示す式を最大にするときの素性の集合と解となりうる 分類の二項からなる確率分布を求め,前記確率分布を前記学習結果情報として前 記学習結果記憶部に格納し,  [4] The machine learning means identifies a class that can be a solution from the teacher data, and sets a feature that satisfies a predetermined conditional expression and maximizes an expression showing entropy. A probability distribution consisting of two terms is obtained, and the probability distribution is stored in the learning result storage unit as the learning result information.
前記解推定手段は,前記学習結果記憶手段に格納された前記学習結果情報であ る前記確率分布を利用して,前記二項関係の候補の集合の場合のそれぞれの解と なりうる分類の確率を求めて,最も大きい確率値を持つ解となりうる分類を特定し,前 記特定した分類を前記二項関係の候補の解と推定する  The solution estimation means uses the probability distribution that is the learning result information stored in the learning result storage means, and the probability of classification that can be each solution in the case of the binomial candidate set. And identify the class that can be the solution with the largest probability value, and estimate the identified class as the solution of the binomial relation candidate.
ことを特徴とする請求項 1または請求項 2のいずれか一項に記載の二項関係抽出 装置。  The binary relation extraction device according to claim 1, wherein the binary relation extraction apparatus is characterized in that:
[5] 前記機械学習手段は,前記教師データから解となりうる分類を特定し,前記分類を 正例と負例とに分割し,所定のカーネル関数を用いたサポートベクトルマシン法を実 行する関数にしたがって前記二項関係の候補力 抽出された素性の集合を次元とす る空間上で前記正例と前記負例との間隔を最大にしかつ超平面で分割する超平面 を求め,前記超平面を前記学習結果情報として前記学習結果記憶手段に格納し, 前記解推定手段は,前記学習結果記憶手段に格納された前記学習結果情報であ る前記超平面を利用して,前記二項関係の候補力 抽出された素性の集合が前記 超平面で分割された前記空間にお!、て前記正例の側か前記負例の側のどちらにあ るかを特定し,前記特定された結果にもとづいて定まる解となりうる分類を特定し,前 記特定した分類を前記二項関係の候補の解と推定する [5] The machine learning means specifies a classification that can be a solution from the teacher data, divides the classification into a positive example and a negative example, and executes a support vector machine method using a predetermined kernel function. The candidate power of the binomial relation is obtained according to the above, and a hyperplane that maximizes the interval between the positive example and the negative example and divides the hyperplane is obtained in a space having the extracted feature set as a dimension. Is stored in the learning result storage means as the learning result information, and the solution estimation means is the learning result information stored in the learning result storage means. Using the hyperplane, the candidate power of the binomial relation is extracted in the space divided by the hyperplane, either on the positive example side or on the negative example side. Identify a category that can be a solution determined based on the identified result, and estimate the identified category as a solution of the binomial candidate.
ことを特徴とする請求項 1または請求項 2のいずれか一項に記載の二項関係抽出 装置。  The binary relation extraction device according to claim 1, wherein the binary relation extraction apparatus is characterized in that:
[6] 前記機械学習手段は,前記教師データの事例同士が,その事例から抽出された素 性の集合のうち重複する素性の割合にもとづく事例同士の類似度を定義しておき, 前記定義した類似度と事例を前記学習結果情報として前記学習結果記憶手段に格 納し,  [6] The machine learning means defines the degree of similarity between cases based on the ratio of overlapping features in the set of features extracted from the cases of the teacher data. The similarity and the case are stored in the learning result storage means as the learning result information,
前記解推定手段は,前記学習結果記憶手段に格納された前記学習結果情報であ る前記定義した類似度と前記事例を参照して,前記二項関係の候補につ!、てその 候補との類似度が高 、順に k個の事例を選択し,前記選択した k個の事例での多数 決によって定めた分類先を,前記二項関係の候補の解と推定する  The solution estimation means refers to the defined similarity and the case, which are the learning result information stored in the learning result storage means, and determines the binomial relationship candidate! K cases are selected in order of high similarity, and the classification destination determined by the majority vote in the selected k cases is estimated as a solution of the binomial relation candidate.
ことを特徴とする請求項 1または請求項 2のいずれか一項に記載の二項関係抽出 装置。  The binary relation extraction device according to claim 1, wherein the binary relation extraction apparatus is characterized in that:
[7] 前記機械学習手段は,前記解と素性の集合との組を前記学習結果情報として前記 学習結果記憶手段に格納し,  [7] The machine learning means stores the set of the solution and feature set as the learning result information in the learning result storage means,
前記解推定手段は,前記学習結果記憶手段の前記解と素性の集合との組をもとに ,ベイズの定理にもとづ 、て前記素性抽出手段から得た前記二項関係の候補の素 性の集合の場合の各分類になる確率を算出し,前記確率の値が最も大きい分類を, 前記二項関係の候補の解と推定する  The solution estimation means, based on a set of the solution and feature set of the learning result storage means, based on a Bayes's theorem, obtains a candidate element of the binary relation obtained from the feature extraction means. Probability of each classification in the case of sex set is calculated, and the classification with the largest probability value is estimated as the solution of the binary relation candidate
ことを特徴とする請求項 1または請求項 2のいずれか一項に記載の二項関係抽出 装置。  The binary relation extraction device according to claim 1, wherein the binary relation extraction apparatus is characterized in that:
[8] 複数の検索キーワードによる情報検索処理において,教師あり機械学習処理を用 いた二項関係抽出処理結果を利用して検索結果を抽出する処理装置であって, 問題と解との組で構成される事例であって,問題が検索キーワードを要素とする二 項関係であって解が抽出するべき二項関係であるものを含む教師データが格納され た教師データ記憶手段と, [8] A processing device that extracts search results using binary relation extraction processing results using supervised machine learning processing in information search processing using multiple search keywords. The teacher data is stored, including the case where the problem is a binary relation with the search keyword as an element and the solution is a binary relation to be extracted. Teacher data storage means,
前記教師データ記憶手段から前記事例を取り出し,前記事例ごとに,所定の情報 を素性として抽出し,前記解と前記抽出した素性の集合との組を生成する解 素性 対抽出手段と,  Extracting the case from the teacher data storage unit, extracting predetermined information as a feature for each case, and generating a pair of the feature and the extracted feature set;
所定の機械学習アルゴリズムにもとづいて,前記解と素性の集合との組について, どのような素性の集合の場合に前記解となるかと 、うことを機械学習処理し,前記ど のような素性の集合の場合に前記解となるかということを示す情報を学習結果情報と して学習結果記憶手段に保存する機械学習手段と,  Based on a predetermined machine learning algorithm, a machine learning process is performed on the set of the feature and the feature set to determine the feature set in which the solution is obtained. Machine learning means for storing, in the learning result storage means, information indicating whether the solution is obtained in the case of a set, as learning result information;
入力された複数の検索キーワードを用いた入力検索キーワード対を生成し,検索 対象となるテキストデータ力も前記入力検索キーワード対を含むテキストデータを抽 出して取得する情報検索手段と,  An information search means for generating an input search keyword pair using a plurality of input search keywords, extracting text data including the input search keyword pair and acquiring the text data force to be searched;
前記検索して取得された各テキストデータ力 前記入力検索キーワードで構成され る対を生成し,前記生成した対を二項関係の候補とする候補抽出手段と,  Each text data force obtained by the search, generating a pair composed of the input search keywords, and candidate extraction means for using the generated pair as a binary relation candidate;
前記解 素性対抽出手段が行う抽出処理と同様の抽出処理によって,前記二項 関係の候補について前記所定の情報を素性として抽出する素性抽出手段と, 前記学習結果記憶手段に格納された前記学習結果情報にもとづいて,前記二項 関係の候補の素性の集合の場合の前記解となりやす 、度合 、を推定する解推定手 段と,  A feature extraction unit that extracts the predetermined information as a feature of the binary relation candidate by an extraction process similar to the extraction process performed by the feature pair extraction unit; and the learning result stored in the learning result storage unit Based on the information, a solution estimation means for estimating the degree of, which is likely to be the solution in the case of the set of features of the binomial relation candidates,
前記推定結果として,前記二項関係の候補について抽出するべき二項関係である ことを示す解となりやす 、度合 、が所定の程度より良 、場合に,前記二項関係の候 補を抽出するべき二項関係として選択し,前記選択した二項関係を含むテキストデ ータを検索結果として抽出する検索結果抽出手段とを備える  As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. Search result extraction means for selecting as a binary relation and extracting text data including the selected binary relation as a search result.
ことを特徴とする二項関係抽出処理を用いた情報検索装置。  An information retrieval apparatus using binary relation extraction processing characterized by the above.
前記教師データ記憶手段は,前記事例として,問題の二項関係が,抽出するべき 二項関係であることを示す正の解が与えられた正の事例と,問題の二項関係が,抽 出するべきではない二項関係であることを示す負の解が与えられた負の事例とを含 む教師データが格納される  The teacher data storage means extracts, as the case, a positive case given a positive solution indicating that the binary relation of the problem is a binary relation to be extracted and the binary relation of the problem. Teacher data is stored, including negative examples given negative solutions that indicate binary relationships that should not be done
ことを特徴とする請求項 8記載の二項関係抽出処理を用いた情報検索装置。 [10] 前記機械学習手段は,前記教師データから,前記所定の情報である素性の集合と 解を示す情報との対で構成した規則を設定し,前記規則を所定の順序でリスト上に 並べたものを学習結果とし,前記規則のリストを学習結果情報として前記学習結果記 憶手段に格納し, 9. An information retrieval apparatus using binary relation extraction processing according to claim 8. [10] The machine learning means sets, from the teacher data, a rule composed of a pair of a feature set as the predetermined information and information indicating a solution, and arranges the rule on the list in a predetermined order. A list of rules is stored as learning result information in the learning result storage means,
前記解推定手段は,前記学習結果記憶手段に格納された前記学習結果情報であ る前記規則のリストを先頭力もチェックして,前記二項関係の候補から抽出された素 性の集合と一致する規則を検出し,検出した規則の解を示す情報をもとに,前記二 項関係の候補の解として推定する  The solution estimation means also checks a head force of the rule list which is the learning result information stored in the learning result storage means, and matches the set of features extracted from the binomial relation candidates. A rule is detected and estimated as a solution of the binary relation candidate based on information indicating the solution of the detected rule.
ことを特徴とする請求項 8または請求項 9のいずれか一項に記載の二項関係抽出 処理を用いた情報検索装置。  10. An information search apparatus using the binary relation extraction process according to claim 8, wherein the binary relation extraction process is performed.
[11] 前記機械学習手段は,前記教師データから,解となりうる分類を特定し,所定の条 件式を満足しかつエントロピーを示す式を最大にするときの素性の集合と解となりうる 分類の二項からなる確率分布を求め,前記確率分布を前記学習結果情報として前 記学習結果記憶部に格納し, [11] The machine learning means identifies a class that can be a solution from the teacher data, and a set of features that satisfy a predetermined conditional expression and maximize an expression showing entropy. A probability distribution consisting of two terms is obtained, and the probability distribution is stored in the learning result storage unit as the learning result information.
前記解推定手段は,前記学習結果記憶手段に格納された前記学習結果情報であ る前記確率分布を利用して,前記二項関係の候補の集合の場合のそれぞれの解と なりうる分類の確率を求めて,最も大きい確率値を持つ解となりうる分類を特定し,前 記特定した分類を前記二項関係の候補の解と推定する  The solution estimation means uses the probability distribution that is the learning result information stored in the learning result storage means, and the probability of classification that can be each solution in the case of the binomial candidate set. And identify the class that can be the solution with the largest probability value, and estimate the identified class as the solution of the binomial relation candidate.
ことを特徴とする請求項 8または請求項 9のいずれか一項に記載の二項関係抽出 処理を用いた情報検索装置。  10. An information search apparatus using the binary relation extraction process according to claim 8, wherein the binary relation extraction process is performed.
[12] 前記機械学習手段は,前記教師データから解となりうる分類を特定し,前記分類を 正例と負例とに分割し,所定のカーネル関数を用いたサポートベクトルマシン法を実 行する関数にしたがって前記二項関係の候補力 抽出された素性の集合を次元とす る空間上で前記正例と前記負例との間隔を最大にしかつ超平面で分割する超平面 を求め,前記超平面を前記学習結果情報として前記学習結果記憶手段に格納し, 前記解推定手段は,前記学習結果記憶手段に格納された前記学習結果情報であ る前記超平面を利用して,前記二項関係の候補力 抽出された素性の集合が前記 超平面で分割された前記空間にお!、て前記正例の側か前記負例の側のどちらにあ るかを特定し,前記特定された結果にもとづいて定まる解となりうる分類を特定し,前 記特定した分類を前記二項関係の候補の解と推定する [12] The machine learning means identifies a class that can be a solution from the teacher data, divides the class into a positive example and a negative example, and executes a support vector machine method using a predetermined kernel function. The candidate power of the binomial relation is obtained according to the above, and a hyperplane that maximizes the interval between the positive example and the negative example and divides the hyperplane is obtained in a space having the extracted feature set as a dimension. Is stored in the learning result storage means as the learning result information, and the solution estimation means uses the hyperplane that is the learning result information stored in the learning result storage means to store the binomial relationship. Candidate power The set of extracted features is in the space divided by the hyperplane, either on the positive side or on the negative side. Identify a class that can be a solution determined based on the identified result, and estimate the identified class as a solution of the binomial relationship candidate.
ことを特徴とする請求項 8または請求項 9のいずれか一項に記載の二項関係抽出 処理を用いた情報検索装置。  10. An information search apparatus using the binary relation extraction process according to claim 8, wherein the binary relation extraction process is performed.
[13] 前記機械学習手段は,前記教師データの事例同士が,その事例力 抽出された素 性の集合のうち重複する素性の割合にもとづく事例同士の類似度を定義しておき, 前記定義した類似度と事例を前記学習結果情報として前記学習結果記憶手段に格 納し, [13] The machine learning means defines the degree of similarity between cases based on the ratio of overlapping features in the set of features extracted from the case power of the case examples of the teacher data. The similarity and the case are stored in the learning result storage means as the learning result information,
前記解推定手段は,前記学習結果記憶手段に格納された前記学習結果情報であ る前記定義した類似度と前記事例を参照して,前記二項関係の候補につ!、てその 候補との類似度が高 、順に k個の事例を選択し,前記選択した k個の事例での多数 決によって定めた分類先を,前記二項関係の候補の解と推定する  The solution estimation means refers to the defined similarity and the case, which are the learning result information stored in the learning result storage means, and determines the binomial relationship candidate! K cases are selected in order of high similarity, and the classification destination determined by the majority vote in the selected k cases is estimated as a solution of the binomial relation candidate.
ことを特徴とする請求項 8または請求項 9のいずれか一項に記載の二項関係抽出 処理を用いた情報検索装置。  10. An information search apparatus using the binary relation extraction process according to claim 8, wherein the binary relation extraction process is performed.
[14] 前記機械学習手段は,前記解と素性の集合との組を前記学習結果情報として前記 学習結果記憶手段に格納し, [14] The machine learning means stores the set of the solution and the set of features as the learning result information in the learning result storage means,
前記解推定手段は,前記学習結果記憶手段の前記解と素性の集合との組をもとに ,ベイズの定理にもとづ 、て前記素性抽出手段から得た前記二項関係の候補の素 性の集合の場合の各分類になる確率を算出し,前記確率の値が最も大きい分類を, 前記二項関係の候補の解と推定する  The solution estimation means, based on a set of the solution and feature set of the learning result storage means, based on a Bayes's theorem, obtains a candidate element of the binary relation obtained from the feature extraction means. Probability of each classification in the case of sex set is calculated, and the classification with the largest probability value is estimated as the solution of the binary relation candidate
ことを特徴とする請求項 8または請求項 9のいずれか一項に記載の二項関係抽出 処理を用いた情報検索装置。  10. An information search apparatus using the binary relation extraction process according to claim 8, wherein the binary relation extraction process is performed.
[15] コンピュータが読み取り可能な記憶装置に格納された文データ中に出現する二項 関係を,機械学習処理を用いて抽出する二項関係抽出処理方法であって, 問題と解との組で構成される事例であって,問題が文データ中に出現する二項関 係であって解が抽出するべき二項関係であるものを含む教師データが格納された教 師データ記憶手段から前記事例を取り出し,前記事例ごとに,所定の情報を素性とし て抽出し,前記解と前記抽出した素性の集合との組を生成する解 素性対抽出処 理過程と, [15] A binary relation extraction processing method that uses machine learning processing to extract binary relations that appear in sentence data stored in a computer-readable storage device. The case is composed of teacher data storage means storing teacher data including a binary relation in which the problem appears in sentence data and a binary relation to be extracted. For each of the cases, predetermined information is extracted as a feature, and a feature-feature pair extraction process that generates a set of the solution and the set of extracted features. Reasoning,
所定の機械学習アルゴリズムにもとづいて,前記解と素性の集合との組について, どのような素性の集合の場合に前記解となるかと 、うことを機械学習処理し,前記ど のような素性の集合の場合に前記解となるかということを示す情報を学習結果情報と して学習結果記憶手段に保存する機械学習処理過程と,  Based on a predetermined machine learning algorithm, a machine learning process is performed on the set of the feature and the feature set to determine the feature set in which the solution is obtained. A machine learning process for storing, in the learning result storage means, information indicating whether the solution is obtained in the case of a set, as learning result information;
前記記憶装置に格納されたテキストデータから,前記二項関係の要素を抽出し,前 記要素で構成される対を抽出し,前記抽出した対を二項関係の候補とする候補抽出 処理過程と,  A candidate extraction process in which the binary relational elements are extracted from the text data stored in the storage device, a pair composed of the above elements is extracted, and the extracted pair is a binary relational candidate. ,
前記解 素性対抽出手段が行う抽出処理と同様の抽出処理によって,前記二項 関係の候補について前記所定の情報を素性として抽出する素性抽出処理過程と, 前記学習結果記憶手段に格納された前記学習結果情報にもとづいて,前記二項 関係の候補の素性の集合の場合の前記解となりやす 、度合 、を推定する解推定処 理過程と,  A feature extraction process for extracting the predetermined information as a feature for the binomial relation candidate by an extraction process similar to the extraction process performed by the feature pair extraction means, and the learning stored in the learning result storage means Based on the result information, a solution estimation process for estimating the degree of the likelihood of being the solution in the case of the set of candidate features of the binomial relationship;
前記推定結果として,前記二項関係の候補について抽出するべき二項関係である ことを示す解となりやす 、度合 、が所定の程度より良 、場合に,前記二項関係の候 補を抽出するべき二項関係として選択する二項関係抽出処理過程とを備える ことを特徴とする二項関係抽出処理方法。  As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. A binary relation extraction processing method comprising: a binary relation extraction process selected as a binary relation.
コンピュータが複数の検索キーワードによる情報検索処理を行う場合に,教師あり 機械学習処理を用いた二項関係抽出処理結果を利用して検索結果を抽出する情報 検索処理方法であって,  An information search processing method for extracting a search result using a binary relation extraction process result using a supervised machine learning process when a computer performs an information search process using a plurality of search keywords.
問題と解との組で構成される事例であって,問題が検索キーワードを要素とする二 項関係であって解が抽出するべき二項関係であるものを含む教師データが格納され た教師データ記憶手段から前記事例を取り出し,前記事例ごとに,所定の情報を素 性として抽出し,前記解と前記抽出した素性の集合との組を生成する解 素性対抽 出処理過程と,  Teacher data in which teacher data is stored, including cases that consist of pairs of problems and solutions, where the problem is a binary relationship with the search keywords as elements and the solution should be extracted Extracting the case from the storage means, extracting predetermined information as a feature for each case, and generating a feature-feature pair extraction process for generating a set of the solution and the set of extracted features;
所定の機械学習アルゴリズムにもとづいて,前記解と素性の集合との組について, どのような素性の集合の場合に前記解となるかと 、うことを機械学習処理し,前記ど のような素性の集合の場合に前記解となるかということを示す情報を学習結果情報と して学習結果記憶手段に保存する機械学習処理過程と, Based on a predetermined machine learning algorithm, a machine learning process is performed on the set of the feature and the feature set to determine the feature set in which the solution is obtained. Information indicating whether the solution in the case of a set is the learning result information Machine learning process to be stored in the learning result storage means,
入力された複数の検索キーワードを用いた入力検索キーワード対を生成し,検索 対象となるテキストデータ力も前記入力検索キーワード対を含むテキストデータを抽 出して取得する情報検索処理過程と,  An information search process for generating an input search keyword pair using a plurality of input search keywords, extracting text data including the input search keyword pair and acquiring the text data force to be searched;
前記検索して取得された各テキストデータ力 前記入力検索キーワードで構成され る対を生成し,前記生成した対を二項関係の候補とする候補抽出処理過程と, 前記解 素性対抽出手段が行う抽出処理と同様の抽出処理によって,前記二項 関係の候補について前記所定の情報を素性として抽出する素性抽出処理過程と, 前記学習結果記憶手段に格納された前記学習結果情報にもとづいて,前記二項 関係の候補の素性の集合の場合の前記解となりやす 、度合 、を推定する解推定処 理過程と,  Each text data force obtained by the search is generated by a candidate extraction process that generates a pair composed of the input search keyword and uses the generated pair as a binary relation candidate, and the feature pair extraction means performs Based on a feature extraction process for extracting the predetermined information as a feature for the binomial relationship candidate by an extraction process similar to the extraction process, and the learning result information stored in the learning result storage means, A solution estimation process for estimating the degree of, which is likely to be the solution in the case of a set of feature candidates of term relations,
前記推定結果として,前記二項関係の候補について抽出するべき二項関係である ことを示す解となりやす 、度合 、が所定の程度より良 、場合に,前記二項関係の候 補を抽出するべき二項関係として選択し,前記選択した二項関係を含むテキストデ ータを検索結果として抽出する検索結果抽出処理過程とを備える  As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. A search result extraction process for selecting as a binary relation and extracting text data including the selected binary relation as a search result.
ことを特徴とする二項関係抽出処理を用いた情報検索処理方法。  An information retrieval processing method using binary relation extraction processing characterized by the above.
コンピュータに,読み取り可能な記憶装置に格納された文データ中に出現する二 項関係を,機械学習処理を用いて抽出する処理方法として,  As a processing method for extracting binary relations that appear in sentence data stored in a readable storage device to a computer using machine learning processing,
問題と解との組で構成される事例であって,問題が文データ中に出現する二項関 係であって解が抽出するべき二項関係であるものを含む教師データが格納された教 師データ記憶手段から,前記事例を取り出し,前記事例ごとに,所定の情報を素性と して抽出し,前記解と前記抽出した素性の集合との組を生成する解 素性対抽出処 理過程と,  An example consisting of a set of a problem and a solution, in which the teaching data that contains the binary relation that the problem appears in the sentence data and the binary relation that should be extracted is stored Extracting the case from the teacher data storage means, extracting predetermined information as a feature for each case, and generating a pair of feature pairs to generate a set of the solution and the set of the extracted features; ,
所定の機械学習アルゴリズムにもとづいて,前記解と素性の集合との組について, どのような素性の集合の場合に前記解となるかと 、うことを機械学習処理し,前記ど のような素性の集合の場合に前記解となるかということを示す情報を学習結果情報と して学習結果記憶手段に保存する機械学習処理過程と,  Based on a predetermined machine learning algorithm, a machine learning process is performed on the set of the feature and the feature set to determine the feature set in which the solution is obtained. A machine learning process for storing, in the learning result storage means, information indicating whether the solution is obtained in the case of a set, as learning result information;
前記記憶装置に格納されたテキストデータから,前記二項関係の要素を抽出し,前 記要素で構成される対を抽出し,前記抽出した対を二項関係の候補とする候補抽出 処理過程と, Extract the binary relation elements from the text data stored in the storage device. A candidate extraction process in which a pair consisting of the following elements is extracted, and the extracted pair is a binary relation candidate;
前記解 素性対抽出手段が行う抽出処理と同様の抽出処理によって,前記二項 関係の候補について前記所定の情報を素性として抽出する素性抽出処理過程と, 前記学習結果記憶手段に格納された前記学習結果情報にもとづいて,前記二項 関係の候補の素性の集合の場合の前記解となりやす 、度合 、を推定する解推定処 理過程と,  A feature extraction process for extracting the predetermined information as a feature for the binomial relation candidate by an extraction process similar to the extraction process performed by the feature pair extraction means, and the learning stored in the learning result storage means Based on the result information, a solution estimation process for estimating the degree of the likelihood of being the solution in the case of the set of candidate features of the binomial relationship;
前記推定結果として,前記二項関係の候補について抽出するべき二項関係である ことを示す解となりやす 、度合 、が所定の程度より良 、場合に,前記二項関係の候 補を抽出するべき二項関係として選択する二項関係抽出処理過程とを,  As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. A binary relation extraction process to be selected as a binary relation,
実行させるための二項関係抽出処理プログラム。  A binary relation extraction processing program for execution.
コンピュータに,複数の検索キーワードによる情報検索処理を行う場合に,教師あり 機械学習処理を用いた二項関係抽出処理結果を利用して検索結果を抽出する方法 として,  When performing information retrieval processing with multiple search keywords on a computer, a method for extracting retrieval results using the binary relation extraction processing results using supervised machine learning processing
問題と解との組で構成される事例であって,問題が検索キーワードを要素とする二 項関係であって解が抽出するべき二項関係であるものを含む教師データが格納され た教師データ記憶手段から前記事例を取り出し,前記事例ごとに,所定の情報を素 性として抽出し,前記解と前記抽出した素性の集合との組を生成する解 素性対抽 出処理過程と,  Teacher data in which teacher data is stored, including cases that consist of pairs of problems and solutions, where the problem is a binary relationship with the search keywords as elements and the solution should be extracted Extracting the case from the storage means, extracting predetermined information as a feature for each case, and generating a feature-feature pair extraction process for generating a set of the solution and the set of extracted features;
所定の機械学習アルゴリズムにもとづいて,前記解と素性の集合との組について, どのような素性の集合の場合に前記解となるかと 、うことを機械学習処理し,前記ど のような素性の集合の場合に前記解となるかということを示す情報を学習結果情報と して学習結果記憶手段に保存する機械学習処理過程と,  Based on a predetermined machine learning algorithm, a machine learning process is performed on the set of the feature and the feature set to determine the feature set in which the solution is obtained. A machine learning process for storing, in the learning result storage means, information indicating whether the solution is obtained in the case of a set, as learning result information;
入力された複数の検索キーワードを用いた入力検索キーワード対を生成し,検索 対象となるテキストデータ力も前記入力検索キーワード対を含むテキストデータを抽 出して取得する情報検索処理過程と,  An information search process for generating an input search keyword pair using a plurality of input search keywords, extracting text data including the input search keyword pair and acquiring the text data force to be searched;
前記検索して取得された各テキストデータ力 前記入力検索キーワードで構成され る対を生成し,前記生成した対を二項関係の候補とする候補抽出処理過程と, 前記解 素性対抽出手段が行う抽出処理と同様の抽出処理によって,前記二項 関係の候補について前記所定の情報を素性として抽出する素性抽出処理過程と, 前記学習結果記憶手段に格納された前記学習結果情報にもとづいて,前記二項 関係の候補の素性の集合の場合の前記解となりやす 、度合 、を推定する解推定処 理過程と, Each text data force obtained by the search is generated as a pair composed of the input search keywords, and a candidate extraction process in which the generated pair is a binary relation candidate, A feature extraction process for extracting the predetermined information as a feature for the binomial relation candidate by an extraction process similar to the extraction process performed by the feature pair extraction means, and the learning stored in the learning result storage means Based on the result information, a solution estimation process for estimating the degree of the likelihood of being the solution in the case of the set of candidate features of the binomial relationship;
前記推定結果として,前記二項関係の候補について抽出するべき二項関係である ことを示す解となりやす 、度合 、が所定の程度より良 、場合に,前記二項関係の候 補を抽出するべき二項関係として選択し,前記選択した二項関係を含むテキストデ ータを検索結果として抽出する検索結果抽出処理過程とを,  As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. A search result extraction process for selecting as a binary relation and extracting text data including the selected binary relation as a search result.
実行させるための二項関係抽出処理を用いた情報検索処理プログラム。  An information search processing program using binary relation extraction processing for execution.
PCT/JP2006/312592 2005-06-23 2006-06-23 Binary relation extracting device WO2006137516A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2005-183495 2005-06-23
JP2005183495A JP4565106B2 (en) 2005-06-23 2005-06-23 Binary Relation Extraction Device, Information Retrieval Device Using Binary Relation Extraction Processing, Binary Relation Extraction Processing Method, Information Retrieval Processing Method Using Binary Relation Extraction Processing, Binary Relation Extraction Processing Program, and Binary Relation Extraction Retrieval processing program using processing

Publications (1)

Publication Number Publication Date
WO2006137516A1 true WO2006137516A1 (en) 2006-12-28

Family

ID=37570533

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2006/312592 WO2006137516A1 (en) 2005-06-23 2006-06-23 Binary relation extracting device

Country Status (3)

Country Link
JP (1) JP4565106B2 (en)
CN (1) CN101253497A (en)
WO (1) WO2006137516A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678681A (en) * 2013-12-25 2014-03-26 中国科学院深圳先进技术研究院 Self-adaptive parameter multiple kernel learning classification method based on large-scale data
CN104361224A (en) * 2014-10-31 2015-02-18 深圳信息职业技术学院 Confidence classification method and confidence machine
CN109791632A (en) * 2016-09-26 2019-05-21 国立研究开发法人情报通信研究机构 Scene segment classifier, scene classifier and the computer program for it
JP2020052902A (en) * 2018-09-28 2020-04-02 株式会社東芝 Unique expression extracting apparatus, method, and program
WO2020095655A1 (en) * 2018-11-05 2020-05-14 日本電信電話株式会社 Selection device and selection method

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008225565A (en) * 2007-03-08 2008-09-25 Nippon Telegr & Teleph Corp <Ntt> Device and method for extracting set of interrelated unique expression
JP4793932B2 (en) * 2007-03-08 2011-10-12 日本電信電話株式会社 Apparatus and method for extracting sets of interrelated specific expressions
JP4646078B2 (en) * 2007-03-08 2011-03-09 日本電信電話株式会社 Apparatus and method for extracting sets of interrelated specific expressions
JP4793931B2 (en) * 2007-03-08 2011-10-12 日本電信電話株式会社 Apparatus and method for extracting sets of interrelated specific expressions
JP5116775B2 (en) * 2007-11-19 2013-01-09 日本電信電話株式会社 Information retrieval method and apparatus, program, and computer-readable recording medium
JP4671440B2 (en) * 2007-12-04 2011-04-20 日本電信電話株式会社 Reputation relationship extraction device, method and program thereof
WO2009123288A1 (en) * 2008-04-03 2009-10-08 日本電気株式会社 Word classification system, method, and program
JP5858456B2 (en) * 2011-01-21 2016-02-10 国立研究開発法人情報通信研究機構 Information retrieval service providing apparatus and computer program
EP2953064B1 (en) 2013-02-01 2022-07-06 Fujitsu Limited Information conversion method, information conversion device, and information conversion program
WO2014118978A1 (en) 2013-02-01 2014-08-07 富士通株式会社 Learning method, image processing device and learning program
JP6004014B2 (en) 2013-02-01 2016-10-05 富士通株式会社 Learning method, information conversion apparatus, and learning program
JP6505421B2 (en) 2014-11-19 2019-04-24 株式会社東芝 Information extraction support device, method and program
JP6775935B2 (en) 2015-11-04 2020-10-28 株式会社東芝 Document processing equipment, methods, and programs
JP6490607B2 (en) 2016-02-09 2019-03-27 株式会社東芝 Material recommendation device
JP6602243B2 (en) 2016-03-16 2019-11-06 株式会社東芝 Learning apparatus, method, and program
JP6622172B2 (en) 2016-11-17 2019-12-18 株式会社東芝 Information extraction support device, information extraction support method, and program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08147307A (en) * 1994-11-22 1996-06-07 Gijutsu Kenkyu Kumiai Shinjoho Shiyori Kaihatsu Kiko Semantic knowledge acquisition device
JP2003186894A (en) * 2001-12-21 2003-07-04 Hitachi Ltd Substance dictionary creating method, and inter- substance binary relationship extracting method, predicting method and displaying method
JP2003196636A (en) * 2001-12-26 2003-07-11 Communication Research Laboratory Notation error detection processing method using machine learning method having teacher, its processing device and its processing program
JP2003223456A (en) * 2002-01-31 2003-08-08 Communication Research Laboratory Method and device for automatic summary evaluation and processing, and program therefor
JP2005157524A (en) * 2003-11-21 2005-06-16 National Institute Of Information & Communication Technology Question response system, and method for processing question response

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08147307A (en) * 1994-11-22 1996-06-07 Gijutsu Kenkyu Kumiai Shinjoho Shiyori Kaihatsu Kiko Semantic knowledge acquisition device
JP2003186894A (en) * 2001-12-21 2003-07-04 Hitachi Ltd Substance dictionary creating method, and inter- substance binary relationship extracting method, predicting method and displaying method
JP2003196636A (en) * 2001-12-26 2003-07-11 Communication Research Laboratory Notation error detection processing method using machine learning method having teacher, its processing device and its processing program
JP2003223456A (en) * 2002-01-31 2003-08-08 Communication Research Laboratory Method and device for automatic summary evaluation and processing, and program therefor
JP2005157524A (en) * 2003-11-21 2005-06-16 National Institute Of Information & Communication Technology Question response system, and method for processing question response

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MITSUMORI T. ET AL.: "Gene/protein name recognition based on support vector machine using dictionary as features", BMC BIOINFORMATICS 2005, vol. 6, no. SUPPL. 1, 24 May 2005 (2005-05-24), XP021001017 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678681A (en) * 2013-12-25 2014-03-26 中国科学院深圳先进技术研究院 Self-adaptive parameter multiple kernel learning classification method based on large-scale data
CN104361224A (en) * 2014-10-31 2015-02-18 深圳信息职业技术学院 Confidence classification method and confidence machine
CN109791632A (en) * 2016-09-26 2019-05-21 国立研究开发法人情报通信研究机构 Scene segment classifier, scene classifier and the computer program for it
CN109791632B (en) * 2016-09-26 2023-07-21 国立研究开发法人情报通信研究机构 Scene segment classifier, scene classifier, and recording medium
JP2020052902A (en) * 2018-09-28 2020-04-02 株式会社東芝 Unique expression extracting apparatus, method, and program
WO2020067313A1 (en) * 2018-09-28 2020-04-02 株式会社 東芝 Named entity extraction device, method, and storage medium
JP7286291B2 (en) 2018-09-28 2023-06-05 株式会社東芝 Named entity extraction device, method and program
US11868726B2 (en) 2018-09-28 2024-01-09 Kabushiki Kaisha Toshiba Named-entity extraction apparatus, method, and non-transitory computer readable storage medium
WO2020095655A1 (en) * 2018-11-05 2020-05-14 日本電信電話株式会社 Selection device and selection method
JP2020077054A (en) * 2018-11-05 2020-05-21 日本電信電話株式会社 Selection device and selection method

Also Published As

Publication number Publication date
CN101253497A (en) 2008-08-27
JP4565106B2 (en) 2010-10-20
JP2007004458A (en) 2007-01-11

Similar Documents

Publication Publication Date Title
JP4565106B2 (en) Binary Relation Extraction Device, Information Retrieval Device Using Binary Relation Extraction Processing, Binary Relation Extraction Processing Method, Information Retrieval Processing Method Using Binary Relation Extraction Processing, Binary Relation Extraction Processing Program, and Binary Relation Extraction Retrieval processing program using processing
Arora et al. Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis
Oussous et al. ASA: A framework for Arabic sentiment analysis
Levy et al. Neural word embedding as implicit matrix factorization
Hermann et al. Semantic frame identification with distributed word representations
US9262406B1 (en) Semantic frame identification with distributed word representations
Shen et al. Voting between multiple data representations for text chunking
Fernandes et al. Learning from partially annotated sequences
Xing et al. Um-checker: A hybrid system for english grammatical error correction
Şenel et al. Measuring cross-lingual semantic similarity across European languages
Hu et al. Bootstrapping object coreferencing on the semantic web
Kocmi et al. SubGram: extending skip-gram word representation with substrings
Huang et al. Analyzing multiple medical corpora using word embedding
JP5366179B2 (en) Information importance estimation system, method and program
JP4895645B2 (en) Information search apparatus and information search program
Jaber et al. NER in English translation of hadith documents using classifiers combination
Choi et al. How to generate data for acronym detection and expansion
Kaewphan et al. TurkuNLP entry for interactive Bio-ID assignment
Suzdaltseva et al. De-identification of Medical Information for Forming Multimodal Datasets to Train Neural Networks.
Chen et al. Extract protein-protein interactions from the literature using support vector machines with feature selection
JP3780341B2 (en) Language analysis processing system and sentence conversion processing system
Pham Sensitive keyword detection on textual product data: an approximate dictionary matching and context-score approach
JP2008021093A (en) Sentence conversion processing system, translation processing system having sentence conversion function, voice recognition processing system having sentence conversion function, and speech synthesis processing system having sentence conversion function
Wang et al. Learning functional sections in medical conversations: iterative pseudo-labeling and human-in-the-loop approach
Stampolidou Extracting Local Features to Improve Transformer-based Biomedical Question Answering Models

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200680022356.9

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06780635

Country of ref document: EP

Kind code of ref document: A1