WO2009096523A1 - 情報分析装置、検索システム、情報分析方法及び情報分析用プログラム - Google Patents
情報分析装置、検索システム、情報分析方法及び情報分析用プログラム Download PDFInfo
- Publication number
- WO2009096523A1 WO2009096523A1 PCT/JP2009/051581 JP2009051581W WO2009096523A1 WO 2009096523 A1 WO2009096523 A1 WO 2009096523A1 JP 2009051581 W JP2009051581 W JP 2009051581W WO 2009096523 A1 WO2009096523 A1 WO 2009096523A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- language expression
- document
- expression
- candidate
- analysis
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- the present invention relates to an information analysis apparatus, an information analysis method, and an information analysis program for analyzing information.
- the present invention also relates to a search system using an information analysis apparatus.
- a description that expresses a specific noun, topic, opinion, or thing in the text is called “language expression”.
- “linguistic expressions” include noun expressions such as event names, case names, and product names (for example, “race games”, “earthquake resistant gels”, “food disguise”), and noun expressions further predicates or modifier And sentences (for example, “An earthquake-resistant gel is effective” and “A diesel engine is good for the environment”).
- “Language expression” may be a character string itself appearing in the text, or the result of analyzing the text using existing natural language processing techniques such as morphological analysis, syntax analysis, dependency analysis, and synonym processing. There may be.
- “school” and “student” are linguistic expressions each consisting of one word. Also, for example, the relationship between words such as “School ⁇ Go” obtained by dependency analysis of texts such as “go to school”, “go to school”, “go to school”, etc.
- the received analysis result is also a linguistic expression representing a single meaning.
- Non-Patent Document 1 describes a correlation analysis based on the co-occurrence degree as a text mining technique for analyzing a free description questionnaire.
- the correlation analysis based on the co-occurrence degree is a technique for determining that the relevance between these words is high based on the information that a plurality of words appear in the same document.
- co-occurrence not only between words, but also between one language expression and another language expression, with any language expression as a processing unit, such as predicates consisting of multiple words and dependency relations between words. By looking at the co-occurrence relationship, it is possible to extract a linguistic expression highly relevant to a specific linguistic expression.
- a linguistic expression that is highly related to the target language expression is a cause or result of the original target language expression, a different result resulting from a common cause, or simply a correlation that arises from a common situation / environment. It is possible that this is a high event. In any case, the highly relevant language expression is an important finding for the language expression of interest.
- time information such as a normal transmission date / time, a creation date / time, and a response date / time is given to the above-described document collection such as a blog on the Internet, an e-mail, and a response history in a call center.
- Non-Patent Document 2 describes a technique called BlogWatcher IV.
- the number of times a specific topic word appears in the entire collected blog, the number of times that the topic word is described positively, and the time series change of the number of times described negatively are broken lines.
- a method of plotting as a graph is described (hereinafter referred to as second related technology).
- Non-Patent Document 2 also describes a function of detecting, as a burst, a point in time when the number of appearances of a topic word of interest has changed abruptly.
- burst means that the topic word of interest suddenly increases / decreases within a certain time.
- Non-Patent Document 2 describes not only a simple increase / decrease, but also a method of normalizing with the total population of collected blogs. Detect bursts in response to changes.
- a document set including a target language expression (hereinafter referred to as a target document set) is selected as an analysis target from a set of set document sets. Then, in each text of the target document set, a language expression that co-occurs statistically frequently with the target language expression is extracted as a highly relevant language expression. Therefore, language expressions that do not appear frequently in the target document set cannot be extracted even if they are highly related to the target language expression.
- the language expression representing such cause or result is not always described simultaneously in the document including the original target language expression. Even if a part of the target document set may co-occur with a target language expression and a highly relevant language expression, many of the target document sets are statistically frequently related languages. It is generally not possible to expect an expression to appear.
- regression analysis is a basic method of statistical analysis. This is a technology to detect highly relevant events by examining the correlation of time changes of multiple time-series data when there are multiple sets of time-series data such as the number of occurrences and prices of each event at each time point. It is. For example, when there is a correlation between a time change of a certain stock price and a time change of another stock price, the regression analysis is performed by regarding the prices of each of the two stocks as time series data. By doing so, you can calculate how much correlation there was between the prices of both.
- the document set as the analysis population includes time information. If given, time series data of each language expression can be obtained by using the second related technique. In this case, the document set as the analysis population is divided for each period using time information, and the number of documents including each language expression and the number of appearances of the language expression for each period are determined for each period of each language expression. Time series data.
- each document in the document set can contain a large number of language expressions even if a population of the document set to be analyzed is given. . Therefore, in order to obtain a linguistic expression that is highly correlated in time with a specific language expression of interest, it is necessary to calculate the temporal correlation with these very many linguistic expressions.
- the population of document sets to be analyzed becomes large, such as on the Internet or a large number of response histories, such a method for obtaining temporal correlation of time-series data of language representation is realistic from the viewpoint of computational complexity. Not right.
- a typical purpose of the present invention is to establish the relevance of the linguistic expression of interest to the linguistic expression that is less statistically co-occurred in the same document as the linguistic expression of interest.
- An object is to provide an information analysis apparatus, a search system, an information analysis method, and an information analysis program that can be analyzed.
- An exemplary information analysis system includes a language-of-interpretation time-series data acquisition unit that acquires time-series data corresponding to an input language expression of an analysis target, A related language expression candidate generation unit that generates a candidate of a language expression highly relevant to the language expression as a related language expression candidate; For the related language expression candidate generated by the related language expression candidate generation unit, a related language expression candidate time-series data acquisition unit that acquires time-series data corresponding to the related language expression candidate; A time series analysis unit that analyzes temporal correlation between the time series data acquired by the language expression time series data acquisition unit of interest and the time series data acquired by the related language expression candidate time series data acquisition unit; , Using the analysis result of the time series analysis unit, a relevance calculation unit that calculates a relevance between the language expression and the related language expression candidate generated by the related language expression candidate generation unit; It is provided with.
- An exemplary search system includes the above-described information analysis apparatus and a plurality of search target documents using, as search conditions, related language expressions that are output from the information analysis apparatus and have a high degree of association with the target language expression.
- An exemplary information analysis method acquires time-series data corresponding to an input language expression of an analysis target, Generating a language expression candidate highly relevant to the language expression as a related language expression candidate; For the generated related language expression candidate, obtain time series data corresponding to the related language expression candidate, Analyzing temporal correlation between the time series data corresponding to the acquired language expression and the time series data corresponding to the acquired related language expression candidate, The retrieval system includes calculating a degree of association between the language expression and the related language expression candidate using an analysis result of temporal correlation between the time series data.
- An exemplary information analysis program is stored in a computer.
- Language-of-interest expression time-series data acquisition processing for acquiring time-series data corresponding to the input language expression of the analysis target;
- a related language expression candidate generation process for generating a language expression candidate highly relevant to the language expression as a related language expression candidate;
- related language expression candidate time-series data acquisition processing for acquiring time-series data corresponding to the related language expression candidate;
- a time series analysis process for analyzing temporal correlation between the acquired time series data corresponding to the language expression and the acquired time series data corresponding to the related language expression candidate; Using the analysis result of the temporal correlation between the time series data, a relevance calculation process for calculating a relevance between the language expression and the related language expression candidate;
- Embodiment 1 relates to an information analysis apparatus using an information analysis method that extracts a language expression of interest as an analysis target and a related language expression that is highly correlated in time series from a document set.
- FIG. 1 is a block diagram showing a configuration of a first embodiment of an information analysis apparatus according to the present invention.
- the information analysis apparatus includes a target language expression time-series data acquisition unit 20, a related language expression candidate generation unit 40, a related language expression candidate time-series data acquisition unit 50, a time series analysis unit 60, and a degree of association.
- a calculation unit 70 is included.
- the document set database 30 provides a means for accessing a document set defined as a population of documents to be analyzed.
- the target language expression input unit 10 inputs the language expression to be analyzed to the target language expression time-series data acquisition unit 20.
- the related information output device 80 outputs related information related to the language expression to be analyzed.
- the information analysis device may include some or all of the target language expression input unit 10, the related information output device 80, and the document set database 30.
- the information analysis apparatus is specifically realized by an information processing apparatus such as a personal computer that operates according to a program.
- the information analysis apparatus can be applied to the use of a search system that presents, for example, a language expression that is highly relevant to a language expression input using the information analysis apparatus as related information or a related search condition.
- the target language expression input unit 10 inputs a language expression to be analyzed.
- the target language expression time-series data acquisition unit 20 acquires time-series data of the target language expression input to the target language expression input unit 10.
- the document set database 30 provides a means for accessing a document set defined as a population of documents to be analyzed.
- the related language expression candidate generation unit 40 generates language expression candidates that are highly related to the input target language expression as related language expression candidates.
- the related language expression candidate time-series data acquisition unit 50 acquires time-series data for each of the generated related language expression candidates.
- the time series analysis unit 60 temporally compares the time series data obtained by the target language expression time series data acquisition unit 20 and the time series data obtained by the related language expression candidate time series data acquisition unit 50. Check the correlation.
- the relevance calculation unit 70 calculates the relevance between the target language expression and the related language expression candidate using the analysis result of the time series analysis unit 60.
- the related information output device 80 outputs a linguistic expression having a high degree of relevance with the language expression of interest from the result of the relevance calculation unit 70.
- the target language expression input unit 10 is realized by a CPU of an information processing device that operates according to a program and an input device such as a keyboard and a mouse.
- the target language expression input unit 10 has a function of inputting a language expression to be analyzed in accordance with a user operation.
- the target language expression input unit 10 may input the target language expression in a format that specifies a part of text in the document, and any input can be used as long as the language expression such as text input from the keyboard can be specified. You may enter it in the format. Further, the target language expression input unit 10 may input the target language expression in a text format such as “A product is cool”. Further, the target language expression input unit 10 uses a data format obtained as a result of existing language processing such as morphological analysis, syntax analysis, dependency analysis, or synonym processing, such as “A product ⁇ cool”. A language expression of interest may be input.
- the target language expression time-series data acquisition unit 20 is realized by a CPU of an information processing apparatus that operates according to a program.
- the target language expression time series data acquisition unit 20 has a function of acquiring (extracting from the document set database 30) time series data for the target language expression input by the target language expression input unit 10 using the document set database 30.
- the language-of-interest expression time-series data acquisition unit 20 divides a document set accessible by the document set database 30 for each period based on time information given to each document. Further, the target language expression time-series data acquisition unit 20 calculates the number of documents including the target language expression within each period or the number of appearances of the target language expression within each period at the time of the target language expression for the period. Obtained as series data.
- the target language expression time-series data acquisition unit 20 determines that the number of documents containing the target language expression is 52 in the first week of January and 48 in the first week of January when the period is set to one week.
- the number of appearances is obtained as follows: 192 cases in the third week of January, 218 cases in the fourth week of January,.
- the language-of-interest expression time-series data acquisition unit 20 obtains a series of these appearance numbers as time-series data for the language-of-interest expression.
- the target language expression time-series data acquisition unit 20 may obtain the number of documents including the target language expression and the number of appearances of the target language expression using the number as it is, or for each period included in the analysis target population.
- the normalized number may be obtained from the total number of documents.
- the range of time series data for example, start time, end time
- the length of the period for example, every hour, every day, or every week. It is determined as appropriate according to the purpose and purpose of analysis, the nature of the population to be analyzed, and the like.
- the synonym processing and the expression / syntax are different but the same.
- the process of identifying the consent expression using existing language processing technology such as the identification of the analysis result considered to be the meaning of, may be used as necessary. Specifically, what kind of words and expressions should be identified are appropriately determined in advance according to the purpose and purpose of realizing the information analysis apparatus, the characteristics of the analysis target population, and the like.
- the document set database 30 is realized by a database device such as a magnetic disk device or an optical disk device, or a network device.
- the document set database 30 is a database that accumulates various electronic documents with time information and provides means for accessing a document set defined as a population of documents to be analyzed.
- the document set database 30 is, for example, a database device provided in a call center.
- the time information given to the electronic document may be the creation time of each document, or may be arbitrary time information such as the transmission time and the last update date and time. However, it is determined in advance which type of time information is used as the time information of the time series data acquired by the target language expression time series data acquisition unit 20 (for example, one type of time information is determined). .
- the actual document data may be held inside the information analysis apparatus or may be held outside.
- the document set database 30 may be a blog search engine that searches a blog on the Internet by specifying a keyword or a date and time instead of a database device.
- the analysis target population may be blog data searched by the blog search engine
- the text may be the text of each blog article
- the time information may be the date given to each blog article.
- the related language expression candidate generation unit 40 is realized by a CPU of an information processing apparatus that operates according to a program.
- the related language expression candidate generation unit 40 has a function of generating, as related language expression candidates, language expression candidates that are highly relevant to the target language expression input by the target language expression input unit 10.
- the related language expression candidate generation unit 40 uses the related language expression candidate as input text content of the target language expression, text content of the document including the target language expression, or meta information attached to the document including the target language expression. To generate.
- the related language expression candidate generation unit 40 once selects a language expression having a certain relationship from the target language expression or the target document set as a language highly related to the target language expression. Generate as a candidate for expression.
- FIG. 2 is a block diagram showing an example of a more detailed configuration of the related language expression candidate generation unit 40.
- the related language expression candidate generation unit 40 includes a search target document condition selection unit 410, a search target document set acquisition unit 420, and a characteristic language expression extraction unit 430.
- the investigation target document condition selection unit 410 selects a document condition to be investigated. Further, the survey target document set acquisition unit 420 acquires a set of documents that satisfy the selected condition. Also, the characteristic language expression extraction unit 430 extracts characteristic language expressions from the acquired document set.
- the search target document condition selection unit 410 is a document set different from the target document set including the target language expression in order to obtain the related language expression candidate, but the target language expression or the document set having a certain relationship with the target document set A function for selecting the conditions is provided.
- the survey target document condition selection unit 410 uses the text content of the electronic document including the input language expression or the meta information attached to the document including the language expression to extract the comparison target document extraction condition. Is selected.
- a document having a certain relationship from the target language expression or target document set is referred to as a survey target document.
- the set of survey target documents is referred to as a survey target document set.
- Table 1 is a table showing an example of the document condition to be investigated and an example of a related language expression candidate condition.
- examples of conditions for defining the document to be investigated include the conditions shown in the first row, first column, second row, first column, third row, first column, and fourth row, first column of Table 1. is there.
- Examples of the related language expression candidate conditions include the conditions shown in the first row and the second column, the fourth row and the second column, the fifth row and the second column, the sixth row and the second column, and the seventh row and the second column in Table 1.
- the condition shown in the first row and the first column of the table of Table 1 is a condition that “a document set of the same field or topic as the target document set” is selected.
- the condition shown in the first row and the first column of the table in Table 1 is a technique in which the document to be investigated is that the document in the same field or the same topic as the target document set.
- the survey target document condition selection unit 410 selects an electronic document in the same or similar field or an electronic document in the same or similar topic for a part or all of the set of electronic documents including the input language expression. Is selected as the extraction condition for the comparison target document.
- field determination technology from existing text and topic determination technology can be used. Further, when a field, topic, or the like is given as meta information to each document in the document set of interest, the meta information may be used.
- condition of the field or topic When there are a plurality of document fields or topics belonging to the document set of interest, all of them may be used as the condition of the field or topic. Alternatively, only a field or topic belonging to the target document set and having a certain number or more of the same field or topic may be used as a condition.
- the judgment method of the field and topic, the system, the conditions for regarding the field and the topic to be the same, etc. are set in advance from the purpose and purpose of realizing the information analysis device, the characteristics of the analysis target population, and the like. For example, when the blog on the Internet is the analysis target population and the target language expression is “I bought a DVD recorder”, the meta information “category” given to the document belonging to the target document set is “AV device” ”Is the most common category. In this case, a document belonging to the category “AV device” can be set as a condition of the investigation target document.
- the condition shown in the 2nd row and the 1st column of the table in Table 1 is a condition of selecting “a set of documents linked within a certain number of hops from the set of target documents”.
- the condition shown in the 2nd row and the 1st column of the table of Table 1 is a method in which the document to be investigated is that the document is linked within a certain number of hops from the document belonging to the target document set. That is, in this case, the investigation target document condition selection unit 410 selects, as an extraction condition for the comparison target document, that the electronic document is linked within a certain number of hops from the electronic document including the input language expression.
- link information to other related documents is added as meta information to all or part of the documents belonging to the analysis target population.
- Examples of such links include hyperlinks and trackbacks in Web texts, original mail IDs in returned e-mails, original articles on electronic bulletin boards, and the like.
- the condition shown in the 3rd row and the 1st column of the table of Table 1 is “a set of documents having similarity (similar) to a document belonging to the target document set and having a similarity higher than a certain value when the text similarity is calculated. "Is selected.
- the condition shown in the 3rd row and the 1st column of the table of Table 1 is that when the text similarity is calculated with a document belonging to the target document set, it is a document having a similarity (similar to) a certain value or more.
- This is a method for setting the condition of the document to be investigated. That is, in this case, the investigation target document condition selection unit 410 selects, as an extraction condition for the comparison target document, that the electronic document has a text similarity within a certain value with respect to the electronic document including the input language expression. To do.
- a plurality of documents belong to the target document set. For this reason, whether it is sufficient to have a certain degree of similarity with at least one of them, or the target document set may be regarded as one document cluster, and the center of the cluster may have a certain degree of similarity. Any setting is possible.
- the condition shown in the 4th row and the 1st column of the table of Table 1 is a condition that “the set of other documents created or transmitted by the creator or sender of the target document set” is selected.
- the condition shown in the 4th row and the 1st column of the table in Table 1 is a method in which the document or the sender of the document belonging to the target document set is another document created or sent as a condition of the investigation target document. .
- the survey target document condition selection unit 410 determines that a part or all of the set of electronic documents including the input language expression and other electronic documents that are common to the author or the sender are compared. Select as document extraction conditions. It is a premise to use this method that all or part of the documents belonging to the analysis target population is given meta information indicating the author or sender of each document.
- a plurality of documents belong to the target document set. Therefore, it is sufficient if the document has a common author or sender with at least one of them, or only authors or senders with a certain number of documents belonging to the target document set (limited authors) It is possible to arbitrarily set whether another document created or transmitted by the sender (only the sender) is set as the investigation target document.
- conditions shown in the first to fourth lines of the table in Table 1 are examples of conditions for defining the investigation target document, and the conditions for defining the investigation target document are not limited to these.
- time-related conditions such as “a document created / sent within a certain period from the date / time when the document of interest was created / sent” may be used.
- complex conditions may be defined from AND / OR of multiple conditions. For example, “a document linked in one hop from any document in the target document set, or linked in two hops from any document in the target document set, and the link source.
- a compound condition such as “a document that equalizes a field of interest with a field” may be defined.
- the conditions for defining the survey control document as shown in the first to fourth lines of the table in Table 1 are determined in advance according to the purpose and use when realizing the information analysis apparatus, the characteristics of the analysis target population, etc. It is decided.
- the survey target document condition selection unit 410 reads the target language expression or target document set and embodies a predetermined condition.
- a predetermined condition is “a document belonging to a category having the largest number of documents of interest belonging to the category among category information of documents belonging to the document collection of interest”.
- the target document condition selection unit 410 actually reads the target document set and the maximum category is “AV device”, the first condition is specified and “category“ AV device ”is specified. “Document to which it belongs” is set as a condition for defining the document to be investigated.
- the survey target document set acquisition unit 420 has a function of acquiring (extracting) a set of documents satisfying the condition as a survey target document set from the document set database 30 using the conditions determined by the survey target document condition selection unit 410. Prepare.
- the characteristic language expression extraction unit 430 has a function of performing language analysis on each survey target document acquired by the survey target document set acquisition unit 420. Next, the characteristic linguistic expression extraction unit 430 has a function of extracting a characteristic linguistic expression from among the linguistic expressions included in the investigation target document based on the linguistic analysis result. Further, the characteristic language expression extraction unit 430 has a function of obtaining the extracted characteristic language expression as a related language expression candidate.
- the second column of the first row to the fourth row of Table 1 shows an example of a method for extracting a characteristic language expression from the investigation target document of the related language expression candidate.
- the condition of the related language expression candidate in the second column of the first row in Table 1 is “language expression characteristic in the document to be investigated”.
- Characteristic linguistic expressions are, for example, “linguistic expressions that appear frequently”, “linguistic expressions that appear significantly more frequently in the survey target document compared to the population of the document set”, or “study target document”. Is a linguistic expression representing the subject of
- language expressions other than the exemplified language expressions may be used as characteristic language expressions. In this case, values are also set in advance for the threshold value used for determination such as “language expression that appears frequently”.
- the conditions of the related language expression candidates in the second column of the second to fourth lines in Table 1 are the same as the conditions of the related language expression candidates in the first column of the first line, except that the conditions of the document to be investigated are different. It is.
- the characteristic linguistic expression of the second column from the second row to the fourth row is the same as the characteristic linguistic expression of the first column of the first row.
- the condition of the related language expression candidate in the second column of the fifth row in Table 1 is “a language expression that appears in a target document set and correlates with a target language expression at a certain value or more”.
- the target document set is subdivided according to the time information, category information, or text content given to each document, and the target language expression is constant within the subdivided document set.
- a linguistic expression that appears in correlation with each other may be used.
- the characteristic language expression extraction unit 430 may set all characteristic language expressions as related language expression candidates in each document to be investigated. In addition, the characteristic language expression extraction unit 430 extracts a characteristic language expression from the entire survey target document set using a text mining technique or a multi-document summarization technique, and uses the extracted language expression as a related language expression candidate. Also good.
- a related language that generates a related language expression candidate by combining the three functional units of the search target document condition selection unit 410, the search target document set acquisition unit 420, and the characteristic language expression extraction unit 430. It functions as the expression candidate generation unit 40.
- the related language expression candidate time-series data acquisition unit 50 is realized by a CPU of an information processing apparatus that operates according to a program.
- the related language expression candidate time-series data acquisition unit 50 has a function of acquiring (extracting) time-series data of each related language expression candidate generated by the related language expression candidate generation unit 40 from the document set database 30.
- the processing method in which the related language expression candidate time-series data acquisition unit 50 extracts the time-series data is that the target language expression time-series data acquisition unit 20 uses the time. This is the same as the processing method for extracting the series data.
- time series data acquisition range start time, end time
- period length are such that the time series analysis unit 60 can analyze the temporal correlation with the language expression time series data of interest. Is the same as the range and period length of the target language expression time-series data.
- the time series analysis unit 60 is specifically realized by a CPU of an information processing device that operates according to a program.
- the time series analysis unit 60 includes a time series data acquired by the target language expression time series data acquisition unit 20 and a time series data of each related language expression candidate acquired by the related language expression candidate time series data acquisition unit 50.
- a function for analyzing the presence or absence of temporal correlation is provided. That is, when there are three related language expression candidates, candidate 1, candidate 2, and candidate 3, the time series analysis unit 60 selects (focused language expression, candidate 1), (focused language expression, candidate 2), And the presence / absence of temporal correlation is analyzed for the three combinations of (target language expression, candidate 3).
- the amount of calculation required to investigate the temporal correlation between the two becomes larger as the time range of the time-series data to be examined becomes larger and the allowable time delay becomes longer. Increase. Therefore, before examining the temporal correlation between two time-series data, a change point where a large change occurs in each time-series data may be first detected. Then, it is checked whether or not the change point corresponding to the change point of one time series data exists in the other time series data, and the temporal correlation is performed only in the peripheral section of the change point that may correspond. A method of examining the relationship may be used. Alternatively, only a fixed interval around the changing point in each time series data may be set as a target range for time series analysis.
- a time point that becomes a positive value from 0 (or a very small value) at a certain time point is an appearance point
- a time point that becomes 0 (or a very small value) from a certain positive value is a vanishing point.
- attention may be paid to the appearance point or vanishing point of one of the time series data.
- a certain interval around the appearance point or vanishing point may be set as a target range for performing time series analysis preferentially.
- FIG. 3 is an explanatory diagram showing an example of time-series data of related language expression candidates that have a positive correlation with the target language expression.
- the target language expression is “An earthquake-resistant gel is effective”
- the related language expression candidates are “A Chuetsu earthquake occurs” and “Use a stick”.
- the number of appearances of each language expression on the Internet is used as time series data.
- the language expression of interest “seismic gel is effective” has increased rapidly since the latter half of 2004.
- the related language expression candidate “Chetsugo earthquake occurs” appears to correlate positively with the increase in the focused language expression “An earthquake-resistant gel is effective”.
- FIG. 4 is an explanatory diagram illustrating an example of time-series data of related language expression candidates that have a negative correlation with the target language expression.
- the attention language expression is “diesel vehicle is bad for the environment”
- the related language expression candidate is “diesel vehicle is low pollution”.
- the number of appearances of each language expression on the Internet is used as time series data.
- the language expression of interest “diesel vehicles are bad for the environment” has suddenly decreased from the middle of 2005, while the related language expression candidate “diesel vehicles are low pollution” is 2005. It has suddenly increased since around May. And a negative correlation is seen in the range around November 2005.
- FIG. 4 is an explanatory diagram illustrating an example of time-series data of related language expression candidates that have a negative correlation with the target language expression.
- the relevance calculation unit 70 is realized by a CPU of an information processing device that operates according to a program.
- the degree-of-association calculation unit 70 has a function of calculating the degree of association between the target language expression and the related language expression candidate using the analysis result of the time series analysis unit 60.
- the relevance level calculating unit 70 may calculate the relevance level for each of the related language expression candidates generated by the related language expression candidate generating unit 40. Further, the relevance calculation unit 70 may calculate the relevance only for the related language expression candidates for which the time series analysis unit 60 has detected a temporal correlation that is equal to or greater than a certain value with the target language expression.
- the relevance level basically indicates the level of temporal correlation detected by the time series analysis unit 60. Specifically, a correlation coefficient indicating the degree of correlation between the time-series data of the target language expression and the time-series data of the related language expression candidate can be used as the degree of association. Further, the relevance calculation unit 70 may obtain the relevance by averaging the correlation coefficient in the time range in which the correlation is found, or may obtain the relevance by obtaining the maximum value of the time range. Good. Further, the relevance calculation unit 70 may obtain a value obtained by performing some normalization or representative value processing as the relevance based on the correlation coefficient.
- the related language expression candidate generation unit 40 uses a certain scale for selecting the related language expression candidate when the related language expression candidate generation section 40 generates the related language expression candidate
- measures used to select related language expression candidates include the number of link hops from the target document to the document including the related language expression candidate, and the document including the target document set and the related language expression candidate. And text similarity.
- the relevance calculation unit 70 has a function of passing (outputting) the related language expression candidate and the calculation result of the relevance to the related information output device 80.
- the relevance calculation unit 70 may pass the analysis result of the time series analysis unit 60 and the time range in which temporal correlation is detected to the related information output device 80 in addition to the relevance. .
- the related information output device 80 is specifically realized by an output device such as a CPU of an information processing device that operates according to a program and a liquid crystal display device.
- the related information output device 80 has a function of outputting, as related information of the target language expression, a language expression having a high degree of relevance with the target language expression based on the calculation result of the related degree calculation unit 70.
- the related information output device 80 may output only the related language expression candidates for which the degree of association equal to or greater than a separately defined threshold is calculated among the related language expression candidates for which the degree of association calculation unit 70 has calculated the degree of association. Alternatively, all sets of related language expression candidates and relevance levels may be output.
- the related information output device 80 may output, for each related language expression candidate, the target language expression and the time range in which the temporal correlation exists. Further, the related information output device 80 may further output time series data of related language expression candidates.
- the information analysis apparatus can perform the target language expression even for a language expression that has less statistical tendency that co-occurs in the same document as the target language expression to be analyzed. Can be analyzed. Therefore, it is obvious that the target language expression is related to the target language expression without using the information analysis apparatus shown in the present embodiment, such as a language expression that tends to appear together with the target language expression in the target document set. Even if it is included in the related language expression candidates, the related information output device 80 does not output such a trivial language expression but outputs only a non-trivial language expression. Good.
- the related language expression candidate generation unit 40, the related language expression candidate time-series data acquisition unit 50, the time-series analysis unit 60, and the degree-of-relevance calculation unit are used to screen the related language expression candidates to be output as described above. 70 and the related information output device 80 may be performed by any functional unit.
- 70 and the related information output device 80 may be performed by any functional unit.
- using text mining technology the degree of co-occurrence with the target language expression in the target document set is examined, and the language expression that statistically appears with a high correlation of a certain value with the target language expression is extracted from the related language expression candidates. You may process so that it may be eliminated.
- the information analysis apparatus has the above-described configuration, so that the language expression of both of the input target language expression and the language expression that does not appear co-occurrence in the document statistically frequently.
- the time-series data has a temporal correlation
- the correlated language expression can be output as related information of the target language expression.
- the storage device of the information processing apparatus that implements the information analysis apparatus stores various programs for analyzing information such as documents with time information. For example, the storage device of the information processing apparatus that implements the information analysis apparatus generates related language expression candidate generation processing that generates, as related language expression candidates, language expression candidates that are highly relevant to the input language expression of the analysis target. And an information analysis program for executing a relevance calculation process for calculating a relevance between the input language expression and the generated related language expression candidate.
- FIG. 13 is a block diagram showing a configuration example of a computer constituting the failure cause analysis system of this embodiment.
- a program describing the functions of the time-series data acquisition unit 50, the time-series analysis unit 60, and the relevance calculation unit 70 is stored in a disk device 1005 such as a hard disk device, and the data of the document set database 30 is stored in the disk device 1005. .
- the CPU 1004 executes the program.
- the input unit 1001 constitutes a part of the target language expression input unit 10 and serves as an input device such as a keyboard.
- a display unit 1002 such as a liquid crystal display constitutes a part of the related information output device 80.
- Each unit of the information analysis apparatus is connected by a bus 1006 such as a data bus, and information necessary for information processing by the CPU 1004 is stored in a memory 1003 such as a DRAM that stores the information.
- each component shown in FIG. 1 is realized as a program for controlling each function, and is readable by a computer such as a flexible disk such as an FD (floppy disk), a CD-ROM, a DVD, or a flash memory. It is stored in an information recording medium or provided through a network such as the Internet. And an information analysis apparatus may be implement
- FIG. 5 is a flowchart showing the overall processing of the related information output operation executed by the information analysis apparatus.
- the language expression input unit 10 of interest receives, as an input, a language expression to be analyzed in accordance with a user operation (step A1).
- the target language expression time-series data acquisition unit 20 accesses the document set database 30 and acquires (extracts) time-series data for the target language expression from the document set database 30 (step A2).
- the process in step A2 is highly independent from the processes in steps A3 and A4 described later, the process order between step A2 and steps A3 and A4 is changed before step A5. May be.
- the related language expression candidate generation unit 40 generates language expression candidates highly relevant to the target language expression input by the target language expression input unit 10 as related language expression candidates (step A3).
- the related language expression candidate time-series data acquisition unit 50 performs time-series data of each related language expression candidate for each related language expression candidate generated by the related language expression candidate generation unit 40 according to the same process as in step A2. Is obtained (extracted) from the document set database 30 (step A4).
- the time series analysis unit 60 performs a time series analysis for obtaining temporal correlation between the time series data for the target language expression acquired in step A2 and the time series data of each related language expression candidate acquired in step A4. Perform (Step A5).
- the relevance calculation unit 70 calculates the relevance between the target language expression and the related language expression candidate using the analysis result of the time series analysis obtained in step A5 (step A6).
- the related information output device 80 outputs a related language expression having a high degree of relevance as related information for the target language expression based on the relevance degree obtained by the relevance degree calculation unit 70 (step A7).
- FIG. 6 is a flowchart illustrating an example of related language expression candidate generation processing executed by the related language expression candidate generation unit 40.
- the investigation target document condition selection unit 410 is a document set different from the target document set including the target language expression in order to obtain the related language expression candidate, but from the target language expression or the target document set.
- the condition of the document set having a certain relationship is selected as the condition of the document to be investigated (step B1).
- the survey target document set acquisition unit 420 acquires (extracts) a set of survey target documents that satisfy the conditions selected in step B1 from the document set database 30 (step B2).
- the characteristic language expression extraction unit 430 extracts a characteristic language expression from the search target document set acquired by the search target document set acquisition unit 420 as a related language expression candidate (step B3), so that the related language expression is extracted.
- the candidate generation process is terminated.
- a candidate for a language expression highly relevant to the input language expression to be analyzed is generated as a related language expression candidate. Then, the degree of association between the input language expression and the generated related language expression candidate is calculated. Therefore, the degree of relevance can be obtained as a highly relevant language even if it is not a language expression that co-occurs in the same document as the language expression of interest as an analysis target. Therefore, it is possible to analyze the relevance of the linguistic expression of interest even with respect to the linguistic expression having a small statistical tendency that co-occurs in the same document as the linguistic expression of interest as the analysis target.
- highly relevant linguistic expressions are obtained from the content of the linguistic expression of interest, the text content of the document including the linguistic expression of interest, and the meta information attached to the document including the linguistic expression of interest. Narrow down the candidates. Then, by performing time series analysis of the narrowed-down related language expression candidates and the target language expression, it is possible to output a language expression highly relevant to the target language expression.
- the related language expression candidate generation unit 40 is configured to have the detailed configuration shown in FIG. 2, it is not the target document set itself, but it is determined from the target language expression or the target document set. It is possible to select a survey target document having a relationship once, and use a language expression included in the selected survey target document as a related language expression candidate. Therefore, the number of language expression candidates for which temporal correlation is obtained in the time series analysis unit 60 can be appropriately narrowed down, and the processing can be made efficient.
- the related language expression is constant from the target language expression or target document set even if it does not appear frequently in the target document set itself. It is considered that there is a high probability of appearing in a document having the relationship. In view of this, it is a technique for narrowing down the related language expression candidates having a truly high temporal correlation to characteristic language expressions appearing in the investigation object document by appropriately selecting the investigation object document. In addition, even if the language expression never appears in the target document set, the language expression is included in the survey target document, and in the analysis target population, there is a temporal correlation with the target language expression. It can be output as a related language expression.
- FIG. 7 is a block diagram illustrating a configuration example of the related language expression candidate generation unit 40 according to the second embodiment.
- the related language expression candidate generation unit 40 includes a target document set correlation analysis unit 440 and a limited correlation language expression extraction unit 450. Different from the first embodiment.
- the related language expression candidate generation unit 40 may include a target document set correlation analysis unit 440 and a limited correlation language expression extraction unit 450 in addition to the components shown in the first embodiment.
- the difference from the first embodiment is only the internal configuration of the related language expression candidate generation unit 40, and the overall configuration of the information analysis apparatus is the same as that of the first embodiment (see FIG. 1). Therefore, description is abbreviate
- the related language expression candidate generation unit 40 includes a focused document set correlation analysis unit 440 and a limited correlation language expression extraction unit 450.
- the target document set correlation analysis unit 440 analyzes the target document set for the presence or absence of a language expression that appears in a limited correlation with the target language expression.
- the limited correlation language expression extraction unit 450 extracts language expressions that are correlated in a limited manner from the analysis result of the target document set.
- the target document set correlation analysis unit 440 has a function of analyzing the correlation between the language expression included in the target document set and the target language expression using text mining technology.
- the target document set correlation analysis unit 440 obtains a language expression that appears in correlation with the input language expression in a part or all of the set of electronic documents including the input language expression.
- the document-of-interest collection analysis unit 440 subdivides the document-of-interest collection into several subsets, and in each of the segmented subsets instead of the entire document-of-interest collection, the language expression included in the subset The correlation with the target language expression may be analyzed.
- the target document set correlation analysis unit 440 uses a method for subdividing by classifying each document, when meta information is given to each document. May be.
- the document-of-interest collection correlation analysis unit 440 may use a method of dividing the document at regular intervals using time information given to each document.
- the target document set correlation analysis unit 440 may subdivide the text contents of each document by using an existing text clustering technique.
- the limited correlation language expression extraction unit 450 has a function of receiving, as a related language expression candidate, a language expression that correlates limitedly with the target language expression in response to the analysis result of the target document set correlation analysis unit 440.
- the limited correlation language expression extraction unit 450 uses the calculation result of the document-of-interest collection correlation analysis unit 440 to calculate a language expression that appears in correlation with the input language expression at a certain value or more. Extract as a candidate.
- limited correlation means that when the target document set correlation analysis unit 440 analyzes the entire target document set, the value indicating the degree of correlation with the target language expression is changed from a certain lower limit value to an upper limit. A linguistic expression between values.
- Linguistic expressions whose degree of correlation with the target language expression is higher than a certain threshold can be obtained using text mining technology. Therefore, when realizing an information analysis apparatus, if a language expression that can be obtained by a related technique such as the text mining technique is not targeted, a threshold value therefor may be set as the upper limit value. Conversely, when it is desired to obtain language expressions that can be obtained using the text mining technique at once, the upper limit value need not be set.
- the lower limit value is set too low, the number of language expressions extracted as related language expression candidates increases, and the amount of calculation in the time series analysis unit 60 also increases. Therefore, the lower limit value is set in advance in consideration of the use and purpose when realizing the information analysis apparatus, the properties of the analysis target population, and the like.
- the limited correlation language expression extraction unit 450 when the target document set correlation analysis unit 440 analyzes the correlation with the target language expression with respect to the subset of the target document set, the target language expression in each subset.
- a language expression having a correlation value with a certain value is extracted as a language expression that correlates in a limited manner and is obtained as a related language expression candidate. For example, this cannot be said to be particularly correlated with the target language expression in the entire target document set, but if the analysis is limited to the document set limited to a certain period or category, the correlation with the target language expression is High linguistic expressions are extracted.
- the information analysis apparatus includes the internal configuration of the related language expression candidate generation unit 40 described above and the overall configuration illustrated in FIG.
- each component shown in FIGS. 1 and 7 is realized as a program for controlling each function, and a computer such as a flexible disk such as an FD (floppy disk), a CD-ROM, a DVD, or a flash memory. It is stored in a readable information recording medium or provided through a network such as the Internet. And an information analysis apparatus may be implement
- FIG. 8 is a flowchart illustrating an example of related language expression candidate generation processing executed by the related language expression candidate generation unit 40 according to the second embodiment.
- the target document set correlation analysis unit 440 performs correlation analysis of the target language expression in all or a partial subset of the target document set (step C1).
- the limited correlation language expression extraction unit 450 extracts a language expression that correlates limitedly with the target language expression based on the result of the correlation analysis in Step C1, and outputs it as a related language expression candidate (Step C2).
- the related language expression candidate generation process in the present embodiment is completed.
- the related language expression candidate generation unit 40 is configured to have the detailed configuration shown in FIG. It is possible to detect the correlation between the target language expression and the language expression for which a large correlation with the target language expression cannot be found even by using the related technology text mining technique described. That is, in this embodiment, in the target document set, a language expression that is correlated only with the target language expression in a limited manner is once extracted as a related language expression candidate. Then, for these related language expression candidates, the temporal correlation between the target language expression and the related language expression candidates in the entire analysis target population is examined. By doing so, it is possible to check the language of interest for a language that cannot be found to have a large correlation with the language of interest by using text mining technology by checking whether it is really related to the language of interest. Correlation with expression can be detected.
- FIG. 9 is a block diagram illustrating a configuration example of the related language expression candidate generation unit 40 according to the third embodiment.
- the related language expression candidate generation unit 40 includes a target document set analysis unit 460 and a relevance suggestion language expression extraction unit 470.
- the related language expression candidate generation unit 40 may include a target document set analysis unit 460 and a relevance suggestion language expression extraction unit 470 in addition to the components shown in the first embodiment or the second embodiment. .
- the difference from the first embodiment is only the internal configuration of the related language expression candidate generation unit 40, and the overall configuration of the information analysis apparatus is the same as that of the first embodiment (see FIG. 1). Therefore, description is abbreviate
- the related language expression candidate generation unit 40 includes a target document set analysis unit 460 and a relevance suggestion language expression extraction unit 470.
- the target document set analysis unit 460 performs language analysis of the target document set.
- the relevance-suggesting language expression extraction unit 470 extracts a language expression having a description that suggests relevance to the target language expression from the result of language analysis.
- the target document set analysis unit 460 has a function of first obtaining a target document set and performing language analysis of each document of the determined target document set.
- the document-of-interest collection analysis unit 460 performs language analysis on a part or all of a set of electronic documents including the input language expression. Note that specific processing to be performed as language analysis is determined according to the type and form of language expression handled when the information analysis apparatus is realized. If the language analysis of each document has already been completed in the process of obtaining the target document set, there is no need to perform a new language analysis.
- the relevance-suggesting language expression extraction unit 470 examines the language analysis result around the target language expression for each document in the target document set, and describes another language expression that is suggested to be related to the target language expression. It has a function to search for the existence.
- the relevance-suggesting language expression extraction unit 470 uses the analysis result of the target document set analysis unit 460 as a related language expression candidate for a language expression that is suggested to be related to the input language expression. Extract.
- the linguistic suggestion linguistic expression extraction unit 470 selects the linguistic expression that suggests such a relation. All are extracted and output as related language expression candidates.
- the relevance-suggesting language expression extraction unit 470 performs syntax analysis and semantic analysis on each document in the target document set, and from the analysis result, a language expression in which a relationship with the target language expression is suggested May be extracted.
- the information analysis apparatus includes the internal configuration of the related language expression candidate generation unit 40 described above and the overall configuration illustrated in FIG.
- each component shown in FIGS. 1 and 9 is realized as a program for controlling each function, and is a flexible disk such as an FD (floppy disk), a computer such as a CD-ROM, a DVD, and a flash memory. It is stored in a readable information recording medium or provided through a network such as the Internet. Then, the information analysis apparatus may be realized by reading these programs into a computer (computer) or the like and executing them.
- a program for controlling each function and is a flexible disk such as an FD (floppy disk), a computer such as a CD-ROM, a DVD, and a flash memory. It is stored in a readable information recording medium or provided through a network such as the Internet. Then, the information analysis apparatus may be realized by reading these programs into a computer (computer) or the like and executing them.
- FIG. 10 is a flowchart illustrating an example of related language expression candidate generation processing executed by the related language expression candidate generation unit 40 according to the third embodiment.
- the target document set analysis unit 460 performs language analysis of the target document set (step D1).
- the relevance-suggesting language expression extraction unit 470 searches each document in the target document set for a description of another language expression that suggests a relationship with the target language expression.
- the relevance suggestion language expression extraction unit 470 extracts the language expression found as a result of the search and outputs it as a related language expression candidate (step D2), thereby ending the related language expression candidate generation process in the present embodiment. To do.
- the related language expression candidate generation unit 40 is configured to have the detailed configuration shown in FIG. 9, even one of the creators of the target document can separate it from the target language expression. If the user recognizes the relationship with the language expression and describes it in the document of interest, the relationship with the language expression can be detected. Of course, since there are many mistakes in the description of the creator of the document of interest, a language expression that suggests relevance is once extracted as a related language expression candidate. Then, for these related language expression candidates, the temporal correlation between the target language expression and the related language expression candidates in the entire analysis target population is examined. By doing so, it is possible to detect related information with high accuracy by confirming whether or not it is really related to the language expression of interest.
- FIG. 11 is a block diagram illustrating a configuration example of the related language expression candidate generation unit 40 in the fourth embodiment.
- the related language expression candidate generation unit 40 includes a focused language expression analysis unit 480 and an opposing language expression generation unit 490. Different from form.
- the related language expression candidate generation unit 40 may include a target language expression analysis unit 480 and an opposing language expression generation unit 490 in addition to the constituent elements shown in the first to third embodiments.
- the difference from the first embodiment is only the internal configuration of the related language expression candidate generation unit 40, and the overall configuration of the information analysis apparatus is the same as that of the first embodiment (see FIG. 1). Therefore, description is abbreviate
- the related language expression candidate generation unit 40 includes a target language expression analysis unit 480 and an opposing language expression generation unit 490.
- the target language expression analysis unit 480 performs language analysis of the target language expression.
- the conflicting language expression generation unit 490 generates a language expression that opposes the target language expression from the result of language analysis.
- the target language expression analysis unit 480 has a function of performing language analysis of the target language expression.
- the specific contents of the language analysis are affected by processing in the conflict language expression generation unit 490 described later. For example, when a language expression in which the target language expression is a negative form is generated by an opposing language expression generation unit 490 described later, the target language expression analysis unit 480 needs to perform morphological analysis and syntax analysis. .
- the conflict language expression generation unit 490 has a function of reading the language analysis result of the target language expression and generating a language expression that is semantically opposed to the target language expression.
- the conflicting language expression generation unit 490 uses the analysis result of the target language expression analysis unit 480 to generate a language expression that conflicts with the input language expression as a related language expression candidate.
- the conflicting language expression generation unit 490 generates, for example, a sentence in which a sentence that was originally an affirmative form is corrected to a negative form.
- the conflicting language expression generation unit 490 generates, for example, a sentence obtained by correcting a sentence that was originally a negative form into a positive form.
- the conflicting language expression generation unit 490 generates language expressions that are semantically opposed by using, for example, a method of adding an adjective, adverb, prefix, or the like that means negation.
- the conflict language expression generation unit 490 can generate a language expression such as “the earthquake resistant gel is not valid” or “the earthquake resistant gel is invalid” as the conflict language expression from the target language expression “the earthquake resistant gel is valid”. Such transformation of the conflicting language expression is possible by using pattern matching and syntax analysis techniques.
- the opposing language expression generation unit 490 can generate an opposing language expression by using resources of such various dictionaries. It becomes. For example, it is assumed that knowledge that “good for the environment” and “low pollution” are synonymous expressions is stored in the synonym dictionary. In this case, the conflicting language expression generation unit 490 uses the synonym dictionary to generate the negative expression “diesel vehicle is good for the environment” once from the target language expression “diesel vehicle is bad for the environment”. Further, the conflicting language expression generation unit 490 can further generate “diesel vehicle is low pollution”.
- the information analysis apparatus includes the internal configuration of the related language expression candidate generation unit 40 described above and the overall configuration illustrated in FIG.
- each component shown in FIGS. 1 and 11 is realized as a program for controlling each function, and a flexible disk such as an FD (floppy disk), a computer such as a CD-ROM, a DVD, and a flash memory. It is stored in a readable information recording medium or provided through a network such as the Internet. Then, the information analysis apparatus may be realized by reading these programs into a computer (computer) or the like and executing them.
- a flexible disk such as an FD (floppy disk)
- a computer such as a CD-ROM, a DVD, and a flash memory. It is stored in a readable information recording medium or provided through a network such as the Internet.
- the information analysis apparatus may be realized by reading these programs into a computer (computer) or the like and executing them.
- FIG. 12 is a flowchart illustrating an example of related language expression candidate generation processing executed by the related language expression candidate generation unit 40 according to the fourth embodiment.
- the target language expression analysis unit 480 performs language analysis of the target language expression (step E1).
- the opposite language expression generation unit 490 generates an opposite language expression that is semantically opposed to the target language expression from the language analysis result of the target language expression, and outputs it as a related language expression candidate (step E2).
- the related language expression candidate generation process in this embodiment is completed.
- the related language expression candidate generation unit 40 has the detailed configuration shown in FIG. 11, an opposing language that is semantically opposed to the language expression of interest by the language processing technology. Generate the representation directly. By doing so, it is possible to detect the relevance with the target language expression regardless of whether or not the conflicting language expression is included in the target document set or the search target document set. That is, since not all the conflicting language expressions actually correlate with the target language expression, the conflicting language expressions are once extracted as related language expression candidates. Then, for these related language expression candidates, the temporal correlation between the target language expression and the related language expression candidates in the entire analysis target population is examined. By doing so, it is possible to detect related information with high accuracy by confirming whether or not it is really related to the language expression of interest.
- the information analysis apparatus of each embodiment described above can be realized by an information processing apparatus such as a computer that operates according to a program. That is, the information analysis apparatus according to the present invention can be realized by software. However, each part of the information analysis apparatus shown in FIGS. 1, 2, 7, 9, and 11 or a part thereof may be configured by a dedicated IC and may be configured by hardware.
- the information analysis device is a server connected to a terminal via a network
- the target language expression input unit 10 and the related information output device 80 are communication units that communicate with the terminal, and the keyboard, mouse, and liquid crystal display device are Not necessary.
- the information analysis apparatus of each embodiment described above can be applied to the use of a search system that presents a language expression that is highly relevant to a language expression input using the information analysis apparatus as related information or a related search condition.
- FIG. 14 is a block diagram showing the configuration of the search system according to the present invention.
- the search system shown in FIG. 14 includes an information analysis device 200, a related information containing document search unit 90, a related document output device 100, and a search target document database 110.
- the information analysis apparatus 200 is the information analysis apparatus of the first embodiment shown in FIG. 1, but any of the information analysis apparatuses of the second to fourth embodiments may be used.
- the related information containing document search unit 90 receives a related language expression output as related information by the related information output device 80 as a search condition, and receives the related language received from the plurality of documents accessible in the search target document database 110. Search for documents that contain expressions.
- the related document output apparatus 100 outputs the document searched by the related information containing document search unit 90 as a related document.
- the search target document database 110 is a database that enables access to a set of documents to be searched.
- the configuration of the search target document database 110 may be the same as that of the text set database 30 or may be a database that provides access to a document set such as Internet text.
- a set of documents to be searched may be stored in the search target document database 110, or only a means for accessing each document such as a URL may be provided, and the substance of the document may be stored externally.
- the related information output device 800 only needs to have a function of outputting a language expression having a high degree of association with the target language expression as related information of the target language expression based on the calculation result of the relevance calculation unit 70. Etc., the output device may not be included.
- the present invention can be applied to the use of text data on the Internet such as a blog and analysis of document data to which time information such as a call center response history is added. Further, the present invention can be applied to uses such as questionnaire surveys that are periodically executed and analysis of the results of market surveys. Furthermore, the present invention can be applied to uses such as document search navigation and search result classification by detecting language expressions highly relevant to the target language expression.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
前記言語表現と関連性の高い言語表現の候補を、関連言語表現候補として生成する関連言語表現候補生成部と、
前記関連言語表現候補生成部が生成した前記関連言語表現候補について、前記関連言語表現候補に対応する時系列データを取得する関連言語表現候補時系列データ取得部と、
前記着目言語表現時系列データ取得部が取得した時系列データと、前記関連言語表現候補時系列データ取得部が取得した時系列データとの間の時間的な相関性を分析する時系列分析部と、
前記時系列分析部の分析結果を用いて、前記言語表現と、前記関連言語表現候補生成部が生成した前記関連言語表現候補との関連度を計算する関連度計算部と、
を備えたことを特徴とする。
前記関連情報含有文書検索部で検索された文書を出力する関連文書出力部とを備えた検索システムである。
前記言語表現と関連性の高い言語表現の候補を、関連言語表現候補として生成し、
生成した前記関連言語表現候補について、前記関連言語表現候補に対応する時系列データを取得し、
取得した前記言語表現に対応する時系列データと、取得した前記関連言語表現候補に対応する時系列データとの間の時間的な相関性を分析し、
前記時系列データ間の時間的な相関性の分析結果を用いて、前記言語表現と前記関連言語表現候補との関連度を計算することを含む検索システムである。
入力された分析対象の言語表現に対応する時系列データを取得する着目言語表現時系列データ取得処理と、
前記言語表現と関連性の高い言語表現の候補を、関連言語表現候補として生成する関連言語表現候補生成処理と、
生成した前記関連言語表現候補について、前記関連言語表現候補に対応する時系列データを取得する関連言語表現候補時系列データ取得処理と、
取得した前記言語表現に対応する時系列データと、取得した前記関連言語表現候補に対応する時系列データとの間の時間的な相関性を分析する時系列分析処理と、
前記時系列データ間の時間的な相関性の分析結果を用いて、前記言語表現と前記関連言語表現候補との関連度を計算する関連度計算処理と、
を実行させるための情報分析用プログラムである。
20 着目言語表現時系列データ取得部
30 文書集合データベース
40 関連言語表現候補生成部
50 関連言語表現候補時系列データ取得部
60 時系列分析部
70 関連度計算部
80 関連情報出力装置
410 調査対象文書条件選定部
420 調査対象文書集合取得部
430 特徴的言語表現抽出部
440 着目文書集合相関分析部
450 限定的相関言語表現抽出部
460 着目文書集合解析部
470 関連性示唆言語表現抽出部
480 着目言語表現解析部
490 対立言語表現生成部
以下、本発明の典型的な(exemplary)第1の実施形態について図面を参照して説明する。本発明は、分析対象として着目する言語表現と、時系列的に相関性の高い関連言語表現を文書集合から抽出する情報分析方式を用いた情報分析装置に関するものである。
次に、本発明の典型的な(exemplary)第2の実施形態について図面を参照して説明する。図7は、第2の実施形態における関連言語表現候補生成部40の構成例を示すブロック図である。図7に示すように、本実施形態では、情報分析装置において、関連言語表現候補生成部40が、着目文書集合相関分析部440と、限定的相関言語表現抽出部450とを含む点で、第1の実施形態と異なる。なお、関連言語表現候補生成部40は、第1の実施形態で示した構成要素に加えて、着目文書集合相関分析部440及び限定的相関言語表現抽出部450を含んでもよい。
次に、本発明の典型的な(exemplary)第3の実施形態について図面を参照して説明する。図9は、第3の実施形態における関連言語表現候補生成部40の構成例を示すブロック図である。図9に示すように、本実施形態では、情報分析装置において、関連言語表現候補生成部40が、着目文書集合解析部460と、関連性示唆言語表現抽出部470とを含む点で、第1の実施形態と異なる。なお、関連言語表現候補生成部40は、第1の実施形態または第2の実施形態で示した構成要素に加えて、着目文書集合解析部460及び関連性示唆言語表現抽出部470を含んでもよい。
次に、本発明の典型的な(exemplary)第4の実施形態について図面を参照して説明する。図11は、第4の実施形態における関連言語表現候補生成部40の構成例を示すブロック図である。図11に示すように、本実施形態では、情報分析装置において、関連言語表現候補生成部40が、着目言語表現解析部480と、対立言語表現生成部490とを含む点で、第1の実施形態と異なる。なお、関連言語表現候補生成部40は、第1の実施形態~第3の実施形態で示した構成要素に加えて、着目言語表現解析部480及び対立言語表現生成部490を含んでもよい。
Claims (28)
- 入力された分析対象の言語表現に対応する時系列データを取得する着目言語表現時系列データ取得部と、
前記言語表現と関連性の高い言語表現の候補を、関連言語表現候補として生成する関連言語表現候補生成部と、
前記関連言語表現候補生成部が生成した前記関連言語表現候補について、前記関連言語表現候補に対応する時系列データを取得する関連言語表現候補時系列データ取得部と、
前記着目言語表現時系列データ取得部が取得した時系列データと、前記関連言語表現候補時系列データ取得部が取得した時系列データとの間の時間的な相関性を分析する時系列分析部と、
前記時系列分析部の分析結果を用いて、前記言語表現と、前記関連言語表現候補生成部が生成した前記関連言語表現候補との関連度を計算する関連度計算部と、
を備えたことを特徴とする情報分析装置。 - 前記関連言語表現候補生成部は、
前記言語表現を含む電子文書のテキスト内容、又は前記言語表現を含む文書に付与されているメタ情報を用いて、前記関連言語表現候補を調査する文書の抽出条件を選定する調査対象文書条件選定部と、
前記抽出条件を満たす電子文書の集合を取得する調査対象文書集合取得部と、
前記調査対象文書集合取得部が取得した電子文書の集合から特徴的な言語表現を、関連言語表現候補として抽出する特徴的言語表現抽出部とを含む
請求項1記載の情報分析装置。 - 前記調査対象文書条件選定部は、前記言語表現を含む電子文書の集合の一部もしくは全部に対して、同一もしくは類似する分野の電子文書又は同一もしくは類似するトピックの電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定する請求項2記載の情報分析装置。
- 前記調査対象文書条件選定部は、前記言語表現を含む電子文書から一定ホップ数以内でリンクされる電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定する請求項2記載の情報分析装置。
- 前記調査対象文書条件選定部は、前記言語表現を含む電子文書に対して一定値以内のテキスト類似度をもつ電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定する請求項2記載の情報分析装置。
- 前記調査対象文書条件選定部は、前記言語表現を含む電子文書の集合の一部もしくは全部と、作者又は発信者が共通する他の電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定する請求項2記載の情報分析装置。
- 前記関連言語表現候補生成部は、
前記言語表現を含む電子文書の集合の一部もしくは全部において、前記言語表現と相関して出現する言語表現を求める着目文書集合相関分析部と、
前記着目文書集合相関分析部の算出結果を用いて、前記言語表現と一定値以上で相関して出現する言語表現を、関連言語表現候補として抽出する限定的相関言語表現抽出部とを含む
請求項1記載の情報分析装置。 - 前記関連言語表現候補生成部は、
前記言語表現を含む電子文書の集合の一部もしくは全部を言語解析する着目文書集合解析部と、
前記着目文書集合解析部の解析結果を用いて、前記言語表現との関連性が示唆されている言語表現を、関連言語表現候補として抽出する関連性示唆言語表現抽出部とを含む
請求項1記載の情報分析装置。 - 前記関連言語表現候補生成部は、
前記言語表現を言語解析する着目言語表現解析部と、
前記着目言語表現解析部の解析結果を用いて、前記言語表現と対立する言語表現を、関連言語表現候補として生成する対立言語表現生成部とを含む
請求項1記載の情報分析装置。 - 請求項1から請求項9のいずれか1項に記載の情報分析装置と、該情報分析装置から出力される、着目言語表現と関連度が高い関連言語表現を検索条件として、複数の検索対象文書から該関連言語表現を含む文書を検索する関連情報含有文書検索部と、
前記関連情報含有文書検索部で検索された文書を出力する関連文書出力部とを備えた検索システム。 - 入力された分析対象の言語表現に対応する時系列データを取得し、
前記言語表現と関連性の高い言語表現の候補を、関連言語表現候補として生成し、
生成した前記関連言語表現候補について、前記関連言語表現候補に対応する時系列データを取得し、
取得した前記言語表現に対応する時系列データと、取得した前記関連言語表現候補に対応する時系列データとの間の時間的な相関性を分析し、
前記時系列データ間の時間的な相関性の分析結果を用いて、前記言語表現と前記関連言語表現候補との関連度を計算することを含む情報分析方法。 - 前記関連言語表現候補の生成は、
前記言語表現を含む電子文書のテキスト内容、又は前記言語表現を含む文書に付与されているメタ情報を用いて、前記関連言語表現候補を調査する文書の抽出条件を選定し、
前記抽出条件を満たす電子文書の集合を取得し、
取得した電子文書の集合から特徴的な言語表現を、関連言語表現候補として抽出することを含む請求項11記載の情報分析方法。 - 前記抽出条件の選定は、
前記言語表現を含む電子文書の集合の一部もしくは全部に対して、同一もしくは類似する分野の電子文書又は同一もしくは類似するトピックの電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定することである請求項12記載の情報分析方法。 - 前記抽出条件の選定は、
前記言語表現を含む電子文書から一定ホップ数以内でリンクされる電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定することである請求項12記載の情報分析方法。 - 前記抽出条件の選定は、
前記言語表現を含む電子文書に対して一定値以内のテキスト類似度をもつ電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定することである請求項12記載の情報分析方法。 - 前記抽出条件の選定は、
前記言語表現を含む電子文書の集合の一部もしくは全部と、作者又は発信者が共通する他の電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定することである請求項12記載の情報分析方法。 - 前記関連言語表現候補の生成は、
前記言語表現を含む電子文書の集合の一部もしくは全部において、前記言語表現と相関して出現する言語表現を求め、
前記相関して出現する言語表現の算出結果を用いて、前記言語表現と一定値以上で相関して出現する言語表現を、関連言語表現候補として抽出することを含む請求項11記載の情報分析方法。 - 前記関連言語表現候補の生成は、
前記言語表現を含む電子文書の集合の一部もしくは全部を言語解析し、
前記言語解析の解析結果を用いて、前記言語表現との関連性が示唆されている言語表現を、関連言語表現候補として抽出することを含む請求項11記載の情報分析方法。 - 前記関連言語表現候補の生成は、
前記言語表現を言語解析し、
前記言語解析の解析結果を用いて、前記言語表現と対立する言語表現を、関連言語表現候補として生成することである請求項12記載の情報分析方法。 - コンピュータに、
入力された分析対象の言語表現に対応する時系列データを取得する着目言語表現時系列データ取得処理と、
前記言語表現と関連性の高い言語表現の候補を、関連言語表現候補として生成する関連言語表現候補生成処理と、
生成した前記関連言語表現候補について、前記関連言語表現候補に対応する時系列データを取得する関連言語表現候補時系列データ取得処理と、
取得した前記言語表現に対応する時系列データと、取得した前記関連言語表現候補に対応する時系列データとの間の時間的な相関性を分析する時系列分析処理と、
前記時系列データ間の時間的な相関性の分析結果を用いて、前記言語表現と前記関連言語表現候補との関連度を計算する関連度計算処理と、
を実行させるための情報分析用プログラム。 - コンピュータに、
前記関連言語表現候補生成処理で、
前記言語表現を含む電子文書のテキスト内容、又は前記言語表現を含む文書に付与されているメタ情報を用いて、前記関連言語表現候補を調査する文書の抽出条件を選定する処理と、
前記抽出条件を満たす電子文書の集合を取得する処理と、
取得した電子文書の集合から特徴的な言語表現を、関連言語表現候補として抽出する処理とを実行させる
請求項20記載の情報分析用プログラム。 - コンピュータに、
前記調査対象文書条件選定処理で、前記言語表現を含む電子文書の集合の一部もしくは全部に対して、同一もしくは類似する分野の電子文書又は同一もしくは類似するトピックの電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定する処理を実行させる
請求項21記載の情報分析用プログラム。 - コンピュータに、
前記調査対象文書条件選定処理で、前記言語表現を含む電子文書から一定ホップ数以内でリンクされる電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定する処理を実行させる
請求項21記載の情報分析用プログラム。 - コンピュータに、
前記調査対象文書条件選定処理で、前記言語表現を含む電子文書に対して一定値以内のテキスト類似度をもつ電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定する処理を実行させる
請求項21記載の情報分析用プログラム。 - コンピュータに、
前記調査対象文書条件選定処理で、前記言語表現を含む電子文書の集合の一部もしくは全部と、作者又は発信者が共通する他の電子文書であることを、前記関連言語表現候補を調査する文書の抽出条件として選定する処理を実行させる
請求項21記載の情報分析用プログラム。 - コンピュータに、
前記関連言語表現候補生成処理で、
前記言語表現を含む電子文書の集合の一部もしくは全部において、前記言語表現と相関して出現する言語表現を求める処理と、
前記相関して出現する言語表現の算出結果を用いて、前記言語表現と一定値以上で相関して出現する言語表現を、関連言語表現候補として抽出する処理とを実行させる
請求項20記載の情報分析用プログラム。 - コンピュータに、
前記関連言語表現候補生成処理で、
前記言語表現を含む電子文書の集合の一部もしくは全部を言語解析する処理と、
前記言語解析の解析結果を用いて、前記言語表現との関連性が示唆されている言語表現を、関連言語表現候補として抽出する処理とを実行させる
請求項20記載の情報分析用プログラム。 - コンピュータに、
前記関連言語表現候補生成処理で、
前記言語表現を言語解析する処理と、
前記言語解析の解析結果を用いて、前記言語表現と対立する言語表現を、関連言語表現候補として生成する処理とを実行させる
請求項20記載の情報分析用プログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009551600A JPWO2009096523A1 (ja) | 2008-01-30 | 2009-01-30 | 情報分析装置、検索システム、情報分析方法及び情報分析用プログラム |
US12/864,780 US20100318526A1 (en) | 2008-01-30 | 2009-01-30 | Information analysis device, search system, information analysis method, and information analysis program |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008019014 | 2008-01-30 | ||
JP2008-019014 | 2008-01-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009096523A1 true WO2009096523A1 (ja) | 2009-08-06 |
Family
ID=40912866
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/051581 WO2009096523A1 (ja) | 2008-01-30 | 2009-01-30 | 情報分析装置、検索システム、情報分析方法及び情報分析用プログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US20100318526A1 (ja) |
JP (1) | JPWO2009096523A1 (ja) |
WO (1) | WO2009096523A1 (ja) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013093010A (ja) * | 2011-10-05 | 2013-05-16 | Keiko Takeda | コンテンツ販売/販売システム、及び、その方法 |
JP2013114635A (ja) * | 2011-12-01 | 2013-06-10 | Hitachi Systems Ltd | テキストデータ管理方法およびテキストデータ管理システム |
JP2014002653A (ja) * | 2012-06-20 | 2014-01-09 | Ntt Docomo Inc | 共起語を特定する装置およびプログラム |
JP2016197332A (ja) * | 2015-04-03 | 2016-11-24 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | 情報処理システム、情報処理方法、およびコンピュータプログラム |
JP2018092367A (ja) * | 2016-12-02 | 2018-06-14 | 日本放送協会 | 関連語抽出装置及びプログラム |
JP2019101841A (ja) * | 2017-12-05 | 2019-06-24 | 富士通株式会社 | 検索処理プログラム、検索処理方法および検索処理装置 |
US10970648B2 (en) | 2017-08-30 | 2021-04-06 | International Business Machines Corporation | Machine learning for time series using semantic and time series data |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5387577B2 (ja) * | 2008-09-25 | 2014-01-15 | 日本電気株式会社 | 情報分析装置、情報分析方法、及びプログラム |
JP5884740B2 (ja) * | 2011-02-15 | 2016-03-15 | 日本電気株式会社 | 時系列文書要約装置、時系列文書要約方法および時系列文書要約プログラム |
US8880389B2 (en) * | 2011-12-09 | 2014-11-04 | Igor Iofinov | Computer implemented semantic search methodology, system and computer program product for determining information density in text |
CN102710874A (zh) * | 2012-04-11 | 2012-10-03 | 佳都新太科技股份有限公司 | 一种基于微博呼叫接入的acd排队路由方法 |
CN104272301B (zh) * | 2012-04-25 | 2018-01-23 | 国际商业机器公司 | 用于提取一部分文本的方法、计算机可读介质和计算机 |
US8543576B1 (en) * | 2012-05-23 | 2013-09-24 | Google Inc. | Classification of clustered documents based on similarity scores |
US9355170B2 (en) | 2012-11-27 | 2016-05-31 | Hewlett Packard Enterprise Development Lp | Causal topic miner |
US10474747B2 (en) | 2013-12-16 | 2019-11-12 | International Business Machines Corporation | Adjusting time dependent terminology in a question and answer system |
CN106910501B (zh) * | 2017-02-27 | 2019-03-01 | 腾讯科技(深圳)有限公司 | 文本实体提取方法及装置 |
RU2680765C1 (ru) * | 2017-12-22 | 2019-02-26 | Общество с ограниченной ответственностью "Аби Продакшн" | Автоматизированное определение и обрезка неоднозначного контура документа на изображении |
US11416507B2 (en) * | 2020-10-26 | 2022-08-16 | Sap Se | Integration of timeseries data and time dependent semantic data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007249584A (ja) * | 2006-03-15 | 2007-09-27 | Softec:Kk | クライアントデータベース構築方法、データ検索方法、データ検索システム、データ検索フィルタリングシステム、クライアントデータベース構築プログラム、データ検索プログラム、データ検索フィルタリングプログラム及びプログラムを格納したコンピュータで読み取り可能な記録媒体並びに記録した機器 |
JP2007328714A (ja) * | 2006-06-09 | 2007-12-20 | Hitachi Ltd | 文書検索装置及び文書検索プログラム |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0437615B1 (en) * | 1989-06-14 | 1998-10-21 | Hitachi, Ltd. | Hierarchical presearch-type document retrieval method, apparatus therefor, and magnetic disc device for this apparatus |
JP3525948B2 (ja) * | 1994-05-31 | 2004-05-10 | 富士通株式会社 | 情報検索装置 |
US5842206A (en) * | 1996-08-20 | 1998-11-24 | Iconovex Corporation | Computerized method and system for qualified searching of electronically stored documents |
US5941944A (en) * | 1997-03-03 | 1999-08-24 | Microsoft Corporation | Method for providing a substitute for a requested inaccessible object by identifying substantially similar objects using weights corresponding to object features |
JP2001216311A (ja) * | 2000-02-01 | 2001-08-10 | Just Syst Corp | イベント分析装置、及びイベント分析プログラムが格納されたプログラム装置 |
JP2002351897A (ja) * | 2001-05-22 | 2002-12-06 | Fujitsu Ltd | 情報利用頻度予測プログラム、情報利用頻度予測装置および情報利用頻度予測方法 |
JP2004326379A (ja) * | 2003-04-24 | 2004-11-18 | Hitachi Ltd | 資産運用支援システムおよび資産運用支援方法 |
US7505964B2 (en) * | 2003-09-12 | 2009-03-17 | Google Inc. | Methods and systems for improving a search ranking using related queries |
US20050160107A1 (en) * | 2003-12-29 | 2005-07-21 | Ping Liang | Advanced search, file system, and intelligent assistant agent |
JP2006031066A (ja) * | 2004-07-12 | 2006-02-02 | Ksl:Kk | 年表検索装置、年表検索方法、プログラム及び記録媒体 |
US7577651B2 (en) * | 2005-04-28 | 2009-08-18 | Yahoo! Inc. | System and method for providing temporal search results in response to a search query |
US7577646B2 (en) * | 2005-05-02 | 2009-08-18 | Microsoft Corporation | Method for finding semantically related search engine queries |
US7403932B2 (en) * | 2005-07-01 | 2008-07-22 | The Boeing Company | Text differentiation methods, systems, and computer program products for content analysis |
WO2007043322A1 (ja) * | 2005-09-30 | 2007-04-19 | Nec Corporation | トレンド評価装置と、その方法及びプログラム |
US7685091B2 (en) * | 2006-02-14 | 2010-03-23 | Accenture Global Services Gmbh | System and method for online information analysis |
EP2122506A4 (en) * | 2007-01-10 | 2011-11-30 | Sysomos Inc | METHOD AND SYSTEM FOR INFORMATION DISCOVERY AND TEXT ANALYSIS |
-
2009
- 2009-01-30 WO PCT/JP2009/051581 patent/WO2009096523A1/ja active Application Filing
- 2009-01-30 US US12/864,780 patent/US20100318526A1/en not_active Abandoned
- 2009-01-30 JP JP2009551600A patent/JPWO2009096523A1/ja active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007249584A (ja) * | 2006-03-15 | 2007-09-27 | Softec:Kk | クライアントデータベース構築方法、データ検索方法、データ検索システム、データ検索フィルタリングシステム、クライアントデータベース構築プログラム、データ検索プログラム、データ検索フィルタリングプログラム及びプログラムを格納したコンピュータで読み取り可能な記録媒体並びに記録した機器 |
JP2007328714A (ja) * | 2006-06-09 | 2007-12-20 | Hitachi Ltd | 文書検索装置及び文書検索プログラム |
Non-Patent Citations (3)
Title |
---|
"Dai 21 Kai Annual Conference of JSAI, 18 June, 2007 (18.06. 07)", vol. 3H9-3, article KEN'ICHI YAMAMOTO ET AL.: "Doko Joho no Kensaku ni yoru Joho Hensan", pages: 1 - 4 * |
NASUKAWA T. ET AL.: "Field o Hirogeru Shizen Gengo Shori", JOHO SHORI, vol. 40, no. 4, 15 April 1999 (1999-04-15), pages 358 - 364 * |
TOMOYUKI NANNO, ET AL.: "Automatic Collection and Monitoring of Japanese Weblogs", WWW2004 WORKSHOP ON THE WEBLOGGING ECOSYSTEM: AGGREGATION, ANALYSIS AND DYNAMICS,, 2004, Retrieved from the Internet <URL:http://www.blogpulse.com/papers/www2004nanno> * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013093010A (ja) * | 2011-10-05 | 2013-05-16 | Keiko Takeda | コンテンツ販売/販売システム、及び、その方法 |
JP2013114635A (ja) * | 2011-12-01 | 2013-06-10 | Hitachi Systems Ltd | テキストデータ管理方法およびテキストデータ管理システム |
JP2014002653A (ja) * | 2012-06-20 | 2014-01-09 | Ntt Docomo Inc | 共起語を特定する装置およびプログラム |
JP2016197332A (ja) * | 2015-04-03 | 2016-11-24 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | 情報処理システム、情報処理方法、およびコンピュータプログラム |
JP2018092367A (ja) * | 2016-12-02 | 2018-06-14 | 日本放送協会 | 関連語抽出装置及びプログラム |
US10970648B2 (en) | 2017-08-30 | 2021-04-06 | International Business Machines Corporation | Machine learning for time series using semantic and time series data |
US11010689B2 (en) | 2017-08-30 | 2021-05-18 | International Business Machines Corporation | Machine learning for time series using semantic and time series data |
JP2019101841A (ja) * | 2017-12-05 | 2019-06-24 | 富士通株式会社 | 検索処理プログラム、検索処理方法および検索処理装置 |
JP7059599B2 (ja) | 2017-12-05 | 2022-04-26 | 富士通株式会社 | 検索処理プログラム、検索処理方法および検索処理装置 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2009096523A1 (ja) | 2011-05-26 |
US20100318526A1 (en) | 2010-12-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2009096523A1 (ja) | 情報分析装置、検索システム、情報分析方法及び情報分析用プログラム | |
Anderka et al. | Predicting quality flaws in user-generated content: the case of wikipedia | |
US8630972B2 (en) | Providing context for web articles | |
Wanner et al. | State-of-the-Art Report of Visual Analysis for Event Detection in Text Data Streams. | |
US9720977B2 (en) | Weighting search criteria based on similarities to an ingested corpus in a question and answer (QA) system | |
US20090300046A1 (en) | Method and system for document classification based on document structure and written style | |
Spina et al. | Discovering filter keywords for company name disambiguation in twitter | |
US10810245B2 (en) | Hybrid method of building topic ontologies for publisher and marketer content and ad recommendations | |
Karkali et al. | Using temporal IDF for efficient novelty detection in text streams | |
JP5136910B2 (ja) | 情報分析装置、情報分析方法、情報分析用プログラム、及び検索システム | |
JP5387577B2 (ja) | 情報分析装置、情報分析方法、及びプログラム | |
Jeaco | Key words when text forms the unit of study: Sizing up the effects of different measures | |
McEnery et al. | Building a written corpus: what are the basics? | |
Dąbrowski et al. | Mining and searching app reviews for requirements engineering: Evaluation and replication studies | |
Rawat et al. | Topic modelling of legal documents using NLP and bidirectional encoder representations from transformers | |
Chen et al. | Research on clustering analysis of Internet public opinion | |
Nazemi et al. | Comparison of full-text articles and abstracts for visual trend analytics through natural language processing | |
JP4539616B2 (ja) | 意見収集分析装置及びそれに用いる意見収集分析方法並びにそのプログラム | |
Lukin et al. | Adjectives and adverbs as stylometric analysis parameters | |
JP2005196572A (ja) | 複数文書の要約作成方法 | |
Van den Hoven et al. | Beyond reported history: Strikes that never happened | |
Szwoch et al. | Creation of Polish Online News Corpus for Political Polarization Studies | |
Lomotey et al. | Terms analytics service for CouchDB: a document-based NoSQL | |
Kejriwal et al. | Empirical best practices on using product-specific schema. org | |
Kowalski | Information system evaluation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09705423 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009551600 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12864780 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09705423 Country of ref document: EP Kind code of ref document: A1 |