WO2009110550A1 - 属性抽出方法、システム及びプログラム - Google Patents
属性抽出方法、システム及びプログラム Download PDFInfo
- Publication number
- WO2009110550A1 WO2009110550A1 PCT/JP2009/054170 JP2009054170W WO2009110550A1 WO 2009110550 A1 WO2009110550 A1 WO 2009110550A1 JP 2009054170 W JP2009054170 W JP 2009054170W WO 2009110550 A1 WO2009110550 A1 WO 2009110550A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- attribute
- group
- image
- character string
- attribute name
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/28—Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/751—Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
Definitions
- the present invention relates to an attribute extraction method, system, and program.
- Attribute is a property or characteristic provided for things or matters, and here consists of attribute name and attribute value.
- One thing or thing has multiple attributes. For example, if the CPU of a personal computer is 1 GHz and the memory is 500 MB, the personal computer has two attributes: attribute name CPU / attribute value 1 GHz and attribute name memory / attribute value 50 MB.
- This type of attribute extraction system required time and effort to manually create a program that matches the format of each document.
- the present invention has been invented in view of the above problems, and an object thereof is an attribute extraction method and system capable of extracting an attribute name and an attribute value based on a character string of a document or a drawing position of an image. And providing a program.
- the present invention that solves the above-described problem is to extract a character string or a set of images in which a drawing position of a character string or an image in a document is aligned in one direction as an attribute group, and an attribute indicating the degree to which the attribute group is a set of attribute names
- a name score is calculated, an attribute name group is selected from the attribute groups based on the attribute name score, and at least one character string or image of the attribute name group and the same character string or image
- the drawing position of the same character string or image selects the character string or image drawing position of the attribute name group and the same attribute group, and the character string or image of the same drawing position is selected.
- the attribute value corresponding to the attribute name is extracted from the character string or image of the selected attribute group other than the character string or image at the same drawing position. Is that attribute extraction method.
- the present invention for solving the above-described problems includes a document storage unit in which a document is stored, and a character string or image in which the drawing position of the character string or image in the document stored in the document storage unit is aligned in one direction.
- An attribute group extraction unit that selects a group and extracts an attribute group; calculates an attribute name score indicating a degree that the attribute group is a set of attribute names; and, based on the attribute name score, out of the attribute group candidates
- An attribute name group selection unit for selecting an attribute name group from the above, at least one or more character strings or images of the attribute name group, and the same character string or image, and including the same character string or image
- the drawing position is the same attribute group as the drawing position of the character string or image of the attribute name group, and the attribute name is extracted from the character string or image of the same drawing position. From other than the character string or image of the of identity in a flat drawing position of a character string or image of the selected attribute group is an attribute extraction system having an attribute extraction unit for extract
- the present invention for solving the above-described problems includes an attribute group extraction process for extracting, as an attribute group, a character string or image set in which a character string or image drawing position in a document is aligned in one direction, and the attribute group has an attribute name.
- Attribute name group selection processing for calculating an attribute name score indicating the degree of a set, and selecting an attribute name group from the attribute groups based on the attribute name score, and at least one or more characters of the attribute name group
- the same character string or image, and the drawing position of the same character string or image selects the character string or image drawing position of the attribute name group and the same attribute group
- an attribute name extraction process for extracting an attribute name from the character string or image at the same drawing position, and the character string or image of the selected attribute group From other than the character string or image of the same drawing position, a program for executing the attribute value extraction process for extracting the attribute value corresponding to the attribute name to the information processing apparatus.
- the attribute name and attribute value can be extracted based on the character string of the document or the drawing position of the image.
- FIG. 1 is a configuration diagram of an attribute extraction system according to the first embodiment.
- FIG. 2 is a diagram showing an example of input words.
- FIG. 3 is a diagram showing an example of a document group stored in the document group storage unit 2.
- FIG. 4 is a diagram showing an example of a delimiter pattern stored in the delimiter pattern group storage unit 3.
- FIG. 5 is a diagram showing an example of registration in the attribute group group storage unit 4 by the attribute group extraction unit 6.
- FIG. 6 is a diagram showing an example of registering scores in the attribute group group storage unit 4 by the attribute group selection unit 7.
- FIG. 7 is a diagram for explaining the first embodiment.
- FIG. 8 is an operation flowchart according to the first embodiment.
- FIG. 9 is a configuration diagram of an attribute extraction system according to the second embodiment.
- FIG. 9 is a configuration diagram of an attribute extraction system according to the second embodiment.
- FIG. 10 is a diagram illustrating a storage example of the attribute candidate group storage unit 10.
- FIG. 11 is a diagram showing an example of the co-occurrence frequency dictionary 11.
- FIG. 12 is an operation flowchart of the second embodiment.
- FIG. 13 is a configuration diagram of an attribute extraction system according to the third embodiment.
- FIG. 14 is a diagram illustrating a storage example of the second attribute group group storage unit 21.
- FIG. 15 is an operation flowchart of the third embodiment.
- FIG. 16 is a block diagram according to the fourth embodiment.
- FIG. 17 is a diagram showing a storage example of the output word storage unit 5.
- FIG. 18 is a diagram showing a storage example of the output word storage unit 5.
- FIG. 19 is a diagram showing a storage example of the output word storage unit 5.
- FIG. 1 is a configuration diagram of an attribute extraction system according to the first embodiment.
- the attribute extraction system includes an input word storage unit 1, a document group storage unit 2, a delimiter pattern group storage unit 3, an attribute group group storage unit 4, an output word storage unit 5, an attribute A group extraction unit 6, an attribute group selection unit 7, and an attribute value selection unit 8 are provided.
- the input word storage unit 1 is a storage unit that stores a list of things and matters that the user wants to know the attributes of.
- the user registers in the input word storage unit 1 an input word representing an object such as an object for which an attribute is desired or an object.
- An example of the input word is shown in FIG. In the example of FIG. 2, “product A” to “product E” are registered as input words.
- the document group storage unit 2 stores documents for attribute extraction.
- An example of a document group stored in the document group storage unit 2 is shown in FIG. In FIG. 3, the document ID for identifying the stored document is stored in association with the document data of the document.
- the document to be stored is not limited to a structured document such as HTML, but may be a text separated by commas.
- the delimiter pattern group storage unit 3 stores a delimiter pattern group for delimiting into character strings or images (hereinafter referred to as blocks) constituting a structure such as a table or a list.
- the delimiter pattern is a pattern for extracting a block that is an element of a table or list structure. For example, the inside of an HTML td tag, the inside of an li tag, or “,” or “:” of a text document. Is a delimiter.
- An example of a delimiter pattern stored in the delimiter pattern group storage unit 3 is shown in FIG.
- the attribute group extraction unit 6 delimits the document stored in the document group storage unit 2 with the delimiter pattern stored in the delimiter pattern group storage unit 3. Then, for each divided block, the drawing position of the block is calculated, a set of blocks in which the drawing positions of the blocks are aligned in the vertical direction or the horizontal direction is extracted as an attribute group, and the attribute group group storage unit 4 sign up.
- a general rendering engine is used to calculate the drawing position of the block. For example, if it is an HTML format document, it is a publicly available web browser.
- FIG. 5 An example of registration in the attribute group group storage unit 4 by the attribute group extraction unit 6 is shown in FIG.
- the document ID, the input word, the group ID, the word extracted from the block, the upper left position and the lower right position of the block, and the score are stored in association with each other. The score will be described later.
- Each column corresponds to a block in the attribute group.
- the document ID is the document ID in which the attribute group appears
- the input word is the input word that appears before the appearance position of the attribute group in the document
- the group ID is the attribute group ID
- the upper left position and the upper right position are the documents of the document ID
- the drawing position of the block when drawing is shown. Note that words with the same group ID indicate the same attribute group.
- a word extracted from a character string or image block is simply referred to as a word.
- a word For extracting words from blocks other than character strings, for example, image blocks, techniques such as character recognition using images can be used.
- the attribute group selection unit 7 calculates, for the attribute groups stored in the attribute group group storage unit 4, an attribute name score (hereinafter simply referred to as a score) indicating the degree to which the attribute group is a set of attribute names. .
- a score is calculated from a statistic using the appearance frequency and appearance probability of a word, and an attribute group with a high possibility that an attribute name is described is selected.
- the average appearance probability of words in each block is measured and used as the attribute group score.
- An example of the registration of the score in the attribute group group storage unit 4 by the attribute group selection unit 7 is shown in FIG.
- the word appearance probability is registered as a score for each word extracted from the blocks in the attribute group, but the average of the appearance probability of each word is registered as the score of one attribute group. Also good.
- the attribute group selection part 7 selects an attribute name group with high possibility that the attribute name is described based on the score. Specifically, an attribute group candidate having a score equal to or higher than a certain threshold value or an attribute group candidate having a higher percentage is selected as an attribute name group.
- the attribute value selection unit 8 uses the attribute name group word selected by the attribute group selection unit 7 as an attribute name, and the attribute name from the attribute group block having a block whose drawing position overlaps the word name block. The attribute value corresponding to is extracted. In addition, although it is extraction of an attribute value, an attribute value is extracted from words other than the word extracted as an attribute name among the words of an attribute group.
- the block of the word “CPU” with the group ID “1” and the block of the word “CPU” with the group ID “2” are drawn. Positions overlap. Therefore, the word “CPU” is used as an attribute name, and the word “1 GHz” of a block other than the block of the word “CPU” of the group ID “2” is extracted as an attribute value of the attribute name “CPU”.
- the attribute value is selected from the attribute group having the blocks whose drawing positions overlap each other. The attribute name and attribute value obtained here are stored in the output word storage unit 5 together with the input word.
- the word “CPU” with the group ID “1” and the word “CPU” with the group ID “2” are the same, but it is not always necessary to be completely the same. Inconsistencies that do not deviate are considered the same. For example, when there is a character string “CPU”, an appropriate word as an attribute name is “CPU”, and ““ ”is unnecessary. However, when the word “CPU” is extracted at the extraction stage from the character string in a certain attribute group and the word “CPU” is extracted at the extraction stage from the character string in another attribute group, the word “CPU” is extracted. ”And the word“ CPU ”are regarded as the same, and the attribute name and attribute value are extracted.
- FIG. 8 is an operation flowchart in the present embodiment.
- the attribute group extraction unit 6 selects one document including the input word from the document group storage unit 2 (Step 100). For example, in the example illustrated in FIG. 3, when the input word “product A” is selected, a document “document A” including “product A” is selected.
- the attribute group extraction unit 6 acquires a delimiter pattern from the delimiter pattern group storage unit 3 and calculates the drawing position of the block delimited by the delimiter pattern (Step 101).
- a general rendering engine is used to calculate the drawing position of the block. For example, if it is an HTML format document, it is a publicly available web browser. For example, in “Document A”, each part such as “ ⁇ td> CPU ⁇ / td>”, “ ⁇ td> 1 GHz ⁇ / td>”, etc., to which “ ⁇ td> * ⁇ / td>” matches the delimiter pattern. Calculate the drawing position.
- the drawing position is specified by upper left coordinates (X, Y) and lower right coordinates (X, Y) on the screen.
- the coordinates on the screen are expressed with the upper left corner of the screen as the origin, the horizontal direction as the X axis, and the vertical direction as the Y axis.
- the drawing position of “ ⁇ td> CPU ⁇ / td>” in “Document A” obtained by calculation is assumed to be the upper left coordinates (10, 10) and the lower right coordinates (40, 20).
- the drawing position of ⁇ td> 1 GHz ⁇ / td> is assumed to be upper left coordinates (40, 10) and lower right coordinates (80, 20).
- the attribute group extraction unit 6 extracts, as attribute groups, blocks that continue in the vertical direction or the horizontal direction from the blocks delimited by the delimiter pattern, and stores them in the attribute group group storage unit 4 (Step 102).
- the drawing positions of “ ⁇ td> CPU ⁇ / td>” and “ ⁇ td> 1 GHz ⁇ / td>” have the same upper left Y coordinate and lower right Y coordinate, and “ ⁇ td> CPU ⁇ / td>”. Since the lower right X coordinate of “td>” and the upper left X coordinate of “ ⁇ td> 1 GHz ⁇ / td>” are the same, they continue in the horizontal direction. Therefore, this group is an attribute group.
- the horizontal direction does not have to be the same Y coordinate, and may have an error. You may give an error also about the point which continues.
- attribute group candidates continuing in the vertical direction are extracted. The same X coordinate may not be used in the vertical direction, and an error may be given.
- the document ID, group ID, word (character string excluding the delimiter in the block or character string extracted from the block), upper left coordinate, lower right coordinate, and attribute group candidate group storage unit 4 memorize. Note that the same group ID is assigned to the same attribute group candidate. The score is empty at this point. The above is continued until there are no more documents including the input word.
- the attribute group selection unit 7 calculates the attribute group score and assigns it to the attribute group (Step 103).
- the calculated score is a score indicating that the attribute group candidate is a set of attribute names, and a larger value is more likely to be a set of attribute names.
- the attribute name is often used in multiple documents. Therefore, it is highly possible that an attribute group candidate that includes more frequently-occurring words is a set of attribute names.
- the attribute group score is the average of the appearance probabilities of each word of the attribute group candidate.
- Wj is the appearance frequency in the attribute group candidate group of the jth word of the attribute group candidate of group G
- N is the total appearance frequency of all words in the attribute group candidate group
- Pj is the appearance of the jth word
- J is the number of words in group G.
- the attribute group score of group ID “2” (hereinafter referred to as group 2) is an average of word appearance probabilities of “CPU” and “1 GHz”.
- W is the number of all words in the attribute group candidate group.
- W0 is an average value of word appearance frequencies. In addition, it can be calculated as the total or average of word appearance frequencies. However, since the value varies depending on the size of the document group only with the simple word appearance frequency, it is better to use the difference between the appearance probability and the average.
- the attribute group selection unit 7 refers to the score of the attribute group and selects an attribute name group that is a set of attribute values (Step 104).
- an attribute group having an attribute group score equal to or higher than a threshold set in advance in the system is selected as the attribute name group. For example, in the above example, when the threshold value is 0.03, group 1 is selected as the attribute name group.
- the attribute name group may be selected from a group with a few percent from the top score. If the bias due to the size of the document group is alleviated, such as the average word appearance probability, a threshold can be set as the score. If there is a bias due to the size of the document group, such as a simple word appearance frequency, it is difficult to set a threshold value, so it is preferable to select the top few percent of the score ranking. These two threshold values may be set simultaneously.
- the attribute value selection unit 8 extracts an attribute name, an attribute value, and an input word from the selected attribute name group and the attribute group that crosses this attribute name group (Step 105).
- the crossing attribute group is an attribute group that shares a block of the same word with the selected attribute name group (the drawing position is the same) and continues in a right angle direction.
- Attribute value selection unit 8 extracts shared words as attribute names and unshared words as attribute values. Further, input words corresponding to attribute groups are also extracted.
- attribute group 2 is an attribute group that continues in the vertical direction
- attribute group 2 is an attribute group candidate that continues in the horizontal direction.
- the block of the word “CPU” is shared.
- the word “CPU” shared by the attribute group 1 and the attribute group 2 is an attribute name
- the word “1 GHz” that is not shared with the attribute group 1 among the attributes group 2 (words extracted from the block) is the attribute value. Extract as Further, the input word “product A” of attribute group 1 is selected. If the attribute value is the same as the input word, it is not selected as the attribute value.
- a word having the highest word appearance probability is selected, or a plurality of attribute values are selected.
- an attribute name, an attribute value, and an input word are selected from other attribute groups.
- FIG. 17 shows an example of the output word storage unit 5.
- the input word, the attribute name, and the attribute value constitute one record, and the input word, the attribute name, and the attribute value are stored in association with each other.
- the input word “product A”, the attribute name “CPU”, and the attribute value “1 GHz” are stored in association with each other.
- the attribute group extraction unit creates an attribute group based on the drawing position of the word, it is not necessary to prepare a template.
- the attribute name can be recognized from the statistical information of the words of the attribute group by the attribute group selection unit.
- FIG. 9 shows a configuration diagram of the second embodiment.
- the second embodiment is different from the first embodiment in that an attribute candidate group storage unit 10, a co-occurrence frequency dictionary 11, and a co-occurrence degree calculation unit 12 are added.
- the attribute candidate group storage unit 10 stores a database that stores word candidates that are attribute names (hereinafter referred to as attribute candidates).
- attribute candidates A storage example of the attribute candidate group storage unit 10 is shown in FIG. What is described in each record shown in FIG. 10 is an attribute candidate.
- the co-occurrence frequency dictionary 11 is a database in which the co-occurrence frequencies of attribute group candidate words are accumulated.
- the co-occurrence word frequency calculation unit 12 accumulates the calculated results.
- An example of the co-occurrence frequency dictionary 11 is shown in FIG. In the example shown in FIG. 11, two co-occurring words “word 1” and “word 2” and the co-occurrence frequency “frequency” are shown.
- the co-occurrence frequency calculation unit 12 reads the attribute candidate and calculates the co-occurrence frequency between the attribute candidate and the attribute group word. Store the result in the co-occurrence frequency dictionary.
- word 1 is an attribute candidate and word 2 is a word other than the attribute candidate.
- the attribute value selection unit 8 refers to the co-occurrence word frequency, and has a high co-occurrence frequency with the attribute candidate word or the attribute candidate word among the attribute group words selected in the same manner as in the first embodiment. Only a word is an attribute name, and an attribute value is selected only from an attribute group that crosses this attribute name block.
- a word with a high co-occurrence frequency is a word that is equal to or higher than the lower limit of the frequency, a word that is in the upper few percent of the frequency, or a word whose appearance probability is equal to or higher than a threshold.
- FIG. 12 is an operation flowchart of the second embodiment.
- the attribute group extraction unit 6 selects a document including an input word from a document group (Step 100).
- the attribute group extraction unit 6 acquires a delimiter pattern from the delimiter pattern group, and calculates a drawing position for each block delimited by the delimiter pattern (Step 101).
- the attribute group extraction unit 6 extracts a set of blocks that continue in the vertical direction or the horizontal direction from the blocks divided by the division pattern as attribute groups, and stores them in the attribute group group storage unit 4 ( Step 102).
- the co-occurrence frequency calculation unit 12 refers to the attribute candidate group storage unit 10 and the attribute group group storage unit 4 and includes attribute candidates and attribute groups that include the attribute candidates as words (words extracted from the block).
- the frequency with the word (word extracted from the block) is calculated and stored in the co-occurrence frequency dictionary 11 (Step 200).
- the attribute group including the attribute candidate “CPU” in FIG. 11 is attribute group 1 in FIG.
- the attribute name group selection unit 7 calculates the score of the attribute group and assigns it to the attribute group (Step 103). Then, as in the first embodiment, the attribute name group selection unit 7 refers to the score of the attribute group and selects an attribute name group that is a set of attribute values (Step 104).
- Attribute value selection unit 8 extracts an attribute name, an attribute value, and an input word from an attribute group that crosses the selected attribute name group. However, among the extracted attribute names, a combination of an attribute name, an attribute value, and an input word having a word registered in the attribute candidate group and a word having a high frequency in the co-occurrence frequency dictionary 11 as an attribute name is selected and Pass to step.
- a word with high frequency on the co-occurrence word frequency dictionary 11 is a word that is equal to or higher than the lower limit of the frequency, a word that is in the upper few percent of the frequency, or a word whose appearance probability is equal to or higher than a threshold value.
- a threshold such as a lower frequency limit is registered in the system in advance.
- Fi is the frequency of the word Ri in the co-occurrence word dictionary
- RN is the sum of all frequencies of the co-occurrence word dictionary.
- FIG. 18 shows an example of the output word storage unit 5.
- an input word, attribute name, and attribute value constitute one record, and the input word, attribute name, and attribute value are stored in association with each other.
- the input word “product A”, the attribute name “CPU”, and the attribute value “1 GHz” are stored in association with each other.
- “price” having a high co-occurrence frequency with the attribute candidate “liquid crystal” shown in FIG. 10 is output and stored as the attribute name, and the attribute value “210,000 yen” of the “price” is also output and stored. Has been.
- candidate attribute names are prepared in advance, the co-occurrence frequency calculation unit 12 calculates the co-occurrence frequency of the attribute name candidates and the words of the attribute group, and the attribute value selection unit 8 determines the co-occurrence frequency.
- the co-occurrence frequency calculation unit 12 calculates the co-occurrence frequency of the attribute name candidates and the words of the attribute group, and the attribute value selection unit 8 determines the co-occurrence frequency.
- FIG. 13 is a configuration diagram of an attribute extraction system according to the third embodiment. Referring to FIG. 13, the input word respecifying unit 20 is added as compared with the first embodiment. Further, the attribute group group storage unit is changed to the second attribute group group storage unit 21.
- FIG. 1 An example of storage in the second attribute group group storage unit 21 is shown in FIG.
- the attribute group candidate stored in the second attribute group group storage unit 21 is added with a re-input field indicating whether or not to handle it again as an input word, as compared with the above-described embodiment.
- the input word re-specifying unit 20 specifies an attribute group including many words in the same category as the input word among the attribute group candidates including the input word. For example, among the results of the attribute group extraction unit 6, there is an attribute group that includes an input word. Among the attribute groups including the input word, an attribute group including many input words is specified, and “YES” is substituted into the re-input field of the second attribute group candidate.
- the attribute group including many input words is assumed to have a ratio of the number of input words appearing in the attribute group larger than the threshold among the number of input words appearing in the same document. Moreover, the lower limit of the number of input words appearing in one attribute group may be used. Furthermore, it is good also as what satisfy
- the group ID 1 stored in the attribute group candidate storage unit 21 shown in FIG. The same three words “product A”, “product B”, and “product C” are included. If there are five input words appearing in “Document A”, the ratio of the number of input words appearing in this attribute group is 3/5.
- the condition is that the ratio of the number of input words is 60% or more and the lower limit after input is three or more, it can be determined that the group ID 1 is an attribute group including words that can be re-input words. From this result, “YES” is assigned to the re-input field of the attribute group of group ID 1.
- FIG. 15 is an operation flowchart of the third embodiment.
- the attribute group extraction unit 6 is a document in which an input word or a word of a record with “YES” in the re-input field among the attribute groups stored in the second attribute group group storage unit 21 appears. Is selected from the document group (Step 300).
- the attribute group extraction unit 6 acquires a delimiter pattern from the delimiter pattern group, and calculates a drawing position for each block delimited by the delimiter pattern (Step 301).
- the attribute group extraction unit 6 extracts, as attribute group candidates, blocks that continue in the vertical direction or the horizontal direction from the blocks divided by the division pattern, and the second attribute group group storage unit 21 (Step 102). The above is continued until there are no more documents including the input word.
- the re-entry word specifying unit 20 specifies an attribute group including words in the same category as the input word based on the appearance rate and appearance frequency of the input word, and the re-input field of the second attribute group candidate is “ "YES" is substituted and the re-input word is specified (Step 301).
- the attribute name group selection unit 7 calculates an attribute group score and assigns it to the attribute group (Step 103).
- the attribute group selection unit 7 refers to the score of the attribute group and selects an attribute name group that is a set of attribute values (Step 104).
- the attribute value selection unit 8 extracts an attribute name, an attribute value, and an input word from the selected attribute name group and an attribute group that crosses this attribute name group (Step 105). However, a word with “YES” in the re-input field is treated as an input word, not an attribute value.
- FIG. 19 shows an example of the output word storage unit 5.
- the input word, attribute name, and attribute value constitute one record, and the input word, attribute name, and attribute value are stored in association with each other.
- the input word “product A”, the attribute name “CPU”, and the attribute value “1 GHz” are stored in association with each other.
- “product I”, which is a re-input word, an attribute name “CPU”, and an attribute value “2 GHz” are stored in association with each other.
- the input word respecifying unit is configured to increase the number of words in the same category as the input word, so that it is possible to acquire even attribute names and attribute values related to product names in the same category as the input word. .
- FIG. 16 is a block diagram according to the fourth embodiment.
- the fourth embodiment includes an attribute extraction system 1000 according to the present invention, a dictionary service system 2000 that operates and manages the attribute extraction system 1000, and an attribute dictionary database list 3000 created by the attribute extraction system 1000.
- the dictionary creator creates an attribute dictionary using the attribute extraction system 1000 and the dictionary service system 2000 managed by the system operator, and registers them in the attribute dictionary database list 3000.
- the attribute dictionary purchaser searches the attribute dictionary database list 3000 for the desired attribute dictionary, and if there is a desired attribute dictionary, purchases it from the dictionary creator via the system administrator.
- the system operator receives money delivery and attribute extraction system usage fees from the dictionary creator when sales are made.
- the input word storage unit 1, the attribute group extraction unit 6, the attribute group selection unit 7, the attribute value selection unit 8, and the like are configured by hardware, but are configured by a CPU or the like that operates with a program. You can also
- a set of character strings or images in which a character string or image drawing position in a document is aligned in one direction is extracted as an attribute group, and the attribute group is a set of attribute names.
- Calculating an attribute name score indicating a degree selecting an attribute name group from the attribute groups based on the attribute name score, and at least one character string or image of the attribute name group and the same character
- a drawing position of the same character string or image is selected from the character string or image drawing position of the attribute name group and the same attribute group, and the same drawing position of the same character string or image is selected.
- An attribute name is extracted from a character string or an image, and an attribute value corresponding to the attribute name is selected from a character string or an image of the selected attribute group other than the character string or image at the same drawing position. It is an attribute extraction method to extract.
- an input word related to an object whose attribute is desired is registered, and a document including the input word is extracted from a document group.
- the document is divided into character strings or images based on a predetermined rule, the drawing position of each character string or image is calculated, and the drawing position of the character string or image is calculated.
- a set of character strings or images that are aligned in one direction is defined as an attribute group.
- At least one or more character strings or images of the attribute name group and the same character string or image, and the drawing position of the same character string or image Is the same as the drawing position of the character string or image of the attribute name group, and the character string or image drawing position is perpendicular to the direction of the character string or image drawing position of the attribute name group.
- an attribute group having the attribute name score larger than a predetermined threshold is selected as the attribute name group.
- the attribute name score is an average of the appearance probabilities of each character string or image in the attribute group.
- a co-occurrence probability between an attribute name candidate that is a candidate for an attribute name and a character string or an image of an attribute group that includes the attribute name candidate is calculated, and the attribute name Among the character strings or images of the group, the attribute name is selected from the character string or image selected based on the attribute name candidate or the co-occurrence probability with the attribute name candidate, and the character string or image including the attribute name is selected.
- the attribute value is extracted from the character string or image of the attribute group that has the same attribute group as the drawing position of the attribute name group.
- a second input word that can be an input word is extracted from a character string or an image of an attribute group that includes the input word in the character string or the image, and the second input Extract documents that contain words.
- a document storage unit storing a document, and a character string or image set in which the drawing positions of the character strings or images in the document stored in the document storage unit are aligned in one direction.
- An attribute group extracting unit that selects and extracts an attribute group, calculates an attribute name score indicating the degree to which the attribute group is a set of attribute names, and based on the attribute name score, selects an attribute from the attribute group candidates
- An attribute name group selection unit for selecting a name group, at least one or more character strings or images of the attribute name group, and a drawing position of the same character string or image, including the same character string or image Selects the same attribute group as the drawing position of the character string or image of the attribute name group, extracts the attribute name from the character string or image of the same drawing position, and selects the selected From other than the character string or image of the of identity in a flat drawing position of a character string or image sex group is an attribute extraction system having an attribute extraction unit for extracting an attribute value
- an input word storage unit in which an input word related to an object whose attribute is desired is stored in the above aspect, and the attribute group extraction unit targets a document including the input word.
- the attribute group extraction unit divides a document into character strings or images based on a predetermined rule, calculates a drawing position of each character string or image, A set of character strings or images in which the drawing positions of each character string or image are aligned in one direction is defined as an attribute group.
- the attribute extraction unit includes at least one character string or image of the attribute name group and the same character string or image, and the same character
- the drawing position of the column or image is the same as the drawing position of the character string or image of the attribute name group, and the character string or the character string or the image perpendicular to the drawing position direction of the attribute name group Select the attribute group where the image is drawn.
- the attribute name group selection unit selects an attribute group having an attribute name score larger than a predetermined threshold as an attribute name group.
- the attribute name group selection unit calculates an average of appearance probabilities of words included in a character string or an image of the attribute group as an attribute name score.
- an attribute candidate storage unit storing attribute candidates that are word candidate candidates for attribute names, an attribute name candidate that is an attribute name candidate, and the attribute name candidate
- a co-occurrence probability calculating unit that calculates a co-occurrence probability with a character string or an image of an attribute group including the attribute extraction unit, among the character string or image of the attribute name group, the attribute name candidate or Select an attribute name from a character string or image selected based on the co-occurrence probability with the attribute name candidate, and have a character string or image including the attribute name, and the character string or image including the attribute name
- the attribute value is extracted from the character string or image of the same attribute group as that of the attribute name group.
- an input word extraction unit for extracting a second input word that can be an input word from a character string or an image of an attribute group that includes the input word in the character string or the image.
- an attribute group extraction process for extracting, as an attribute group, a group of character strings or images in which a drawing position of a character string or an image in a document is aligned in one direction; Calculating an attribute name score indicating a degree of a set of attribute names, and selecting an attribute name group from the attribute groups based on the attribute name score; and at least one of the attribute name groups
- the above character string or image, and the same character string or image, and the drawing position of the same character string or image is the same attribute as the character string or image drawing position of the attribute name group
- Attribute name extraction processing for selecting a group and extracting an attribute name from the character string or image at the same drawing position; and the character string or image of the selected attribute group Among the non-string or image of the of identity in a flat drawing position, a program for executing the attribute value extraction process for extracting the attribute value corresponding to the attribute name to the information processing apparatus.
- an input word related to an object whose attribute is desired is registered, and a document extraction process for extracting a document including the input word from the document group is executed in the information processing apparatus.
- the attribute name group selection process includes: a process of dividing a document into character strings or images based on a predetermined rule; and a drawing position of each character string or image.
- the attribute name extraction process includes at least one character string or image of the attribute name group and the same character string or image, and the same character string or image.
- the character string or image drawing position is the same as the character string or image drawing position of the attribute name group, and the character string is perpendicular to the attribute name group character string or image drawing position direction.
- an attribute group having an image drawing position is selected.
- the attribute name group selection process selects an attribute group having an attribute name score greater than a predetermined threshold as an attribute name group.
- the attribute name score is an average of the appearance probabilities of each character string or image of the attribute group.
- the attribute name extraction process includes a co-occurrence probability between an attribute name candidate that is a candidate for an attribute name and a character string or an image of an attribute group that includes the attribute name candidate. Calculating an attribute name from a character string or image selected based on a co-occurrence probability with the attribute name candidate or the attribute name candidate among the character string or image of the attribute name group, and the attribute value
- the extraction process includes a character string or an image including the attribute name, and the character string or image including the attribute name has a drawing position of the same attribute group as the drawing position of the attribute name group. Extract attribute values from.
- a second input word that can be an input word is extracted from a character string or an image of an attribute group that includes the input word in the character string or the image, and the second input causes the information processing apparatus to execute processing for extracting a document including words.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
2 文書群記憶部
3 区切りパターン群記憶部
4 属性グループ群記憶部
5 出力語記憶部
6 属性グループ抽出部
7 属性グループ選択部
8 属性値選択部
第1の実施の形態を図面を参照して説明する。
Pj=Wj/N
ここで、WjはグループGの属性グループ候補のj番目の単語の属性グループ候補群内の出現頻度、Nは属性グループ候補群内の全ての単語の出現頻度合計、Pjはj番目の単語の出現確率、JはグループGの単語数である。
スコア(グループ1)=1/3*(5/300+5/300+20/300)
=10/300=0.033
と計算できる。尚、属性グループ候補群記憶部4のスコアフィールドには、各単語の出現確率を記憶する。
スコア(グループ2)=1/2*(5/300+3/300)=4/300=0.013
と計算できる。
W0=N/W
ここで、Wは属性グループ候補群内の全ての単語数である。W0は、単語出現頻度の平均値である。この他にも、単語出現頻度の合計や平均としても計算できる。しかし、単純な単語出現頻度のみでは、文書群の大きさによって値が異なるため、出現確率や平均との差を用いる方が良い。
第2の実施の形態を、図面を参照して説明する。
出現確率(Ri)=Fi/RN
ここで、Fiは共起語辞書の単語Riの頻度、RNは共起語辞書の全ての頻度の総和である。
第3の実施の形態を、図面を参照して説明する。
第4の実施の形態を、図面を参照して説明する。
Claims (24)
- 文書における文字列又は画像の描画位置が一方向にならぶ文字列又は画像の組を属性グループとして抽出し、
前記属性グループが属性名の集合である度合を示す属性名スコアを計算し、前記属性名スコアに基づいて、前記属性グループのうちから属性名グループを選択し、
前記属性名グループの少なくとも一以上の文字列又は画像と、該同一な文字列又は画像を含み、かつ、前記該同一な文字列又は画像の描画位置が前記属性名グループの前記文字列又は画像の描画位置と該同一の属性グループを選択し、
前記該同一な描画位置の文字列又は画像から属性名を抽出し、
前記選択された属性グループの文字列又は画像のうち前記該同一な描画位置の文字列又は画像以外から、前記属性名に対応する属性値を抽出する属性抽出方法。 - 属性を知りたい事物に関する入力語を登録し、文書群のうちから、前記入力語を含む文書を抽出する請求項1に記載の属性抽出方法。
- 所定の規則に基づいて、文書を文字列又は画像に区切り、
前記各文字列又は画像の描画位置を計算し、
文字列又は画像の描画位置が一方向にならぶ文字列又は画像の組を属性グループとする
請求項1又は請求項2に記載の属性抽出方法。 - 前記属性名グループの少なくとも一以上の文字列又は画像と、該同一な文字列又は画像を含み、前記該同一な文字列又は画像の描画位置が前記属性名グループの前記文字列又は画像の描画位置と該同一であり、かつ、前記属性名グループの文字列又は画像の描画位置方向に対して直角方向に文字列又は画像の描画位置がある属性グループを選択する請求項1から請求項3のいずれかに記載の属性抽出方法。
- 前記属性名スコアが所定の閾値よりも大きい属性グループを、属性名グループとして選択する請求項1から請求項4のいずれかに記載の属性抽出方法。
- 前記属性名スコアは、前記属性グループの各文字列又は画像の出現確率の平均である請求項1から請求項5のいずれかに記載の属性抽出方法。
- 属性名の候補である属性名候補と、前記属性名候補を含む属性グループの文字列又は画像との共起確率を計算し、
前記属性名グループの文字列又は画像のうち、前記属性名候補又は前記属性名候補との共起確率に基づいて選択された文字列又は画像から属性名を選択し、この属性名を含む文字列又は画像を有し、かつ、前記属性名を含む文字列又は画像の描画位置が前記属性名グループの描画位置と該同一な属性グループの文字列又は画像から属性値を抽出する
請求項1から請求項6のいずれかに記載の属性抽出方法。 - 文字列又は画像に前記入力語を含む属性グループの文字列又は画像から入力語となりうる第2の入力語を抽出し、前記第2の入力語を含む文書を抽出する請求項2から請求項7のいずれかに記載の属性抽出方法。
- 文書が記憶されている文書記憶部と、
前記文書記憶部に記憶されている文書における文字列又は画像の描画位置が一方向にならぶ文字列又は画像の組を選択して属性グループを抽出する属性グループ抽出部と、
前記属性グループが属性名の集合である度合を示す属性名スコアを計算し、前記属性名スコアに基づいて、前記属性グループ候補のうちから属性名グループを選択する属性名グループ選択部と、
前記属性名グループの少なくとも一以上の文字列又は画像と、該同一な文字列又は画像を含み、かつ、前記該同一な文字列又は画像の描画位置が前記属性名グループの前記文字列又は画像の描画位置と該同一の属性グループを選択し、前記該同一な描画位置の文字列又は画像から属性名を抽出し、前記選択された属性グループの文字列又は画像のうち前記該同一な描画位置の文字列又は画像以外から、前記属性名に対応する属性値を抽出する属性抽出部と
を有する属性抽出システム。 - 属性を知りたい事物に関する入力語が記憶されている入力語記憶部を有し、
前記属性グループ抽出部は、前記入力語を含む文書を対象とする
請求項9に記載の属性抽出システム。 - 前記属性グループ抽出部は、所定の規則に基づいて、文書を文字列又は画像に区切り、前記各文字列又は画像の描画位置を計算し、各文字列又は画像の描画位置が一方向にならぶ文字列又は画像の組を属性グループとする
請求項9又は請求項10に記載の属性抽出システム。 - 前記属性抽出部は、前記属性名グループの少なくとも一以上の文字列又は画像と、該同一な文字列又は画像を含み、前記該同一な文字列又は画像の描画位置が前記属性名グループの前記文字列又は画像の描画位置と該同一であり、かつ、前記属性名グループの文字列又は画像の描画位置方向に対して直角方向に文字列又は画像の描画位置がある属性グループを選択する請求項9から請求項11のいずれかに記載の属性抽出システム。
- 前記属性名グループ選択部は、前記属性名スコアが所定の閾値よりも大きい属性グループを、属性名グループとして選択する請求項9から請求項12のいずれかに記載の属性抽出システム。
- 前記属性名グループ選択部は、前記属性グループの文字列又は画像に含まれる単語の出現確率の平均を属性名スコアとして計算する請求項9から請求項13のいずれかに記載の属性抽出システム。
- 属性名となる単語の候補である属性候補が記憶された属性候補記憶部と、
属性名の候補である属性名候補と、前記属性名候補を含む属性グループの文字列又は画像との共起確率を計算する共起確率計算部とを有し、
前記属性抽出部は、前記属性名グループの文字列又は画像のうち、前記属性名候補又は前記属性名候補との共起確率に基づいて選択された文字列又は画像から属性名を選択し、この属性名を含む文字列又は画像を有し、かつ、前記属性名を含む文字列又は画像の描画位置が前記属性名グループの描画位置と該同一な属性グループの文字列又は画像から属性値を抽出する
請求項9から請求項14のいずれかに記載の属性抽出システム。 - 文字列又は画像に前記入力語を含む属性グループの文字列又は画像から入力語となりうる第2の入力語を抽出する入力語抽出部を有する請求項10から請求項15のいずれかに記載の属性抽出システム。
- 文書における文字列又は画像の描画位置が一方向にならぶ文字列又は画像の組を属性グループとして抽出する属性グループ抽出処理と、
前記属性グループが属性名の集合である度合を示す属性名スコアを計算し、前記属性名スコアに基づいて、前記属性グループのうちから属性名グループを選択する属性名グループ選択処理と、
前記属性名グループの少なくとも一以上の文字列又は画像と、該同一な文字列又は画像を含み、かつ、前記該同一な文字列又は画像の描画位置が前記属性名グループの前記文字列又は画像の描画位置と該同一の属性グループを選択し、前記該同一な描画位置の文字列又は画像から属性名を抽出する属性名抽出処理と、
前記選択された属性グループの文字列又は画像のうち前記該同一な描画位置の文字列又は画像以外から、前記属性名に対応する属性値を抽出する属性値抽出処理と
を情報処理装置に実行させるプログラム。 - 属性を知りたい事物に関する入力語を登録し、文書群のうちから、前記入力語を含む文書を抽出する文書抽出処理を情報処理装置に実行させる請求項17に記載のプログラム。
- 前記属性名グループ選択処理は、
所定の規則に基づいて、文書を文字列又は画像に区切る処理と、
前記各文字列又は画像の描画位置を計算する処理と、
文字列又は画像の描画位置が一方向にならぶ文字列又は画像の組を属性グループとして抽出する処理と
を有する請求項17又は請求項18に記載のプログラム。 - 前記属性名抽出処理は、
前記属性名グループの少なくとも一以上の文字列又は画像と、該同一な文字列又は画像を含み、前記該同一な文字列又は画像の描画位置が前記属性名グループの前記文字列又は画像の描画位置と該同一であり、かつ、前記属性名グループの文字列又は画像の描画位置方向に対して直角方向に文字列又は画像の描画位置がある属性グループを選択する請求項17から請求項19のいずれかに記載のプログラム。 - 前記属性名グループ選択処理は、前記属性名スコアが所定の閾値よりも大きい属性グループを、属性名グループとして選択する請求項17から請求項20のいずれかに記載のプログラム。
- 前記属性名スコアは、前記属性グループの各文字列又は画像の出現確率の平均である請求項17から請求項21のいずれかに記載のプログラム。
- 前記属性名抽出処理は、属性名の候補である属性名候補と、前記属性名候補を含む属性グループの文字列又は画像との共起確率を計算し、前記属性名グループの文字列又は画像のうち、前記属性名候補又は前記属性名候補との共起確率に基づいて選択された文字列又は画像から属性名を選択し、
前記属性値抽出処理は、この属性名を含む文字列又は画像を有し、かつ、前記属性名を含む文字列又は画像の描画位置が前記属性名グループの描画位置と該同一な属性グループの文字列又は画像から属性値を抽出する
請求項17から請求項22のいずれかに記載のプログラム。 - 文字列又は画像に前記入力語を含む属性グループの文字列又は画像から入力語となりうる第2の入力語を抽出し、前記第2の入力語を含む文書を抽出する処理を情報処理装置に実行させる請求項18から請求項23のいずれかに記載のプログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2010501954A JP5445787B2 (ja) | 2008-03-06 | 2009-03-05 | 属性抽出方法、システム及びプログラム |
US12/866,215 US8463738B2 (en) | 2008-03-06 | 2009-03-05 | Attribute extraction method, system, and program |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008-055789 | 2008-03-06 | ||
JP2008055789 | 2008-03-06 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009110550A1 true WO2009110550A1 (ja) | 2009-09-11 |
Family
ID=41056102
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/054170 WO2009110550A1 (ja) | 2008-03-06 | 2009-03-05 | 属性抽出方法、システム及びプログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US8463738B2 (ja) |
JP (1) | JP5445787B2 (ja) |
WO (1) | WO2009110550A1 (ja) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012105898A1 (en) * | 2011-02-03 | 2012-08-09 | Shl Group Ab | Medicament delivery device |
JP2013517561A (ja) * | 2010-01-13 | 2013-05-16 | アリババ・グループ・ホールディング・リミテッド | 標準製品ユニットのための属性集約 |
WO2016075833A1 (ja) * | 2014-11-14 | 2016-05-19 | 富士通株式会社 | データ取得プログラム、データ取得方法及びデータ取得装置 |
JP2017059124A (ja) * | 2015-09-18 | 2017-03-23 | 富士フイルム株式会社 | 画像抽出システム,画像抽出方法,画像抽出プログラムおよびそのプログラムを格納した記録媒体 |
US10839146B2 (en) | 2015-03-02 | 2020-11-17 | Canon Kabushiki Kaisha | Information processing system, information processing apparatus, control method, and storage medium |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102193934B (zh) * | 2010-03-11 | 2013-05-29 | 株式会社理光 | 用于寻找图像集合中的代表性图像的***和方法 |
WO2015129044A1 (ja) * | 2014-02-28 | 2015-09-03 | 楽天株式会社 | 情報処理システム、情報処理方法、および情報処理プログラム |
JP6123764B2 (ja) * | 2014-09-11 | 2017-05-10 | トヨタ自動車株式会社 | 電源システム |
TWI571753B (zh) * | 2014-11-07 | 2017-02-21 | 財團法人資訊工業策進會 | 用於產生一影像之一互動索引碼圖之電子計算裝置、其方法及其電腦程式產品 |
US11010768B2 (en) | 2015-04-30 | 2021-05-18 | Oracle International Corporation | Character-based attribute value extraction system |
CN108885617B (zh) * | 2016-03-23 | 2022-05-31 | 株式会社野村综合研究所 | 语句解析***以及程序 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH02116970A (ja) * | 1988-10-27 | 1990-05-01 | Fujitsu Ltd | 表内データ自動抽出処理方式 |
JPH04319770A (ja) * | 1991-04-18 | 1992-11-10 | Fuji Xerox Co Ltd | 電子ファイリング装置 |
JPH11259524A (ja) * | 1998-03-06 | 1999-09-24 | Omron Corp | 情報検索システム、情報検索システムにおける情報処理方法および記録媒体 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5666549A (en) * | 1992-03-10 | 1997-09-09 | Hitachi, Ltd. | Method and system for processing a document transmitted via facsimile in an initially input form stored in a knowledge base |
JPH11232378A (ja) * | 1997-12-09 | 1999-08-27 | Canon Inc | デジタルカメラ、そのデジタルカメラを用いた文書処理システム、コンピュータ可読の記憶媒体、及び、プログラムコード送出装置 |
JP4856925B2 (ja) * | 2005-10-07 | 2012-01-18 | 株式会社リコー | 画像処理装置、画像処理方法及び画像処理プログラム |
-
2009
- 2009-03-05 JP JP2010501954A patent/JP5445787B2/ja active Active
- 2009-03-05 WO PCT/JP2009/054170 patent/WO2009110550A1/ja active Application Filing
- 2009-03-05 US US12/866,215 patent/US8463738B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH02116970A (ja) * | 1988-10-27 | 1990-05-01 | Fujitsu Ltd | 表内データ自動抽出処理方式 |
JPH04319770A (ja) * | 1991-04-18 | 1992-11-10 | Fuji Xerox Co Ltd | 電子ファイリング装置 |
JPH11259524A (ja) * | 1998-03-06 | 1999-09-24 | Omron Corp | 情報検索システム、情報検索システムにおける情報処理方法および記録媒体 |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013517561A (ja) * | 2010-01-13 | 2013-05-16 | アリババ・グループ・ホールディング・リミテッド | 標準製品ユニットのための属性集約 |
WO2012105898A1 (en) * | 2011-02-03 | 2012-08-09 | Shl Group Ab | Medicament delivery device |
AU2012212720B2 (en) * | 2011-02-03 | 2015-01-22 | Shl Medical Ag | Medicament delivery device |
WO2016075833A1 (ja) * | 2014-11-14 | 2016-05-19 | 富士通株式会社 | データ取得プログラム、データ取得方法及びデータ取得装置 |
JPWO2016075833A1 (ja) * | 2014-11-14 | 2017-09-28 | 富士通株式会社 | データ取得プログラム、データ取得方法及びデータ取得装置 |
US10839146B2 (en) | 2015-03-02 | 2020-11-17 | Canon Kabushiki Kaisha | Information processing system, information processing apparatus, control method, and storage medium |
JP2017059124A (ja) * | 2015-09-18 | 2017-03-23 | 富士フイルム株式会社 | 画像抽出システム,画像抽出方法,画像抽出プログラムおよびそのプログラムを格納した記録媒体 |
Also Published As
Publication number | Publication date |
---|---|
US8463738B2 (en) | 2013-06-11 |
JPWO2009110550A1 (ja) | 2011-07-14 |
US20100318525A1 (en) | 2010-12-16 |
JP5445787B2 (ja) | 2014-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5445787B2 (ja) | 属性抽出方法、システム及びプログラム | |
US11868411B1 (en) | Techniques for compiling and presenting query results | |
US8073865B2 (en) | System and method for content extraction from unstructured sources | |
US8749553B1 (en) | Systems and methods for accurately plotting mathematical functions | |
CN110866180B (zh) | 资源推荐方法、服务器及存储介质 | |
US8924396B2 (en) | Method and system for scoring texts | |
US20210165956A1 (en) | Systems and methods for generating tables from print-ready digital source documents | |
US9898464B2 (en) | Information extraction supporting apparatus and method | |
CN113449187A (zh) | 基于双画像的产品推荐方法、装置、设备及存储介质 | |
CN108959453B (zh) | 基于文本聚类的信息提取方法、装置及可读存储介质 | |
CN103678460B (zh) | 用于识别适于在多语言环境中进行通信的非文本元素的方法和*** | |
JP2009129087A (ja) | 商品情報分類装置、プログラム、商品情報分類方法 | |
CN103577547B (zh) | 网页类型识别方法及装置 | |
US20170132484A1 (en) | Two Step Mathematical Expression Search | |
WO2019093172A1 (ja) | 類似性指標値算出装置、類似検索装置および類似性指標値算出用プログラム | |
US9430793B2 (en) | Dictionary generation device, dictionary generation method, dictionary generation program and computer-readable recording medium storing same program | |
JP5526057B2 (ja) | データ分析支援装置およびプログラム | |
WO2014061285A1 (ja) | コーパス生成装置、コーパス生成方法及びコーパス生成プログラム | |
JP2016110260A (ja) | コンテンツ検索結果提供システム及びコンテンツ検索結果提供方法 | |
JP6804913B2 (ja) | 表構造推定システムおよび方法 | |
CN108959299B (zh) | 对象描述 | |
JP6496025B2 (ja) | 文書処理システム及び文書処理方法 | |
CN111832310A (zh) | 一种文本处理方法及装置 | |
US11645332B2 (en) | System and method for clustering documents | |
JP6710360B1 (ja) | 登録済質問文判定方法、コンピュータプログラム及び情報処理装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09716541 Country of ref document: EP Kind code of ref document: A1 |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 12866215 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010501954 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09716541 Country of ref document: EP Kind code of ref document: A1 |