WO2013179348A1 - インデックス生成プログラム及び検索プログラム - Google Patents
インデックス生成プログラム及び検索プログラム Download PDFInfo
- Publication number
- WO2013179348A1 WO2013179348A1 PCT/JP2012/003592 JP2012003592W WO2013179348A1 WO 2013179348 A1 WO2013179348 A1 WO 2013179348A1 JP 2012003592 W JP2012003592 W JP 2012003592W WO 2013179348 A1 WO2013179348 A1 WO 2013179348A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- file
- search
- block
- information
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/185—Hierarchical storage management [HSM] systems, e.g. file migration or policies thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
Definitions
- the present invention relates to a document data search technique.
- index information is a bit string in which bits indicating which document elements (units such as chapters, sections, and terms) character information exists in a file are allocated in document element units (for example, Patent Document 1).
- HTML Hyper Text Markup Language
- Document data described in HTML is divided into document elements constituting the document by tag information in the document data. For example, data from a start tag to an end tag for one tag is one document element. For a document element, data from another start tag to an end tag included in the document element is a child element of the document element described above. As described above, the hierarchical relationship between the document elements is shown according to the inclusion relationship of the range indicated by the set of the start tag and the end tag.
- the blocks obtained by the division do not necessarily have the same data size.
- the number of types of character information included in each block tends to be different. For example, in a chaptered academic book, if a certain chapter is long, the number of types of character information may increase only in the block corresponding to that chapter. In such a case, in the index information, a particular block stands out and the presence of many types of character information is indicated.
- the compressed index information is index information in which information indicating the correspondence between character information included in a plurality of document data is superimposed on the plurality of character information. That is, in the compressed index information, information indicating whether or not one of a plurality of character information is included is associated with each block. Then, the presence / absence information about a plurality of character information is superimposed, so that the data size of the index information itself is suppressed.
- the document elements other than the boundary of the document elements or subordinate document elements (parts, chapters, sections, eyes, and chapters) Etc.).
- a file corresponding to Chapter 1 is divided into a first block including a part of the first section and the second section, and a second block including a part of the second section and the third section.
- the terms included in the same section often include related contents.
- character information included in terms characteristic to each item in the second section may exist in both the first block and the second block.
- the second Character information included in terms characteristic of the clause may exist only in the second block (when the characteristic term does not exist in both the first and third clauses).
- An object of one aspect of the present disclosure is to suppress narrowing noise in narrowing down the target of character string search for document data.
- the generation program includes the data in the document file in any of the plurality of blocks depending on whether or not the document file having a predetermined number or more of child elements exists in the computer file. Is switched for each document element in the hierarchy of the child element or for each document element in the hierarchy of the element higher than the document element or in response to the switching. Under the control, the document file is divided into the plurality of blocks, and for each block obtained by the division, index information indicating whether each block includes predetermined character information is generated.
- the computer controls which of the plurality of blocks contains the data in the document file depending on whether or not the document file has a document element having a predetermined number or more of child elements. Switching for each document element in the hierarchy of the child element or for each document element in the hierarchy of the document element or the element higher than the document element, and by the control according to the switching A generation method for executing processing, wherein the document file is divided into the plurality of blocks, and for each block obtained by the division, index information indicating whether or not each block includes predetermined character information is generated. Used.
- the generation apparatus controls whether to include data in the document file in a plurality of blocks according to whether or not there are document elements having a predetermined number or more of child elements in the document file. Is switched for each document element in the child element hierarchy, or for each document element in the hierarchy of the element higher than the document element or the document element, and by the control according to the switching, A division unit that divides the document file into the plurality of blocks; and a generation unit that generates index information indicating whether each block includes predetermined character information for each block obtained by the division. It is characterized by that.
- the search program when the search program accepts a search character string, the search program, based on the character information included in the search character string, whether there is a document element having a predetermined number of child elements or more in the document file. Depending on whether or not the data in the document file is included in a plurality of blocks is controlled for each document element in the child element hierarchy, or higher than the document element or the document element With reference to index information in which each block obtained by the division performed by switching for each document element in the hierarchy of the element is associated with whether or not each block includes the character information, By referring to the index information, a block indicating that the character information is included in the index information is specified. Performing string search with a search string, to execute the process.
- the search character string when the search character string is received by the computer, based on the character information included in the search character string, whether or not a document element having a predetermined number or more child elements exists in the document file is determined.
- the control of whether to include the data in the document file in a plurality of blocks is performed for each document element of the child element hierarchy, or the document element or a hierarchy of elements higher than the document element.
- the index information associated with whether or not each block includes the character information is referred to each block obtained by the division performed by switching for each document element, and the index information is referred to To specify a block indicating that the character information is included in the index information, and to the specified block by the search character string Performing string search, the search method of executing the process is used.
- the search device includes a reception unit that receives a search character string, and a document that has a predetermined number or more child elements in a document file based on character information included in the search character string received by the reception unit.
- the control of whether to include the data in the document file in a plurality of blocks is performed for each document element in the child element hierarchy, or the document element or the Index information in which whether or not each block includes the character information is associated with each block obtained by the division performed by switching for each document element in the hierarchy of elements higher than the document element
- a storage unit for storing By referring to the index information stored in the storage unit, a narrowing unit that specifies a block that indicates that the character information is included in the index information, and the search character string for the specified block And a search unit for performing a character string search.
- FIG. 1A and 1B show an example of index information and an example of a bit string generated based on the index information.
- FIG. 2A shows an example of a hierarchical structure of document data.
- FIG. 2B shows an example of a hierarchical structure of document data.
- FIG. 3 shows an example of functional blocks of the computer 1.
- FIG. 4 shows an example of functional blocks of the generation unit 13.
- FIG. 5 shows the correspondence between block numbers and block reading positions.
- FIG. 6 shows an example of functional blocks of the narrowing-down unit 15.
- FIG. 7 shows an example of the hardware configuration of the computer 1.
- FIG. 8 shows a configuration example of software that runs on the computer 1.
- FIG. 9 shows an example of an index generation processing procedure.
- FIG. 10A shows a processing procedure example of the document structure analysis processing.
- FIG. 10A shows a processing procedure example of the document structure analysis processing.
- FIG. 10B shows a processing procedure example of the document structure analysis processing.
- FIG. 11 shows an example of a document structure table.
- FIG. 12A shows an example of a processing procedure for file division processing.
- FIG. 12B shows an example of a processing procedure for file division processing.
- FIG. 13 shows a processing procedure example of the full-text search process.
- FIG. 14 shows a processing procedure for index reference processing.
- FIG. 15 shows an example of a table for storing search results.
- FIG. 1A shows index information I1 based on search target file groups F1 to Fn.
- the file number shown at the top of the index information I1 is a number corresponding to each of the search target file groups F1 to Fn.
- each of the character information groups C1 to Cm is associated with a bit string relating to the presence / absence of the file groups F1 to Fn.
- the character information Cj included in the character information groups C1 to Cm is, for example, a character string composed of one character or a combination of a plurality of characters. Alternatively, the character information Cj may be a part of a binary code corresponding to the character information.
- the character information groups C1 to Cm may be all combinations of characters that are assumed to be used (for example, characters to which a JIS code is assigned). For example, it is assumed that a file Fi (file number is i) in the file group F1 to Fn is a file including a character string “life is a tragedy when viewed in close-up and a comedy when viewed in a long shot”.
- the file Fi is a file including character information of “people”, “raw”, “ha”,..., “Play”, and “life”, “raw”, “hak”,. ⁇ ⁇ ⁇ It is also a file containing the text information “comedy”.
- the case where each of the character information groups C1 to Cm is character information of two characters is exemplified.
- the character information Cj is included in the file groups F1 to Fn is determined for each number i of 1 to n in the storage area corresponding to the character information Cj and the file Fi and the character information Cj is included in the file Fi. This is indicated by storing information about whether or not.
- the storage location of the presence / absence information regarding whether or not the file Fi includes character information Cj is the address Pj obtained by substituting the binary code corresponding to the character information Cj into the hash function, and the file number indicated by i.
- the binary code corresponding to the character information is, for example, 0x346E3760 (0x means hexadecimal notation) if it is a binary code (character code based on JIS) corresponding to the character information “comedy”.
- the presence / absence information of the character information Cj is indicated by a bit having a value of “1” if the character information Cj exists in the file Fi. If the character information Cj does not exist, it is indicated by a bit having a value of “0”.
- a plurality of character information for example, character information Cj and character information Ck
- the presence / absence information is indicated by a bit having a value of “1” if at least one of the character information Cj and the character information Ck exists in the file Fi, and the character information Cj and the character information Ck in the file Fi.
- presence / absence may be indicated by a plurality of bits.
- the fact that character information is included is indicated by a bit having a value of “1”.
- the file Fi since the file Fi includes character information other than “comedy”, not only “comedy” but also “life”, “raw”,.
- the bit at the position corresponding to the character information also indicates a value of “1”.
- the bit at the position corresponding to the character information included in each file has a value of “1”.
- the search target file is narrowed down using the index information I1 shown in FIG. 1A.
- the search character string “comedy king” includes character information “comedy” and character information “drama king”.
- the file to be searched for the character string is, for example, a bit string indicated by an address (Pj in FIG. 1A) calculated based on “comedy” and an address (Pk in FIG. 1A) calculated based on “Drama King”.
- the bit string indicated by For example, a bit string A1 that is a logical product operation result of the bit string corresponding to the address Pj and the bit string corresponding to the address Pk is as shown in FIG. 1B.
- the file corresponding to the bit that is “1” becomes the character string search target file.
- a plurality of pieces of character information for example, “See” and “Drama King” correspond to the address Pk.
- the file Fi does not include “Drama King” but includes “Look”. Therefore, the bit of the file Fi in the bit string corresponding to the pointer Pk corresponding to “Look” and “Play King” is also “1”.
- index information I1 when the search target file is narrowed down by the character information “comedy” and “geo king”, “comed” and “comed” are not included in the file Fi. It is determined that the file includes both “Drama King” and becomes a search target file.
- the file Fi includes a character string “Life is a trolley when when in close-up, but a comedy in long-shot.”. Then, for example, in the index information, the address Pj calculated based on the character information “come” and the bit at the position indicated by the file number i indicate “1”. Further, for example, the address Pk calculated based on the character information “medy” and the bit at the position indicated by the file number i indicate “1”.
- the search character string is “comedian”, for example, it is assumed that the search target file is narrowed down to files including both “come” and “dian” based on the index information. In this case, if the address calculated based on the character information “dian” happens to be the same as the address Pk calculated based on the character information “medy”, the file Fi does not include “dian”, but “ “comdian” is a search target file.
- noise may be generated in file narrowing down. This is based on the character information not included in the file Fi (such as “Drama King” and “dian”) and the character information included in the file Fi (such as “see” and “medy”). This is because the pointers shown overlap. Since the bit is set to “1” due to the presence of character information (“see”, “medy”, etc.) included in the file Fi, character information (“Drama King”, “dian”, etc.) not included in the file Fi ) Does not exist in the index information. By the way, if the corresponding pointer does not include both of the plurality of overlapping character information, the bit is in the state of “0”, so it is clear that neither the index information nor the plurality of character information exists. Become.
- a narrower noise is more likely to occur in a file where the character information pointer included in the file and the character information pointer not included in the file tend to overlap.
- files such as indexes and table of contents are more likely to contain more character types than files in the main part, and even files in the same e-book are included in the file.
- even in the main files there is a difference in the type of character information included in the file between a file with a large data size and a file with a small data size.
- the index information of the file groups F1 to Fn is a sparse matrix as a whole
- narrowing noise due to overlapping pointers between character information is likely to occur in a file containing many types of character information.
- an example of a file including many character types is a file having a file size larger than other files.
- index information information regarding whether or not character information is included may be associated with each block obtained by dividing a file, not on a file basis. Then, the amount of data to be read when the character string search is performed due to narrowing noise is suppressed.
- the document structure may vary greatly depending on the document data.
- a dictionary or the like has a document structure in which document elements of a specific hierarchy (for example, document elements corresponding to clauses and terms) are listed.
- each document element has an independent meaning and content, and for example, there are many cases where adjacent document elements do not include a common term (a lot of non-common terms are included).
- academic books and the like have a document structure in which document elements have a hierarchical relationship, and common terms are easily used among child elements having a common parent element.
- novels tend to have a small number of document elements in only one layer. In the novel, common terms are easily used throughout the main story.
- a dictionary or the like tends to include a list of specific document elements.
- An enumeration of document elements is often used when information about independent and distinct events is expressed in some common format.
- a word corresponds to each item, and each item to be listed is expressed in a common format that is a word and information (meaning, usage, etc.) regarding the word.
- child elements whose parent element is a word group whose first character is “A” are “Ashika” and “Ashigarayama”.
- index information is generated by associating information regarding whether or not character information is included for each block obtained by dividing a file.
- index information is generated by associating information regarding whether or not character information is included for each block obtained by dividing a file.
- FIG. 2A shows an example of a hierarchical structure of document data described in a markup language such as HTML (Hyper Text Markup Language).
- a markup language such as HTML (Hyper Text Markup Language).
- the division in units of child elements identified by the ⁇ h2> tag may be performed without attempting the division in units of parent elements identified by the ⁇ h1> tag.
- FIG. 2A (A) it may be divided into block AA-1 and block AA-2.
- a part (child element 1-2) where a musical feature is described in a part (parent element 1) where a movie feature is described by block division, a part (parent element 2) where life is described Suppose that a block containing is obtained. Then, in the block, the child element 1-2 has words characteristic of the parent element 1 such as “appearance”, “style”, “story”, and words representing thought, “marriage”, “migrating”
- the parent element 2 is likely to include both characteristic words. For example, if the target file for character string search is narrowed down based on the index information corresponding to the divided blocks, the parent element 1 and the parent element 2 are narrowed down for the search character string “style”. It will be.
- the parent element 1 block may not contain words characteristic of the parent element 2 such as “marriage” and “migrant”. Or, the block of the parent element 2 may not include words characteristic of the parent element 1 such as “appearance”, “style”, and “story”. If the parent element 2 block does not contain a characteristic word in the parent element 1, narrowing down the text search target with a search character string such as “style” will also narrow down to the parent element 2 block It won't happen.
- FIG. 2B shows an example of a hierarchical structure of document data.
- Each of (A), (B), and (C) shown in FIG. 2B shows an example of block division of a file.
- the block BA-1 and the block BA-2 are obtained by division at the element of the hierarchy corresponding to the ⁇ h1> tag.
- division is performed by elements of the hierarchy corresponding to the ⁇ h3> tag, and a block BB-1 and a block BB-2 are obtained.
- the block BC-1 and the block BC-2 are obtained by division at the element of the hierarchy corresponding to the ⁇ h3> tag.
- block BC-1 when divided as in division example (C), block BC-1 also includes a term that is characteristic in the first element among the elements identified by the ⁇ h1> tag in the search character string. In this case, it becomes a target of character string search.
- the block when the block is divided in the index information generation, it is not included in a block having a large data size, and the document data having a part of the document structure is divided in accordance with the upper hierarchy unit. Contributes to efficient narrowing down of the target of character string search. That is, by controlling the priority of the determination criterion for determining the block division position according to the document structure, the noise of file narrowing by the generated index information is suppressed.
- the hierarchical structure of the document data shown in FIG. 2A is used as an example in which a predetermined number or more child elements exist in one element.
- a predetermined number or more child elements exist in one element.
- the number of words whose initial is “shi” is 15921 and the number of words whose initial is “ka” is 13895.
- the number of words with the initial “Nu” is 662
- the number of words with the initial “RU” is 444
- the number of words starting with “O” is 6, and the number of words starting with “N” is 8.
- Excluding “O” and “N” there are 444 or more child elements with each initial as a parent element.
- the predetermined number used to determine the document structure may be “10” or “100”, for example. If the predetermined number is “10”, when there are 10 or more child elements in one element, control is performed so that block division is performed in the hierarchy of the child elements.
- index information indicating the presence / absence of each character information is used for the search for each record unit or page unit.
- Index information indicating the presence / absence of each character information may be used for the search for a block unit divided so as to include a plurality of records or a plurality of pages instead of a record unit or a page unit.
- the database is also characterized by a hierarchical structure, similar to electronic books.
- records that are records of each event are added, so that data is listed in units of records.
- information required for recording each event is different.
- a customer information database there is a database in which information corresponding to items such as ID, company name, department, person in charge, address, and telephone number is stored for each customer.
- a database has a format in which each record as customer information is listed, and has a hierarchical structure similar to a dictionary in an electronic dictionary.
- administration history information is stored for each administration.
- a record including information such as administration time, administered drug, investigator's condition (body temperature, etc.), side effect symptoms, etc. is generated.
- the investigator may provide an item for storing information indicating the investigator's condition or an item for storing information on side effects.
- the data structure Since the hierarchy is determined according to the characteristics of the event in this way, it has a hierarchical structure similar to academic books in the electronic dictionary.
- the data may be a small amount of data if no side effects occur, but the amount of data increases if side effects occur.
- the hierarchical structure of the database is also different. Therefore, similarly to the electronic book, by performing block division according to the characteristics of the hierarchical structure, generation of noise for narrowing down the character string search target is suppressed.
- FIG. 3 shows an example of functional blocks of the computer 1 in the first embodiment.
- the computer 1 includes a processing unit 11 and a storage unit 12.
- the processing unit 11 generates index information and performs a search using the generated index information.
- the storage unit 12 stores information used for processing by the processing unit 11 (for example, file groups F1 to Fn to be searched and index information).
- the processing unit 11 includes a generation unit 13.
- the generation unit 13 generates index information and stores it in the storage unit 12.
- FIG. 4 shows an example of functional blocks of the generation unit 13.
- the generation unit 13 includes a control unit 131, a reading unit 132, an analysis unit 133, and a determination unit 134.
- the control unit 131 sequentially designates the file F1 to the file Fn, and causes the reading unit 132, the analysis unit 133, and the determination unit 134 to execute the respective processes for the designated file.
- the reading unit 132 reads, from the storage unit 12, the file Fi designated by the control unit 131 among the file groups F1 to Fn.
- the analysis unit 133 analyzes the document structure in the file for each file read by the reading unit 132.
- the control unit 131 divides the file based on the analysis result of the analysis unit 133.
- the determination unit 134 includes Cj for each character information Cj in the set character information groups C1 to Cm for each block (corresponding to the file itself if not divided) divided by the control unit 131. Determine whether.
- the control unit 131 calculates an address based on the character information Cj and the file number i of the block Bi, and the storage location indicated by the calculated address The information indicating that the character information Cj is included is stored.
- FIG. 5 shows an example of the table T1 for storing the block number, the reading position of the block, and the correspondence relationship.
- the control unit 131 assigns a number to each block obtained by the division, and stores the block read position and the block number in association with each other in the table T1.
- Information in the table T1 is referred to by a character string search unit 16 described later.
- the processing unit 11 further includes a search control unit 14, a narrowing unit 15, and a character string search unit 16.
- the search control unit 14 performs search processing according to the search request by controlling the narrowing unit 15 and the character string search unit 16.
- the narrowing-down unit 15 narrows down search target files using the index information generated by the generation unit 13.
- the search control unit 14 extracts the character information Ca from the search character string included in the received search request, and notifies the extraction unit 15 of the extracted character information Ca.
- the narrowing-down unit 15 notifies the search control unit 14 of the block numbers of the blocks other than the file that does not include the character information Ca notified to the search control unit 14 among the block groups B1 to Bp.
- the character string search unit 16 reads the block data from the read position stored in the table T1 for the blocks narrowed down by the narrowing unit 15, and performs a character string search based on the search request received by the search control unit 14.
- FIG. 6 shows an example of functional blocks of the narrowing-down unit 15.
- the narrowing-down unit 15 includes a reference unit 151 and a determination unit 152.
- the reference unit 151 reads a portion corresponding to the character information Ca notified from the search control unit 14 among the index information stored in the storage unit 12.
- An address indicating a portion corresponding to the character information Ca is calculated according to the character information Ca.
- the reference unit 151 calculates an address based on the character information Ca, and reads a bit string corresponding to the address.
- the determination unit 152 determines a block that does not include the character information Ca based on the bit string read by the reference unit 151, and excludes blocks that do not include the character information Ca from the block groups B1 to Bp.
- the search unit 16 is notified.
- the search control unit 14 may extract a plurality of character information (for example, character information Ca and character information Cb) from the search character string. Then, the reference unit 151 reads the bit string corresponding to the index information for each of the plurality of character information Ca and Cb. Further, the determination unit 152 calculates a logical product (AND) of the presence / absence information included in the bit string corresponding to the character information Ca and the presence / absence information included in the bit string corresponding to the character information Cb, and based on the calculation result. The presence / absence of character information Ca, Cb in each file is determined. The file number of the file determined not to include any of the character information Ca and Cb is not notified to the character string search unit 16.
- a logical product (AND) of the presence / absence information included in the bit string corresponding to the character information Ca and the presence / absence information included in the bit string corresponding to the character information Cb
- FIG. 7 shows a hardware configuration example of the computer 1.
- the computer 1 includes, for example, a processor 301, a RAM (Random Access Memory) 302, a ROM (Read Only Memory) 303, a drive device 304, a storage medium 305, an input interface (I / F) 306, an input device 307, an output interface (I / F) 308, output device 309, communication interface (I / F) 310, and the like.
- Each piece of hardware is connected via a bus 311.
- a communication I / F 310 controls communication via the network 4.
- the input interface 306 is connected to the input device 307 and transmits an input signal received from the input device 307 to the processor 301.
- the output interface 308 is connected to the output device 309 and causes the output device 309 to execute output in accordance with an instruction from the processor 301.
- the RAM 302 is a readable / writable memory device, and for example, a semiconductor memory such as SRAM (Static RAM) or DRAM (Dynamic RAM), or a flash memory even if not a RAM is used.
- the ROM 303 includes a PROM (Programmable ROM).
- the drive device 304 is a device that performs at least one of reading and writing of information recorded in the storage medium 305.
- the storage medium 305 stores information written by the drive device 304.
- the storage medium 305 is, for example, a storage medium such as a hard disk, a CD (Compact Disc), a DVD (Digital Versatile Disc), or a Blu-ray disc.
- the computer 1 includes a drive device 304 and a storage medium 305 for each of a plurality of types of storage media.
- the input device 307 is a device that transmits an input signal according to an operation.
- the input signal is, for example, a key device such as a keyboard or a button attached to the main body of the computer 1, or a pointing device such as a mouse or a touch panel.
- the output device 309 is a device that outputs information according to the control of the computer 1.
- the output device 309 is, for example, an image output device (display device) such as a display, or an audio output device such as a speaker.
- an input / output device such as a touch screen is used as the input device 307 and the output device 309.
- information stored in the storage medium 305 may be stored in the storage device 3 controlled by the computer 2 connected via the network 4.
- the processor 301 acquires the information stored in the storage device 3 via the communication interface 310, so that the reading unit 132, the character string search unit 16 and the like read the block.
- the processor 301 reads a program stored in the ROM 303 or the storage medium 305 to the RAM 302, and performs processing of the processing unit 11 according to the read program procedure. At that time, the RAM 302 is used as a work area of the processor 301.
- the functions of the storage unit 12 are realized by the ROM 303 and the storage medium 305 storing programs and file groups F1 to Fn and the RAM 302 being used as a work area of the processor 301.
- a program read by the processor 301 will be described with reference to FIG.
- FIG. 8 shows a configuration example of software operating on the computer 1.
- an OS 22 operation system
- the processor 301 operates in accordance with the procedure in accordance with the OS 22 to control and manage the hardware 21, whereby processing by the application program and middleware is executed by the hardware 21.
- the index generation program 23 a and the search processing program 23 b are read into the RAM 302 and executed by the processor 301.
- the processor 301 performs processing based on the index generation program 23a, so that the function of the generation unit 13 is realized (by controlling the hardware 21 based on the OS 22).
- the processor 301 performs processing based on the search processing program 23b (by controlling the hardware 21 based on the OS 22), the search control unit 14, the narrowing unit 15, and the character string search unit 16 Function is realized.
- the index generation program 23a and the search processing program 23b are shown as separate programs in FIG. 8, both programs may be combined into one program.
- FIG. 9 shows an example of an index generation processing procedure.
- the control unit 131 performs preprocessing (S101).
- the pre-processing of S101 is, for example, a process of reading the file path list of the search target file groups F1 to Fn and the character information groups C1 to Cm into the storage unit 12.
- the control unit 131 determines whether or not the generation of index information is requested (S102), and repeatedly determines until the generation of index information is requested (S102: NO).
- the control unit 131 secures a storage area for storing the index information (S103). For example, each bit in the storage area secured in S103 is set to “0”.
- the reading unit 132 refers to the list of file paths, reads the search target file groups F1 to Fn, and the analysis unit 133 performs a process of analyzing the document structure for each of the read files (S104).
- the control unit 131 divides the file according to the analysis result of the document structure of the analysis unit 133, and for the block obtained by the division, the block number and the information indicating the read position of the block are stored in the table T1 shown in FIG. Store (S105). Detailed processing of S104 and S105 will be described later. *
- the control unit 131 selects the block number i from the table T1 shown in FIG. 5, and causes the reading unit 132 to read the block Bi of the selected block number i (S106). For example, in S106, the control unit 131 selects records in the table T1 in the order of block numbers.
- the determination unit 134 selects one character information Cj from the character information C1 to Cm (S107). For example, in S107, the determination unit 134 may sequentially select the character information from the list of character information C1 to Cm held in the storage unit 12, or the character code may be incremented within a predetermined numerical value range to increase the character code. Information may be generated in order.
- the determination unit 134 determines whether or not the block Bi includes character information Cj (S108).
- the control unit 131 calculates an address based on the block number i and the character information Cj.
- the control unit 131 updates the bit at the position corresponding to the calculated address to “1” (S109). That is, the control unit 131 stores the result of the logical sum (OR) operation of the bit at the position corresponding to the calculated address and “1” at the position corresponding to the calculated address. For example, the i-th bit of the bit string corresponding to the value obtained by substituting the binary code of the character information Cj into a predetermined hash function is set to “1”.
- the determination unit 134 performs the process of S110.
- the determination unit 134 determines that the block Bi does not include the character information Cj (S108: NO)
- the determination unit 134 performs the process of S110.
- the determination unit 134 performs the process of S107 again (S110). If there is no unselected character information among the character information C1 to Cm, the process of S111 is performed.
- the reading unit 132 performs the process of S106 again. If there is no unselected file in the block groups B1 to Bp, the process of S112 is performed.
- the control unit 131 notifies that the index information generation processing for the file groups F1 to Fn has been completed (S112). In S112, the control unit 131 further saves information in the area secured in S103 as an index file. After the process of S112, it is determined whether an end instruction has been received (S113). If an end instruction has been received (S113: YES), the processing unit 11 ends the index generation program 23a (S114). If the end instruction has not been received (S113: NO), the process of S102 is performed again.
- 10A and 10B show an example of a processing procedure for document structure analysis processing.
- the number of child elements of each document element included in the file is counted for each file.
- the control unit 131 sequentially selects files from the files F1 to Fn, and the reading unit 132 reads the selected file Fi (S201).
- the analysis unit 133 reads tag information in order from the file Fi (S202).
- the analysis unit 133 determines whether the tag information read in S202 is a ⁇ / body> tag (S203). If the tag information read in S202 is a ⁇ / body> tag (S203: YES), the analysis unit 133 stores the document structure table created for the file Fi in the storage unit 12 (S204).
- the analysis unit 133 performs the process of S201 if there is a file that has not been subjected to the document structure analysis process, and ends the document structure analysis process if there is no file that has not been subjected to the document structure analysis process (S206).
- the process of S105 is performed (S205).
- the tag information read in S202 is not a ⁇ / body> tag (S203: NO)
- the tag information indicating the document structure hierarchy is, for example, ⁇ body>, ⁇ h1>, ⁇ h2>. If the tag information read in S202 is not tag information indicating a hierarchy (S207: NO), the process of S202 is performed again.
- the analysis unit 133 determines whether the read tag information is tag information indicating the start (S208).
- the tag information indicating the start is, for example, ⁇ body> tag, where ⁇ body> indicates the start and ⁇ / body> ends. For example, for ⁇ h1>, ⁇ h1> indicates the start and ⁇ / h1> indicates the end.
- the analysis unit 133 sets an end flag for counting the number of child elements to be described later (S214).
- the analysis unit 133 When the tag information read in S202 is a tag indicating the start (S208: YES), the analysis unit 133 generates a record in the document structure table T2 (S209). At the first time for each file, the analysis unit 133 secures a storage area for the document structure table T2. In step S209, the analysis unit 133 generates a new tag ID and stores the generated tag ID in the tag ID item of the document structure table. For example, the tag ID is generated by incrementing the previously generated ID value.
- FIG. 11 shows the document structure table T2.
- the document structure table T2 includes items of tag ID, number of layers, number of child elements, and flag.
- the tag ID item an ID assigned to tag information included in the document is stored.
- the number of hierarchies stores the number of hierarchies indicated by the tag information.
- the number of child elements stores the number of child elements included in the tag information.
- the flag is a flag indicating whether or not counting of the number of child elements for the tag information stored in the document structure table is completed.
- the document structure table T2 is generated for each of the files F1 to Fn.
- the analysis unit 133 When generating the record in the document structure table T2, the analysis unit 133 stores the number of layers indicated in the read tag information in the item of the number of layers of the generated record (S210). For example, if the read tag information is ⁇ body>, the number of layers is 0, if ⁇ h1>, the number of layers is 1, if it is ⁇ h2>, the number of layers is 2, and ⁇ h3> If so, the number of layers is three. Next, the analysis unit 133 counts the number of layers (S211 to S213). The analysis unit 133 performs the process of S212, where j is the number obtained by subtracting 1 from the number of hierarchies of the read tag information.
- the analysis unit 133 searches for and extracts a record having the number of hierarchies j from among the records in the document structure table T2 from the record generated in S209 in a direction in which the tag ID becomes smaller.
- the analysis unit 133 generates by incrementing the value of the item of the number of child elements of the extracted record.
- the analysis unit 133 performs the process of S202 again.
- 12A and 12B show a processing procedure example of the file division processing.
- the determination unit 134 determines whether the data read from each file exceeds a predetermined data size.
- the control unit 131 selects one of the files F1 to Fn (S301). That is, one of 1 to n is selected.
- the control unit 131 reads the document structure table T2 corresponding to the file selected in S301 (S302).
- the determination unit 134 extracts records in which the number of child elements is a predetermined number or more from the read document structure table T2 (S303).
- S303: YES the number of hierarchies of the record having the smallest number of hierarchies among the records having the number of child elements equal to or larger than the predetermined number is selected.
- S304 If there are no more than a predetermined number of child elements (S303: NO), 0 is selected as the number of layers (S305).
- the determination unit 134 reads an element indicating the selected number of hierarchies from the file Fi. Further, the determination unit 134 counts the data amount of the read element (S306). For example, the determination unit 134 sequentially extracts records in which the selected number of hierarchies is stored in the item of the number of hierarchies from the document structure table T2. In S306, the determination unit 134 reads data from the tag information indicated in the extracted record to the corresponding end tag from the file Fi.
- the determination unit 134 determines whether the data amount read in S306 is smaller than the first predetermined value (S307). If the amount of data read in S306 is smaller than the first predetermined value (S307: YES), it is determined whether there is unread data in the file Fi (S308).
- the determination unit 134 When there is unread data in the file Fi (S308: YES), the determination unit 134 adds the data amount counted in S306 to the integrated value S (S309). In each file, the integrated value is zero. The determination unit 134 determines whether or not the integrated value is greater than a second predetermined value (S310). When the integrated value is not larger than the second predetermined value (S310: NO), the determination unit 134 performs the process of S306 again. When the integrated value is larger than the second predetermined value (S310: YES), the read end position when the data is read in S306 is stored in the table T1 shown in FIG. 5 (S311).
- the reading position is stored in the table T1 as the reading position of the second block of the file Fi. Further, the determination unit 134 clears the integrated value (S312). Further, when the process of S312 is completed, the determination unit 134 performs the process of S306 again.
- the second predetermined value for example, a value smaller than the first predetermined value is used.
- the determination unit 134 increments the number of hierarchies that determines a unit for reading data (S318). As a result, the determination unit 134 can divide the file into blocks in smaller units.
- the determination unit 134 reads data from the file Fi based on the number of hierarchies determined in S318, and counts the data amount (S319). Further, the determination unit 317 determines whether or not the data amount read in S319 is smaller than the first predetermined value (S320). When the amount of data read in S319 is not smaller than the first predetermined value (S320: NO), the determination unit 134 performs S318 again.
- the determination unit 134 determines the data (the number of hierarchies minus one) higher than the number of hierarchies selected in S318 (hierarchy number-1). It is determined whether or not all the data read immediately before S306, etc.) has been read with the number of hierarchies selected in S318 (S321). If it is determined in S321 that all the data has been read (S321: YES), the determination unit 134 performs the process of S309 (S322).
- the determination unit 134 performs the same determination as in S310 (S323), and when YES is determined in S323 (S323: YES), performs the same processing as S311 and S312 (S324 and S325), and again in S319. Perform processing. When it is determined NO in S323 (S323: NO), the determination unit 134 performs the process of S319.
- the determination unit 134 performs the same processing as S311 and S312 (S326 and S327). Next, the determination unit 134 decrements the number of layers to be selected (S328). The determination unit 134 determines whether the number of selected hierarchies is 0 or the number of hierarchies selected in S304 (S329). In the determination of S329, when the number of selected hierarchies is 0 or the number of hierarchies selected in S304 (S329: YES), the determining unit 134 performs the process of S306 again. If none of the determinations is satisfied in S329 (S329: NO), the determination unit 134 performs the process of S319 again.
- the determination unit 134 clears the integrated value (S313). If the file Fi is not the file Fn, the generation unit 13 performs the process again from S301 (S314). When the file Fi is the file Fn, the total number of blocks obtained by dividing the files F1 to Fn is set to p (S315). Further, the generation unit 13 performs the process of S106 (S316).
- the reading position may be returned to the reading position where the punctuation is read in a straight line. Then, it is avoided that the boundary when divided into blocks is halfway. Furthermore, for example, the reading position may be returned to the previous line feed.
- FIG. 13 shows an example of a full text search processing procedure.
- the search control unit 14 performs preprocessing (S401).
- the preprocessing of S401 is reading of the table T1 shown in FIG. 5 and reading of index information.
- the search control unit 14 determines whether or not a search request has been received (S402), and repeats the determination in S402 until a search request is received (S402: NO). If a search request is received (S402: YES), an index reference process is executed (S403).
- FIG. 14 shows an example of a reference processing procedure for index information.
- the search control unit 14 extracts a search character string included in the search request, and character information Ca, Cb,... Included in the search character string among the character information C1 to Cm. Is extracted (S501).
- the narrowing-down unit 15 selects any one of the extracted character information Ca, Cb,... For each of the block groups B1 to Bp. However, it is determined whether or not the block is not included. Specifically, first, one of the extracted character information is selected (S502). The reference unit 151 calculates an address based on the selected character information, and reads information stored at the position indicated by the calculated address (S503). In S503, the reference unit 151 calculates an address by the same calculation as in S109. At that time, for example, the reference unit 151 reads a bit string corresponding to a value obtained by substituting the binary code of the selected character information into a predetermined hash function.
- the narrowing-down unit 15 performs the process of S502 again and extracts the extracted character information Ca, Cb,. If there is no unselected character information in *, the index reference process is terminated (S504, S505).
- the narrowing down unit 15 extracts the block number of the search target block (S404).
- the determination unit 152 calculates a logical product (AND) of bit strings read by the reference unit 151 for each of the character information Ca, Cb,.
- the determination unit 152 generates a number indicating the number of bits that are “1” in the calculated bit string. For example, if the xth bit and the yth bit are “1” in the calculated bit string, the determination unit 152 generates x and y.
- the search control unit 14 selects a number i that is one of the numbers x, y,... Generated by the determination unit 152 (S405).
- the character string search unit 16 reads out a block Bi whose block number is the selected number i (S406).
- the character string search unit 16 reads a block from the reading position associated with the block number i in the table T1 shown in FIG.
- the character string search unit 16 searches the read block Bi with the search character string (S407). For example, when the character string search unit 16 detects a character string that matches the search character string in the block Bi, the character string search unit 16 generates information indicating the position of the matched character string in the block Bi, and the block number of the block Bi.
- a counter that counts the amount of data collated with the search character string is provided in advance, and the value of the counter when the matching of the character string is detected is used as information indicating the position in the file.
- FIG. 15 shows an example of a table for storing search results.
- the table T2 illustrated in FIG. 15 includes a record indicating a position where a character string that matches the search character string exists.
- the position of the character string that matches the search character string is indicated, for example, by the number of the block that includes the character string and the value of the counter that is incremented each time the character information of each block is read.
- the counter value is read, for example, when a match is detected.
- the search control unit 14 After the process of S407, if there is an unselected number among the numbers x, y,... Generated by the determination unit 152, the search control unit 14 performs the process of S405 (S408). When there is no unselected number among the numbers x, y,... Generated by the determination unit 152, the search control unit 14 performs the process of S409.
- the search control unit 14 performs search result output processing (S409). For example, a character string near the position indicated in the information stored in the table T2 shown in FIG. 15 is extracted in the process of S407, and the extracted character string is combined with the file name of the file corresponding to the block number. Processing such as displaying on a display device is performed.
- the processing unit 11 determines whether or not there is an instruction to end (S410). If there is no end instruction (S410: NO), the search control unit 14 performs the process of S402. When there is an instruction to end (S410: YES), the processing unit 11 ends the search processing program 22b (S411).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
前記記憶部に記憶された前記インデックス情報の参照により、前記インデックス情報に前記文字情報を含む旨が示されるブロックを特定する絞込部と、特定された前記ブロックに対して、前記検索文字列による文字列検索を行なう検索部と、を含む。
2 コンピュータ
3 記憶装置
4 ネットワーク
11 処理部
12 記憶部
13 生成部
14 検索制御部
15 絞込部
16 文字列検索部
131 制御部
132 読出し部
133 解析部
134 判定部
151 参照部
152 判定部
Claims (8)
- コンピュータに、
文書ファイルに所定数以上の子要素を有する文書要素が存在するか否かに応じて、前記文書ファイル内のデータを複数のブロックのいずれに含めるかの制御を、前記子要素の階層の文書要素ごとに行なうか、もしくは、前記文書要素又は前記文書要素よりも上位の要素の階層の文書要素ごとに行なうかの切り換えを行ない、
前記切り換えに応じた前記制御により、前記文書ファイルを前記複数のブロックに分割し、
分割して得られたブロックごとに、各ブロックが所定の文字情報を含むか否かを示すインデックス情報を生成する、
処理を実行させることを特徴とする生成プログラム。 - 前記コンピュータに、
前記文書要素又は前記文書要素よりも上位の文書要素の階層の文書要素のデータサイズが所定値よりも大きい場合には、さらに1階層下位の文書要素ごとに前記制御を実行させる、
処理を実行させることを特徴とする請求項1に記載の生成プログラム。 - 前記文書ファイルに含まれる各文書要素は、前記文書ファイルに含まれるタグの開始タグから終了タグの範囲に含まれる文字情報群である、
ことを特徴とする請求項1または請求項2に記載の生成プログラム。 - コンピュータに、
文書ファイルに所定数以上の子要素を有する文書要素が存在するか否かに応じて、前記文書ファイル内のデータを複数のブロックのいずれに含めるかの制御を、前記子要素の階層の文書要素ごとに行なうか、もしくは、前記文書要素又は前記文書要素よりも上位の要素の階層の文書要素ごとに行なうかの切り換えを行ない、
前記切り換えに応じた前記制御により、前記文書ファイルを前記複数のブロックに分割し、
分割して得られたブロックごとに、各ブロックが所定の文字情報を含むか否かを示すインデックス情報を生成する、
処理を実行させることを特徴とする生成方法。 - 文書ファイルに所定数以上の子要素を有する文書要素が存在するか否かに応じて、前記文書ファイル内のデータを複数のブロックのいずれに含めるかの制御を、前記子要素の階層の文書要素ごとに行なうか、もしくは、前記文書要素又は前記文書要素よりも上位の要素の階層の文書要素ごとに行なうかを切り換え、前記切り換えに応じた前記制御により、前記文書ファイルを前記複数のブロックに分割する分割部と、
分割して得られたブロックごとに、各ブロックが所定の文字情報を含むか否かを示すインデックス情報を生成する生成部と、
を含むことを特徴とする生成装置。 - コンピュータに、
検索文字列を受け付けると、前記検索文字列に含まれる文字情報に基づいて、文書ファイルに所定数以上の子要素を有する文書要素が存在するか否かに応じて、前記文書ファイル内のデータを複数のブロックのいずれに含めるかの制御を、前記子要素の階層の文書要素ごとに行なうか、もしくは、前記文書要素又は前記文書要素よりも上位の要素の階層の文書要素ごとに行なうかで切り換えて行なわれた分割により得られた各ブロックに、前記各ブロックが前記文字情報を含むか否かが対応付けられたインデックス情報を参照し、
前記インデックス情報の参照により、前記インデックス情報に前記文字情報を含む旨が示されるブロックを特定し、
特定された前記ブロックに対して、前記検索文字列による文字列検索を行なう、
処理を実行させることを特徴とする検索プログラム。 - コンピュータに、
検索文字列を受け付けると、前記検索文字列に含まれる文字情報に基づいて、文書ファイルに所定数以上の子要素を有する文書要素が存在するか否かに応じて、前記文書ファイル内のデータを複数のブロックのいずれに含めるかの制御を、前記子要素の階層の文書要素ごとに行なうか、もしくは、前記文書要素又は前記文書要素よりも上位の要素の階層の文書要素ごとに行なうかで切り換えて行なわれた分割により得られた各ブロックに、前記各ブロックが前記文字情報を含むか否かが対応付けられたインデックス情報を参照し、
前記インデックス情報の参照により、前記インデックス情報に前記文字情報を含む旨が示されるブロックを特定し、
特定された前記ブロックに対して、前記検索文字列による文字列検索を行なう、
処理を実行させることを特徴とする検索方法。 - 検索文字列を受け付ける受付部と、
前記受付部が受け付けた前記検索文字列に含まれる文字情報に基づいて、文書ファイルに所定数以上の子要素を有する文書要素が存在するか否かに応じて、前記文書ファイル内のデータを複数のブロックのいずれに含めるかの制御を、前記子要素の階層の文書要素ごとに行なうか、もしくは、前記文書要素又は前記文書要素よりも上位の要素の階層の文書要素ごとに行なうかで切り換えて行なわれた分割により得られた各ブロックに、前記各ブロックが前記文字情報を含むか否かが対応付けられたインデックス情報を記憶する記憶部と、
前記記憶部に記憶された前記インデックス情報の参照により、前記インデックス情報に前記文字情報を含む旨が示されるブロックを特定する絞込部と、
特定された前記ブロックに対して、前記検索文字列による文字列検索を行なう検索部と、
を含むことを特徴とする検索装置。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201280073480.3A CN104380286A (zh) | 2012-05-31 | 2012-05-31 | 索引生成程序以及检索程序 |
EP12877979.0A EP2857986A4 (en) | 2012-05-31 | 2012-05-31 | INDEX GENERATION PROGRAM AND RESEARCH PROGRAM |
JP2014518093A JP5880699B2 (ja) | 2012-05-31 | 2012-05-31 | インデックス生成プログラム及び検索プログラム |
PCT/JP2012/003592 WO2013179348A1 (ja) | 2012-05-31 | 2012-05-31 | インデックス生成プログラム及び検索プログラム |
US14/556,012 US20150088944A1 (en) | 2012-05-31 | 2014-11-28 | Generating method, generating apparatus, and recording medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2012/003592 WO2013179348A1 (ja) | 2012-05-31 | 2012-05-31 | インデックス生成プログラム及び検索プログラム |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/556,012 Continuation US20150088944A1 (en) | 2012-05-31 | 2014-11-28 | Generating method, generating apparatus, and recording medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013179348A1 true WO2013179348A1 (ja) | 2013-12-05 |
Family
ID=49672607
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2012/003592 WO2013179348A1 (ja) | 2012-05-31 | 2012-05-31 | インデックス生成プログラム及び検索プログラム |
Country Status (5)
Country | Link |
---|---|
US (1) | US20150088944A1 (ja) |
EP (1) | EP2857986A4 (ja) |
JP (1) | JP5880699B2 (ja) |
CN (1) | CN104380286A (ja) |
WO (1) | WO2013179348A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104834277A (zh) * | 2014-02-07 | 2015-08-12 | 富士通株式会社 | 管理方法、管理设备和管理*** |
WO2016001991A1 (ja) * | 2014-06-30 | 2016-01-07 | 株式会社日立製作所 | 検索方法 |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105844726B (zh) * | 2016-03-18 | 2018-04-17 | 吉林大学 | 一种手写签名签到管理*** |
EP3608800A4 (en) * | 2017-04-06 | 2020-04-01 | Fujitsu Limited | INDEX GENERATION PROGRAM, INDEX GENERATION DEVICE, INDEX GENERATION METHOD, SEARCH PROGRAM, SEARCH DEVICE, AND SEARCH METHOD |
JP6911877B2 (ja) * | 2018-02-19 | 2021-07-28 | 日本電信電話株式会社 | 情報管理装置、情報管理方法及び情報管理プログラム |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06290217A (ja) * | 1993-03-31 | 1994-10-18 | Ricoh Co Ltd | 文書検索方式 |
JPH08147311A (ja) * | 1994-11-17 | 1996-06-07 | Hitachi Ltd | 構造化文書検索方法及び装置 |
JPH08314966A (ja) | 1995-05-19 | 1996-11-29 | Toshiba Corp | 文書検索装置のインデックス作成方法及び文書検索装置 |
JPH08329116A (ja) * | 1995-06-05 | 1996-12-13 | Hitachi Ltd | 構造化文書検索方法 |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5438657A (en) * | 1992-04-24 | 1995-08-01 | Casio Computer Co., Ltd. | Document processing apparatus for extracting a format from one document and using the extracted format to automatically edit another document |
JP2758826B2 (ja) * | 1994-03-02 | 1998-05-28 | 株式会社リコー | 文書検索装置 |
JP3520554B2 (ja) * | 1994-03-11 | 2004-04-19 | ヤマハ株式会社 | ディジタルデータ再生方法及び装置 |
JPH08241325A (ja) * | 1995-03-03 | 1996-09-17 | Matsushita Electric Ind Co Ltd | 電子辞書及びその製造方法並びにインデックス圧縮・伸長装置 |
JP3160201B2 (ja) * | 1996-03-25 | 2001-04-25 | インターナショナル・ビジネス・マシーンズ・コーポレ−ション | 情報検索方法、情報検索装置 |
US5774715A (en) * | 1996-03-27 | 1998-06-30 | Sun Microsystems, Inc. | File system level compression using holes |
CA2242158C (en) * | 1997-07-01 | 2004-06-01 | Hitachi, Ltd. | Method and apparatus for searching and displaying structured document |
US6704753B1 (en) * | 1998-01-29 | 2004-03-09 | International Business Machines Corporation | Method of storage management in document databases |
US20020129006A1 (en) * | 2001-02-16 | 2002-09-12 | David Emmett | System and method for modifying a document format |
US7248737B2 (en) * | 2001-10-02 | 2007-07-24 | Siemens Corporate Research, Inc. | Page decomposition using local orthogonal transforms and a map optimization |
JP4322031B2 (ja) * | 2003-03-27 | 2009-08-26 | 株式会社日立製作所 | 記憶装置 |
US7366837B2 (en) * | 2003-11-24 | 2008-04-29 | Network Appliance, Inc. | Data placement technique for striping data containers across volumes of a storage system cluster |
JP4314204B2 (ja) * | 2005-03-11 | 2009-08-12 | 株式会社東芝 | 文書管理方法、システム及びプログラム |
US7797310B2 (en) * | 2006-10-16 | 2010-09-14 | Oracle International Corporation | Technique to estimate the cost of streaming evaluation of XPaths |
US8412677B2 (en) * | 2008-11-26 | 2013-04-02 | Commvault Systems, Inc. | Systems and methods for byte-level or quasi byte-level single instancing |
CN102741838B (zh) * | 2009-10-02 | 2017-05-03 | A·穆苏卢里 | 块分割、识别与索引视觉元素及搜索文档的***与方法 |
JP5083367B2 (ja) * | 2010-04-27 | 2012-11-28 | カシオ計算機株式会社 | 検索装置、検索方法、ならびに、コンピュータプログラム |
US9501661B2 (en) * | 2014-06-10 | 2016-11-22 | Salesforce.Com, Inc. | Systems and methods for implementing an encrypted search index |
-
2012
- 2012-05-31 JP JP2014518093A patent/JP5880699B2/ja active Active
- 2012-05-31 WO PCT/JP2012/003592 patent/WO2013179348A1/ja active Application Filing
- 2012-05-31 EP EP12877979.0A patent/EP2857986A4/en not_active Withdrawn
- 2012-05-31 CN CN201280073480.3A patent/CN104380286A/zh active Pending
-
2014
- 2014-11-28 US US14/556,012 patent/US20150088944A1/en not_active Abandoned
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06290217A (ja) * | 1993-03-31 | 1994-10-18 | Ricoh Co Ltd | 文書検索方式 |
JPH08147311A (ja) * | 1994-11-17 | 1996-06-07 | Hitachi Ltd | 構造化文書検索方法及び装置 |
JPH08314966A (ja) | 1995-05-19 | 1996-11-29 | Toshiba Corp | 文書検索装置のインデックス作成方法及び文書検索装置 |
JPH08329116A (ja) * | 1995-06-05 | 1996-12-13 | Hitachi Ltd | 構造化文書検索方法 |
Non-Patent Citations (3)
Title |
---|
HIROTO KURITA: "Efficiency of Distributed Query Processing for Huge XML Data", IPSJ SIG NOTES, vol. 2006, no. 33, 22 March 2006 (2006-03-22), pages 23 - 30, XP031095315 * |
See also references of EP2857986A4 |
TOM WHITE: "Hadoop", 2011 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104834277A (zh) * | 2014-02-07 | 2015-08-12 | 富士通株式会社 | 管理方法、管理设备和管理*** |
WO2016001991A1 (ja) * | 2014-06-30 | 2016-01-07 | 株式会社日立製作所 | 検索方法 |
JPWO2016001991A1 (ja) * | 2014-06-30 | 2017-04-27 | 株式会社日立製作所 | 検索方法 |
Also Published As
Publication number | Publication date |
---|---|
US20150088944A1 (en) | 2015-03-26 |
EP2857986A1 (en) | 2015-04-08 |
CN104380286A (zh) | 2015-02-25 |
JPWO2013179348A1 (ja) | 2016-01-14 |
EP2857986A4 (en) | 2015-10-14 |
JP5880699B2 (ja) | 2016-03-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5880699B2 (ja) | インデックス生成プログラム及び検索プログラム | |
US10552539B2 (en) | Dynamic highlighting of text in electronic documents | |
JP5512489B2 (ja) | ファイル管理装置及びファイル管理方法 | |
JP5229226B2 (ja) | 情報共有システム、情報共有方法、および情報共有プログラム | |
JP2005285127A5 (ja) | ||
JP2008152585A (ja) | 表示画像制御装置及びその制御方法 | |
WO2014006851A1 (ja) | 匿名化装置、匿名化システム、匿名化方法、及び、プログラム記録媒体 | |
JP6163854B2 (ja) | 検索制御装置、検索制御方法、生成装置および生成方法 | |
Kabadjov et al. | Multilingual statistical news summarization | |
JP2011164830A (ja) | グラフ可視化装置及びグラフ可視化方法及びグラフ可視化プログラム | |
JP6028392B2 (ja) | 生成プログラム、生成方法、生成装置、検索プログラム、検索方法および検索装置 | |
JP5950522B2 (ja) | 文書リストの表示のための装置、方法及びプログラム | |
JP4900475B2 (ja) | 電子文書管理装置及び電子文書管理プログラム | |
JP4844737B2 (ja) | 代表情報選択方法、代表情報選択システム及びプログラム | |
KR101545216B1 (ko) | 데이터 모델링 방법 및 장치 | |
CN117290302B (zh) | 目录分离方法、装置、计算机设备和存储介质 | |
CN112988668B (zh) | 基于PostgreSQL的流式文档处理方法、装置以及装置的应用方法 | |
JP6028393B2 (ja) | 照合プログラム、照合方法および照合装置 | |
JP2015162170A (ja) | 情報処理装置、及び制御方法 | |
JP2024018742A (ja) | 効率的な文書閲覧のための仮想フォルダに関する生成方法、コンピュータシステム、コンピュータ装置、及びコンピュータプログラム | |
JP5971571B2 (ja) | 構造文書管理システム、構造文書管理方法及びプログラム | |
JP2022126229A (ja) | 将来事象推定システム、および将来事象推定方法 | |
Andrade et al. | Traces of Digitized Newspapers and Born-Digital News Sites: A Trail to the Memory on the Internet | |
JP2007188469A (ja) | 情報空間処理装置、プログラム、および、方法 | |
Van Pulis | CC: DA/TF/OPAC Displays 2/3 Task Force for the Review of “Guidelines for OPAC Displays”: Report |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12877979 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2014518093 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
REEP | Request for entry into the european phase |
Ref document number: 2012877979 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2012877979 Country of ref document: EP |