CN109284273A

CN109284273A - A kind of mass small documents querying method and system using Suffix array clustering index

Info

Publication number: CN109284273A
Application number: CN201811133108.2A
Authority: CN
Inventors: 赵鑫; 孙茜; 农革
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2019-01-29
Anticipated expiration: 2038-09-27
Also published as: CN109284273B

Abstract

The invention discloses a kind of mass small documents querying methods indexed using Suffix array clustering.The present invention is by improving space utilization rate in storage to distributed file system after merging small documents, establishing Suffix array clustering index record to each small documents simultaneously, it stores information and small documents itself attribute information, and provide effective small documents update method, support the small documents inquiry of various ways, traditional single inefficient mass small documents inquiry is avoided, ensure that the instantaneity, accuracy, high efficiency of inquiry.It solves and simply merges small documents in the prior art and lead to that small documents inquiry mode is single, reading efficiency is low, small documents update that difficult, inquiry instantaneity is poor.

Description

A kind of mass small documents querying method and system using Suffix array clustering index

Technical field

The present invention relates to big data management domains, more particularly, to a kind of small text of magnanimity indexed using Suffix array clustering Part querying method and system.

Background technique

It is currently big data era, various present information applications can all produce the data of magnanimity, also bring accordingly Pressure in terms of storage and management.Using HDFS as a variety of common distributed file systems of representative, all it is more suitable in design big The storage of file.It is stored if it is to small documents, then each small documents can be single because of one piece of complete storage is occupied Bit space and the waste for leading to space.Directly store small documents on a distributed simultaneously, it can be due to creating small documents Metadata information expends a large amount of server memories, and after small documents quantity reaches certain scale, the speed of storage and retrieval Degree also can accordingly slow down.

The common practice for solving problem above is then stored into distributed file system after merging small documents, but existing Technology is mainly that the offset directly to small documents in big file establishes index, establishes hash index such as simply to be closed And.This merging mode will cause that small documents inquiry mode is single, reading efficiency is low, small documents update that difficult, inquiry is instant The problems such as property can not ensure.

Summary of the invention

It is an object of the invention to overcome in the prior art simply merge small documents cause small documents inquiry mode it is single, read It takes inefficiency, small documents to update the problems such as difficult, inquiry instantaneity is poor, it is small to provide a kind of magnanimity using Suffix array clustering index File polling method.

To realize the above goal of the invention, and the technological means used is:

A kind of mass small documents querying method indexed using Suffix array clustering, comprising:

Small documents storing step:

Client is presented a paper upload request；

Each file size is obtained, file size is judged, if being judged as non-small documents, file is established respectively Suffix array clustering indexes and uploads to distributed file system；If being judged as small documents, small documents are put into and merge queue progress Merge, small documents are established with Suffix array clustering respectively and indexes and the file after merging is uploaded into distributed file system.

Small documents query steps:

It obtains and parses inquiry request；

Determine query type；

The specified domain and querying condition that determination to be inquired；

Specified domain is searched in Suffix array clustering index according to querying condition, obtains qualified Suffix array clustering index note Record；

Location information of the small documents in distributed file system is obtained according to Suffix array clustering index record, from distributed text Corresponding small documents are obtained in part system.

By establishing Suffix array clustering index record to each small documents, it stores information and small documents sheet to above scheme The attribute information of body remerges in small documents storage to distributed file system.So that small documents inquiry mode branch of the invention The small documents inquiry for holding various ways avoids traditional single inefficient mass small documents inquiry, and ensure that inquiry Instantaneity, accuracy, high efficiency.

Preferably, the detailed process of judgement described in storing step are as follows: the default defined in a distributed file system is deposited The size of storage unit is threshold value b, definition threshold value a is the value less than threshold value b, and the file less than threshold value a is small documents, is more than or equal to The file of threshold value a is non-small documents.

Preferably, the index of Suffix array clustering described in storing step includes small documents name, small documents size, the corresponding storage of small documents The corresponding offset, creation time for being stored in file in distributed system of filename in distributed system, small documents totally five A domain；Each domain includes the metadata that the domain particular content is corresponded to for recording file, Suffix array clustering and domain information structure；

Wherein domain information structure includes the storage file quantity in the domain, and the metadata size in the domain records each in the domain The FileInfo of the file information structure of file；

Wherein the FileInfo includes that index deletes that marker character, that file corresponds to the metadata of property content in this domain is big Small, file corresponds to attribute metadata the first byte offset of metadata, file ID in this domain.

Preferably, storing step further includes when the file size in merging queue reaches threshold value b, by file with binary system Form merges and establishes Suffix array clustering index to each file, then uploads the file after merging；In the merging queue File empty and recycle after the completion of upload.

Preferably, query type described in query steps includes accurate inquiry and fuzzy query.

Preferably, search specified domain described in query steps specifically: query metadata and Suffix array clustering find occurrence Offset in the metadata of Suffix array clustering index record finds corresponding file ID according to offset in FileInfo；

Preferably, further includes:

Small documents update step:

Obtain the small documents that need to be updated；

Suffix array clustering index is searched for, the small documents to be updated are labeled as having deleted；

Upload the small documents updated；

To in distributed file system include old edition small documents and meet recombination condition merging file carry out physics recombination； Wherein meet recombination condition to refer to: the size summation that definition merges the small documents not being updated in file is effective use space, The threshold value for setting each effective use space for merging file in distributed file system, when effective use space is less than threshold value Merge when quantity of documents reaches specified quantity and meets recombination condition.

Preferably, update the specific calculating process that space is efficiently used described in step: identifier mark is deleted in inquiry It obtains to merge file where it and calculate and respectively merges text in distributed file system for 0 small documents Suffix array clustering index record Part efficiently uses space, judges whether to reach threshold value, and an effective use space is denoted as if not reaching threshold value and is less than threshold value Merge file.

Meanwhile the present invention also provides a kind of systems using above method comprising:

File size judgment module, for judging whether each file to be uploaded is small documents；

Merging module, for merging multiple small documents；

File uploading module, for uploading the file after merging or uploading non-small documents；

Index module, for being indexed for each document creation Suffix array clustering；

Enquiry module, for providing a variety of query types for inquiry mass small documents.

File acquisition module is inquired, is inquired for being obtained from distributed file system according to Suffix array clustering index record Small documents.

File update module, the update for small documents.

Merge file recombination module, weight is carried out to the merging file in distributed file system after updating for small documents Group deletes old edition small documents, regenerates in new merging file storage to distributed file system.

Preferably, the query type that the enquiry module provides includes accurate inquiry and fuzzy query.

Compared with prior art, the beneficial effect of technical solution of the present invention is:

A kind of mass small documents querying method indexed using Suffix array clustering, can be by establishing suffix to each small documents Array indexing records its attribute information for storing information and small documents itself, remerges small documents storage to distributed field system On system；Simultaneously the present invention provides effective small documents update method, using logic delete plus physics recombination by the way of, with Conventional direct physics, which is deleted and rebuild to compare, can be reduced a large amount of IO expenses；So that small documents inquiry mode support of the invention is more The small documents inquiry of kind mode avoids traditional single inefficient mass small documents inquiry, solves simple in the prior art Merging small documents leads to that small documents inquiry mode is single, reading efficiency is low, small documents update difficult, inquiry instantaneity difference etc. and ask Topic.

Detailed description of the invention

Fig. 1 is the small documents storage method flow chart of one embodiment of the invention.

Fig. 2 is the small documents querying method flow chart of one embodiment of the invention.

Fig. 3 is the small documents update method flow chart of one embodiment of the invention.

Fig. 4 is the module connection diagram of present system.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product Size；

To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

The present embodiment applies the present invention to Hadoop distributed file system (HDFS).

The specific object data of two small documents are as shown in table 1 in the present embodiment, share small documents name filename, small text Part size filesize, small documents correspond to the filename unionfilename being stored in distributed system, in distributed system Offset fileoffset, five attributes of creation time date, i.e. five domains in corresponding storage file.

Table 1

As shown in table 2, each domain includes to correspond to the domain for recording file to the corresponding Suffix array clustering index of two files Metadata, Suffix array clustering and the domain information structure of particular content；

Table 2

The domain information structure in the domain filename is as shown in table 3 in two of them file, including the storage file number in the domain FileNum is measured, the metadata size currentSize in the domain records the file information structure of each file in the domain FileInfo；The file information structure of each file in FileInfo recording domain contains index and deletes marker character delete (wherein 0 be do not delete, 1 is deletion), file corresponds to metadata size size, the file of property content in this domain and corresponds to property metadata According to the offset offset, file ID fileID of first byte metadata in this domain.Because paper trail is corresponding in the embodiment File is not deleted, so delete is identified as 0.

Table 3

As shown in Figure 1, small documents storing step includes:

A1. it presents a paper upload request, obtains any type of mass small documents；

A2. file size judgement is carried out one by one to the mass small documents that needs are stored to HDFS；

Since the hadoop of 2.X version is 128MB, the size of given threshold b is 128MB, the size of threshold value a according to Specific requirements are set as 32MB.

If file size is less than threshold value a, step A3 is jumped to；Otherwise new a upload queue and the upload are put files into Queue meets upload condition, and jumps to step A5；

A3. the file that will be deemed as small documents, which is put into, to be merged in queue, and second is carried out before being put into and is judged: merging team Whether All Files size summation is greater than threshold value b in column, if more than the small documents are then put into a new merging queue, because Current merging queue has met upload condition.A4 is entered step simultaneously, step A2 is otherwise recycled into and carries out next file Size judgement；

A4. it will meet the binary merging formation of the file progress in the merging queue of A3 upload condition and meet upload condition Upload queue；

A5. after being established to All Files in the upload queue for meeting upload condition (being small documents before merging if having merging) Sew array indexing, recording it includes small documents name, small documents size, small documents storage to filename corresponding on HDFS, small text Part stores the location information and self information and index information to offset and creation time in respective file on HDFS It is stored on the server of maintenance index information；

A6. the merging file of binary form or single big file (the latter has switched to binary form) are uploaded to On HDFS, empties and recycle the upload queue in A5；

A7. it judges whether there is not upper transmitting file and otherwise jumps to step A2 if completing this upload request without if.

Wherein, when the file in merging queue merges, the file in the queue still retains, only on HDFS It creates a new merging file and file duplication in queue is written wherein.It can be according to described when therefore establishing file index Each the file information before file acquisition merges in queue.

1 small file name filename of the present embodiment inquiry table is the small documents of picture.As shown in Fig. 2, small documents are looked into Inquiry process includes:

B1. inquiry request is obtained；

B2. inquiry request is parsed；

B3. determine that inquiry request type is accurate inquiry or fuzzy query；

B4. the specified domain and querying condition that content determination to be inquired are parsed according to inquiry request；The present embodiment specified domain For the domain filename, querying condition picture.

B5. the domain filename is retrieved in Suffix array clustering index, query metadata and Suffix array clustering find picture in member Offset offset in data finds respective file ID further according to offset offset in FileInfo；

B6. other corresponding data are obtained according to other domains of file ID in Suffix array clustering index, obtained complete small The file information including small documents name, small documents size, corresponds to the filename being stored on HDFS, in corresponding HDFS file Offset, creation time.

B7. it obtains HDFS and corresponds to storage file, inquired small documents are obtained according to offset and small documents size.

Wherein, accurate inquiry is to specify a complete content, for example searching name is book.txt small documents.It is fuzzy to look into The format and specification for needing to specify some fuzzy query asterisk wildcards are ask, such as * matching any character, _ matching one character, [abcd] Any single character etc. in matched character string abcd.Such as the small documents that name is b_e.text are searched, it will obtain Multiple small documents for meeting fuzzy query conditions such as bce.text, bde.text.

As shown in figure 3, small documents update step includes:

C1. small documents need to be updated by obtaining；

C2. search Suffix array clustering index, according to the fileID of the corresponding index of the acquisition of information such as small documents name；

C3. the manipulative indexing information in all domains is found according to fileID, delete mark is changed to 1；

C4. the small documents of update are uploaded；

C5. judge whether that reaching HDFS merges file recombination condition；

It is recombinated if C6. reaching.

As shown in figure 4, the inquiry system of the method for the present invention application includes:

File acquisition module 1, for obtaining the mass file to be uploaded, file type can support arbitrary format file；

File size judgment module 2, for judging whether each file to be uploaded is small documents；

Merging module 3, for merging multiple small documents；

File uploading module 4, for uploading the file after merging or uploading non-small documents；

Index module 5, for being indexed for each document creation Suffix array clustering；

Enquiry module 6, for providing a variety of query types for inquiry mass small documents, such as accurate inquiry and fuzzy query.

File acquisition module 7 is inquired, is looked into for being obtained from distributed file system according to Suffix array clustering index record The small documents of inquiry.

File update module 8, the update for small documents.

Merge file recombination module 9, weight is carried out to the merging file in distributed file system after updating for small documents Group deletes old edition small documents, regenerates in new merging file storage to distributed file system.

The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent；

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of mass small documents querying method indexed using Suffix array clustering characterized by comprising

Small documents storing step:

Client is presented a paper upload request；

Each file size is obtained, file size is judged, if being judged as non-small documents, suffix is established respectively to file Array indexing simultaneously uploads to distributed file system；If being judged as small documents, small documents are put into merging queue and are merged, Small documents are established with Suffix array clustering respectively to index and the file after merging is uploaded to distributed file system.

Small documents query steps:

It obtains and parses inquiry request；

Determine query type；

Specified domain is searched in Suffix array clustering index according to querying condition, obtains qualified Suffix array clustering index record；

Location information of the small documents in distributed file system is obtained according to Suffix array clustering index record, from distributed field system Corresponding small documents are obtained on system.

2. mass small documents querying method according to claim 1, which is characterized in that judgement described in storing step it is specific Process are as follows: the size for defining the default storage unit in a distributed file system is threshold value b, defines threshold value a as less than threshold The value of value b, the file less than threshold value a are small documents, and the file more than or equal to threshold value a is non-small documents.

3. mass small documents querying method according to claim 1, which is characterized in that Suffix array clustering rope described in storing step Draw including small documents name, small documents size, the corresponding filename being stored in distributed system of small documents, the corresponding storage of small documents The offset, creation time of file totally five domains in distributed system；Each domain includes to correspond to the domain for recording file Metadata, Suffix array clustering and the domain information structure of particular content；

Wherein domain information structure includes the storage file quantity in the domain, and the metadata size in the domain records each file in the domain File information structure FileInfo；

Wherein the FileInfo include index delete marker character, file correspond to property content metadata size in this domain, File corresponds to attribute metadata the first byte offset of metadata, file ID in this domain.

4. mass small documents querying method according to claim 1, which is characterized in that storing step further includes when merging team When file size in column reaches threshold value b, file is merged in binary form and Suffix array clustering is established to each file Then index uploads the file after merging；File in the merging queue is emptied and is recycled after the completion of upload.

5. mass small documents querying method according to claim 1, which is characterized in that query type described in query steps Including accurately inquiring and fuzzy query.

6. mass small documents querying method according to claim 1, which is characterized in that search described in query steps is specified Domain specifically: query metadata and Suffix array clustering find offset of the occurrence in the metadata of Suffix array clustering index record, Corresponding file ID is found in FileInfo according to offset.

7. mass small documents querying method according to claim 6, which is characterized in that further include:

Small documents update step:

Obtain the small documents that need to be updated；

Upload the small documents updated；

To in distributed file system include old edition small documents and meet recombination condition merging file carry out physics recombination；Wherein Meet recombination condition to refer to: definition merges the size summation for the small documents not being updated in file for effective use space, setting The threshold value in each effective use space for merging file in distributed file system, when effective use space is less than the merging of threshold value Quantity of documents meets recombination condition when reaching specified quantity.

8. mass small documents querying method according to claim 7, which is characterized in that update effective benefit described in step It is obtained where its with the specific calculating process in space: the small documents Suffix array clustering index record that inquiry deletion identifier is identified as 0 Merge file and calculate and respectively merge file effective use space in distributed file system, judges whether to reach threshold value, if not having Reach threshold value and is then denoted as the merging file that an effective use space is less than threshold value.

9. a kind of mass small documents inquiry system indexed using Suffix array clustering characterized by comprising

File acquisition module, for obtaining the mass file to be uploaded, file type can support arbitrary format file；

Merging module, for merging multiple small documents；

Inquire file acquisition module, for obtained from distributed file system according to Suffix array clustering index record inquired it is small File.

File update module, the update for small documents.

Merge file recombination module, the merging file in distributed file system is recombinated after being updated for small documents, is deleted Except old edition small documents, regenerate in new merging file storage to distributed file system.

10. the mass small documents inquiry system according to claim 9 indexed using Suffix array clustering, which is characterized in that institute The query type for stating enquiry module offer includes accurate inquiry and fuzzy query.