CN112199333A

CN112199333A - Storage method and device supporting multi-value index file

Info

Publication number: CN112199333A
Application number: CN202011014922.XA
Authority: CN
Inventors: 牛晨光; 王梦来; 李竞
Original assignee: Wuhan Greenet Information Service Co Ltd
Current assignee: Wuhan Greenet Information Service Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2021-01-08
Anticipated expiration: 2040-09-24
Also published as: CN112199333B

Abstract

The invention relates to the technical field of data storage and search, and provides a storage method and a storage device for supporting a multi-value index file. The method comprises the steps of calculating the serial number value of a hash bucket to be j through a preset hash algorithm; when the hash bucket j conflicts, matching is started from the first data block in the hash bucket j, and matching of ArrayCount data blocks recorded by a data management structure in the hash bucket j is completed; if a consistent data block is found in the matching process, directly using the record item of the data block, and ending the matching process; and if the data blocks are not matched, applying for a new data block, and entering a data block distribution process. The invention provides a borrowing method applied to adjacent hash buckets of hash buckets, and solves the problem that in the prior art, data blocks applied in batches form a hashed discontinuous memory space, so that computing resources are wasted during access.

Description

Storage method and device supporting multi-value index file

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of data storage and search, in particular to a storage method and a storage device supporting a multi-value index file.

[ background of the invention ]

In order to export and view raw data of control plane and service plane of a single user in a network through some expert subsystem in the relevant OSS system of a telecom operator, a system such as DPI is required to be constructed to support storing and inquiring user signaling raw data according to a user number.

At present, the number of users borne by a DPI system constructed by taking provinces as a unit is over 1000 thousands, and the speed data of an original signaling packet generated in real time is up to 6000000 pps. There is therefore a need for a more efficient, storage-efficient, hardware-efficient indexing scheme than the use of common distributed storage solutions such as Hadoop.

Currently, the most common index algorithm is hash tree, and the hash belongs to the algorithm with the most stable performance. A good hash algorithm can provide a good hash effect, but the hash collision can never be completely avoided, so any system using hash as a fast indexing algorithm needs to resolve the hash collision.

Considering the memory overhead and the index performance, the conflict in the online system is tolerant in 7x24 hours, that is, all conflicts cannot be solved without limit. The number of collisions under a hash is usually limited to only N, and collisions exceeding this value are discarded directly without storage, as shown in fig. 1.

Another conflict resolution method is also commonly used in memory indexing implementations: applying M conflict-solving data pools in advance, solving when the total number of conflicts in the whole index system does not exceed M, and discarding when the total number exceeds the value. This scheme is optimized to solve only N collisions under a hash, and can increase the utilization of the data blocks reserved for collision resolution, as shown in fig. 2.

However, the scheme is only suitable for memory indexing, the data blocks applied in batches can hash discontinuous memory spaces, and for index files, the data blocks can hash discontinuous file spaces, so that the files can be randomly read and written, and the performance is greatly consumed.

In view of the above, overcoming the drawbacks of the prior art is an urgent problem in the art.

[ summary of the invention ]

The technical problem to be solved by the invention is that the current scheme is only suitable for memory indexing, the data blocks applied in batches can hash discontinuous memory spaces, and for index files, the data blocks can hash discontinuous file spaces, so that the files can be randomly read and written, and the performance is greatly consumed.

The invention adopts the following technical scheme:

in a first aspect, the invention provides a storage method supporting a multi-value index file, which applies for a collision array with continuous global memory and allocates N data blocks for storing data collision for each hash bucket; setting the maximum number of data blocks allowed to be accessed as X for each hash bucket; wherein, the value of X satisfies: x > -N and the statistics of X all start with the initial data block of the corresponding hash bucket; the data block identification Index in the corresponding use range corresponding to the ith hash bucket satisfies the following conditions: (N × i) < ═ Index < (N × i + 1)); setting a field ArrayCount in the data management structure corresponding to each hash bucket, wherein the field ArrayCount is used for recording the number of data blocks actually used in the hash bucket; when a record of the keyword K is to be stored, the method comprises the following steps:

calculating the serial number value of the hash bucket to be j through a preset hash algorithm;

when the hash bucket j conflicts, matching is started from the first data block in the hash bucket j, and matching of ArrayCount data blocks recorded by a data management structure in the hash bucket j is completed;

if a consistent data block is found in the matching process, directly using the record item of the data block, and ending the matching process; if not, a new data block is required to be applied, and then a data block allocation flow is entered:

in the data block distribution process, searching for an idle data block from ArrayCount to X in the data block distribution process; and if N + j + ArrayCount > N, borrowing and storing the next adjacent hash bucket j +1 of the hash bucket j, and synchronously updating the ArrayCount value in the hash bucket j +1 data management structure.

Preferably, the data stored in the data blocks establishes a data index chain in a reverse index manner, wherein each data block comprises an address pointer pPrev pointing to the previously stored conflicting data and a Value storing the content of its own conflicting data.

Preferably, in a hash bucket, an address pointer of a data block at the end of a data index chain is stored in a data management structure of the hash bucket, and, each time collision data is newly added to a corresponding hash bucket, pPrev in the data block for carrying the newly added collision data is assigned as the address pointer stored in the data management structure, and Value in the data block for carrying the newly added collision data is assigned as the content of the collision data; and updating the pointer stored in the data management structure to be the address pointer of the data block for bearing the newly increased conflict data.

Preferably, the collision rate and the memory occupation can be guaranteed to be optimal when the total number H of the hash buckets is 2-5 times of the total number of the stored keywords.

Preferably, the value of N is generally associated with a maximum number of collisions C, N satisfying the following condition: n MAX (C/5,2)

Preferably, the value of X is set to N × 2 to N × 3.

Preferably, when a record of the keyword K is to be stored, if it is determined that the size of the ArrayCount value in the jth hash bucket is equal to X, after the content of each piece of collision data recorded in the jth hash bucket is matched, if a consistent result is not matched, the corresponding keyword K is directly discarded.

Preferably, the conflict array is further recorded in an assigned index document, where the assigned index document performs file handle division according to a time interval, specifically:

index documents containing different conflict arrays are loaded into the memory when the current time is matched with the time interval associated in the file handle;

and if the current time reaches the other end of the association time of the index document loaded in the memory at present, searching the file handle matched with the current time, and then loading the index document corresponding to the corresponding file handle.

In a second aspect, the present invention further provides a storage apparatus supporting a multi-valued index file, for implementing the storage method supporting the multi-valued index file in the first aspect, where the apparatus includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the processor for performing the storing method supporting a multi-valued index file of the first aspect.

In a third aspect, the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, which are executed by one or more processors, for implementing the storage method for supporting a multi-valued index file according to the first aspect.

The invention provides a borrowing method applied to adjacent hash buckets of hash buckets, and solves the problems that in the prior art, data blocks applied in batches form discontinuous hash memory spaces to cause waste of computing resources during access, and the application of the memory spaces of a plurality of hash buckets is carried out at one time, and the applied space can not meet the actual requirement to cause waste of storage resources.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic diagram illustrating a memory overhead and index performance presentation architecture according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating an effect of resolving a usage rate of a data block reserved by a conflict according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a storage method supporting a multi-valued index file according to an embodiment of the present invention;

FIG. 4 is a block diagram of an inverted index scheme according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating an effect of a hash bucket data structure according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a storage device supporting a multi-valued index file according to an embodiment of the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the description of the present invention, the terms "inner", "outer", "longitudinal", "lateral", "upper", "lower", "top", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are for convenience only to describe the present invention without requiring the present invention to be necessarily constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1:

the embodiment 1 of the invention provides a storage method supporting a multi-value index file, which applies for a collision array with continuous global memory and allocates N data blocks for storing data collision for each hash bucket; setting the maximum number of data blocks allowed to be accessed as X for each hash bucket; wherein, the value of X satisfies: x > -N and the statistics of X all start with the initial data block of the corresponding hash bucket; the data block identification Index in the corresponding use range corresponding to the ith hash bucket satisfies the following conditions: (N × i) < ═ Index < (N × i + 1)); setting a field ArrayCount in a data management structure corresponding to each hash bucket, wherein the field ArrayCount is used for recording the number of data blocks actually used in the hash bucket, and the number of the data blocks actually used recorded by the ArrayCount comprises the data blocks used by the ArrayCount and the data blocks occupied by borrowing; to store a record of the key K, as shown in fig. 3, the method includes:

in step 201, a number value j of the hash bucket is calculated by a preset hash algorithm.

In step 202, when a hash bucket j conflicts, matching is started from the first data block in the hash bucket j, and matching of ArrayCount data blocks recorded by the data management structure in the hash bucket j is completed.

In step 203, if a consistent data block is found in the matching process, the record item of the data block is directly used, and the matching process is ended; if not, a new data block is required to be applied, and then a data block allocation flow is entered:

in step 204, in the data block allocation procedure, the data block allocation procedure searches for a free data block from ArrayCount to X; and if N + j + ArrayCount > N, borrowing and storing the next adjacent hash bucket j +1 of the hash bucket j, and synchronously updating the ArrayCount value in the hash bucket j +1 data management structure.

In combination with the embodiment of the present invention, a storage manner of conflicting data in the data block in the hash bucket is preferably performed in a reverse index manner, as shown in fig. 4, the data stored in the data block establishes a data index chain in the reverse index manner, where each data block includes an address pointer pPrev pointing to previously stored conflicting data and a Value storing content of conflicting data of its own.

In order to avoid the random reading and writing problem caused by the forward index, the embodiment of the invention provides the reverse file index. The core idea of the reverse index is as follows: it is not necessary that the last node points to the newly inserted node, but that the new insertion points to the last node. As shown in fig. 4, where pTail stores the file offset of the last node, the following operations are performed when the pNode4 node needs to be inserted:

1) the pPrev field of pNode4 needs to be set to the value of pTail.

2) The pNode4 node is written into the tail of the index file. (can complete batch operation with other nodes to be written through a buffer mechanism to achieve the optimal I/O performance.)

3) The file offset at which pNode4 is located is assigned to the pTail field.

According to the above operation method, the historical data in the current index file does not need to be modified, and only the correct setting of pPrev of pNode4 is completed before the pNode4 is written, and the root system pTail value is obtained.

All linked lists in the index file have the pTail values of the linked lists, and the pTail is the access entry of the linked lists, so that the linked lists are stored. The periodic line synchronous storage can be performed through an additional Entry area or file, and the optimal I/O performance can also be achieved because a plurality of linked lists pTail are written into the Entry file in a batch mode in a sequential overlapping manner, and a hash bucket data structure formed by combining the method of the embodiment of the present invention and the reverse file index is shown in fig. 5. In a hash bucket, storing an address pointer of a data block at the tail of a data index chain in a data management structure of the hash bucket, assigning pPrev in the data block for bearing newly-increased conflict data as the address pointer stored in the data management structure and assigning Value in the data block for bearing newly-increased conflict data as the conflict data content each time conflict data is newly added in the corresponding hash bucket; and updating the pointer stored in the data management structure to be the address pointer of the data block for bearing the newly increased conflict data.

The total number H of the hash buckets is 2-5 times of the total number of the stored keywords, and the collision rate and the memory occupation can be guaranteed to be optimal. It should be noted that fig. 5 shows that at least 4 linked list spaces (i.e. total number N of data blocks) are involved in one hash bucket shown in the figure only for convenience of presenting the linked list relationship, but in a practical application scenario of the present invention, the value of N is usually related to the maximum number of collisions C, and N satisfies the following condition: n ═ MAX (C/5,2), i.e., the maximum of C/5 and 2. In the embodiment of the present invention, the value of X is usually set to N × 2 to N × 3.

When the record of the key word K is to be stored, if the ArrayCount value in the jth Hash bucket is determined to be equal to X, after the contents of each conflict data recorded in the jth Hash bucket are matched, if a consistent result is not matched, directly discarding the corresponding key word K.

With reference to the embodiments of the present invention, there is also a preferred implementation scheme, which can further improve data indexing efficiency, and this implementation scheme is generally applicable to a clear time limit existing in conflict data analysis, and the conflict array is also recorded in an assigned index document, where the assigned index document performs file handle division according to a time interval, specifically:

The index file is stored in the physical partition according to time, so that the system needs to manage the index file handles of different partitions.

The system of the invention has certain requirements on the time sequence of the original data to be indexed: out-of-order packets with times greater than "time zone size/2" are not allowed to occur, and such data is discarded if it occurs.

Based on the premise, the system only needs to keep 2 index file handles. The specific execution logic:

1) assume that the time zone in which the system is implemented spans 1 hour.

2) When the current time is 1:00:00, the system maintains the file handle of point 0 and the file handle of point 1.

3) This time allows the original data with time stamps in the range of 0:00:00,2:00:00) to be indexed and stored.

4) When the current time is 1:30:00, the file handle of 0 point is closed, and the file handle of 2 points is created. At this point the system maintains file handles for point 1 and point 2.

5) This time allows the original data with time stamps in the range of [1:00:00,3:00:00) to be indexed and stored.

6) And so on. The file handle validity sample is as follows:

under the logic, the system can be ensured to only keep and open the handles of 2 time zones, the occupation of system resources is reduced, and higher time disorder fault tolerance rate can be ensured. And respectively managing 2 handles by a double-thread/process mode can ensure that the initialization of the index file can be simultaneously carried out when the handle is created for the next time zone (generally, the initialization operation of the file is time-consuming, and if the initialization of the index file of the current time zone is triggered after the data of the current time zone arrives, a large amount of instantaneous index file backlog and loss can be caused.

Example 2:

fig. 6 is a schematic diagram of an architecture of a storage device supporting a multi-value index file according to an embodiment of the present invention. The storage device supporting the multi-value index file of the present embodiment includes one or more processors 21 and a memory 22. In fig. 6, one processor 21 is taken as an example.

The processor 21 and the memory 22 may be connected by a bus or other means, such as the bus connection in fig. 6.

The memory 22, which is a nonvolatile computer-readable storage medium, may be used to store a nonvolatile software program and a nonvolatile computer-executable program, such as the storage method supporting the multi-value index file in embodiment 1. The processor 21 executes the storage method supporting the multi-value index file by executing the nonvolatile software program and instructions stored in the memory 22.

The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The program instructions/modules are stored in the memory 22, and when executed by the one or more processors 21, perform the storage method supporting the multi-value index file in the above-described embodiment 1, for example, perform the respective steps shown in fig. 3 described above.

It should be noted that, for the information interaction, execution process and other contents between the modules and units in the apparatus and system, the specific contents may refer to the description in the embodiment of the method of the present invention because the same concept is used as the embodiment of the processing method of the present invention, and are not described herein again.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A storage method supporting multi-valued index files is characterized in that a conflict array with continuous global memory is applied, and N data blocks for storing data conflicts are distributed to each hash bucket; setting the maximum number of data blocks allowed to be accessed as X for each hash bucket; wherein, the value of X satisfies: x > -N and the statistics of X all start with the initial data block of the corresponding hash bucket; the data block identification Index in the corresponding use range corresponding to the ith hash bucket satisfies the following conditions: (N × i) < ═ Index < (N × i + 1)); setting a field ArrayCount in the data management structure corresponding to each hash bucket, wherein the field ArrayCount is used for recording the number of data blocks actually used in the hash bucket; when a record of the keyword K is to be stored, the method comprises the following steps:

2. The storage method supporting the multi-Value index file according to claim 1, wherein the data stored in the data blocks establishes a data index chain in a reverse index manner, wherein each data block comprises an address pointer pPrev pointing to the previously stored conflict data and a Value storing the content of the conflict data of the data block.

3. The storage method supporting the multi-Value index file according to claim 2, wherein an address pointer of a data block at the end of a data index chain is stored in a hash bucket, and, each time collision data is newly added to the corresponding hash bucket, pPrev in the data block for carrying newly added collision data is assigned as the address pointer stored in the data management structure, and Value in the data block for carrying newly added collision data is assigned as the collision data content; and updating the pointer stored in the data management structure to be the address pointer of the data block for bearing the newly increased conflict data.

4. The storage method supporting the multi-valued index file according to any of claims 1 to 3, wherein the collision rate and the memory occupation can be guaranteed to be optimal when the total number H of the hash buckets is 2 to 5 times of the total number of the storage keywords.

5. The storage method supporting the multi-value index file according to any one of claims 1 to 3, wherein the value of N is generally related to the maximum conflict number C, and N satisfies the following condition: n ═ MAX (C/5, 2).

6. The storage method supporting the multi-value index file according to any one of claims 1 to 3, wherein the value of X is set to N X2 to N X3.

7. The method according to claim 1, wherein when a record of a keyword K is to be stored, if it is determined that the size of the ArrayCount value in the jth hash bucket is equal to X, after the matching of each content of conflicting data recorded in the jth hash bucket is completed, if no matching result is obtained, the corresponding keyword K is directly discarded.

8. The storage method supporting the multi-value index file according to claim 1, wherein the conflict array is further recorded in a designated index document, wherein the designated index document performs file handle division according to time intervals, specifically:

9. A storage apparatus supporting a multi-valued index file, the apparatus comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor for performing the method of supporting multi-valued index file storage of any of claims 1-8.