CN112199333B

CN112199333B - Storage method and device supporting multi-valued index file

Info

Publication number: CN112199333B
Application number: CN202011014922.XA
Authority: CN
Inventors: 牛晨光; 王梦来; 李竞
Original assignee: Wuhan Greenet Information Service Co Ltd
Current assignee: Wuhan Greenet Information Service Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2022-11-22
Anticipated expiration: 2040-09-24
Also published as: CN112199333A

Abstract

The invention relates to the technical field of data storage and search, and provides a storage method and a storage device for supporting a multi-valued index file. The method comprises the steps of calculating the serial number value of a hash bucket to be j through a preset hash algorithm; when the hash bucket j conflicts, matching is started from the first data block in the hash bucket j, and matching of ArrayCount data blocks recorded by a data management structure in the hash bucket j is completed; if a consistent data block is found in the matching process, directly using the record item of the data block, and ending the matching process; if not, a new data block is required to be applied, and then a data block allocation flow is entered. The invention provides a borrowing method applied to adjacent hash buckets of hash buckets, and solves the problem that in the prior art, data blocks applied in batches form a hashed discontinuous memory space, so that computing resources are wasted during access.

Description

Storage method and device supporting multi-valued index file

[ technical field ] A

The invention relates to the technical field of data storage and search, in particular to a storage method and a storage device supporting a multi-value index file.

[ background ] A method for producing a semiconductor device

In order to export and view raw data of control plane and service plane of a single user in a network through some expert subsystem in the relevant OSS system of a telecom operator, a system such as DPI is required to be constructed to support storing and inquiring user signaling raw data according to a user number.

At present, the number of users borne by a DPI system constructed by taking provinces as a unit is over 1000 thousands, and the speed data of an original signaling packet generated in real time is up to 6000000pps. There is therefore a need for a more efficient, storage-efficient, hardware-efficient indexing scheme than the use of common distributed storage solutions such as Hadoop.

Currently, the most common index algorithm is hash tree, and the hash belongs to the algorithm with the most stable performance. A good hash algorithm can provide a good hash effect, but the hash collision can never be completely avoided, so any system using hash as a fast indexing algorithm needs to resolve the hash collision.

Considering the memory overhead and the index performance, the conflict in the online system of 7x24 hours is tolerant, that is, all conflicts cannot be solved without limit. The number of collisions under a hash is usually limited to only N, and collisions exceeding this value are discarded directly without storage, as shown in fig. 1.

Another conflict resolution method is also commonly used in memory indexing implementations: applying M conflict-solving data pools in advance, solving when the total number of conflicts in the whole index system does not exceed M, and discarding when the total number exceeds the value. This scheme is optimized to solve only N collisions under a hash, and can increase the utilization of the data blocks reserved for collision resolution, as shown in fig. 2.

However, the scheme is only suitable for memory indexing, the data blocks applied in batches can hash discontinuous memory spaces, and for index files, the data blocks can hash discontinuous file spaces, so that the files can be randomly read and written, and the performance is greatly consumed.

In view of the above, overcoming the drawbacks of the prior art is an urgent problem in the art.

[ summary of the invention ]

The technical problem to be solved by the invention is that the current scheme is only suitable for memory indexing, the data blocks applied in batches can hash discontinuous memory spaces, and for index files, the data blocks can hash discontinuous file spaces, so that the files can be randomly read and written, and the performance is greatly consumed.

The invention adopts the following technical scheme:

in a first aspect, the invention provides a storage method supporting a multi-valued index file, which applies for a collision array with continuous global memory and allocates N data blocks for storing data collision for each hash bucket; setting the maximum number of data blocks allowed to be accessed as X in each hash bucket; wherein, the value of X satisfies: x > = N, and statistics of X all start from the initial data block of the corresponding hash bucket; the data block identifier Index in the corresponding use range corresponding to the ith hash bucket meets the following conditions: (N = i) < (= Index < (N: (i + 1)); setting a field ArrayCount corresponding to the data management structure of each hash bucket, wherein the field ArrayCount is used for recording the number of data blocks actually used in the hash bucket; when a record of the keyword K is to be stored, the method comprises the following steps:

calculating the serial number value of the hash bucket as j through a preset hash algorithm;

when the hash bucket j conflicts, matching is started from the first data block in the hash bucket j, and matching of ArrayCount data blocks recorded by a data management structure in the hash bucket j is completed;

if a consistent data block is found in the matching process, directly using the record item of the data block, and ending the matching process; if not, a new data block is required to be applied, and then a data block allocation flow is entered:

in the data block distribution process, searching for an idle data block from ArrayCount to X in the data block distribution process; and if N + j + ArrayCount > N, borrowing and storing the next adjacent hash bucket j +1 of the hash bucket j, and synchronously updating the ArrayCount value in the hash bucket j +1 data management structure.

Preferably, the data stored in the data blocks establishes a data index chain in a reverse index manner, wherein each data block includes an address pointer pprv pointing to the previously stored conflicting data, and a Value storing the content of its own conflicting data.

Preferably, in a hash bucket, an address pointer of a data block at the tail of the data index chain is stored in a data management structure of the hash bucket, and when collision data is newly added in a corresponding hash bucket each time, the pprv in the data block for bearing the newly added collision data is assigned as the address pointer stored in the data management structure, and the Value in the data block for bearing the newly added collision data is assigned as the content of the collision data; and updating the pointer stored in the data management structure to be the address pointer of the data block for bearing the newly increased conflict data.

Preferably, the collision rate and the memory occupation can be guaranteed to be optimal when the total number H of the hash buckets is 2-5 times of the total number of the stored keywords.

Preferably, the value of N is generally associated with a maximum number of collisions C, N satisfying the following condition: n = MAX (C/5, 2)

Preferably, the value of X is set to N × 2 to N × 3.

Preferably, when a record of the keyword K is to be stored, if it is determined that the size of the ArrayCount value in the jth hash bucket is equal to X, after the content of each piece of collision data recorded in the jth hash bucket is matched, if a consistent result is not matched, the corresponding keyword K is directly discarded.

Preferably, the conflict array is further recorded in an assigned index document, where the assigned index document performs file handle division according to a time interval, specifically:

index documents containing different conflict arrays are loaded into the memory when the current time is matched with the time interval associated in the file handle;

and if the current time reaches the other end of the association time of the index document loaded in the memory at present, searching the file handle matched with the current time, and then loading the index document corresponding to the corresponding file handle.

In a second aspect, the present invention further provides a storage apparatus supporting a multi-valued index file, for implementing the storage method supporting the multi-valued index file in the first aspect, where the apparatus includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the processor to perform the method for supporting multi-valued index file storage of the first aspect.

In a third aspect, the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions are executed by one or more processors, and are used for completing the storage method supporting the multi-value index file according to the first aspect.

The invention provides a borrowing method applied to adjacent hash buckets of hash buckets, and solves the problems that in the prior art, data blocks applied in batches form discontinuous hash memory spaces to cause waste of computing resources during access, and the application of the memory spaces of a plurality of hash buckets is carried out at one time, and the applied space can not meet the actual requirement to cause waste of storage resources.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic diagram illustrating a memory overhead and index performance presentation architecture according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating an effect of resolving a utilization rate of a data block reserved by a conflict according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a storage method supporting a multi-valued index file according to an embodiment of the present invention;

FIG. 4 is a block diagram of an inverted index scheme according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating an effect of a hash bucket data structure according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a storage device supporting a multi-valued index file according to an embodiment of the present invention.

[ detailed description ] A

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the description of the present invention, the terms "inner", "outer", "longitudinal", "lateral", "upper", "lower", "top", "bottom", and the like indicate orientations or positional relationships based on orientations or positional relationships shown in the drawings, and are for convenience in describing the present invention only and do not require that the present invention be constructed and operated in a particular orientation, and therefore should not be construed as limiting the present invention.

In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1:

the embodiment 1 of the invention provides a storage method supporting a multi-valued index file, which applies for a collision array with continuous global memory and allocates N data blocks for storing data collision for each hash bucket; setting the maximum number of data blocks allowed to be accessed as X in each hash bucket; wherein, the value of X satisfies: x > = N, and statistics of X all start from the initial data block of the corresponding hash bucket; the data block identifier Index in the corresponding use range corresponding to the ith hash bucket meets the following conditions: (N × i) < = Index < (N × (i + 1)); setting a field ArrayCount in a data management structure corresponding to each hash bucket, wherein the field ArrayCount is used for recording the number of data blocks actually used in the hash bucket, and the number of the data blocks actually used recorded by the ArrayCount comprises the data blocks used by the ArrayCount and the data blocks occupied by borrowing; when a record of the keyword K is to be stored, as shown in fig. 3, the method includes:

in step 201, a number value j of the hash bucket is calculated by a preset hash algorithm.

In step 202, when a hash bucket j conflicts, matching is started from the first data block in the hash bucket j, and matching of ArrayCount data blocks recorded by the data management structure in the hash bucket j is completed.

In step 203, if a consistent data block is found in the matching process, the record item of the data block is directly used, and the matching process is ended; if not, a new data block needs to be applied, and then a data block allocation flow is entered:

in step 204, in the data block allocation process, the data block allocation process searches for an idle data block from ArrayCount to X; and if N + j + ArrayCount > N, borrowing and storing the next adjacent hash bucket j +1 of the hash bucket j, and synchronously updating the ArrayCount value in the data management structure of the hash bucket j + 1.

In combination with the embodiment of the present invention, a storage manner of conflicting data in the data block in the hash bucket is preferably performed in a reverse index manner, as shown in fig. 4, the data stored in the data block establishes a data index chain in the reverse index manner, where each data block includes an address pointer pPrev pointing to previously stored conflicting data and a Value storing content of conflicting data of its own.

In order to avoid the random reading and writing problem caused by the forward index, the embodiment of the invention provides the reverse file index. The core idea of the reverse index is as follows: it is not necessary that the last node points to the newly inserted node, but that the new insertion points to the last node. As shown in fig. 4, where pTail stores the file offset of the last node, the following operations are performed when a pNode4 node needs to be inserted:

1) The pPrev field of pNode4 needs to be set to the value of pTail.

2) And writing the pNode4 node into the tail part of the index file. (can complete batch operation with other nodes to be written through a buffer mechanism to achieve the optimal I/O performance.)

3) The file offset where pNode4 is located is assigned to the pTail field.

According to the operation method, historical data in the current index file does not need to be modified, and only the correct setting of pPrev of the pNode4 is completed before the pNode4 is written, and the root system pTail value is obtained.

All linked lists in the index file have the pTail values of the linked lists, and the pTail is the access entry of the linked lists, so that the linked lists are stored. The periodic line synchronous storage can be performed through an additional Entry area or file, and the optimal I/O performance can also be achieved because a plurality of linked lists pTail are written into the Entry file in a batch mode in a sequential overlay manner, and a hash bucket data structure formed by combining the method according to the embodiment of the present invention and the reverse file index mentioned above is shown in fig. 5. In a hash bucket, storing an address pointer of a data block at the tail of a data index chain in a data management structure of the hash bucket, assigning pPrev in the data block for bearing newly-increased conflict data as the address pointer stored in the data management structure and assigning Value in the data block for bearing newly-increased conflict data as the conflict data content each time conflict data is newly added in the corresponding hash bucket; and updating the pointer stored in the data management structure to be the address pointer of the data block for bearing the newly increased conflict data.

When the total number H of the hash buckets is 2-5 times of the total number of the stored keywords, the collision rate and the memory occupation can be guaranteed to be optimal. It should be noted that fig. 5 shows that at least 4 linked list spaces (i.e. total number N of data blocks) are involved in one hash bucket shown in the figure only for convenience of presenting the linked list relationship, but in a practical application scenario of the present invention, the value of N is usually related to the maximum number of collisions C, and N satisfies the following condition: n = MAX (C/5,2), i.e. taking the maximum of C/5 and 2. In the present embodiment, the value of X is typically set to N × 2 to N × 3.

When the record of the key word K is to be stored, if the ArrayCount value in the jth Hash bucket is determined to be equal to X, after the contents of each conflict data recorded in the jth Hash bucket are matched, if a consistent result is not matched, directly discarding the corresponding key word K.

With reference to the embodiments of the present invention, there is also a preferred implementation scheme, which can further improve data indexing efficiency, and this implementation scheme is generally applicable to a clear time limit existing in conflict data analysis, and the conflict array is also recorded in an assigned index document, where the assigned index document performs file handle division according to a time interval, specifically:

The index file is stored in the physical partition according to time, so that the system needs to manage the index file handles of different partitions.

The system of the invention has certain requirements on the time sequence of the original data to be indexed: out-of-order packets with times greater than "time zone size/2" are not allowed to occur, and such data is discarded if it occurs.

Based on the premise, the system only needs to keep 2 index file handles. The specific execution logic:

1) Assume that the time zone in which the system is implemented spans 1 hour.

2) When the current time is 1.

3) This time, it is allowed to index and store the original data whose time stamp is in the range of [0, 00, 2.

4) When the current time is 1. At this point the system maintains file handles of 1 and 2 points.

5) This time, it is allowed to index and store the original data whose time stamp is in the range of [ 100, 3.

6) And so on. The file handle validity sample is as follows:

under the logic, the system can be ensured to only keep and open the handles of 2 time zones, the occupation of system resources is reduced, and higher time disorder fault tolerance rate can be ensured. And respectively managing 2 handles by a double-thread/process mode can ensure that the initialization of the index file can be simultaneously carried out when the handle is established for the next time zone (generally, the initialization operation of the file is time-consuming in price comparison, and if the initialization of the index file of the current time zone is triggered after the data of the current time zone arrives, a large amount of instantaneous overstocked index files and loss of the index files can be caused.

Example 2:

fig. 6 is a schematic diagram of an architecture of a storage apparatus supporting a multi-valued index file according to an embodiment of the present invention. The storage device supporting the multi-value index file of the present embodiment includes one or more processors 21 and a memory 22. In fig. 6, one processor 21 is taken as an example.

The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The memory 22, which is a nonvolatile computer-readable storage medium, may be used to store a nonvolatile software program and a nonvolatile computer-executable program, such as the storage method supporting the multi-value index file in embodiment 1. The processor 21 executes the storage method supporting the multi-value index file by executing the nonvolatile software program and instructions stored in the memory 22.

The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The program instructions/modules are stored in the memory 22, and when executed by the one or more processors 21, perform the storage method supporting the multi-value index file in the above-described embodiment 1, for example, perform the respective steps shown in fig. 3 described above.

It should be noted that, for the information interaction, execution process and other contents between the modules and units in the apparatus and system, the specific contents may refer to the description in the embodiment of the method of the present invention because the same concept is used as the embodiment of the processing method of the present invention, and are not described herein again.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A storage method supporting multi-valued index files is characterized in that a collision array with continuous global memory is applied, and N data blocks used for storing data collision are distributed for each hash bucket; setting the maximum number of data blocks allowed to be accessed as X in each hash bucket; wherein, the value of X satisfies: x > = N, and statistics of X all start from the initial data block of the corresponding hash bucket; the data block identification Index in the corresponding use range corresponding to the ith hash bucket satisfies the following conditions: (N × i) < = Index < (N × (i + 1)); setting a field ArrayCount corresponding to the data management structure of each hash bucket, wherein the field ArrayCount is used for recording the number of data blocks actually used in the hash bucket; when the record of the key word K is to be stored, the method comprises the following steps:

calculating the serial number value of the hash bucket to be j through a preset hash algorithm;

2. The storage method supporting the multi-Value index file according to claim 1, wherein the data stored in the data blocks establishes a data index chain in a reverse index manner, wherein each data block comprises an address pointer pprv pointing to the previously stored conflicting data and a Value storing the content of its own conflicting data.

3. The storage method supporting multi-valued index file according to claim 2, characterized in that in a hash bucket, the address pointer of the data block at the end of the data index chain is stored in its data management structure, and every time collision data is newly added in the corresponding hash bucket, the pPrev in the data block for bearing newly added collision data is assigned as the address pointer stored in the data management structure, and the Value in the data block for bearing newly added collision data is assigned as the collision data content; and updating the pointer stored in the data management structure to be the address pointer of the data block for bearing the newly increased conflict data.

4. The storage method supporting the multi-value index file according to any one of claims 1 to 3, wherein the total number H of the hash buckets is 2 to 5 times of the total number of the storage keywords, so as to ensure that a collision rate and memory occupation are optimal.

5. The storage method supporting the multi-value index file according to any one of claims 1 to 3, wherein the value of N is related to a maximum conflict number C, and N satisfies the following condition: n = MAX (C/5,2).

6. The storage method supporting the multi-valued index file according to any of claims 1-3, wherein the value of X is set to be N X2 to N X3.

7. The method according to claim 1, wherein when a record of a keyword K is to be stored, if it is determined that the size of the ArrayCount value in the jth hash bucket is equal to X, after the matching of each content of conflicting data recorded in the jth hash bucket is completed, if no matching result is obtained, the corresponding keyword K is directly discarded.

8. The storage method supporting the multi-value index file according to claim 1, wherein the conflict array is further recorded in a designated index document, wherein the designated index document performs file handle division according to time intervals, specifically:

9. A storage apparatus supporting a multi-valued index file, the apparatus comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor for performing the method of supporting multi-valued index file storage of any of claims 1-8.