CN113655968B

CN113655968B - Unstructured data storage method

Info

Publication number: CN113655968B
Application number: CN202110971730.6A
Authority: CN
Inventors: 郭殿勇
Original assignee: Shanghai Jinshuo Information Technology Co ltd
Current assignee: Shanghai Jinshuo Information Technology Co ltd
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2024-06-18
Anticipated expiration: 2041-08-24
Also published as: CN113655968A

Abstract

The invention relates to the field of electronic digital data processing, in particular to an unstructured data storage method, which comprises the following steps: obtaining unstructured data; identifying unstructured data and generating a main label; storing unstructured data in blocks based on the main tag; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; the mapping relationship is stored to the second storage area. Therefore, unstructured data can be stored in blocks through categories in the main tag, and then mapping can be established based on the auxiliary tag and the data in all storage areas, so that the retrieval speed and the read-write speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.

Description

Unstructured data storage method

Technical Field

The invention relates to the field of electronic digital data processing, in particular to an unstructured data storage method.

Background

The continued development of computer applications has resulted in dramatic increases in data volume, as the data structuring process is limited by manual processing speeds, resulting in the growth of unstructured data far greater than structured data. For large-scale data which is increased to achieve TB and PB level at present, better tools or technologies are needed for organizing and managing files, and an efficient data organization method can help people to quickly acquire the wanted data from the background large-scale data when needed.

The existing unstructured data are generally stored in a memory in sequence, so that the data have no relation, the search time is long, and the use is difficult.

Disclosure of Invention

The invention aims to provide an unstructured data storage method, which aims to improve the retrieval speed and the read-write speed of stored data, and ensure the use of unstructured data so as to enable the response speed to be faster.

To achieve the above object, the present invention provides an unstructured data storage method, comprising: obtaining unstructured data;

Identifying unstructured data and generating a main label;

Storing unstructured data in blocks based on the main tag;

generating a secondary label based on the mining characteristics;

Retrieving and generating a mapping based on the secondary label in each storage area;

the mapping relationship is stored to the second storage area.

The specific mode for acquiring the unstructured data is to acquire the unstructured data from data channels such as websites, computer programs, mobile phone APP and the like.

Wherein the primary tag may be a file type or a data source channel.

The specific steps of storing unstructured data in blocks based on the main label are as follows:

generating a partitioned storage area based on the master tag;

identifying unstructured data based on the master tag;

storing unstructured data into corresponding partitioned storage areas;

when the capacity of the partitioned storage area is exhausted, a new storage area is added and a mapping is established.

The specific steps of generating the partitioned storage area based on the main label are as follows:

Acquiring the address of a total storage area;

Acquiring the number of main labels;

and dividing the addresses of the total storage areas based on the number of the main labels to generate partitioned storage areas corresponding to the number of the main labels.

The partitioned storage area comprises a cache area and a storage area, wherein the cache area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.

The specific steps of storing unstructured data into the corresponding partitioned storage areas are as follows:

setting a standard capacity value;

Comparing the current file unit capacity value with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the current file unit with the standard capacity value until the current file unit capacity value is larger than the standard capacity value, and generating a storage unit;

An index of the memory cells is generated.

When the capacity of the partitioned storage area is used up, the specific steps of adding a new storage area and establishing a mapping are as follows:

detecting the capacity of the partitioned storage areas;

searching for a new memory area when the capacity is below a threshold;

When a new storage area is found, an address mapping is established.

The invention discloses an unstructured data storage method, which comprises the following steps: obtaining unstructured data; identifying unstructured data and generating a main label; storing unstructured data in blocks based on the main tag; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; the mapping relationship is stored to the second storage area. Therefore, unstructured data can be stored in blocks through categories in the main tag, and then mapping can be established based on the auxiliary tag and the data in all storage areas, so that the retrieval speed and the read-write speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an unstructured data storage method of the present invention;

FIG. 2 is a flow chart of the present invention for storing unstructured data in blocks based on master tags;

FIG. 3 is a flow chart of the present invention for generating a partitioned storage area based on a master tag;

FIG. 4 is a flow chart of storing unstructured data in corresponding partitioned storage areas in accordance with the present invention;

FIG. 5 is a flow chart of the present invention for adding new storage and creating a map when the partitioned storage capacity is exhausted;

FIG. 6 is a flow chart of the invention for retrieving and generating a map based on secondary labels at each storage area.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

Referring to fig. 1 to 6, the present invention provides an unstructured data storage method, which includes:

s101, unstructured data is acquired;

Unstructured data is data represented by a two-dimensional logical table of a database, which is irregular or incomplete in data structure, has no predefined data model, and is inconvenient. Including office documents, text, pictures, XML, HTML, various types of reports, image and audio/video information, etc. in all formats.

Data in a computer informatization system is divided into structured data and unstructured data. Unstructured data is very diverse in format and diverse in standards, and technically unstructured information is more difficult to normalize and understand than structured information.

The method can obtain the required unstructured data from data channels such as websites, computer programs, mobile phone APP and the like, and the format can be documents, texts, pictures and the like.

S102, identifying unstructured data and generating a main label;

The primary label may be a file type or a data source channel. The file types comprise the information of the file, the text, the picture and the like, and various files can be better classified based on the main label of the file types, so that the similar files can be searched in the same mode, and the searching efficiency is improved; the adoption of the classification of the data source channels facilitates the update and deletion of the data at any time, so that the maintenance is more convenient.

S103, storing unstructured data in blocks based on the main label;

The method comprises the following specific steps:

s201, generating a partitioned storage area based on the main label;

The method comprises the following specific steps:

S301, acquiring an address of a total storage area;

By reading the addresses of the total memory area, the capacity size of the entire memory can be obtained to facilitate division.

S302, acquiring the number of main labels;

the main label has a plurality of categories, and each category needs to be assigned.

S303, dividing the addresses of the total storage areas based on the number of the main labels, and generating partitioned storage areas corresponding to the number of the main labels.

The address can be adjusted according to the characteristics of the category, for example, the space occupied by the document and the text is generally smaller than that of the picture, the video and the like, so that the capacity can be graded in advance before the division, and the category of the corresponding level can adopt the address division mode of the same capacity. So that the space utilization is more sufficient.

The block storage area comprises a buffer area and a storage area, wherein the buffer area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.

In the use process of unstructured data, as new data are continuously collected, the effectiveness of old data is reduced, the use frequency is reduced, in order to improve the retrieval read-write efficiency, the partitioned storage area can be set into a cache area and a storage area, wherein the cache area has higher read-write speed, so that quick response can be realized, the storage area has lower read-write speed, but the storage price is low, and the method is suitable for storing a large amount of old data. To distinguish between new data and old data, a time node may be provided to distinguish, for example, 7 days or 10 days of data.

S202, identifying unstructured data based on a main label;

Unstructured data is identified through categories in the main label, so that the unstructured data can be stored correspondingly better.

S203, storing unstructured data into corresponding partitioned storage areas;

The method comprises the following specific steps:

S401, setting a standard capacity value;

in order to reduce the number of small files and increase the processing rate, the small files may be combined into one large file for storage, so that a standard capacity value may be set, and when the small files are smaller than the standard capacity value, the small files need to be combined, and when the small files are larger than the standard capacity value, the small files need not to be combined.

S402, comparing the capacity value of the current file unit with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the current file unit with the standard capacity value until the current file unit is larger than the standard capacity value, and generating a storage unit;

By sequentially executing the steps, the small files can be sequentially combined into the storage unit, so that quick reading and writing can be facilitated.

S403 generates an index of the memory cells.

And indexing all small files in the storage unit, so that the small files can be conveniently searched.

S204, when the capacity of the partitioned storage area is used up, a new storage area is added and a mapping is established.

The method comprises the following specific steps:

s501, detecting the capacity of a partitioned storage area;

S502, searching a new storage area when the capacity is lower than a threshold value;

setting a capacity threshold of the partitioned storage area, and searching other new storage areas to be used in the network when the capacity threshold is lower than the capacity threshold.

S503 establishes an address mapping when a new storage area is found.

The address of the new storage area is connected with the current blocking storage area in a mode of establishing address mapping, so that the capacity can be expanded, and the coverage of other adjacent storage areas is avoided.

S104, generating a secondary label based on the mining characteristics;

When the data is required to be used, corresponding mining features are required to be set, so that the corresponding data can be found out. For more accurate retrieval, the secondary labels may be based on keywords in the mined features.

S105, searching and generating a mapping based on the secondary label in each storage area;

The method comprises the following specific steps:

S601, scanning a plurality of partitioned storage areas simultaneously by using each sub-label;

Because all data are stored in a partitioned mode, all the separated storage areas can be scanned and searched for data based on the secondary label at the same time, and the searching efficiency can be improved.

S602, judging whether contents consistent with the sub-labels are recorded in each partitioned storage area or not, and marking the content with the judged result as yes.

S603 creates a map of the data in the partitioned storage area and the sub-label by marking.

Thus, the connection between the sub-label and all the partitioned storage areas can be established, so that the retrieval speed can be further improved.

S106 stores the mapping relationship to the second storage area.

The second storage area is used for storing the mapping relation and updating according to the requirement. When the mining feature needs to be updated, the step S104 needs to be returned to regenerate the secondary label and the subsequent steps are performed, so that the data of the partitioned storage area can be flexibly used as required, and the processing efficiency can be improved.

The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims

1. A method for unstructured data storage is characterized in that,

Comprising the following steps: obtaining unstructured data;

Identifying unstructured data and generating a main label;

Storing unstructured data in blocks based on the main tag;

generating a secondary label based on the mining characteristics;

the mapping relationship is stored to the second storage area.

2. An unstructured data storage method according to claim 1, wherein,

The specific mode for acquiring the unstructured data is to acquire the unstructured data from a data channel comprising a website, a computer program and a mobile phone APP.

3. An unstructured data storage method according to claim 1, wherein,

The primary label is a file type or data source channel.

4. An unstructured data storage method according to claim 3, wherein,

generating a partitioned storage area based on the master tag;

identifying unstructured data based on the master tag;

storing unstructured data into corresponding partitioned storage areas;

5. An unstructured data storage method according to claim 4, wherein,

Acquiring the address of a total storage area;

Acquiring the number of main labels;

6. An unstructured data storage method according to claim 4, wherein,

7. An unstructured data storage method according to claim 4, wherein,

setting a standard capacity value;

An index of the memory cells is generated.

8. An unstructured data storage method according to claim 4, wherein,

detecting the capacity of the partitioned storage areas;

searching for a new memory area when the capacity is below a threshold;

When a new storage area is found, an address mapping is established.