CN113655968B - Unstructured data storage method - Google Patents

Unstructured data storage method Download PDF

Info

Publication number
CN113655968B
CN113655968B CN202110971730.6A CN202110971730A CN113655968B CN 113655968 B CN113655968 B CN 113655968B CN 202110971730 A CN202110971730 A CN 202110971730A CN 113655968 B CN113655968 B CN 113655968B
Authority
CN
China
Prior art keywords
unstructured data
storage area
data
generating
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110971730.6A
Other languages
Chinese (zh)
Other versions
CN113655968A (en
Inventor
郭殿勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jinshuo Information Technology Co ltd
Original Assignee
Shanghai Jinshuo Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jinshuo Information Technology Co ltd filed Critical Shanghai Jinshuo Information Technology Co ltd
Priority to CN202110971730.6A priority Critical patent/CN113655968B/en
Publication of CN113655968A publication Critical patent/CN113655968A/en
Application granted granted Critical
Publication of CN113655968B publication Critical patent/CN113655968B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of electronic digital data processing, in particular to an unstructured data storage method, which comprises the following steps: obtaining unstructured data; identifying unstructured data and generating a main label; storing unstructured data in blocks based on the main tag; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; the mapping relationship is stored to the second storage area. Therefore, unstructured data can be stored in blocks through categories in the main tag, and then mapping can be established based on the auxiliary tag and the data in all storage areas, so that the retrieval speed and the read-write speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.

Description

Unstructured data storage method
Technical Field
The invention relates to the field of electronic digital data processing, in particular to an unstructured data storage method.
Background
The continued development of computer applications has resulted in dramatic increases in data volume, as the data structuring process is limited by manual processing speeds, resulting in the growth of unstructured data far greater than structured data. For large-scale data which is increased to achieve TB and PB level at present, better tools or technologies are needed for organizing and managing files, and an efficient data organization method can help people to quickly acquire the wanted data from the background large-scale data when needed.
The existing unstructured data are generally stored in a memory in sequence, so that the data have no relation, the search time is long, and the use is difficult.
Disclosure of Invention
The invention aims to provide an unstructured data storage method, which aims to improve the retrieval speed and the read-write speed of stored data, and ensure the use of unstructured data so as to enable the response speed to be faster.
To achieve the above object, the present invention provides an unstructured data storage method, comprising: obtaining unstructured data;
Identifying unstructured data and generating a main label;
Storing unstructured data in blocks based on the main tag;
generating a secondary label based on the mining characteristics;
Retrieving and generating a mapping based on the secondary label in each storage area;
the mapping relationship is stored to the second storage area.
The specific mode for acquiring the unstructured data is to acquire the unstructured data from data channels such as websites, computer programs, mobile phone APP and the like.
Wherein the primary tag may be a file type or a data source channel.
The specific steps of storing unstructured data in blocks based on the main label are as follows:
generating a partitioned storage area based on the master tag;
identifying unstructured data based on the master tag;
storing unstructured data into corresponding partitioned storage areas;
when the capacity of the partitioned storage area is exhausted, a new storage area is added and a mapping is established.
The specific steps of generating the partitioned storage area based on the main label are as follows:
Acquiring the address of a total storage area;
Acquiring the number of main labels;
and dividing the addresses of the total storage areas based on the number of the main labels to generate partitioned storage areas corresponding to the number of the main labels.
The partitioned storage area comprises a cache area and a storage area, wherein the cache area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.
The specific steps of storing unstructured data into the corresponding partitioned storage areas are as follows:
setting a standard capacity value;
Comparing the current file unit capacity value with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the current file unit with the standard capacity value until the current file unit capacity value is larger than the standard capacity value, and generating a storage unit;
An index of the memory cells is generated.
When the capacity of the partitioned storage area is used up, the specific steps of adding a new storage area and establishing a mapping are as follows:
detecting the capacity of the partitioned storage areas;
searching for a new memory area when the capacity is below a threshold;
When a new storage area is found, an address mapping is established.
The invention discloses an unstructured data storage method, which comprises the following steps: obtaining unstructured data; identifying unstructured data and generating a main label; storing unstructured data in blocks based on the main tag; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; the mapping relationship is stored to the second storage area. Therefore, unstructured data can be stored in blocks through categories in the main tag, and then mapping can be established based on the auxiliary tag and the data in all storage areas, so that the retrieval speed and the read-write speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an unstructured data storage method of the present invention;
FIG. 2 is a flow chart of the present invention for storing unstructured data in blocks based on master tags;
FIG. 3 is a flow chart of the present invention for generating a partitioned storage area based on a master tag;
FIG. 4 is a flow chart of storing unstructured data in corresponding partitioned storage areas in accordance with the present invention;
FIG. 5 is a flow chart of the present invention for adding new storage and creating a map when the partitioned storage capacity is exhausted;
FIG. 6 is a flow chart of the invention for retrieving and generating a map based on secondary labels at each storage area.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Referring to fig. 1 to 6, the present invention provides an unstructured data storage method, which includes:
s101, unstructured data is acquired;
Unstructured data is data represented by a two-dimensional logical table of a database, which is irregular or incomplete in data structure, has no predefined data model, and is inconvenient. Including office documents, text, pictures, XML, HTML, various types of reports, image and audio/video information, etc. in all formats.
Data in a computer informatization system is divided into structured data and unstructured data. Unstructured data is very diverse in format and diverse in standards, and technically unstructured information is more difficult to normalize and understand than structured information.
The method can obtain the required unstructured data from data channels such as websites, computer programs, mobile phone APP and the like, and the format can be documents, texts, pictures and the like.
S102, identifying unstructured data and generating a main label;
The primary label may be a file type or a data source channel. The file types comprise the information of the file, the text, the picture and the like, and various files can be better classified based on the main label of the file types, so that the similar files can be searched in the same mode, and the searching efficiency is improved; the adoption of the classification of the data source channels facilitates the update and deletion of the data at any time, so that the maintenance is more convenient.
S103, storing unstructured data in blocks based on the main label;
The method comprises the following specific steps:
s201, generating a partitioned storage area based on the main label;
The method comprises the following specific steps:
S301, acquiring an address of a total storage area;
By reading the addresses of the total memory area, the capacity size of the entire memory can be obtained to facilitate division.
S302, acquiring the number of main labels;
the main label has a plurality of categories, and each category needs to be assigned.
S303, dividing the addresses of the total storage areas based on the number of the main labels, and generating partitioned storage areas corresponding to the number of the main labels.
The address can be adjusted according to the characteristics of the category, for example, the space occupied by the document and the text is generally smaller than that of the picture, the video and the like, so that the capacity can be graded in advance before the division, and the category of the corresponding level can adopt the address division mode of the same capacity. So that the space utilization is more sufficient.
The block storage area comprises a buffer area and a storage area, wherein the buffer area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.
In the use process of unstructured data, as new data are continuously collected, the effectiveness of old data is reduced, the use frequency is reduced, in order to improve the retrieval read-write efficiency, the partitioned storage area can be set into a cache area and a storage area, wherein the cache area has higher read-write speed, so that quick response can be realized, the storage area has lower read-write speed, but the storage price is low, and the method is suitable for storing a large amount of old data. To distinguish between new data and old data, a time node may be provided to distinguish, for example, 7 days or 10 days of data.
S202, identifying unstructured data based on a main label;
Unstructured data is identified through categories in the main label, so that the unstructured data can be stored correspondingly better.
S203, storing unstructured data into corresponding partitioned storage areas;
The method comprises the following specific steps:
S401, setting a standard capacity value;
in order to reduce the number of small files and increase the processing rate, the small files may be combined into one large file for storage, so that a standard capacity value may be set, and when the small files are smaller than the standard capacity value, the small files need to be combined, and when the small files are larger than the standard capacity value, the small files need not to be combined.
S402, comparing the capacity value of the current file unit with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the current file unit with the standard capacity value until the current file unit is larger than the standard capacity value, and generating a storage unit;
By sequentially executing the steps, the small files can be sequentially combined into the storage unit, so that quick reading and writing can be facilitated.
S403 generates an index of the memory cells.
And indexing all small files in the storage unit, so that the small files can be conveniently searched.
S204, when the capacity of the partitioned storage area is used up, a new storage area is added and a mapping is established.
The method comprises the following specific steps:
s501, detecting the capacity of a partitioned storage area;
S502, searching a new storage area when the capacity is lower than a threshold value;
setting a capacity threshold of the partitioned storage area, and searching other new storage areas to be used in the network when the capacity threshold is lower than the capacity threshold.
S503 establishes an address mapping when a new storage area is found.
The address of the new storage area is connected with the current blocking storage area in a mode of establishing address mapping, so that the capacity can be expanded, and the coverage of other adjacent storage areas is avoided.
S104, generating a secondary label based on the mining characteristics;
When the data is required to be used, corresponding mining features are required to be set, so that the corresponding data can be found out. For more accurate retrieval, the secondary labels may be based on keywords in the mined features.
S105, searching and generating a mapping based on the secondary label in each storage area;
The method comprises the following specific steps:
S601, scanning a plurality of partitioned storage areas simultaneously by using each sub-label;
Because all data are stored in a partitioned mode, all the separated storage areas can be scanned and searched for data based on the secondary label at the same time, and the searching efficiency can be improved.
S602, judging whether contents consistent with the sub-labels are recorded in each partitioned storage area or not, and marking the content with the judged result as yes.
S603 creates a map of the data in the partitioned storage area and the sub-label by marking.
Thus, the connection between the sub-label and all the partitioned storage areas can be established, so that the retrieval speed can be further improved.
S106 stores the mapping relationship to the second storage area.
The second storage area is used for storing the mapping relation and updating according to the requirement. When the mining feature needs to be updated, the step S104 needs to be returned to regenerate the secondary label and the subsequent steps are performed, so that the data of the partitioned storage area can be flexibly used as required, and the processing efficiency can be improved.
The invention discloses an unstructured data storage method, which comprises the following steps: obtaining unstructured data; identifying unstructured data and generating a main label; storing unstructured data in blocks based on the main tag; generating a secondary label based on the mining characteristics; retrieving and generating a mapping based on the secondary label in each storage area; the mapping relationship is stored to the second storage area. Therefore, unstructured data can be stored in blocks through categories in the main tag, and then mapping can be established based on the auxiliary tag and the data in all storage areas, so that the retrieval speed and the read-write speed of the stored data can be improved, the use of the unstructured data is guaranteed, and the reaction speed is higher.
The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims (8)

1. A method for unstructured data storage is characterized in that,
Comprising the following steps: obtaining unstructured data;
Identifying unstructured data and generating a main label;
Storing unstructured data in blocks based on the main tag;
generating a secondary label based on the mining characteristics;
Retrieving and generating a mapping based on the secondary label in each storage area;
the mapping relationship is stored to the second storage area.
2. An unstructured data storage method according to claim 1, wherein,
The specific mode for acquiring the unstructured data is to acquire the unstructured data from a data channel comprising a website, a computer program and a mobile phone APP.
3. An unstructured data storage method according to claim 1, wherein,
The primary label is a file type or data source channel.
4. An unstructured data storage method according to claim 3, wherein,
The specific steps of storing unstructured data in blocks based on the main label are as follows:
generating a partitioned storage area based on the master tag;
identifying unstructured data based on the master tag;
storing unstructured data into corresponding partitioned storage areas;
when the capacity of the partitioned storage area is exhausted, a new storage area is added and a mapping is established.
5. An unstructured data storage method according to claim 4, wherein,
The specific steps of generating the partitioned storage area based on the main label are as follows:
Acquiring the address of a total storage area;
Acquiring the number of main labels;
and dividing the addresses of the total storage areas based on the number of the main labels to generate partitioned storage areas corresponding to the number of the main labels.
6. An unstructured data storage method according to claim 4, wherein,
The block storage area comprises a buffer area and a storage area, wherein the buffer area is used for storing high-frequency access data, and the storage area is used for storing low-frequency access data.
7. An unstructured data storage method according to claim 4, wherein,
The specific steps of storing unstructured data into the corresponding partitioned storage areas are as follows:
setting a standard capacity value;
Comparing the current file unit capacity value with the standard capacity value, if the standard capacity value is large, merging and storing the current file unit and the next adjacent file unit, and comparing the current file unit with the standard capacity value until the current file unit capacity value is larger than the standard capacity value, and generating a storage unit;
An index of the memory cells is generated.
8. An unstructured data storage method according to claim 4, wherein,
When the capacity of the partitioned storage area is used up, the specific steps of adding a new storage area and establishing a mapping are as follows:
detecting the capacity of the partitioned storage areas;
searching for a new memory area when the capacity is below a threshold;
When a new storage area is found, an address mapping is established.
CN202110971730.6A 2021-08-24 2021-08-24 Unstructured data storage method Active CN113655968B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110971730.6A CN113655968B (en) 2021-08-24 2021-08-24 Unstructured data storage method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110971730.6A CN113655968B (en) 2021-08-24 2021-08-24 Unstructured data storage method

Publications (2)

Publication Number Publication Date
CN113655968A CN113655968A (en) 2021-11-16
CN113655968B true CN113655968B (en) 2024-06-18

Family

ID=78481689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110971730.6A Active CN113655968B (en) 2021-08-24 2021-08-24 Unstructured data storage method

Country Status (1)

Country Link
CN (1) CN113655968B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024021107A1 (en) * 2022-07-29 2024-02-01 西门子股份公司 Industrial data storage method and apparatus
CN116719822B (en) * 2023-08-10 2023-12-22 深圳市连用科技有限公司 Method and system for storing massive structured data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915373A (en) * 2012-11-06 2013-02-06 无锡江南计算技术研究所 Data storage method and device
CN107315798A (en) * 2017-06-19 2017-11-03 北京神州泰岳软件股份有限公司 Structuring processing method and processing device based on multi-threaded semantic label information MAP

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8140584B2 (en) * 2007-12-10 2012-03-20 Aloke Guha Adaptive data classification for data mining
CN101369275A (en) * 2008-09-10 2009-02-18 浙江大学 Product attribute excavation method of non-structured text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915373A (en) * 2012-11-06 2013-02-06 无锡江南计算技术研究所 Data storage method and device
CN107315798A (en) * 2017-06-19 2017-11-03 北京神州泰岳软件股份有限公司 Structuring processing method and processing device based on multi-threaded semantic label information MAP

Also Published As

Publication number Publication date
CN113655968A (en) 2021-11-16

Similar Documents

Publication Publication Date Title
US11226976B2 (en) Systems and methods for graphical exploration of forensic data
CN102110146B (en) Key-value storage-based distributed file system metadata management method
CN113655968B (en) Unstructured data storage method
US10698937B2 (en) Split mapping for dynamic rendering and maintaining consistency of data processed by applications
US20100169326A1 (en) Method, apparatus and computer program product for providing analysis and visualization of content items association
EP2985694A1 (en) Application program management method and apparatus, server, and terminal device
CN110888837B (en) Object storage small file merging method and device
US20230281377A1 (en) Systems and methods for displaying digital forensic evidence
CN110851663B (en) Method and device for managing metadata
CN113282799B (en) Node operation method, node operation device, computer equipment and storage medium
CN110941629A (en) Metadata processing method, device, equipment and computer readable storage medium
KR20090037704A (en) Meta data generation method for intutive image search
CN116821133A (en) Data processing method and device
CN100407204C (en) Method for labeling computer resource and system therefor
CN111045994A (en) KV database-based file classification retrieval method and system
US20190005112A1 (en) Method and system for creating entity records using existing data sources
US20170262439A1 (en) Information processing apparatus and non-transitory computer readable medium
CN101281524A (en) Method and apparatus for acquiring material
CN104424334A (en) Method and device for constructing nodes of XML (eXtensible Markup Language) documents
CN104965929B (en) A kind of data processing method and device
CN112379891B (en) Data processing method and device
CN115248803B (en) Collection method and device suitable for network disk file, network disk and storage medium
CN112100469B (en) Information data storage and integration system and method based on big data
CN117194587A (en) Label management method and device for data warehouse
Mestl et al. Time Challenges-Challenging Times for Future Information Search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant