CN111680198B - File management system and method based on file segmentation and feature extraction - Google Patents

File management system and method based on file segmentation and feature extraction Download PDF

Info

Publication number
CN111680198B
CN111680198B CN202010355696.5A CN202010355696A CN111680198B CN 111680198 B CN111680198 B CN 111680198B CN 202010355696 A CN202010355696 A CN 202010355696A CN 111680198 B CN111680198 B CN 111680198B
Authority
CN
China
Prior art keywords
file
archive
feature
subfile
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010355696.5A
Other languages
Chinese (zh)
Other versions
CN111680198A (en
Inventor
车晓轩
童晓风
吴高峰
林曾丰
周雅琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin Zhiguang Hengsheng Technology Co.,Ltd.
Original Assignee
Zhejiang Ocean University ZJOU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Ocean University ZJOU filed Critical Zhejiang Ocean University ZJOU
Priority to CN202010355696.5A priority Critical patent/CN111680198B/en
Publication of CN111680198A publication Critical patent/CN111680198A/en
Application granted granted Critical
Publication of CN111680198B publication Critical patent/CN111680198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of file processing, in particular to a file management system and a method based on file segmentation and feature extraction, wherein the system comprises: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; and the file searching unit searches based on the feature tree according to the keywords provided by the user and sends the searched files to the user. The resource space utilization rate is improved, and meanwhile, the retrieval efficiency is improved.

Description

File management system and method based on file segmentation and feature extraction
Technical Field
The invention belongs to the technical field of file management, and particularly relates to a file management system and method based on file segmentation and feature extraction.
Background
The collection, arrangement, storage, identification, statistics and utilization providing activities of the archives. The method comprises the following steps: file collection, file organization, file value identification, file storage, file cataloguing and file retrieval, file statistics, file compilation and research (see file documentation), and file utilization. The division of the 8 jobs is only relatively stable but not absolute, and there are 6 links, and there are two major parts of basic jobs and utilization jobs. Since modern file management work has become a complex system, there are also methods of partitioning at multiple levels. The first level is divided into two subsystems of archive entity management and archive information development, and each subsystem is divided into a plurality of small-level systems.
The archive entity management comprises the working links of collection, arrangement, identification, storage, statistics and the like; the archive information development is divided into two parts of information processing and information output, wherein the information processing comprises cataloguing, document compiling and reference document compiling, and the information output comprises a plurality of service activities such as reading, copying, consulting, transferring, lending, publishing, exhibiting and the like. The whole file management system and the subsystems thereof form a feedback mechanism in operation. With the development of the modernization of file management, new influences are generated on the structure of the file management work. The ultimate purpose of archive management is to provide archive information as a social practice service, and the structure of the archive management system is set according to this purpose. Each of which is indispensable and has a certain program. They form an organic whole, which plays their own role to realize the whole function of the file management system, and are also related and restricted. For example, value accreditation work is sometimes performed in combination with collection, collation work, and even preliminary accreditation when a file is archived.
The development of social modernization, office automation, paperless things and the like greatly change the generation mode of the files. The archive management is carried out in the system, such as the drafting, issuing, hastening and filing of the files are carried out in the computer and the communication line, so that the predecessors of the archives need to take the machine-readable files as the main form, and the archives naturally exist in the machine-readable form, and the utilization mode of the archives is greatly different from that of the paper carrier archives. This variation presupposes that the archives will be faced with more machine-readable forms of diskettes as carriers of archives. The interest of the information retriever is the content of the information, which may come from archives in different machine-readable forms. It is the responsibility of the file workers to provide these file information comprehensively and systematically. Valuable file information is provided without loss of time. A careful process is necessary to make machine-readable archival information systematic, authentic, valuable, and to allow users to obtain more sophisticated services. Therefore, the electronization of file information is a necessary trend of file utilization development.
Disclosure of Invention
The invention mainly aims to provide a file management system and method based on file segmentation and feature extraction, which improve the resource space utilization rate and improve the retrieval efficiency simultaneously based on file segmentation and feature extraction.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
an archive management system based on file segmentation and feature extraction, the system comprising: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; the file retrieval unit is used for retrieving based on the feature tree according to the keywords provided by the user and sending the retrieved file to the user; and the file splicing unit is used for responding to a file acquisition command of a user, splicing all files in the index association family corresponding to the target file in the sequence of the sub-files according to the index association family, and sending the spliced files to the user.
Furthermore, the file splitting unit splits the digitized archive file, stores the split subfiles in different hard disk partitions, and simultaneously, the method for establishing an index association family for the split subfiles of the archive file based on the sequence of the subfiles executes the following steps: the file segmentation unit performs fixed-size slicing processing on the received archive files, generates a unique hash value for each archive file block, associates the archive file blocks by using an archive file structure of a Mercker directed acyclic graph, and generates a root hash as a hash identifier of the file; generating an archive file block hash and a root hash according to the actual content of the file archive file, wherein different files can generate different hash values; after the file is written, the success of writing the file is prompted; when a new file is written, a plurality of file segmentation units are arranged and synchronously write tasks, the file segmentation units executing the writing tasks can slice the file by the same algorithm and then store the file, and when N archive file copies exist in a verification network, the task of writing the archive file is terminated; each file segmentation unit creates a distributed hash table, and the distributed hash table comprises file segmentation unit information, all archive files and archive file structure relations stored under the file segmentation unit, and file segmentation unit information stored in the archive files; when a new archive file is written, updating the hash table and synchronizing information with other file segmentation units; and after the file segmentation is completed, establishing an index association family by using the address of the file in the hash table based on the hash table corresponding to each subfile.
Further, the feature extraction unit performs feature extraction on each archive file to obtain the archive file features, performs feature extraction on each subfile after the file is divided to obtain the subfile features, associates each subfile feature with the archive file features, and forms a feature tree with the archive file features as initial nodes and the subfile features as branch nodes, and the method performs the following steps: determining a feature magnitude G and a fuzzy weight M, randomly initializing feature prototypes, wherein each feature prototype represents an intelligent agent, determining the size of a population, and enabling an evolution algebra E to be 0; the degree of membership is updated using the following formula:
Figure BDA0002473355500000031
wherein i, j and s respectively represent feature categories, V is a feature center, and V isiRepresenting the feature center of class i, vjFeature centers, v, representing class jsFeature centers representing class s; a. and b and c represent the index of the current data to be characterized, x, corresponding to each dimension of the three-dimensional gray scale informationkIs a labelA fiducial mark; calculating individual energy in the feature population according to the updated membership degree so as to obtain a feature label, and performing the feature of the subfile according to the obtained feature label; based on the file characteristics, a characteristic tree is established in which the file characteristics of the archive are the starting nodes and the subfile characteristics are the branch nodes.
Further, the method for calculating the individual energy in the feature population according to the updated membership degree to further obtain the feature label executes the following steps: the individual energies in the population were calculated using the following formula:
Figure BDA0002473355500000032
zeta is an adjusting constant and has a value range of 30-50; according to the obtained individual physical ability values, the individuals in the characteristic population are averagely divided into three types, and different characteristic labels are respectively set for the three types of individuals.
Further, the file retrieval unit performs retrieval based on the feature tree according to the keywords provided by the user, and the method for sending the retrieved file to the user performs the following steps: acquiring a complete feature tree, labeling all nodes in the feature tree, and connecting the labels to form map information, wherein the map information comprises information of a starting point, information of a target point and information of a blocking node, and the information of the blocking node comprises a plurality of intermediate nodes; obtaining a plurality of first paths according to the starting point and the intermediate node; determining a first distance node which is closest to the starting point in the intermediate nodes according to the first path; obtaining a plurality of second paths according to the target points and the intermediate nodes; determining a second distance node which is closest to the target point in the intermediate nodes according to the second path; all other intermediate nodes between the first distance node and the second distance node are obtained by matching in sequence according to an information comparison table; and obtaining an optimal path from the starting point to the target point according to the first distance node, all other intermediate nodes obtained by matching and the second distance node, and retrieving by using the optimal path.
A file management method based on file segmentation and feature extraction is disclosed, and the method comprises the following steps: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; and the file searching unit searches based on the feature tree according to the keywords provided by the user and sends the searched files to the user.
Further, the method further comprises: and responding to a file acquisition command of a user by using a file splicing unit, splicing all files in the index association family corresponding to the target file in the sequence of the sub-files according to the index association family, splicing the sub-files, and sending the spliced files to the user.
Furthermore, the file splitting unit splits the digitized archive file, stores the split subfiles in different hard disk partitions, and simultaneously, the method for establishing an index association family for the split subfiles of the archive file based on the sequence of the subfiles executes the following steps: the file segmentation unit performs fixed-size slicing processing on the received archive files, generates a unique hash value for each archive file block, associates the archive file blocks by using an archive file structure of a Mercker directed acyclic graph, and generates a root hash as a hash identifier of the file; generating an archive file block hash and a root hash according to the actual content of the file archive file, wherein different files can generate different hash values; after the file is written, the success of writing the file is prompted; when a new file is written, a plurality of file segmentation units are arranged and synchronously write tasks, the file segmentation units executing the writing tasks can slice the file by the same algorithm and then store the file, and when N archive file copies exist in a verification network, the task of writing the archive file is terminated; each file segmentation unit creates a distributed hash table, and the distributed hash table comprises file segmentation unit information, all archive files and archive file structure relations stored under the file segmentation unit, and file segmentation unit information stored in the archive files; when a new archive file is written, updating the hash table and synchronizing information with other file segmentation units; and after the file segmentation is completed, establishing an index association family by using the address of the file in the hash table based on the hash table corresponding to each subfile.
Further, the feature extraction unit performs feature extraction on each archive file to obtain the archive file features, performs feature extraction on each subfile after the file is divided to obtain the subfile features, associates each subfile feature with the archive file features, and forms a feature tree with the archive file features as initial nodes and the subfile features as branch nodes, and the method performs the following steps: determining a feature magnitude G and a fuzzy weight M, randomly initializing feature prototypes, wherein each feature prototype represents an intelligent agent, determining the size of a population, and enabling an evolution algebra E to be 0; the degree of membership is updated using the following formula:
Figure BDA0002473355500000041
wherein i, j and s respectively represent feature categories, V is a feature center, and V isiRepresenting the feature center of class i, vjFeature centers, v, representing class jsFeature centers representing class s; a. and b and c represent the index of the current data to be characterized, x, corresponding to each dimension of the three-dimensional gray scale informationkAre standard labels; calculating individual energy in the feature population according to the updated membership degree so as to obtain a feature label, and performing the feature of the subfile according to the obtained feature label; based on the file characteristics, a characteristic tree is established in which the file characteristics of the archive are the starting nodes and the subfile characteristics are the branch nodes.
Further, the calculation of the characteristic population according to the updated membership degreeThe individual energy, and thus the method of obtaining the signature, performs the following steps: the individual energies in the population were calculated using the following formula:
Figure BDA0002473355500000051
zeta is an adjusting constant and has a value range of 30-50; according to the obtained individual physical ability values, the individuals in the characteristic population are averagely divided into three types, and different characteristic labels are respectively set for the three types of individuals.
The file management system and method based on file segmentation and feature extraction have the following beneficial effects: when the archive management is carried out, the archive file is divided on one hand, and the characteristics of the divided archive file are extracted on the other hand. In the prior art, for storage of files, a disk space often needs to be divided into a whole block of space, and the divided whole block of space often cannot be fully utilized due to the sizes of a plurality of files, so that waste of storage space is caused, and the larger the file is, the larger the waste is. After the archive file is divided, the utilization rate of the storage space is greatly improved for the divided subfiles. When the file is segmented, an index association family is established for the segmented subfiles, and the index association family can establish the relation between the subfiles segmented by the same archive file, so that subsequent file splicing and file searching are facilitated. Without reducing the efficiency of file searching and use. When the characteristics are extracted, each subfile characteristic is associated with the file characteristics to form a characteristic tree which takes the file characteristics as an initial node and the subfile characteristics as branch nodes, so that the established characteristic tree can obviously improve the retrieval efficiency when retrieval is carried out.
Drawings
FIG. 1 is a system diagram of an archive management system based on document segmentation and feature extraction according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method of file segmentation and feature extraction-based archive management according to an embodiment of the present invention;
FIG. 3 is a comparison diagram between an experimental curve diagram of the file partitioning and feature extraction-based archive management system and method according to the present invention, in which the storage space utilization varies with the file size, and an experimental curve diagram of the prior art;
FIG. 4 is a graph illustrating experimental curves of data retrieval efficiency of the file partitioning and feature extraction based archive management system and method according to the present invention and a graph illustrating experimental curves of the prior art;
FIG. 5 is a schematic diagram of a feature tree structure of a file management system and method based on file segmentation and feature extraction according to an embodiment of the present invention.
Wherein, 1-prior art experimental curve, 2-inventive experimental curve, 3-theoretical simulation experimental curve, and 4-reference point.
Detailed Description
The technical solution of the present invention is further described in detail below with reference to the following detailed description and the accompanying drawings:
example 1
As shown in fig. 1, 3, 4 and 5, the archive management system based on file segmentation and feature extraction includes: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; the file retrieval unit is used for retrieving based on the feature tree according to the keywords provided by the user and sending the retrieved file to the user; and the file splicing unit is used for responding to a file acquisition command of a user, splicing all files in the index association family corresponding to the target file in the sequence of the sub-files according to the index association family, and sending the spliced files to the user.
Specifically, when the archive management is performed, the archive file is divided on one hand, and the characteristics of the divided archive file are extracted on the other hand. In the prior art, for storage of files, a disk space often needs to be divided into a whole block of space, and the divided whole block of space often cannot be fully utilized due to the sizes of a plurality of files, so that waste of storage space is caused, and the larger the file is, the larger the waste is. After the archive file is divided, the utilization rate of the storage space is greatly improved for the divided subfiles. When the file is segmented, an index association family is established for the segmented subfiles, and the index association family can establish the relation between the subfiles segmented by the same archive file, so that subsequent file splicing and file searching are facilitated. Without reducing the efficiency of file searching and use. When the characteristics are extracted, each subfile characteristic is associated with the file characteristics to form a characteristic tree which takes the file characteristics as an initial node and the subfile characteristics as branch nodes, so that the established characteristic tree can obviously improve the retrieval efficiency when retrieval is carried out.
Example 2
On the basis of the above embodiment, the file splitting unit splits the digitized archive file, stores the split subfiles in different hard disk partitions, and performs the following steps for establishing an index association family for the split subfiles of the archive file based on the sequence of the subfiles: the file segmentation unit performs fixed-size slicing processing on the received archive files, generates a unique hash value for each archive file block, associates the archive file blocks by using an archive file structure of a Mercker directed acyclic graph, and generates a root hash as a hash identifier of the file; generating an archive file block hash and a root hash according to the actual content of the file archive file, wherein different files can generate different hash values; after the file is written, the success of writing the file is prompted; when a new file is written, a plurality of file segmentation units are arranged and synchronously write tasks, the file segmentation units executing the writing tasks can slice the file by the same algorithm and then store the file, and when N archive file copies exist in a verification network, the task of writing the archive file is terminated; each file segmentation unit creates a distributed hash table, and the distributed hash table comprises file segmentation unit information, all archive files and archive file structure relations stored under the file segmentation unit, and file segmentation unit information stored in the archive files; when a new archive file is written, updating the hash table and synchronizing information with other file segmentation units; and after the file segmentation is completed, establishing an index association family by using the address of the file in the hash table based on the hash table corresponding to each subfile.
Specifically, a Hash table (also called Hash table) is a data structure that is directly accessed from a Key value. That is, it accesses the record by mapping the key value to a location in the table to speed up the lookup. This mapping function is called a hash function and the array of stored records is called a hash table.
Giving a table M, wherein a function f (key) exists, substituting a function into any given key value key to obtain an address recorded in the table and containing the key, and the table M is called a Hash (Hash) table, and the function f (key) is a Hash (Hash) function.
The hash function enables a more rapid and efficient access process to a data sequence, by which the data elements are located more quickly.
In actual work, different hash functions are adopted according to different situations, and the factors generally considered are as follows: calculating the time required by the hash function; the length of the keyword; the size of the hash table; distribution of keywords; the frequency of searches recorded.
1. Direct addressing methods: the key or some linear function value of the key is taken as the hash address. Namely h (key) or h (key) a key + b, where a and b are constants (this hash function is called a self function). If there is already a value in H (key), it will go to the next search until there is no value in H (key), and then it will be put in.
2. Digital analysis method: analyzing a set of data, such as the year, month and day of birth of a set of employees, where we find the first digits of the year, month and day of birth to be substantially the same, so that the chance of collisions is high, but we find the last digits of the year, month and specific date to be very different, and if the latter digits are used to form the hash address, the chance of collisions is significantly reduced. Therefore, the numerical analysis method is to find out the regularity of the numbers and use the data as much as possible to construct the hash address with low collision probability.
3. Square taking and centering method: when it cannot be determined which bits in the keyword are distributed uniformly, the square value of the keyword can be first obtained, and then the middle bits of the square value are taken as the hash address according to the requirement. This is because: the middle bits after squaring are associated with each bit in the key, so different keys will generate different hash addresses with higher probability.
Example 3
On the basis of the above embodiment, the method for forming the feature tree with the file feature as the starting node and the subfile feature as the branch node, which is performed by the feature extraction unit, includes the following steps: determining a feature magnitude G and a fuzzy weight M, randomly initializing feature prototypes, wherein each feature prototype represents an intelligent agent, determining the size of a population, and enabling an evolution algebra E to be 0; the degree of membership is updated using the following formula:
Figure BDA0002473355500000071
wherein i, j and s respectively represent feature categories, V is a feature center, and V isiRepresenting the feature center of class i, vjFeature centers, v, representing class jsFeature centers representing class s; a. and b and c represent the index of the current data to be characterized, x, corresponding to each dimension of the three-dimensional gray scale informationkAre standard labels; calculating individual energy in the characteristic population according to the updated membership degree to further obtain the characteristic markLabeling, namely performing the characteristics of the subfiles according to the obtained characteristic labels; based on the file characteristics, a characteristic tree is established in which the file characteristics of the archive are the starting nodes and the subfile characteristics are the branch nodes.
In particular, tree data structures are an important class of non-linear data structures. The tree data structure may represent a one-to-many relationship between data table elements. The tree and the binary tree are most commonly used, and the tree is a hierarchical structure defined by a branch relation in an intuitive view. Tree data structures are widely available in the objective world, for example, the genealogy of human society and various social organizations can be represented visually by tree data structures.
In computer science, a tree is an Abstract Data Type (ADT) or data structure that implements such an abstract data type to model a collection of data that has the nature of a tree structure. It is a set with hierarchy relationship composed of n (n >0) finite nodes. It is called a "tree" because it looks like an inverted tree, i.e., it is root up and leaf down. It has the following characteristics: each node has zero or more child nodes; nodes without parents are called root nodes; each non-root node has only one father node;
each child node, except the root node, may be divided into a plurality of disjoint sub-trees;
tree data structures are an important class of non-linear data structures. Tree data structures are widely used in the field of computers, for example, in a compiler, a syntax structure of a source program can be represented by a tree. As in database systems, a tree data structure is also one of the important organizational forms of information. And in file management, the multi-level directory structure adopts a tree data structure.
Example 4
On the basis of the previous embodiment, the method for calculating the individual energy in the feature population according to the updated membership degree so as to obtain the feature tag executes the following steps: the individual energies in the population were calculated using the following formula:
Figure BDA0002473355500000081
Figure BDA0002473355500000082
zeta is an adjusting constant and has a value range of 30-50; according to the obtained individual physical ability values, the individuals in the characteristic population are averagely divided into three types, and different characteristic labels are respectively set for the three types of individuals.
Example 5
On the basis of the previous embodiment, the file retrieval unit performs retrieval based on the feature tree according to the keywords provided by the user, and the method for sending the retrieved file to the user performs the following steps: acquiring a complete feature tree, labeling all nodes in the feature tree, and connecting the labels to form map information, wherein the map information comprises information of a starting point, information of a target point and information of a blocking node, and the information of the blocking node comprises a plurality of intermediate nodes; obtaining a plurality of first paths according to the starting point and the intermediate node; determining a first distance node which is closest to the starting point in the intermediate nodes according to the first path; obtaining a plurality of second paths according to the target points and the intermediate nodes; determining a second distance node which is closest to the target point in the intermediate nodes according to the second path; all other intermediate nodes between the first distance node and the second distance node are obtained by matching in sequence according to an information comparison table; and obtaining an optimal path from the starting point to the target point according to the first distance node, all other intermediate nodes obtained by matching and the second distance node, and retrieving by using the optimal path.
Example 6
A file management method based on file segmentation and feature extraction is disclosed, and the method comprises the following steps: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; and the file searching unit searches based on the feature tree according to the keywords provided by the user and sends the searched files to the user.
In particular, the method comprises the following steps of,
example 7
On the basis of the above embodiment, the method further includes: and responding to a file acquisition command of a user by using a file splicing unit, splicing all files in the index association family corresponding to the target file in the sequence of the sub-files according to the index association family, splicing the sub-files, and sending the spliced files to the user.
Example 8
On the basis of the above embodiment, the file splitting unit splits the digitized archive file, stores the split subfiles in different hard disk partitions, and performs the following steps for establishing an index association family for the split subfiles of the archive file based on the sequence of the subfiles: the file segmentation unit performs fixed-size slicing processing on the received archive files, generates a unique hash value for each archive file block, associates the archive file blocks by using an archive file structure of a Mercker directed acyclic graph, and generates a root hash as a hash identifier of the file; generating an archive file block hash and a root hash according to the actual content of the file archive file, wherein different files can generate different hash values; after the file is written, the success of writing the file is prompted; when a new file is written, a plurality of file segmentation units are arranged and synchronously write tasks, the file segmentation units executing the writing tasks can slice the file by the same algorithm and then store the file, and when N archive file copies exist in a verification network, the task of writing the archive file is terminated; each file segmentation unit creates a distributed hash table, and the distributed hash table comprises file segmentation unit information, all archive files and archive file structure relations stored under the file segmentation unit, and file segmentation unit information stored in the archive files; when a new archive file is written, updating the hash table and synchronizing information with other file segmentation units; and after the file segmentation is completed, establishing an index association family by using the address of the file in the hash table based on the hash table corresponding to each subfile.
Example 9
On the basis of the above embodiment, the method for forming the feature tree with the file feature as the starting node and the subfile feature as the branch node, which is performed by the feature extraction unit, includes the following steps: determining a feature magnitude G and a fuzzy weight M, randomly initializing feature prototypes, wherein each feature prototype represents an intelligent agent, determining the size of a population, and enabling an evolution algebra E to be 0; the degree of membership is updated using the following formula:
Figure BDA0002473355500000101
wherein i, j and s respectively represent feature categories, V is a feature center, and V isiRepresenting the feature center of class i, vjFeature centers, v, representing class jsFeature centers representing class s; a. and b and c represent the index of the current data to be characterized, x, corresponding to each dimension of the three-dimensional gray scale informationkAre standard labels; calculating individual energy in the feature population according to the updated membership degree so as to obtain a feature label, and performing the feature of the subfile according to the obtained feature label; based on the file characteristics, a characteristic tree is established in which the file characteristics of the archive are the starting nodes and the subfile characteristics are the branch nodes.
Specifically, the Parallel Genetic Algorithm (Parallel Genetic Algorithm) refers to an Algorithm after a Genetic Algorithm is designed in Parallel, and is a multi-population Parallel evolutionary Genetic Algorithm suitable for complex optimization problems. The algorithm can effectively overcome the premature convergence problem of the standard genetic algorithm and has stronger global search capability.
Genetic algorithms are an effective search method based on natural selection and genetics principles, and many fields successfully apply genetic algorithms to obtain satisfactory solutions to problems. Although genetic algorithms can generally find a satisfactory solution within a reasonable time, increasing the operation speed of genetic algorithms becomes more prominent with the increase of complexity and difficulty of solving problems. The genetic algorithm has natural parallelism and is very suitable for being realized on a large-scale parallel computer, and the large-scale parallel computer is increasingly popularized, so that a material foundation is laid for the parallel genetic algorithm. The parallel genetic algorithm is realized by equivalently converting the serial genetic algorithm into a parallel scheme, and more importantly, the structure of the genetic algorithm is modified into a form which is easy to realize parallelization so as to form a parallel population model.
Example 10
On the basis of the previous embodiment, the method for calculating the individual energy in the feature population according to the updated membership degree so as to obtain the feature tag executes the following steps: the individual energies in the population were calculated using the following formula:
Figure BDA0002473355500000102
Figure BDA0002473355500000103
zeta is an adjusting constant and has a value range of 30-50; according to the obtained individual physical ability values, the individuals in the characteristic population are averagely divided into three types, and different characteristic labels are respectively set for the three types of individuals.
The above description is only an embodiment of the present invention, but not intended to limit the scope of the present invention, and any structural changes made according to the present invention should be considered as being limited within the scope of the present invention without departing from the spirit of the present invention.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.
It should be noted that, the system provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (4)

1. Archive management system based on file segmentation and feature extraction, characterized in that, the system includes: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; the file retrieval unit is used for retrieving based on the feature tree according to the keywords provided by the user and sending the retrieved file to the user; the file splicing unit is used for responding to a file acquisition command of a user, splicing the subfiles of all files in the index association family corresponding to the target file according to the index association family by taking the sequence of the subfiles as a splicing sequence, and sending the spliced files to the user;
the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating the characteristics of each subfile with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes, and the method comprises the following steps: determining a feature magnitude G and a fuzzy weight M, randomly initializing feature prototypes, wherein each feature prototype represents an intelligent agent, determining the size of a population, and enabling an evolution algebra E to be 0; the degree of membership is updated using the following formula:
Figure FDA0002958508680000011
Figure FDA0002958508680000012
wherein i, j and s represent feature classes, v, respectivelyiRepresenting the feature center of class i, vjFeature centers, v, representing class jsFeature centers representing class s; a. b and c denote indices, x, of the current feature data corresponding to each dimension of the three-dimensional gray scale information, respectivelykAre standard labels; calculating individual energy in the feature population according to the updated membership degree so as to obtain a feature label, and extracting the features of the subfiles according to the obtained feature label; based on the file characteristics, establishing a characteristic tree in which the file characteristics of the file are initial nodes and the subfile characteristics are branch nodes;
the method for calculating the individual energy in the characteristic population according to the updated membership degree so as to obtain the characteristic label executes the following steps: the individual energies in the population were calculated using the following formula:
Figure FDA0002958508680000013
Figure FDA0002958508680000014
wherein V is taken as a characteristic center, zeta is taken as an adjusting constant, and the value range is 30-50; according to the obtained individual physical ability values, the individuals in the characteristic population are averagely divided into three types, and different characteristic labels are respectively set for the three types of individuals.
2. The system of claim 1, wherein the file splitting unit splits the digitized archive file, stores the split subfiles in different hard disk partitions, and wherein the method for creating an index association family for the split subfiles of the archive file based on subfile order performs the following steps: the file segmentation unit performs fixed-size slicing processing on the received archive files, generates a unique hash value for each archive file block, simultaneously associates the archive file blocks by an archive file structure of a Mercker directed acyclic graph, and generates a root hash as a hash identifier of the file; generating an archive file block hash and a root hash according to the actual content of the archive file, wherein different files can generate different hash values; after the file is written, the success of writing the file is prompted; when a new file is written, a plurality of file segmentation units are arranged and synchronously write tasks, the file segmentation units executing the writing tasks can slice the file by the same algorithm and then store the file, and when N archive file copies exist in a verification network, the task of writing the archive file is terminated; each file segmentation unit creates a distributed hash table, and the distributed hash table comprises file segmentation unit information, all archive files and archive file structure relations stored under the file segmentation unit, and file segmentation unit information stored in the archive files; when a new archive file is written, updating the hash table and synchronizing information with other file segmentation units; and after the file segmentation is completed, establishing an index association family by using the address of the file in the hash table based on the hash table corresponding to each subfile.
3. The system of claim 1, wherein the document retrieval unit performs a retrieval based on the feature tree based on a keyword provided by the user, and the method of transmitting the retrieved document to the user performs the steps of: acquiring a complete feature tree, labeling all nodes in the feature tree, and connecting the labels to form map information, wherein the map information comprises information of a starting point, information of a target point and information of a blocking node, and the information of the blocking node comprises a plurality of intermediate nodes; obtaining a plurality of first paths according to the starting point and the intermediate node; determining a first distance node which is closest to the starting point in the intermediate nodes according to the first path; obtaining a plurality of second paths according to the target points and the intermediate nodes; determining a second distance node which is closest to the target point in the intermediate nodes according to the second path; all other intermediate nodes between the first distance node and the second distance node are obtained by matching in sequence according to an information comparison table; and obtaining an optimal path from the starting point to the target point according to the first distance node, all other intermediate nodes obtained by matching and the second distance node, and retrieving by using the optimal path.
4. A method of archive management using the system of any of claims 1-3.
CN202010355696.5A 2020-04-29 2020-04-29 File management system and method based on file segmentation and feature extraction Active CN111680198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010355696.5A CN111680198B (en) 2020-04-29 2020-04-29 File management system and method based on file segmentation and feature extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010355696.5A CN111680198B (en) 2020-04-29 2020-04-29 File management system and method based on file segmentation and feature extraction

Publications (2)

Publication Number Publication Date
CN111680198A CN111680198A (en) 2020-09-18
CN111680198B true CN111680198B (en) 2021-05-11

Family

ID=72452655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010355696.5A Active CN111680198B (en) 2020-04-29 2020-04-29 File management system and method based on file segmentation and feature extraction

Country Status (1)

Country Link
CN (1) CN111680198B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632075A (en) * 2020-12-25 2021-04-09 创新科技术有限公司 Storage and reading method and device of cluster metadata
CN112597422A (en) * 2020-12-30 2021-04-02 深圳市世强元件网络有限公司 PDF file segmentation method and PDF file loading method in webpage
CN113127421A (en) * 2021-04-01 2021-07-16 山东英信计算机技术有限公司 Method and equipment for searching file content in storage system
CN114896210A (en) * 2022-04-27 2022-08-12 中国航空工业集团公司沈阳飞机设计研究所 Airplane test flight test data processing method and system, electronic equipment and medium thereof
CN114943067B (en) * 2022-07-13 2022-11-08 河北汇金集团股份有限公司 File archive anti-counterfeiting identification method based on OCR technology
CN115794496A (en) * 2023-02-07 2023-03-14 中信天津金融科技服务有限公司 Archive storage method and system based on information extraction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117324B (en) * 2011-02-24 2012-09-05 上海北大方正科技电脑***有限公司 File management method and management system applying fuzzy matrice
CN103914497A (en) * 2013-01-08 2014-07-09 仁宝电脑工业股份有限公司 Management method and system for quick access files
CN104657513A (en) * 2015-03-20 2015-05-27 烟台威尔数据***有限公司 File operation and rapid retrieval method in embedded system
CN105205581A (en) * 2014-06-30 2015-12-30 国网上海市电力公司 Power-supply-enterprise electronic file safety risk evaluation system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10303363B2 (en) * 2016-10-19 2019-05-28 Acronis International Gmbh System and method for data storage using log-structured merge trees

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117324B (en) * 2011-02-24 2012-09-05 上海北大方正科技电脑***有限公司 File management method and management system applying fuzzy matrice
CN103914497A (en) * 2013-01-08 2014-07-09 仁宝电脑工业股份有限公司 Management method and system for quick access files
CN105205581A (en) * 2014-06-30 2015-12-30 国网上海市电力公司 Power-supply-enterprise electronic file safety risk evaluation system
CN104657513A (en) * 2015-03-20 2015-05-27 烟台威尔数据***有限公司 File operation and rapid retrieval method in embedded system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高校档案编研工作的实践与思考;周雅琴;《档案时空》;20161231(第12期);第41-43页 *

Also Published As

Publication number Publication date
CN111680198A (en) 2020-09-18

Similar Documents

Publication Publication Date Title
CN111680198B (en) File management system and method based on file segmentation and feature extraction
US10783168B2 (en) Systems and methods for probabilistic data classification
US7739288B2 (en) Systems and methods of directory entry encodings
CN103902623B (en) Method and system for the accessing file in storage system
Lee et al. Quintary trees: A file structure for multidimensional datbase sytems
KR100798609B1 (en) Data sort method, data sort apparatus, and storage medium storing data sort program
CN113986873B (en) Method for processing, storing and sharing data modeling of mass Internet of things
US9489414B2 (en) Prefix burrows-wheeler transformations for creating and searching a merged lexeme set
CN112765405B (en) Method and system for clustering and inquiring spatial data search results
CN111708895B (en) Knowledge graph system construction method and device
CN114691721A (en) Graph data query method and device, electronic equipment and storage medium
CN102207935A (en) Method and system for establishing index
CN115238153A (en) Document management method and system based on virtual simulation
CN101963993B (en) Method for fast searching database sheet table record
CN108984626A (en) A kind of data processing method, device and server
Chukhray et al. Proximate objects probabilistic searching method
CN108280176A (en) Data mining optimization method based on MapReduce
JPH09305622A (en) Method and system for managing data base having document retrieval function
KR101846347B1 (en) Method and apparatus for managing massive documents
Abdalla et al. NoSQL: Robust and efficient data management on deduplication process by using a mobile application
CN102629274B (en) Index update method for ciphertext full-text searching system based on dynamic succeed tree index structure
CN117540056B (en) Method, device, computer equipment and storage medium for data query
CN113051227A (en) File searching method and device
Junxiu The demonstration of cloud retrieval system mode
Soukehal et al. Suffix Tree Construction based Mapreduce

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220111

Address after: 130000 room 524, building 1, radio Street relocation community (Golden coordinate), Chaoyang District, Changchun City, Jilin Province

Patentee after: Jilin Zhiguang Hengsheng Technology Co.,Ltd.

Address before: 316002 No.1 Haida South Road, Changzhi Island, Lincheng street, Dinghai District, Zhoushan City, Zhejiang Province

Patentee before: Zhejiang Ocean University