CN111680198A - File management system and method based on file segmentation and feature extraction - Google Patents
File management system and method based on file segmentation and feature extraction Download PDFInfo
- Publication number
- CN111680198A CN111680198A CN202010355696.5A CN202010355696A CN111680198A CN 111680198 A CN111680198 A CN 111680198A CN 202010355696 A CN202010355696 A CN 202010355696A CN 111680198 A CN111680198 A CN 111680198A
- Authority
- CN
- China
- Prior art keywords
- file
- feature
- archive
- subfile
- files
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
- G06F16/137—Hash-based
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of file processing, in particular to a file management system and a method based on file segmentation and feature extraction, wherein the system comprises: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; and the file searching unit searches based on the feature tree according to the keywords provided by the user and sends the searched files to the user. The resource space utilization rate is improved, and meanwhile, the retrieval efficiency is improved.
Description
Technical Field
The invention belongs to the technical field of file management, and particularly relates to a file management system and method based on file segmentation and feature extraction.
Background
The collection, arrangement, storage, identification, statistics and utilization providing activities of the archives. The method comprises the following steps: file collection, file organization, file value identification, file storage, file cataloguing and file retrieval, file statistics, file compilation and research (see file documentation), and file utilization. The division of the 8 jobs is only relatively stable but not absolute, and there are 6 links, and there are two major parts of basic jobs and utilization jobs. Since modern file management work has become a complex system, there are also methods of partitioning at multiple levels. The first level is divided into two subsystems of archive entity management and archive information development, and each subsystem is divided into a plurality of small-level systems.
The archive entity management comprises the working links of collection, arrangement, identification, storage, statistics and the like; the archive information development is divided into two parts of information processing and information output, wherein the information processing comprises cataloguing, document compiling and reference document compiling, and the information output comprises a plurality of service activities such as reading, copying, consulting, transferring, lending, publishing, exhibiting and the like. The whole file management system and the subsystems thereof form a feedback mechanism in operation. With the development of the modernization of file management, new influences are generated on the structure of the file management work. The ultimate purpose of archive management is to provide archive information as a social practice service, and the structure of the archive management system is set according to this purpose. Each of which is indispensable and has a certain program. They form an organic whole, which plays their own role to realize the whole function of the file management system, and are also related and restricted. For example, value accreditation work is sometimes performed in combination with collection, collation work, and even preliminary accreditation when a file is archived.
The development of social modernization, office automation, paperless things and the like greatly change the generation mode of the files. The archive management is carried out in the system, such as the drafting, issuing, hastening and filing of the files are carried out in the computer and the communication line, so that the predecessors of the archives need to take the machine-readable files as the main form, and the archives naturally exist in the machine-readable form, and the utilization mode of the archives is greatly different from that of the paper carrier archives. This variation presupposes that the archives will be faced with more machine-readable forms of diskettes as carriers of archives. The interest of the information retriever is the content of the information, which may come from archives in different machine-readable forms. It is the responsibility of the file workers to provide these file information comprehensively and systematically. Valuable file information is provided without loss of time. A careful process is necessary to make machine-readable archival information systematic, authentic, valuable, and to allow users to obtain more sophisticated services. Therefore, the electronization of file information is a necessary trend of file utilization development.
Disclosure of Invention
The invention mainly aims to provide a file management system and method based on file segmentation and feature extraction, which improve the resource space utilization rate and improve the retrieval efficiency simultaneously based on file segmentation and feature extraction.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
an archive management system based on file segmentation and feature extraction, the system comprising: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; the file retrieval unit is used for retrieving based on the feature tree according to the keywords provided by the user and sending the retrieved file to the user; and the file splicing unit is used for responding to a file acquisition command of a user, splicing all files in the index association family corresponding to the target file in the sequence of the sub-files according to the index association family, and sending the spliced files to the user.
Furthermore, the file splitting unit splits the digitized archive file, stores the split subfiles in different hard disk partitions, and simultaneously, the method for establishing an index association family for the split subfiles of the archive file based on the sequence of the subfiles executes the following steps: the file segmentation unit performs fixed-size slicing processing on the received archive files, generates a unique hash value for each archive file block, associates the archive file blocks by using an archive file structure of a Mercker directed acyclic graph, and generates a root hash as a hash identifier of the file; generating an archive file block hash and a root hash according to the actual content of the file archive file, wherein different files can generate different hash values; after the file is written, the success of writing the file is prompted; when a new file is written, a plurality of file segmentation units are arranged and synchronously write tasks, the file segmentation units executing the writing tasks can slice the file by the same algorithm and then store the file, and when N archive file copies exist in a verification network, the task of writing the archive file is terminated; each file segmentation unit creates a distributed hash table, and the distributed hash table comprises file segmentation unit information, all archive files and archive file structure relations stored under the file segmentation unit, and file segmentation unit information stored in the archive files; when a new archive file is written, updating the hash table and synchronizing information with other file segmentation units; and after the file segmentation is completed, establishing an index association family by using the address of the file in the hash table based on the hash table corresponding to each subfile.
Further, the feature extraction unit performs feature extraction on each archive file to obtain the features of the archive file, and performs feature extraction on each subfile obtained by dividing the file to obtain the subfileThe file characteristics, each subfile characteristic is associated with the file characteristics to form a characteristic tree which takes the file characteristics as a starting node and the subfile characteristics as branch nodes, and the method comprises the following steps: determining a feature magnitude G and a fuzzy weight M, randomly initializing feature prototypes, wherein each feature prototype represents an intelligent agent, determining the size of a population, and enabling an evolution algebra E to be 0; the degree of membership is updated using the following formula:wherein i, j and s respectively represent feature categories, V is a feature center, and V isiRepresenting the feature center of class i, vjFeature centers, v, representing class jsFeature centers representing class s; a. and b and c represent the index of the current data to be characterized, x, corresponding to each dimension of the three-dimensional gray scale informationkAre standard labels; calculating individual energy in the feature population according to the updated membership degree so as to obtain a feature label, and performing the feature of the subfile according to the obtained feature label; based on the file characteristics, a characteristic tree is established in which the file characteristics of the archive are the starting nodes and the subfile characteristics are the branch nodes.
Further, the method for calculating the individual energy in the feature population according to the updated membership degree to further obtain the feature label executes the following steps: the individual energies in the population were calculated using the following formula:zeta is an adjusting constant and has a value range of 30-50; according to the obtained individual physical ability values, the individuals in the characteristic population are averagely divided into three types, and different characteristic labels are respectively set for the three types of individuals.
Further, the file retrieval unit performs retrieval based on the feature tree according to the keywords provided by the user, and the method for sending the retrieved file to the user performs the following steps: acquiring a complete feature tree, labeling all nodes in the feature tree, and connecting the labels to form map information, wherein the map information comprises information of a starting point, information of a target point and information of a blocking node, and the information of the blocking node comprises a plurality of intermediate nodes; obtaining a plurality of first paths according to the starting point and the intermediate node; determining a first distance node which is closest to the starting point in the intermediate nodes according to the first path; obtaining a plurality of second paths according to the target points and the intermediate nodes; determining a second distance node which is closest to the target point in the intermediate nodes according to the second path; all other intermediate nodes between the first distance node and the second distance node are obtained by matching in sequence according to an information comparison table; and obtaining an optimal path from the starting point to the target point according to the first distance node, all other intermediate nodes obtained by matching and the second distance node, and retrieving by using the optimal path.
A file management method based on file segmentation and feature extraction is disclosed, and the method comprises the following steps: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; and the file searching unit searches based on the feature tree according to the keywords provided by the user and sends the searched files to the user.
Further, the method further comprises: and responding to a file acquisition command of a user by using a file splicing unit, splicing all files in the index association family corresponding to the target file in the sequence of the sub-files according to the index association family, splicing the sub-files, and sending the spliced files to the user.
Furthermore, the file splitting unit splits the digitized archive file, stores the split subfiles in different hard disk partitions, and simultaneously, the method for establishing an index association family for the split subfiles of the archive file based on the sequence of the subfiles executes the following steps: the file segmentation unit performs fixed-size slicing processing on the received archive files, generates a unique hash value for each archive file block, associates the archive file blocks by using an archive file structure of a Mercker directed acyclic graph, and generates a root hash as a hash identifier of the file; generating an archive file block hash and a root hash according to the actual content of the file archive file, wherein different files can generate different hash values; after the file is written, the success of writing the file is prompted; when a new file is written, a plurality of file segmentation units are arranged and synchronously write tasks, the file segmentation units executing the writing tasks can slice the file by the same algorithm and then store the file, and when N archive file copies exist in a verification network, the task of writing the archive file is terminated; each file segmentation unit creates a distributed hash table, and the distributed hash table comprises file segmentation unit information, all archive files and archive file structure relations stored under the file segmentation unit, and file segmentation unit information stored in the archive files; when a new archive file is written, updating the hash table and synchronizing information with other file segmentation units; and after the file segmentation is completed, establishing an index association family by using the address of the file in the hash table based on the hash table corresponding to each subfile.
Further, the feature extraction unit performs feature extraction on each archive file to obtain the archive file features, performs feature extraction on each subfile after the file is divided to obtain the subfile features, associates each subfile feature with the archive file features, and forms a feature tree with the archive file features as initial nodes and the subfile features as branch nodes, and the method performs the following steps: determining a feature magnitude G and a fuzzy weight M, randomly initializing feature prototypes, wherein each feature prototype represents an intelligent agent, determining the size of a population, and enabling an evolution algebra E to be 0; using the following formula to pair the degree of membershipUpdating:wherein i, j and s respectively represent feature categories, V is a feature center, and V isiRepresenting the feature center of class i, vjFeature centers, v, representing class jsFeature centers representing class s; a. and b and c represent the index of the current data to be characterized, x, corresponding to each dimension of the three-dimensional gray scale informationkAre standard labels; calculating individual energy in the feature population according to the updated membership degree so as to obtain a feature label, and performing the feature of the subfile according to the obtained feature label; based on the file characteristics, a characteristic tree is established in which the file characteristics of the archive are the starting nodes and the subfile characteristics are the branch nodes.
Further, the method for calculating the individual energy in the feature population according to the updated membership degree to further obtain the feature label executes the following steps: the individual energies in the population were calculated using the following formula:zeta is an adjusting constant and has a value range of 30-50; according to the obtained individual physical ability values, the individuals in the characteristic population are averagely divided into three types, and different characteristic labels are respectively set for the three types of individuals.
The file management system and method based on file segmentation and feature extraction have the following beneficial effects: when the archive management is carried out, the archive file is divided on one hand, and the characteristics of the divided archive file are extracted on the other hand. In the prior art, for storage of files, a disk space often needs to be divided into a whole block of space, and the divided whole block of space often cannot be fully utilized due to the sizes of a plurality of files, so that waste of storage space is caused, and the larger the file is, the larger the waste is. After the archive file is divided, the utilization rate of the storage space is greatly improved for the divided subfiles. When the file is segmented, an index association family is established for the segmented subfiles, and the index association family can establish the relation between the subfiles segmented by the same archive file, so that subsequent file splicing and file searching are facilitated. Without reducing the efficiency of file searching and use. When the characteristics are extracted, each subfile characteristic is associated with the file characteristics to form a characteristic tree which takes the file characteristics as an initial node and the subfile characteristics as branch nodes, so that the established characteristic tree can obviously improve the retrieval efficiency when retrieval is carried out.
Drawings
FIG. 1 is a system diagram of an archive management system based on document segmentation and feature extraction according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method of file segmentation and feature extraction-based archive management according to an embodiment of the present invention;
FIG. 3 is a comparison diagram between an experimental curve diagram of the file partitioning and feature extraction-based archive management system and method according to the present invention, in which the storage space utilization varies with the file size, and an experimental curve diagram of the prior art;
FIG. 4 is a graph illustrating experimental curves of data retrieval efficiency of the file partitioning and feature extraction based archive management system and method according to the present invention and a graph illustrating experimental curves of the prior art;
FIG. 5 is a schematic diagram of a feature tree structure of a file management system and method based on file segmentation and feature extraction according to an embodiment of the present invention.
Wherein, 1-prior art experimental curve, 2-inventive experimental curve, 3-theoretical simulation experimental curve, and 4-reference point.
Detailed Description
The technical solution of the present invention is further described in detail below with reference to the following detailed description and the accompanying drawings:
example 1
As shown in fig. 1, 3, 4 and 5, the archive management system based on file segmentation and feature extraction includes: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; the file retrieval unit is used for retrieving based on the feature tree according to the keywords provided by the user and sending the retrieved file to the user; and the file splicing unit is used for responding to a file acquisition command of a user, splicing all files in the index association family corresponding to the target file in the sequence of the sub-files according to the index association family, and sending the spliced files to the user.
Specifically, when the archive management is performed, the archive file is divided on one hand, and the characteristics of the divided archive file are extracted on the other hand. In the prior art, for storage of files, a disk space often needs to be divided into a whole block of space, and the divided whole block of space often cannot be fully utilized due to the sizes of a plurality of files, so that waste of storage space is caused, and the larger the file is, the larger the waste is. After the archive file is divided, the utilization rate of the storage space is greatly improved for the divided subfiles. When the file is segmented, an index association family is established for the segmented subfiles, and the index association family can establish the relation between the subfiles segmented by the same archive file, so that subsequent file splicing and file searching are facilitated. Without reducing the efficiency of file searching and use. When the characteristics are extracted, each subfile characteristic is associated with the file characteristics to form a characteristic tree which takes the file characteristics as an initial node and the subfile characteristics as branch nodes, so that the established characteristic tree can obviously improve the retrieval efficiency when retrieval is carried out.
Example 2
On the basis of the above embodiment, the file splitting unit splits the digitized archive file, stores the split subfiles in different hard disk partitions, and performs the following steps for establishing an index association family for the split subfiles of the archive file based on the sequence of the subfiles: the file segmentation unit performs fixed-size slicing processing on the received archive files, generates a unique hash value for each archive file block, associates the archive file blocks by using an archive file structure of a Mercker directed acyclic graph, and generates a root hash as a hash identifier of the file; generating an archive file block hash and a root hash according to the actual content of the file archive file, wherein different files can generate different hash values; after the file is written, the success of writing the file is prompted; when a new file is written, a plurality of file segmentation units are arranged and synchronously write tasks, the file segmentation units executing the writing tasks can slice the file by the same algorithm and then store the file, and when N archive file copies exist in a verification network, the task of writing the archive file is terminated; each file segmentation unit creates a distributed hash table, and the distributed hash table comprises file segmentation unit information, all archive files and archive file structure relations stored under the file segmentation unit, and file segmentation unit information stored in the archive files; when a new archive file is written, updating the hash table and synchronizing information with other file segmentation units; and after the file segmentation is completed, establishing an index association family by using the address of the file in the hash table based on the hash table corresponding to each subfile.
Specifically, a Hash table (also called Hash table) is a data structure that is directly accessed from a Key value. That is, it accesses the record by mapping the key value to a location in the table to speed up the lookup. This mapping function is called a hash function and the array of stored records is called a hash table.
Giving a table M, wherein a function f (key) exists, substituting a function into any given key value key to obtain an address recorded in the table and containing the key, and the table M is called a Hash (Hash) table, and the function f (key) is a Hash (Hash) function.
The hash function enables a more rapid and efficient access process to a data sequence, by which the data elements are located more quickly.
In actual work, different hash functions are adopted according to different situations, and the factors generally considered are as follows: calculating the time required by the hash function; the length of the keyword; the size of the hash table; distribution of keywords; the frequency of searches recorded.
1. Direct addressing methods: the key or some linear function value of the key is taken as the hash address. Namely h (key) or h (key) a key + b, where a and b are constants (this hash function is called a self function). If there is already a value in H (key), it will go to the next search until there is no value in H (key), and then it will be put in.
2. Digital analysis method: analyzing a set of data, such as the year, month and day of birth of a set of employees, where we find the first digits of the year, month and day of birth to be substantially the same, so that the chance of collisions is high, but we find the last digits of the year, month and specific date to be very different, and if the latter digits are used to form the hash address, the chance of collisions is significantly reduced. Therefore, the numerical analysis method is to find out the regularity of the numbers and use the data as much as possible to construct the hash address with low collision probability.
3. Square taking and centering method: when it cannot be determined which bits in the keyword are distributed uniformly, the square value of the keyword can be first obtained, and then the middle bits of the square value are taken as the hash address according to the requirement. This is because: the middle bits after squaring are associated with each bit in the key, so different keys will generate different hash addresses with higher probability.
Example 3
On the basis of the above embodiment, the feature extraction unit performs feature extraction on each archive file to obtain the features of the archive file, performs feature extraction on each subfile obtained by dividing the file to obtain the features of the subfiles, and associates each subfile feature with the features of the archive file to form the archive fileThe method of the feature tree characterized by the starting node and the subfile characterized by the branching nodes performs the following steps: determining a feature magnitude G and a fuzzy weight M, randomly initializing feature prototypes, wherein each feature prototype represents an intelligent agent, determining the size of a population, and enabling an evolution algebra E to be 0; the degree of membership is updated using the following formula:wherein i, j and s respectively represent feature categories, V is a feature center, and V isiRepresenting the feature center of class i, vjFeature centers, v, representing class jsFeature centers representing class s; a. and b and c represent the index of the current data to be characterized, x, corresponding to each dimension of the three-dimensional gray scale informationkAre standard labels; calculating individual energy in the feature population according to the updated membership degree so as to obtain a feature label, and performing the feature of the subfile according to the obtained feature label; based on the file characteristics, a characteristic tree is established in which the file characteristics of the archive are the starting nodes and the subfile characteristics are the branch nodes.
In particular, tree data structures are an important class of non-linear data structures. The tree data structure may represent a one-to-many relationship between data table elements. The tree and the binary tree are most commonly used, and the tree is a hierarchical structure defined by a branch relation in an intuitive view. Tree data structures are widely available in the objective world, for example, the genealogy of human society and various social organizations can be represented visually by tree data structures.
In computer science, a tree is an Abstract Data Type (ADT) or data structure that implements such an abstract data type to model a collection of data that has the nature of a tree structure. It is a set with hierarchy relationship composed of n (n >0) finite nodes. It is called a "tree" because it looks like an inverted tree, i.e., it is root up and leaf down. It has the following characteristics: each node has zero or more child nodes; nodes without parents are called root nodes; each non-root node has only one father node;
each child node, except the root node, may be divided into a plurality of disjoint sub-trees;
tree data structures are an important class of non-linear data structures. Tree data structures are widely used in the field of computers, for example, in a compiler, a syntax structure of a source program can be represented by a tree. As in database systems, a tree data structure is also one of the important organizational forms of information. And in file management, the multi-level directory structure adopts a tree data structure.
Example 4
On the basis of the previous embodiment, the method for calculating the individual energy in the feature population according to the updated membership degree so as to obtain the feature tag executes the following steps: the individual energies in the population were calculated using the following formula: zeta is an adjusting constant and has a value range of 30-50; according to the obtained individual physical ability values, the individuals in the characteristic population are averagely divided into three types, and different characteristic labels are respectively set for the three types of individuals.
Example 5
On the basis of the previous embodiment, the file retrieval unit performs retrieval based on the feature tree according to the keywords provided by the user, and the method for sending the retrieved file to the user performs the following steps: acquiring a complete feature tree, labeling all nodes in the feature tree, and connecting the labels to form map information, wherein the map information comprises information of a starting point, information of a target point and information of a blocking node, and the information of the blocking node comprises a plurality of intermediate nodes; obtaining a plurality of first paths according to the starting point and the intermediate node; determining a first distance node which is closest to the starting point in the intermediate nodes according to the first path; obtaining a plurality of second paths according to the target points and the intermediate nodes; determining a second distance node which is closest to the target point in the intermediate nodes according to the second path; all other intermediate nodes between the first distance node and the second distance node are obtained by matching in sequence according to an information comparison table; and obtaining an optimal path from the starting point to the target point according to the first distance node, all other intermediate nodes obtained by matching and the second distance node, and retrieving by using the optimal path.
Example 6
A file management method based on file segmentation and feature extraction is disclosed, and the method comprises the following steps: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; and the file searching unit searches based on the feature tree according to the keywords provided by the user and sends the searched files to the user.
In particular, the method comprises the following steps of,
example 7
On the basis of the above embodiment, the method further includes: and responding to a file acquisition command of a user by using a file splicing unit, splicing all files in the index association family corresponding to the target file in the sequence of the sub-files according to the index association family, splicing the sub-files, and sending the spliced files to the user.
Example 8
On the basis of the above embodiment, the file splitting unit splits the digitized archive file, stores the split subfiles in different hard disk partitions, and performs the following steps for establishing an index association family for the split subfiles of the archive file based on the sequence of the subfiles: the file segmentation unit performs fixed-size slicing processing on the received archive files, generates a unique hash value for each archive file block, associates the archive file blocks by using an archive file structure of a Mercker directed acyclic graph, and generates a root hash as a hash identifier of the file; generating an archive file block hash and a root hash according to the actual content of the file archive file, wherein different files can generate different hash values; after the file is written, the success of writing the file is prompted; when a new file is written, a plurality of file segmentation units are arranged and synchronously write tasks, the file segmentation units executing the writing tasks can slice the file by the same algorithm and then store the file, and when N archive file copies exist in a verification network, the task of writing the archive file is terminated; each file segmentation unit creates a distributed hash table, and the distributed hash table comprises file segmentation unit information, all archive files and archive file structure relations stored under the file segmentation unit, and file segmentation unit information stored in the archive files; when a new archive file is written, updating the hash table and synchronizing information with other file segmentation units; and after the file segmentation is completed, establishing an index association family by using the address of the file in the hash table based on the hash table corresponding to each subfile.
Example 9
On the basis of the above embodiment, the method for forming the feature tree with the file feature as the starting node and the subfile feature as the branch node, which is performed by the feature extraction unit, includes the following steps: determining a feature magnitude G and a fuzzy weight M, randomly initializing feature prototypes, wherein each feature prototype represents an intelligent agent, determining the size of a population, and enabling an evolution algebra E to be 0; the degree of membership is updated using the following formula:wherein i, j and s respectively represent feature categories, V is a feature center, and V isiRepresenting the feature center of class i, vjFeature centers, v, representing class jsFeature centers representing class s; a. and b and c represent the index of the current data to be characterized, x, corresponding to each dimension of the three-dimensional gray scale informationkAre standard labels; calculating individual energy in the feature population according to the updated membership degree so as to obtain a feature label, and performing the feature of the subfile according to the obtained feature label; based on the file characteristics, a characteristic tree is established in which the file characteristics of the archive are the starting nodes and the subfile characteristics are the branch nodes.
Specifically, the Parallel Genetic Algorithm (Parallel Genetic Algorithm) refers to an Algorithm after a Genetic Algorithm is designed in Parallel, and is a multi-population Parallel evolutionary Genetic Algorithm suitable for complex optimization problems. The algorithm can effectively overcome the premature convergence problem of the standard genetic algorithm and has stronger global search capability.
Genetic algorithms are an effective search method based on natural selection and genetics principles, and many fields successfully apply genetic algorithms to obtain satisfactory solutions to problems. Although genetic algorithms can generally find a satisfactory solution within a reasonable time, increasing the operation speed of genetic algorithms becomes more prominent with the increase of complexity and difficulty of solving problems. The genetic algorithm has natural parallelism and is very suitable for being realized on a large-scale parallel computer, and the large-scale parallel computer is increasingly popularized, so that a material foundation is laid for the parallel genetic algorithm. The parallel genetic algorithm is realized by equivalently converting the serial genetic algorithm into a parallel scheme, and more importantly, the structure of the genetic algorithm is modified into a form which is easy to realize parallelization so as to form a parallel population model.
Example 10
On the basis of the previous embodiment, the method for calculating the individual energy in the feature population according to the updated membership degree so as to obtain the feature tag executes the following steps: the individual energies in the population were calculated using the following formula: zeta is an adjusting constant and has a value range of 30-50; according to the obtained individual physical ability values, the individuals in the characteristic population are averagely divided into three types, and different characteristic labels are respectively set for the three types of individuals.
The above description is only an embodiment of the present invention, but not intended to limit the scope of the present invention, and any structural changes made according to the present invention should be considered as being limited within the scope of the present invention without departing from the spirit of the present invention.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.
It should be noted that, the system provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.
Claims (10)
1. Archive management system based on file segmentation and feature extraction, characterized in that, the system includes: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; the file retrieval unit is used for retrieving based on the feature tree according to the keywords provided by the user and sending the retrieved file to the user; and the file splicing unit is used for responding to a file acquisition command of a user, splicing all files in the index association family corresponding to the target file in the sequence of the sub-files according to the index association family, and sending the spliced files to the user.
2. The system of claim 1, wherein the file splitting unit splits the digitized archive file, stores the split subfiles in different hard disk partitions, and wherein the method for creating an index association family for the split subfiles of the archive file based on subfile order performs the following steps: the file segmentation unit performs fixed-size slicing processing on the received archive files, generates a unique hash value for each archive file block, associates the archive file blocks by using an archive file structure of a Mercker directed acyclic graph, and generates a root hash as a hash identifier of the file; generating an archive file block hash and a root hash according to the actual content of the file archive file, wherein different files can generate different hash values; after the file is written, the success of writing the file is prompted; when a new file is written, a plurality of file segmentation units are arranged and synchronously write tasks, the file segmentation units executing the writing tasks can slice the file by the same algorithm and then store the file, and when N archive file copies exist in a verification network, the task of writing the archive file is terminated; each file segmentation unit creates a distributed hash table, and the distributed hash table comprises file segmentation unit information, all archive files and archive file structure relations stored under the file segmentation unit, and file segmentation unit information stored in the archive files; when a new archive file is written, updating the hash table and synchronizing information with other file segmentation units; and after the file segmentation is completed, establishing an index association family by using the address of the file in the hash table based on the hash table corresponding to each subfile.
3. The system of claim 2, wherein the feature extraction unit performs feature extraction on each file to obtain file features, performs feature extraction on each subfile after the file is divided to obtain subfile features, associates each subfile feature with a file feature to form a feature tree with the file feature as a starting node and the subfile features as branch nodes, and performs the following steps: determining a feature magnitude G and a fuzzy weight M, randomly initializing feature prototypes, wherein each feature prototype represents an intelligent agent, determining the size of a population, and enabling an evolution algebra E to be 0; the degree of membership is updated using the following formula:wherein i, j and s respectively represent feature categories, V is a feature center, and V isiRepresenting the feature center of class i, vjFeature centers, v, representing class jsFeature centers representing class s; a. and b and c represent the index of the current data to be characterized, x, corresponding to each dimension of the three-dimensional gray scale informationkAre standard labels; calculating individual energy in the feature population according to the updated membership degree so as to obtain a feature label, and performing the feature of the subfile according to the obtained feature label; based on the file characteristics, a characteristic tree is established in which the file characteristics of the archive are the starting nodes and the subfile characteristics are the branch nodes.
4. The apparatus of claim 1, wherein the method of calculating individual energies in the signature population based on the updated membership and thereby obtaining the signature performs the steps of: the individual energies in the population were calculated using the following formula:zeta is an adjusting constant and has a value range of 30-50; according to the obtained individual physical ability values, the individuals in the characteristic population are averagely divided into three types, and different characteristic labels are respectively set for the three types of individuals.
5. The system of claim 4, wherein the document retrieval unit performs a retrieval based on the feature tree based on a keyword provided by the user, and the method of transmitting the retrieved document to the user performs the steps of: acquiring a complete feature tree, labeling all nodes in the feature tree, and connecting the labels to form map information, wherein the map information comprises information of a starting point, information of a target point and information of a blocking node, and the information of the blocking node comprises a plurality of intermediate nodes; obtaining a plurality of first paths according to the starting point and the intermediate node; determining a first distance node which is closest to the starting point in the intermediate nodes according to the first path; obtaining a plurality of second paths according to the target points and the intermediate nodes; determining a second distance node which is closest to the target point in the intermediate nodes according to the second path; all other intermediate nodes between the first distance node and the second distance node are obtained by matching in sequence according to an information comparison table; and obtaining an optimal path from the starting point to the target point according to the first distance node, all other intermediate nodes obtained by matching and the second distance node, and retrieving by using the optimal path.
6. Archive management method based on document segmentation and feature extraction based on the system of one of claims 1 to 5, characterized in that it performs the following steps: the file dividing unit is used for dividing the digitized archive file, respectively storing the divided subfiles in different hard disk partitions, and meanwhile establishing an index association family aiming at the divided subfiles of the archive file based on the sequence of the subfiles; the characteristic extraction unit is used for respectively extracting the characteristics of each file to obtain the characteristics of the file, meanwhile, extracting the characteristics of each subfile after the file is divided to obtain the characteristics of the subfile, and associating each subfile characteristic with the characteristics of the file to form a characteristic tree which takes the characteristics of the file as an initial node and the characteristics of the subfile as branch nodes; and the file searching unit searches based on the feature tree according to the keywords provided by the user and sends the searched files to the user.
7. The method of claim 6, wherein the method further comprises: and responding to a file acquisition command of a user by using a file splicing unit, splicing all files in the index association family corresponding to the target file in the sequence of the sub-files according to the index association family, splicing the sub-files, and sending the spliced files to the user.
8. The method of claim 7, wherein the file splitting unit splits the digitized archive file, stores the split subfiles in different hard disk partitions, and wherein the method for creating an index association family for the split subfiles of the archive file based on subfile order performs the following steps: the file segmentation unit performs fixed-size slicing processing on the received archive files, generates a unique hash value for each archive file block, associates the archive file blocks by using an archive file structure of a Mercker directed acyclic graph, and generates a root hash as a hash identifier of the file; generating an archive file block hash and a root hash according to the actual content of the file archive file, wherein different files can generate different hash values; after the file is written, the success of writing the file is prompted; when a new file is written, a plurality of file segmentation units are arranged and synchronously write tasks, the file segmentation units executing the writing tasks can slice the file by the same algorithm and then store the file, and when N archive file copies exist in a verification network, the task of writing the archive file is terminated; each file segmentation unit creates a distributed hash table, and the distributed hash table comprises file segmentation unit information, all archive files and archive file structure relations stored under the file segmentation unit, and file segmentation unit information stored in the archive files; when a new archive file is written, updating the hash table and synchronizing information with other file segmentation units; and after the file segmentation is completed, establishing an index association family by using the address of the file in the hash table based on the hash table corresponding to each subfile.
9. The method as claimed in claim 2, wherein the feature extraction unit performs feature extraction on each file to obtain file features, performs feature extraction on each subfile after the file is divided to obtain subfile features, associates each subfile feature with a file feature to form a feature tree with the file feature as a starting node and the subfile features as branch nodes, and performs the following steps: determining a feature magnitude G and a fuzzy weight M, randomly initializing feature prototypes, wherein each feature prototype represents an intelligent agent, determining the size of a population, and enabling an evolution algebra E to be 0; the degree of membership is updated using the following formula:wherein i, j and s respectively represent feature categories, V is a feature center, and V isiRepresenting the feature center of class i, vjFeature centers, v, representing class jsFeature centers representing class s; a. and b and c represent the index of the current data to be characterized, x, corresponding to each dimension of the three-dimensional gray scale informationkAre standard labels; calculating individual energy in the characteristic population according to the updated membership degree to obtainPerforming the feature of the subfile according to the feature tag; based on the file characteristics, a characteristic tree is established in which the file characteristics of the archive are the starting nodes and the subfile characteristics are the branch nodes.
10. The apparatus of claim 1, wherein the method of calculating individual energies in the signature population based on the updated membership and thereby obtaining the signature performs the steps of: the individual energies in the population were calculated using the following formula:zeta is an adjusting constant and has a value range of 30-50; according to the obtained individual physical ability values, the individuals in the characteristic population are averagely divided into three types, and different characteristic labels are respectively set for the three types of individuals.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010355696.5A CN111680198B (en) | 2020-04-29 | 2020-04-29 | File management system and method based on file segmentation and feature extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010355696.5A CN111680198B (en) | 2020-04-29 | 2020-04-29 | File management system and method based on file segmentation and feature extraction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111680198A true CN111680198A (en) | 2020-09-18 |
CN111680198B CN111680198B (en) | 2021-05-11 |
Family
ID=72452655
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010355696.5A Active CN111680198B (en) | 2020-04-29 | 2020-04-29 | File management system and method based on file segmentation and feature extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111680198B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597422A (en) * | 2020-12-30 | 2021-04-02 | 深圳市世强元件网络有限公司 | PDF file segmentation method and PDF file loading method in webpage |
CN112632075A (en) * | 2020-12-25 | 2021-04-09 | 创新科技术有限公司 | Storage and reading method and device of cluster metadata |
CN113127421A (en) * | 2021-04-01 | 2021-07-16 | 山东英信计算机技术有限公司 | Method and equipment for searching file content in storage system |
CN114896210A (en) * | 2022-04-27 | 2022-08-12 | 中国航空工业集团公司沈阳飞机设计研究所 | Airplane test flight test data processing method and system, electronic equipment and medium thereof |
CN114943067A (en) * | 2022-07-13 | 2022-08-26 | 河北汇金集团股份有限公司 | Anti-counterfeiting identification method for file archive based on OCR technology |
CN115794496A (en) * | 2023-02-07 | 2023-03-14 | 中信天津金融科技服务有限公司 | Archive storage method and system based on information extraction |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102117324B (en) * | 2011-02-24 | 2012-09-05 | 上海北大方正科技电脑***有限公司 | File management method and management system applying fuzzy matrice |
CN103914497A (en) * | 2013-01-08 | 2014-07-09 | 仁宝电脑工业股份有限公司 | Management method and system for quick access files |
CN104657513A (en) * | 2015-03-20 | 2015-05-27 | 烟台威尔数据***有限公司 | File operation and rapid retrieval method in embedded system |
CN105205581A (en) * | 2014-06-30 | 2015-12-30 | 国网上海市电力公司 | Power-supply-enterprise electronic file safety risk evaluation system |
US20180107402A1 (en) * | 2016-10-19 | 2018-04-19 | Acronis International Gmbh | System and method for data storage using log-structured merge trees |
-
2020
- 2020-04-29 CN CN202010355696.5A patent/CN111680198B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102117324B (en) * | 2011-02-24 | 2012-09-05 | 上海北大方正科技电脑***有限公司 | File management method and management system applying fuzzy matrice |
CN103914497A (en) * | 2013-01-08 | 2014-07-09 | 仁宝电脑工业股份有限公司 | Management method and system for quick access files |
CN105205581A (en) * | 2014-06-30 | 2015-12-30 | 国网上海市电力公司 | Power-supply-enterprise electronic file safety risk evaluation system |
CN104657513A (en) * | 2015-03-20 | 2015-05-27 | 烟台威尔数据***有限公司 | File operation and rapid retrieval method in embedded system |
US20180107402A1 (en) * | 2016-10-19 | 2018-04-19 | Acronis International Gmbh | System and method for data storage using log-structured merge trees |
Non-Patent Citations (1)
Title |
---|
周雅琴: "高校档案编研工作的实践与思考", 《档案时空》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112632075A (en) * | 2020-12-25 | 2021-04-09 | 创新科技术有限公司 | Storage and reading method and device of cluster metadata |
CN112597422A (en) * | 2020-12-30 | 2021-04-02 | 深圳市世强元件网络有限公司 | PDF file segmentation method and PDF file loading method in webpage |
CN113127421A (en) * | 2021-04-01 | 2021-07-16 | 山东英信计算机技术有限公司 | Method and equipment for searching file content in storage system |
CN114896210A (en) * | 2022-04-27 | 2022-08-12 | 中国航空工业集团公司沈阳飞机设计研究所 | Airplane test flight test data processing method and system, electronic equipment and medium thereof |
CN114943067A (en) * | 2022-07-13 | 2022-08-26 | 河北汇金集团股份有限公司 | Anti-counterfeiting identification method for file archive based on OCR technology |
CN114943067B (en) * | 2022-07-13 | 2022-11-08 | 河北汇金集团股份有限公司 | File archive anti-counterfeiting identification method based on OCR technology |
CN115794496A (en) * | 2023-02-07 | 2023-03-14 | 中信天津金融科技服务有限公司 | Archive storage method and system based on information extraction |
Also Published As
Publication number | Publication date |
---|---|
CN111680198B (en) | 2021-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111680198B (en) | File management system and method based on file segmentation and feature extraction | |
US20200210460A1 (en) | Systems and methods for probabilistic data classification | |
CN100418092C (en) | Grid and T-tree index method for rapid positioning in main memory database | |
US7739288B2 (en) | Systems and methods of directory entry encodings | |
KR100798609B1 (en) | Data sort method, data sort apparatus, and storage medium storing data sort program | |
Yagoubi et al. | Massively distributed time series indexing and querying | |
CN106233259A (en) | The many storage data from generation to generation of retrieval in decentralized storage networks | |
CN109284273B (en) | Massive small file query method and system adopting suffix array index | |
CN112765405B (en) | Method and system for clustering and inquiring spatial data search results | |
CN111708895B (en) | Knowledge graph system construction method and device | |
CN102207935A (en) | Method and system for establishing index | |
CN114691721A (en) | Graph data query method and device, electronic equipment and storage medium | |
CN101963993B (en) | Method for fast searching database sheet table record | |
CN115238153A (en) | Document management method and system based on virtual simulation | |
AL-Msie'deen et al. | Detecting commonality and variability in use-case diagram variants | |
CN108984626A (en) | A kind of data processing method, device and server | |
CN110389939A (en) | A kind of Internet of Things storage system based on NoSQL and distributed file system | |
CN113688257B (en) | Author name identity judging method based on large-scale literature data | |
CN108280176A (en) | Data mining optimization method based on MapReduce | |
Chukhray et al. | Proximate objects probabilistic searching method | |
CN108256086A (en) | Data characteristics statistical analysis technique | |
JPH09305622A (en) | Method and system for managing data base having document retrieval function | |
KR101846347B1 (en) | Method and apparatus for managing massive documents | |
Luo | Learning Augmented Binary Search Trees | |
CN117540056B (en) | Method, device, computer equipment and storage medium for data query |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220111 Address after: 130000 room 524, building 1, radio Street relocation community (Golden coordinate), Chaoyang District, Changchun City, Jilin Province Patentee after: Jilin Zhiguang Hengsheng Technology Co.,Ltd. Address before: 316002 No.1 Haida South Road, Changzhi Island, Lincheng street, Dinghai District, Zhoushan City, Zhejiang Province Patentee before: Zhejiang Ocean University |
|
TR01 | Transfer of patent right |