CN109634914B

CN109634914B - Optimization method for whole storage, dispersion and bifurcation retrieval of talkback voice small files

Info

Publication number: CN109634914B
Application number: CN201811390509.6A
Authority: CN
Inventors: 方国栋; 张育钊; 袁科; 刘昊天; 张鑫
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2018-11-21
Filing date: 2018-11-21
Publication date: 2021-11-30
Anticipated expiration: 2038-11-21
Also published as: CN109634914A

Abstract

The invention provides an optimization method for whole storage, dispersion and bifurcation retrieval of talkback voice small files, which comprises the steps of classifying the talkback voice small files; sequencing each type of talkback voice small files in sequence according to the sizes of the files; for each type of talkback voice small files, selecting the maximum number of files which can be accommodated by the integral multiple of the HDFS block space according to the sorting sequence, and merging and storing the selected talkback voice small files; setting classification levels for the remaining talkback voice small files after selection, classifying the remaining talkback voice small files according to the set classification levels, and merging and storing the classified talkback voice small files; and establishing a bifurcation index mechanism for the stored merged file, and recording the information of each talkback voice small file in the merged file. The invention has the advantages that: the number of block spaces can be reduced, and the aim of reducing the overhigh memory occupation when the NameNode maintains the metadata is fulfilled; the space size of metadata information can be reduced, and the reading speed is accelerated.

Description

Optimization method for whole storage, dispersion and bifurcation retrieval of talkback voice small files

Technical Field

The invention relates to the field of performance optimization of distributed file systems, in particular to an optimization method for whole storage, dispersion and bifurcation retrieval of small talkback voice files.

Background

With the rapid development of internet technology, the communication industry has also changed dramatically. The application of the Internet Protocol (IP) -based network interphone is more and more extensive, the amount of talkback voice small files is larger and larger due to frequent use of the interphone by users, and how to effectively manage the talkback voice small files becomes a problem to be solved urgently for the internet interphone provider.

The Hadoop Distributed File System (HDFS) is a core component of a Hadoop Distributed computing framework of an Apache open source organization, takes GFS (Google File System) of Google company as a prototype, is realized by adopting Java open source, and provides reference for erecting cloud storage solutions for various major institutions and companies. Once released, the HDFS has been widely used to store mass data in internet companies such as FaceBook, Yahoo, ariiba, tench, and hectometre. The design is designed to stably run on a low-cost commercial server, and the system also has the advantages of high fault tolerance, good expandability and the like.

The HDFS adopts a master-slave type structure and consists of a NameNode node and a large number of DataNode nodes, wherein the NameNode is the core of the HDFS, the operation of the NameNode is to maintain the metadata information of files and coordinate and manage all the DataNode nodes, and the DataNode is used for storing actual files. After the Hadoop cluster is started, all metadata information is loaded into the memory of the NameNode. When a client accesses the HDFS, metadata information of related files is firstly acquired from the NameNode node, then the DataNode for actually storing the files is found according to the metadata information, and finally the files requested by the client are acquired through the DataNode.

The HDFS master-slave architecture has several problems, that is, because each file corresponds to a piece of metadata information, and the space occupied by each piece of metadata information is about 150 bytes, as the number of small files stored in the HDFS increases, the metadata information to be maintained by the NameNode also increases sharply, and the NameNode space is consumed in large quantities, but the memory space of the NameNode is limited, so that the performance bottleneck of the NameNode is finally caused. Secondly, each time of writing small files, the distribution of data blocks needs to be requested to the NameNode node, and each time of reading small files, metadata information needs to be requested to the NameNode node, so that frequent data reading and writing can cause the performance of the NameNode node to be reduced, and even cause network flooding. And thirdly, the file size of each small file is smaller, and the three steps of requesting file metadata information, locating the position of a data block, and establishing connection between a client and a DataNode are carried out every time when an actual file is transmitted, so that the time for reading and writing the small files is possibly shorter than the time for establishing network connection, and the efficiency of the HDFS is reduced.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an optimization method for whole storage, dispersion and bifurcation retrieval of small files of talkback voice, and the method is used for solving the problems of overhigh memory occupation of a NameNode node, performance reduction caused by reading and writing small files and the like caused by storing a large number of small files by the conventional HDFS.

The invention is realized by the following steps: a method for optimizing whole storage, dispersion and bifurcation retrieval of talkback voice small files comprises the following steps:

step S1, classifying the talkback voice small files;

step S2, after the classification is finished, sequencing each class of talkback voice small files in sequence according to the size of the files;

step S3, selecting the maximum number of files which can be accommodated by the integral multiple of the HDFS block space according to the sorting sequence for each type of small talkback voice files, and merging and storing the selected small talkback voice files into the HDFS block space;

step S4, setting classification levels for the remaining talkback voice small files after selection, classifying the remaining talkback voice small files according to the set classification levels, and merging and storing the classified talkback voice small files;

and step S5, establishing a bifurcation index mechanism for the stored merged file, and recording the information of each talkback voice small file in the merged file.

Further, the step S1 specifically includes:

step S11, when the talkback voice initiator uploads the talkback voice small files, the talkback voice server places all talkback voice small files belonging to the initiator under a specified folder according to the initiator information;

and step S12, marking the appointed folders of each initiator as a type separately according to the initiator information.

Further, the designated folders are each named by an initiator name.

Further, the step S2 is specifically:

after the classification is finished, for each appointed folder, traversing all the talkback voice small files under the appointed folder, and sequencing all the talkback voice small files under the same appointed folder according to the sequence from big to small.

Further, in step S4, the setting of the classification level for the remaining talkback voice small files after selection is specifically:

according to the relation among all the initiators, three classification levels are set for the remaining talkback voice small files after selection, and each classification level is set with a priority, wherein the three classification levels are sequentially from high to low according to the priority: the group-related talkback voice small files, the talkback voice small files with the same time period and other talkback voice small files.

Further, in step S4, the merging and storing the classified small talk-back files specifically includes:

step B11, creating a buffer area;

step B12, filling the talkback voice small files with the group relationship into a cache region, judging whether the cache region can not be filled with the next talkback voice small files with the group relationship, if so, directly storing the talkback voice small files filled into the cache region into an HDFS block space, emptying the cache region, and entering step B13; if not, go directly to step B13;

step B13, judging whether the talkback voice small files with group relation are filled, if yes, entering step B14; if not, return to step B12;

step B14, filling talkback voice small files with the same time period into the cache region, judging whether the cache region can not be refilled with talkback voice small files with the same time period in the next time period, if so, directly storing the talkback voice small files filled into the cache region into an HDFS block space, emptying the cache region, and entering step B15; if not, go directly to step B15;

step B15, judging whether the talkback voice docks in the same time period are filled, if so, entering step B16; if not, return to step B14;

step B16, filling other talkback voice small files into the cache region, judging whether the cache region can not be filled with the next other talkback voice small files or not, if so, directly storing the talkback voice small files filled into the cache region into the HDFS block space, emptying the cache region, and entering step B17; if not, go directly to step B17;

step B17, judging whether other talkback voice small files are filled up, if so, entering step B18; if not, return to step B16;

and step B18, storing the small talkback voice files filled in the cache area into the HDFS block space, and emptying the cache area.

Further, the step S5 is specifically:

storing the metadata information of each talkback voice small file by using a hash table structure, wherein in the hash table structure, a key is a hash value of the name information of the talkback voice small file, and the structure of the key is < user name | file name >; the value is metadata information of the talkback voice small file; for the talkback voice small files selected according to the sorting sequence, the metadata information comprises the range, the initial position and the length of the small files; for the remaining talkback voice small files after selection, the metadata information comprises the range of the small files, the file names after scattered combination, the initial positions and the lengths.

The invention has the following advantages:

1. in an original HDFS block management mode, waste of block space is reduced through small file combination, and an integration and dispersion strategy is added, so that waste of block space by edge files is avoided. The specific implementation mode is as follows: the small files are integrated and dispersed according to the initiators and the relations thereof, namely, the files belonging to the same initiator are merged firstly, and then the small files exceeding the integral multiple of the block space are classified and merged according to the relations among the initiators. By the method, the number of block spaces can be reduced, and the aim of reducing the overhigh memory occupation when the NameNode maintains the metadata is fulfilled.

2. By establishing a bifurcation index mechanism and using different classification methods aiming at different files, the space size of metadata information can be reduced, and the reading speed is accelerated. The specific implementation mode is as follows: and recording the belonged range, the starting position and the length of the small file. The method is very convenient and simple for users, can complete related functions only by deploying software, and is convenient to popularize and use.

Drawings

The invention will be further described with reference to the following examples with reference to the accompanying drawings.

FIG. 1 is a functional block diagram of an implementation of the present invention.

FIG. 2 is a flowchart illustrating an optimization method for whole storage, dispersion and bifurcation retrieval of a small talk-back voice file according to the present invention.

Fig. 3 is a schematic diagram illustrating classifying talkback voice doclets according to the present invention.

Fig. 4 is a schematic diagram of sorting the intercom voice doclets in accordance with the present invention.

Fig. 5 is a diagram illustrating a correspondence relationship between a metadata information structure of the intercom voice doclet and the intercom voice doclet in accordance with the present invention.

FIG. 6 is a diagram illustrating an integrated storage structure according to the present invention.

FIG. 7 is a diagram of a distributed storage structure according to the present invention.

Detailed Description

Referring to fig. 1 to 7, the present invention provides a method for optimizing whole storage, dispersion and bifurcation retrieval of small talk-back files, which comprises the following steps:

step S1, classifying the talkback voice small files;

in this embodiment, the step S1 specifically includes:

step S12, according to the originator information, the designated folders of each originator are individually marked as a class, that is, when the classification is performed specifically, the originator is used as the classification criterion.

As shown in fig. 3, for example, the initiator a and the initiator B both upload 6 intercom voice doclets, and after receiving the intercom voice doclets uploaded by the initiator a and the initiator B, the intercom voice server stores the 6 intercom voice doclets corresponding to the initiator a in the directory of the designated folder corresponding to the initiator a, and stores the 6 intercom voice doclets corresponding to the initiator B in the directory of the designated folder corresponding to the initiator B.

The designated folders are each named by the name of the initiator. The voice talkback is mainly divided into the initiator and the receiver, so the initiator is used for naming and is used as a classification standard, on one hand, the classification rule can be ensured to be clear, and on the other hand, the position of the small file to be accessed can be quickly positioned according to the information of the initiator, so that the efficiency of accessing the talkback voice small file is improved.

in this embodiment, the step S2 specifically includes:

after the classification is finished, for each appointed folder, traversing all the talkback voice small files under the appointed folder, and sequencing all the talkback voice small files under the same appointed folder according to the sequence from big to small. In specific implementation, all the talkback voice small files in the same designated folder can be uniformly sorted by one sorting module.

As shown in fig. 4, for example, 6 intercom voice docks are stored in the designated folder of the originator a, and the size order of the files is: the talkback file 1, the talkback file 2, the talkback file 3, the talkback file 4, the talkback file 5 and the talkback file 6 are sequentially arranged, and then the sorting module sorts the talkback files according to the sequence of the talkback file 1, the talkback file 2, the talkback file 3, the talkback file 4, the talkback file 5 and the talkback file 6.

Step S3, selecting the maximum number of files which can be accommodated by the integral multiple of the HDFS block space according to the sorting sequence for each type of small talkback voice files, and merging and storing the selected small talkback voice files into the HDFS block space; that is to say, in the specific implementation of the present invention, each type of small intercom voice files is stored in an integrated manner, that is, the small intercom voice files whose sum is closest to the integral multiple of the block space of the HDFS (the size of each block space in the HDFS is fixed, and since a plurality of block spaces are usually occupied during the specific storage, the integral multiple of the block space needs to be selected) are selected in the order from large to small, and the small intercom voice files are integrated and stored in the block space of the HDFS.

For example, as can be seen from fig. 4, only the talkback files 1 to 5 can be accommodated by integral multiples of the HDFS block space, then the talkback files 1 to 5 are stored in the HDFS block space in an integrated manner, and the talkback file 6 is handed to the scatter module for processing; at this time, if the talkback file 1 is handed over to the scatter module for processing, the occupancy rate of the memory will be reduced, because each talkback voice small file is sorted from large to small, the following can be obtained:

File1.Size≥File6.Size (1)

this can be further derived from equation (1):

in specific implementation, since there is very large randomness in the user's talk-back time, the probability of being equal to true in equations (1) and (2) is very low, and therefore, storing the talk-back voice small file in this way can maximize the use of the block space.

Step S4, setting classification levels for the remaining talkback voice small files after selection, classifying the remaining talkback voice small files according to the set classification levels, and merging and storing the classified talkback voice small files; in specific implementation, the selected scattered talkback voice small files can be processed by one scattering module in a unified mode.

In step S4, the setting of the classification level for the remaining intercom voice doclets after selection is specifically:

In step S4, the merging and storing the classified talkback voice doclets specifically includes:

step B11, creating a buffer area;

The invention uses the scattered way to combine the remaining talk-back voice small files after selection to reduce the occupation of block space, so as to reduce the occupation of memory of NameNode by reducing the waste of block space.

The step S5 specifically includes:

using a HashMap with a HashMap structure to store metadata information of each talkback voice small file, wherein the HashMap structure is a Key, a Value, namely, a Key, a Value, and in the HashMap with the HashMap structure, the Key (Key) is a HashCode Value of the name information of the talkback voice small file, and the Key (Key) structure is a user name | file name >; the Value (Value) is metadata information of the talkback voice doclet;

for the talkback voice doclets selected according to the sorting order (i.e. the talkback voice doclets stored in an integrated manner, Scope is Whole), the metadata information includes Scope to which the doclets belong, start position (Offset), and Length (Length), as shown in fig. 6; for the remaining talkback voice doclets after selection (i.e. talkback voice doclets stored in a scatter manner, Scope is Apart), the metadata information includes a Scope (Scope) to which the doclets belong, a filename (MergeFileName) after scatter merge, a start position (Offset), and a Length (Length), as shown in fig. 7.

The talkback voice small files are stored by adopting the integrated storage structure and the scattered storage structure, so that when a client (namely an accessor) accesses the talkback voice small files, Scope (range to which the small files belong) information of the talkback voice small files can be obtained from a HashMap in a hash table structure according to a Key (Key), and after a return Value is obtained, if the talkback voice small files are combined in an integrated mode, the corresponding talkback voice small files are read from a specified file directly through an initial position (Offset) and a Length (Length) in a user name and a Value (Value) in the Key (Key); if the talkback voice small files are combined in a scattered manner, reading the corresponding small files from the designated files according to the file name (MergeFileName), the starting position (Offset) and the Length (Length) after scattered combination, as shown in FIG. 5. When reading small files, the HDFSAPI is directly used.

In summary, the invention has the following advantages:

Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims

1. An optimization method for whole storage, dispersion and bifurcation retrieval of talkback voice small files is characterized by comprising the following steps: the method comprises the following steps:

step S1, classifying the talkback voice small files;

step S5, establishing a bifurcation index mechanism for the stored merged file, and recording the information of each talkback voice small file in the merged file;

step B11, creating a buffer area;

2. The method for optimizing whole inventory scatter and bifurcation retrieval of small intercom voice files according to claim 1, wherein: the step S1 specifically includes:

3. The method for optimizing whole inventory and split and bifurcation retrieval of small intercom voice files according to claim 2, wherein the method comprises the following steps: the designated folders are each named by the name of the initiator.

4. The method for optimizing whole inventory and split and bifurcation retrieval of small intercom voice files according to claim 2, wherein the method comprises the following steps: the step S2 specifically includes:

5. The method for optimizing whole inventory scatter and bifurcation retrieval of small intercom voice files according to claim 1, wherein: in step S4, the setting of the classification level for the remaining intercom voice doclets after selection is specifically:

6. The method for optimizing whole inventory scatter and bifurcation retrieval of small intercom voice files according to claim 1, wherein: the step S5 specifically includes: