CN112182094A

CN112182094A - Big data distributed storage method in voice data character text form

Info

Publication number: CN112182094A
Application number: CN201910586613.0A
Authority: CN
Inventors: 游萌; 何云鹏; 高君效; 许兵
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2019-07-01
Filing date: 2019-07-01
Publication date: 2021-01-05

Abstract

A big data distributed storage method in a voice data character text form comprises the following characteristics: the stored data adopts a combined form of two key value pairs formed by language text ID and text content, and the storage database adopts a REDIS database under LINUX architecture; using an epoll data structure and adopting a non-blocking IO mechanism for large-batch files; the storage server adopts a distributed master-slave structure, data storage and backup are performed at slave server nodes, and the master server is used for scheduling the servers. By adopting the big data distributed storage method in the voice data text form, the invention combines the characteristic of voice big data training, adopts the EPOLL data structure under the LINUX architecture, reduces the complexity of data processing transactions, and improves the data read-write calling speed and the data access stability.

Description

Big data distributed storage method in voice data character text form

Technical Field

The invention belongs to the technical field of software, relates to a data storage method, and particularly relates to a large data distributed storage method in a voice data text form.

Background

With the improvement of the application of big data technology, the technical development taking artificial intelligence as the core puts forward high requirements on the use and storage of data, distributed data storage and use, quick scheduling access, high-concurrency read-write requests and the like are taken as technical research hotspots in the field of big data processing, and a large amount of investment is given in the industry; meanwhile, the storage security of the distributed data is also a key index which is worthy of being considered.

The recognition training process of the voice data is a big data processing process, massive training texts are input into an artificial intelligent neural network to be repeatedly trained and calculated to obtain a more vivid voice model, and high requirements are provided for the storage and reading-writing speed of the massive voice training texts in the training process.

Disclosure of Invention

The invention discloses a big data distributed storage method in a voice data text form, which aims to better store and call voice data and improve the reading and writing speed.

The big data distributed storage method in the form of the voice data text comprises the following characteristics:

the stored data adopts a combined form of two key value pairs formed by a voice text ID and text content;

the storage database adopts a REDIS database under an LINUX architecture;

using an epoll data structure and adopting a non-blocking IO mechanism for large-batch files;

the storage server adopts a distributed master-slave structure, data storage and backup are performed at slave server nodes, and the master server is used for scheduling the servers.

Preferably, a mixture of a scatter design and a hash design is adopted for the organization of the data, wherein the scatter design is used for modification and maintenance during data updating, and the hash design is used for data storage and archiving at a later stage.

Furthermore, when the upper limit of the database storage cannot meet the data flow, the excessive data is stored in the original hash table in a backward extending mode.

Preferably, in the memory recovery process of the distributed storage data, if the data in the memory presents power law distribution, the memory management is carried out by using an allkeys-lru mode; and if the data are distributed equally, performing memory management by using an allkeys-random mode.

Preferably, the data storage medium uses a serial singly linked list structure.

By adopting the big data distributed storage method in the voice data text form, the invention combines the characteristic of voice big data training, adopts the EPOLL data structure under the LINUX architecture, reduces the complexity of data processing transactions, and improves the data read-write calling speed and the data access stability.

Drawings

FIG. 1 is a diagram illustrating an embodiment of a terminal server process in an application process according to the present invention;

FIG. 2 is a diagram illustrating an embodiment of a server architecture according to the present invention;

FIG. 3 is a diagram illustrating an integrated process from a server group to a terminal server according to an embodiment of the present invention.

Detailed Description

The following provides a more detailed description of the present invention.

The invention relates to a novel big data distributed storage method in a text form, which is characterized by comprising the following steps:

the stored data adopts a combined form of two key value pairs formed by language text ID and text content;

the storage database adopts a REDIS database under an LINUX architecture;

The voice data adopts a key value pair combination form formed by a language text ID and text content, namely the language text ID is called, namely the text is called; in the aspect of efficient data storage, the invention uses a reliable and stable Linux architecture layout which is suitable for occasions with high stability requirements in design.

The storage database adopts a REDIS database, and the REDIS database provides a memory type database system with a character string, a queue and a set structure. The data types supported to be stored comprise string, list, set, zset, hash and the like; the REDIS database supports a variety of different manners of ordering.

Meanwhile, the REDIS database adopts a single-thread technology without considering the safety problem under concurrent multithreading, and adopts a single thread, thereby avoiding the problems of context switching, other resource competition and the like. The REDIS database is a memory type database system, all read-write requests are operated on a memory, and the data read-write speed is improved. The unique IO multiplexing of the REDIS database also creates favorable conditions for fast access and reading and writing.

The key value pair key contains a database field name with a verification function, is a string type hidden mapping table of field and value, the combination of the key value pairs is suitable for storing objects in one-to-one correspondence, the correspondence of the key value pairs completely meets the storage and management requirements of voice data, the actual text of the voice data is formed by combining an ID (data name) of the voice text and two key value pairs of text body contents, the Hash data structure of a redis database is used, the Hash stores the mapping between character strings and character string values, and the mapping relation is matched with the structural design of the voice data. And the value contains all the constraint rules applied to this field. The combination of the key value pairs is suitable for storing one-to-one corresponding objects in the voice text data, and the corresponding relation of the key value pairs completely meets the requirement of voice data storage management.

The epoll data structure is an improved poll data structure of a Linux kernel for processing a large batch of file descriptors, is an enhanced version of a multiplex IO interface select/poll under Linux, and can remarkably improve the utilization rate of a system CPU (Central processing Unit) under the condition that a program is only slightly active in a large number of concurrent connections.

When the epoll data structure is adopted to obtain events, the entire intercepted descriptor set does not need to be traversed, and only the descriptor sets which are asynchronously awakened by the kernel IO events and added into the queue are traversed. Besides providing Level trigger (Level trigger) of select/poll type IO event, epoll also provides Edge trigger (Edge trigger), so that the user space program is possible to cache IO state, reduce the calling of epoll _ wait/epoll _ pwait and other functions, and improve the efficiency of application program.

In the REDIS environment, simplification of each event is realized by adopting an epoll data structure inside, operations such as reading, writing, storing and connecting of epoll are converted into independent simple events, and an IO interface cannot be used as a bottleneck based on the epoll multiplexing and mutual exclusion characteristics, so that the waste of operation time is reduced.

The non-blocking IO mechanism is that after a user thread initiates a read operation, the result is obtained immediately without waiting. If the result is an ERROR report (ERROR), the read operation is immediately sent again until the result is correct, namely, as long as the data in the kernel is ready, namely, when the read request of the user thread is received again, the data is immediately copied to the user thread and then returned. In the non-blocking IO model, a user thread needs to continuously inquire whether kernel data is ready, that is, the non-blocking IO does not surrender an occupied CPU processor, but always occupies the CPU processor until data is obtained.

By combining the epoll data structure with the non-blocking IO mechanism, the trigger level of the IO event is reduced, and the running efficiency of the program is improved.

The main server end is mainly used for scheduling operation of each main server and each Slave server, the main server does not store data, and the Slave server (Slave) end carries out memory snapshot; the main control software is set at the main server side, important data such as voice data for training and the like are set at the slave server side, the voice data backup data are set in a plurality of backup mirror images of the slave server, and a backup strategy can be set to be synchronous once every second, so that the reliability is improved; the more secure method can arrange the main server and each slave server in a local area network, thereby improving the copying speed of the main-slave structure and the stability of read-write connection access.

For the organization of the voice text data, a scattered design and a Hash design can be mixed, wherein the scattered design is used for the early-stage data organization, and the Hash design is used for the later-stage data organization;

the structure of the scattered design text is easy to modify and is suitable for being used in an early maintenance stage. The Hash design is used for archiving and storing in a more perfect modification stage, and can be used for quick access in a later quick access stage; and an upper-layer index of the text data file name can be established, and the text data structure under the Hash design can be read rapidly in a targeted manner.

For the access control of voice data, the design that a slave library is required to be added to a data storage end of a slave server under the condition that great pressure is generated in read-write connection access is avoided as much as possible; the invention preferably adopts the form of an expanded hash table, and when the upper storage limit of the database cannot meet the data flow, the excessive data is stored in the original hash table in a backward extension way, so that the method of adding a secondary library newly is avoided. This ensures high speed robustness of the access process and security of the distributed storage.

In the memory recycling process of the distributed storage data, if the data in the memory presents power law distribution, namely a coexistence phenomenon that a part of data has high access frequency and a part of data has low access frequency, using allkeys-lru (in the primary key space, preferentially removing the recently unused keys); if the data appears equally distributed, i.e. all data access frequencies are the same, in this case all keys-random is used (in the primary key space, a certain key is removed randomly); some keys stored by a memory recovery user can be actively deleted from an instance by Redis, and the deleting speed from the instance is the calculation processing speed; by adopting the two memory recovery mechanisms, the corresponding mechanism of the data can be optimized more quickly, and the corresponding calculation speed of the management data on the text is higher.

For high speed network hard disk drives and redundant arrays of disks, the data volume is typically large, and voice and textual transcription results of online data are usually kept in space on the TB level and maintained without shutdown in the form of hot mirroring. In order to ensure the access and execution speed in terms of storage media, a serial single linked list structure is preferably used, and the serial single linked list structure is the best summary in the research and development process in comparison with the degree of easiness in maintenance, regardless of the access speed and reliability and the maintenance cost. The serial singly linked list structure is: master post serial Slave 1-2-3 or more. The nodes also maintain a backup policy setting, such as a backup policy setting that is synchronized once per second, to maintain high reliability.

The foregoing is a description of preferred embodiments of the present invention, and the preferred embodiments in the preferred embodiments may be combined and combined in any combination, if not obviously contradictory or prerequisite to a certain preferred embodiment, and the specific parameters in the examples and the embodiments are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the patent protection scope of the present invention, which is defined by the claims and the equivalent structural changes made by the content of the description of the present invention are also included in the protection scope of the present invention.

Claims

1. A big data distributed storage method in a voice data character text form is characterized by comprising the following characteristics:

the storage database adopts a REDIS database under an LINUX architecture;

2. The distributed storage method for big data in the form of text of voice data words as claimed in claim 1, wherein a mixture of a scatter design and a hash design is adopted for the organization of the data, wherein the scatter design is used for modification and maintenance during data updating, and the hash design is used for data storage and archiving at a later stage.

3. The distributed storage method for big data in text form of words in voice data as claimed in claim 2, wherein when the upper limit of the database storage cannot satisfy the data traffic, the excessive data is stored in the original hash table in a backward extension mode.

4. The distributed storage method for big data in the form of text of words of voice data as claimed in claim 1, wherein in the memory recovery process of the distributed storage data, if the data in the memory presents power law distribution, the memory management is performed by using the allkeys-lru mode; and if the data are distributed equally, performing memory management by using an allkeys-random mode.

5. The distributed storage method of big data in text form of words of speech data as in claim 1, wherein the data storage medium uses a serial singly linked list structure.