CN112182094A - Big data distributed storage method in voice data character text form - Google Patents

Big data distributed storage method in voice data character text form Download PDF

Info

Publication number
CN112182094A
CN112182094A CN201910586613.0A CN201910586613A CN112182094A CN 112182094 A CN112182094 A CN 112182094A CN 201910586613 A CN201910586613 A CN 201910586613A CN 112182094 A CN112182094 A CN 112182094A
Authority
CN
China
Prior art keywords
data
text
storage
voice
adopts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910586613.0A
Other languages
Chinese (zh)
Inventor
游萌
何云鹏
高君效
许兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chipintelli Technology Co Ltd
Original Assignee
Chipintelli Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chipintelli Technology Co Ltd filed Critical Chipintelli Technology Co Ltd
Priority to CN201910586613.0A priority Critical patent/CN112182094A/en
Publication of CN112182094A publication Critical patent/CN112182094A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • G06F12/0246Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A big data distributed storage method in a voice data character text form comprises the following characteristics: the stored data adopts a combined form of two key value pairs formed by language text ID and text content, and the storage database adopts a REDIS database under LINUX architecture; using an epoll data structure and adopting a non-blocking IO mechanism for large-batch files; the storage server adopts a distributed master-slave structure, data storage and backup are performed at slave server nodes, and the master server is used for scheduling the servers. By adopting the big data distributed storage method in the voice data text form, the invention combines the characteristic of voice big data training, adopts the EPOLL data structure under the LINUX architecture, reduces the complexity of data processing transactions, and improves the data read-write calling speed and the data access stability.

Description

Big data distributed storage method in voice data character text form
Technical Field
The invention belongs to the technical field of software, relates to a data storage method, and particularly relates to a large data distributed storage method in a voice data text form.
Background
With the improvement of the application of big data technology, the technical development taking artificial intelligence as the core puts forward high requirements on the use and storage of data, distributed data storage and use, quick scheduling access, high-concurrency read-write requests and the like are taken as technical research hotspots in the field of big data processing, and a large amount of investment is given in the industry; meanwhile, the storage security of the distributed data is also a key index which is worthy of being considered.
The recognition training process of the voice data is a big data processing process, massive training texts are input into an artificial intelligent neural network to be repeatedly trained and calculated to obtain a more vivid voice model, and high requirements are provided for the storage and reading-writing speed of the massive voice training texts in the training process.
Disclosure of Invention
The invention discloses a big data distributed storage method in a voice data text form, which aims to better store and call voice data and improve the reading and writing speed.
The big data distributed storage method in the form of the voice data text comprises the following characteristics:
the stored data adopts a combined form of two key value pairs formed by a voice text ID and text content;
the storage database adopts a REDIS database under an LINUX architecture;
using an epoll data structure and adopting a non-blocking IO mechanism for large-batch files;
the storage server adopts a distributed master-slave structure, data storage and backup are performed at slave server nodes, and the master server is used for scheduling the servers.
Preferably, a mixture of a scatter design and a hash design is adopted for the organization of the data, wherein the scatter design is used for modification and maintenance during data updating, and the hash design is used for data storage and archiving at a later stage.
Furthermore, when the upper limit of the database storage cannot meet the data flow, the excessive data is stored in the original hash table in a backward extending mode.
Preferably, in the memory recovery process of the distributed storage data, if the data in the memory presents power law distribution, the memory management is carried out by using an allkeys-lru mode; and if the data are distributed equally, performing memory management by using an allkeys-random mode.
Preferably, the data storage medium uses a serial singly linked list structure.
By adopting the big data distributed storage method in the voice data text form, the invention combines the characteristic of voice big data training, adopts the EPOLL data structure under the LINUX architecture, reduces the complexity of data processing transactions, and improves the data read-write calling speed and the data access stability.
Drawings
FIG. 1 is a diagram illustrating an embodiment of a terminal server process in an application process according to the present invention;
FIG. 2 is a diagram illustrating an embodiment of a server architecture according to the present invention;
FIG. 3 is a diagram illustrating an integrated process from a server group to a terminal server according to an embodiment of the present invention.
Detailed Description
The following provides a more detailed description of the present invention.
The invention relates to a novel big data distributed storage method in a text form, which is characterized by comprising the following steps:
the stored data adopts a combined form of two key value pairs formed by language text ID and text content;
the storage database adopts a REDIS database under an LINUX architecture;
using an epoll data structure and adopting a non-blocking IO mechanism for large-batch files;
the storage server adopts a distributed master-slave structure, data storage and backup are performed at slave server nodes, and the master server is used for scheduling the servers.
The voice data adopts a key value pair combination form formed by a language text ID and text content, namely the language text ID is called, namely the text is called; in the aspect of efficient data storage, the invention uses a reliable and stable Linux architecture layout which is suitable for occasions with high stability requirements in design.
The storage database adopts a REDIS database, and the REDIS database provides a memory type database system with a character string, a queue and a set structure. The data types supported to be stored comprise string, list, set, zset, hash and the like; the REDIS database supports a variety of different manners of ordering.
Meanwhile, the REDIS database adopts a single-thread technology without considering the safety problem under concurrent multithreading, and adopts a single thread, thereby avoiding the problems of context switching, other resource competition and the like. The REDIS database is a memory type database system, all read-write requests are operated on a memory, and the data read-write speed is improved. The unique IO multiplexing of the REDIS database also creates favorable conditions for fast access and reading and writing.
The key value pair key contains a database field name with a verification function, is a string type hidden mapping table of field and value, the combination of the key value pairs is suitable for storing objects in one-to-one correspondence, the correspondence of the key value pairs completely meets the storage and management requirements of voice data, the actual text of the voice data is formed by combining an ID (data name) of the voice text and two key value pairs of text body contents, the Hash data structure of a redis database is used, the Hash stores the mapping between character strings and character string values, and the mapping relation is matched with the structural design of the voice data. And the value contains all the constraint rules applied to this field. The combination of the key value pairs is suitable for storing one-to-one corresponding objects in the voice text data, and the corresponding relation of the key value pairs completely meets the requirement of voice data storage management.
The epoll data structure is an improved poll data structure of a Linux kernel for processing a large batch of file descriptors, is an enhanced version of a multiplex IO interface select/poll under Linux, and can remarkably improve the utilization rate of a system CPU (Central processing Unit) under the condition that a program is only slightly active in a large number of concurrent connections.
When the epoll data structure is adopted to obtain events, the entire intercepted descriptor set does not need to be traversed, and only the descriptor sets which are asynchronously awakened by the kernel IO events and added into the queue are traversed. Besides providing Level trigger (Level trigger) of select/poll type IO event, epoll also provides Edge trigger (Edge trigger), so that the user space program is possible to cache IO state, reduce the calling of epoll _ wait/epoll _ pwait and other functions, and improve the efficiency of application program.
In the REDIS environment, simplification of each event is realized by adopting an epoll data structure inside, operations such as reading, writing, storing and connecting of epoll are converted into independent simple events, and an IO interface cannot be used as a bottleneck based on the epoll multiplexing and mutual exclusion characteristics, so that the waste of operation time is reduced.
The non-blocking IO mechanism is that after a user thread initiates a read operation, the result is obtained immediately without waiting. If the result is an ERROR report (ERROR), the read operation is immediately sent again until the result is correct, namely, as long as the data in the kernel is ready, namely, when the read request of the user thread is received again, the data is immediately copied to the user thread and then returned. In the non-blocking IO model, a user thread needs to continuously inquire whether kernel data is ready, that is, the non-blocking IO does not surrender an occupied CPU processor, but always occupies the CPU processor until data is obtained.
By combining the epoll data structure with the non-blocking IO mechanism, the trigger level of the IO event is reduced, and the running efficiency of the program is improved.
The storage server adopts a distributed master-slave structure, data storage and backup are performed at slave server nodes, and the master server is used for scheduling the servers.
The main server end is mainly used for scheduling operation of each main server and each Slave server, the main server does not store data, and the Slave server (Slave) end carries out memory snapshot; the main control software is set at the main server side, important data such as voice data for training and the like are set at the slave server side, the voice data backup data are set in a plurality of backup mirror images of the slave server, and a backup strategy can be set to be synchronous once every second, so that the reliability is improved; the more secure method can arrange the main server and each slave server in a local area network, thereby improving the copying speed of the main-slave structure and the stability of read-write connection access.
For the organization of the voice text data, a scattered design and a Hash design can be mixed, wherein the scattered design is used for the early-stage data organization, and the Hash design is used for the later-stage data organization;
the structure of the scattered design text is easy to modify and is suitable for being used in an early maintenance stage. The Hash design is used for archiving and storing in a more perfect modification stage, and can be used for quick access in a later quick access stage; and an upper-layer index of the text data file name can be established, and the text data structure under the Hash design can be read rapidly in a targeted manner.
For the access control of voice data, the design that a slave library is required to be added to a data storage end of a slave server under the condition that great pressure is generated in read-write connection access is avoided as much as possible; the invention preferably adopts the form of an expanded hash table, and when the upper storage limit of the database cannot meet the data flow, the excessive data is stored in the original hash table in a backward extension way, so that the method of adding a secondary library newly is avoided. This ensures high speed robustness of the access process and security of the distributed storage.
In the memory recycling process of the distributed storage data, if the data in the memory presents power law distribution, namely a coexistence phenomenon that a part of data has high access frequency and a part of data has low access frequency, using allkeys-lru (in the primary key space, preferentially removing the recently unused keys); if the data appears equally distributed, i.e. all data access frequencies are the same, in this case all keys-random is used (in the primary key space, a certain key is removed randomly); some keys stored by a memory recovery user can be actively deleted from an instance by Redis, and the deleting speed from the instance is the calculation processing speed; by adopting the two memory recovery mechanisms, the corresponding mechanism of the data can be optimized more quickly, and the corresponding calculation speed of the management data on the text is higher.
For high speed network hard disk drives and redundant arrays of disks, the data volume is typically large, and voice and textual transcription results of online data are usually kept in space on the TB level and maintained without shutdown in the form of hot mirroring. In order to ensure the access and execution speed in terms of storage media, a serial single linked list structure is preferably used, and the serial single linked list structure is the best summary in the research and development process in comparison with the degree of easiness in maintenance, regardless of the access speed and reliability and the maintenance cost. The serial singly linked list structure is: master post serial Slave 1-2-3 or more. The nodes also maintain a backup policy setting, such as a backup policy setting that is synchronized once per second, to maintain high reliability.
By adopting the big data distributed storage method in the voice data text form, the invention combines the characteristic of voice big data training, adopts the EPOLL data structure under the LINUX architecture, reduces the complexity of data processing transactions, and improves the data read-write calling speed and the data access stability.
The foregoing is a description of preferred embodiments of the present invention, and the preferred embodiments in the preferred embodiments may be combined and combined in any combination, if not obviously contradictory or prerequisite to a certain preferred embodiment, and the specific parameters in the examples and the embodiments are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the patent protection scope of the present invention, which is defined by the claims and the equivalent structural changes made by the content of the description of the present invention are also included in the protection scope of the present invention.

Claims (5)

1. A big data distributed storage method in a voice data character text form is characterized by comprising the following characteristics:
the stored data adopts a combined form of two key value pairs formed by a voice text ID and text content;
the storage database adopts a REDIS database under an LINUX architecture;
using an epoll data structure and adopting a non-blocking IO mechanism for large-batch files;
the storage server adopts a distributed master-slave structure, data storage and backup are performed at slave server nodes, and the master server is used for scheduling the servers.
2. The distributed storage method for big data in the form of text of voice data words as claimed in claim 1, wherein a mixture of a scatter design and a hash design is adopted for the organization of the data, wherein the scatter design is used for modification and maintenance during data updating, and the hash design is used for data storage and archiving at a later stage.
3. The distributed storage method for big data in text form of words in voice data as claimed in claim 2, wherein when the upper limit of the database storage cannot satisfy the data traffic, the excessive data is stored in the original hash table in a backward extension mode.
4. The distributed storage method for big data in the form of text of words of voice data as claimed in claim 1, wherein in the memory recovery process of the distributed storage data, if the data in the memory presents power law distribution, the memory management is performed by using the allkeys-lru mode; and if the data are distributed equally, performing memory management by using an allkeys-random mode.
5. The distributed storage method of big data in text form of words of speech data as in claim 1, wherein the data storage medium uses a serial singly linked list structure.
CN201910586613.0A 2019-07-01 2019-07-01 Big data distributed storage method in voice data character text form Pending CN112182094A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910586613.0A CN112182094A (en) 2019-07-01 2019-07-01 Big data distributed storage method in voice data character text form

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910586613.0A CN112182094A (en) 2019-07-01 2019-07-01 Big data distributed storage method in voice data character text form

Publications (1)

Publication Number Publication Date
CN112182094A true CN112182094A (en) 2021-01-05

Family

ID=73915567

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910586613.0A Pending CN112182094A (en) 2019-07-01 2019-07-01 Big data distributed storage method in voice data character text form

Country Status (1)

Country Link
CN (1) CN112182094A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118051530A (en) * 2024-04-16 2024-05-17 福建时代星云科技有限公司 Data transmission storage and read-write method and server

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103945167A (en) * 2014-03-27 2014-07-23 东莞中山大学研究院 Digital family video conferencing system based on p2p
CN107229695A (en) * 2017-05-23 2017-10-03 深圳大学 Multi-platform aviation electronics big data system and method
CN108429789A (en) * 2018-02-02 2018-08-21 广州云印信息科技有限公司 A kind of mobile wireless network communication system and method based on automatic vending machine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103945167A (en) * 2014-03-27 2014-07-23 东莞中山大学研究院 Digital family video conferencing system based on p2p
CN107229695A (en) * 2017-05-23 2017-10-03 深圳大学 Multi-platform aviation electronics big data system and method
CN108429789A (en) * 2018-02-02 2018-08-21 广州云印信息科技有限公司 A kind of mobile wireless network communication system and method based on automatic vending machine

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
博客园: "Redis内存淘汰机制", 《HTTPS://WWW.CNBLOGS.COM/TV151579/P/7582100.HTML》 *
姜惠友等: "高性能网络协议栈兼容性研究", 《电信科学》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118051530A (en) * 2024-04-16 2024-05-17 福建时代星云科技有限公司 Data transmission storage and read-write method and server

Similar Documents

Publication Publication Date Title
CN106663047B (en) System and method for optimized signature comparison and data replication
US5903898A (en) Method and apparatus for user selectable logging
US8078582B2 (en) Data change ordering in multi-log based replication
CN103493024B (en) For the method and system of cache hierarchy
CN107180113B (en) Big data retrieval platform
US20040139127A1 (en) Backup system and method of generating a checkpoint for a database
CN108664359A (en) A kind of database restoring method, device, equipment and storage medium
JP4491254B2 (en) Replication command distribution system and method
US20230401214A1 (en) Graph database and methods with improved functionality
KR20200056357A (en) Technique for implementing change data capture in database management system
CN109902127A (en) History state data processing method, device, computer equipment and storage medium
CN114942965B (en) Method and system for accelerating synchronous operation of main database and standby database
US7200625B2 (en) System and method to enhance availability of a relational database
CN115114370B (en) Master-slave database synchronization method and device, electronic equipment and storage medium
JP2004318288A (en) Method and device for processing data and its processing program
CN112182094A (en) Big data distributed storage method in voice data character text form
JP6626976B2 (en) High throughput, high reliability data processing system
CN114817402A (en) SQL execution optimization method of distributed database in multi-region deployment scene
CN111400279B (en) Data operation method, device and computer readable storage medium
Bradberry et al. Practical Cassandra: a developer's approach
CN112861495A (en) Method for generating impala SQL statement based on Excel template file
CN112612647A (en) Log parallel replay method, device, equipment and storage medium
CN105022743A (en) Index management method and index management device
CN109635038A (en) A kind of double reading/writing methods in structural data strange land
US10706012B2 (en) File creation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210105

RJ01 Rejection of invention patent application after publication