CN107391544B

CN107391544B - Processing method, device and equipment of column type storage data and computer storage medium

Info

Publication number: CN107391544B
Application number: CN201710374036.XA
Authority: CN
Inventors: 孙垚光
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2020-06-30
Anticipated expiration: 2037-05-24
Also published as: CN107391544A

Abstract

The application provides a method and a device for processing columnar storage data, wherein the method comprises the following steps: receiving new data aiming at an original data file, wherein the original data file is stored in a columnar storage format, and Footer information of the original data file is recorded by adopting a Footer file independent of the original data file; and writing the new data into the tail part of the original data file according to a columnar storage format, and adding Footer information aiming at the new data into the Footer file to obtain an updated original data file and a Footer file. According to the embodiment of the application, efficient streaming type additional data can be realized for the columnar storage data, the new additional data can be added into the original data file, and a new file recording mode is not needed, so that the processing efficiency is higher, the occupied resources are less, and the data query speed is higher.

Description

Processing method, device and equipment of column type storage data and computer storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a computer storage medium for processing column-type stored data.

Background

The column-type storage format is different from the traditional line-type storage in that one line of data is stored continuously, the column-type storage format serializes partial data values (or all data values) of one column in a data file to be stored continuously, then partial data values (or all data values) of the other column are stored, and Footer information is written at the tail of data, namely storage format description information of the data file, including metadata information, the number of columns, relative positions, data type information or statistical information of the file.

In practical applications, a need to add new data to an original data file often arises, and in related technologies, when new data is obtained, the new data is usually stored by creating a new data file. However, the newly-built data file occupies certain resources, which affects data processing efficiency, and when the data query requirement is met, the original data file and the newly-built data file need to be queried respectively, which results in lower query efficiency.

Disclosure of Invention

To overcome the problems in the related art, the present application provides a method, an apparatus, a device, and a computer storage medium for processing columnar storage data.

According to a first aspect of embodiments of the present application, there is provided a method for processing columnar storage data, the method including:

receiving new data aiming at an original data file, wherein the original data file is stored in a columnar storage format, and Footer information of the original data file is recorded by adopting a Footer file independent of the original data file;

and writing the new data into the tail part of the original data file according to a columnar storage format, and adding Footer information aiming at the new data into the Footer file to obtain an updated original data file and a Footer file.

In an optional implementation manner, after receiving the new data for the original data file, the method includes:

loading the received new data into a high-speed storage space;

the writing of the newly added data to the tail of the original data file according to the column-wise storage format includes:

and when the newly added data loaded in the high-speed storage space meets a preset storage condition, writing the loaded newly added data into the tail part of the original data file according to a column type storage format.

In an optional implementation manner, the preset storage condition includes one or more of the following conditions:

the data volume of the newly added data reaches a preset data volume threshold; or the like, or, alternatively,

and the loading time of the newly added data in the high-speed storage space reaches a preset time threshold.

In an optional implementation, the method further includes:

when a data query request aiming at the original data file is acquired, reading first data meeting the request in the original data file according to a Footer file before updating, and reading second data meeting the request from newly added data loaded in the high-speed storage space;

and combining the first data and the second data and outputting the combined data.

In an optional implementation, the method further includes:

and generating a copy file of the loaded newly added data, and storing the copy file of the loaded newly added data in the same directory as the copy file of the original data file.

According to a second aspect of embodiments of the present application, there is provided a processing apparatus for columnar storage data, comprising:

a data receiving module to: receiving new data aiming at an original data file, wherein the original data file is stored in a columnar storage format, and Footer information of the original data file is recorded by adopting a Footer file independent of the original data file;

a data write module to: and writing the new data into the tail part of the original data file according to a columnar storage format, and adding Footer information aiming at the new data into the Footer file to obtain an updated original data file and a Footer file.

In an optional implementation manner, the data receiving module is further configured to:

after receiving the new data aiming at the original data file, loading the received new data into a high-speed storage space;

the data writing module is specifically configured to:

In an optional implementation manner, the apparatus further includes a reading module, configured to:

In an optional implementation manner, the apparatus further includes a copy processing module, configured to:

generating a copy file of the loaded newly added data, and storing the copy file of the loaded newly added data in the same directory as the copy file of the original data file

According to a third aspect of embodiments of the present application, there is provided a computer apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

According to a fourth aspect of embodiments herein, there is provided a computer storage medium having stored therein program instructions, the program instructions comprising:

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

in the application, unlike the prior art in which the Fotter information of the original data file is stored at the tail of the file, an independent Fotter file is used for recording, so that the newly added data can be directly added to the tail of the original data file, and the Fotter information of the newly added data is recorded in the Fotter file. According to the embodiment of the application, efficient streaming type additional data can be realized for the columnar storage data, the new additional data can be added into the original data file, and a new file recording mode is not needed, so that the processing efficiency is higher, the occupied resources are less, and the data query speed is higher.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1A is a schematic diagram of a columnar storage format in the related art.

Fig. 1B is a schematic diagram of another columnar storage format in the related art.

Fig. 2 is a flowchart illustrating a method for processing columnar storage data according to an exemplary embodiment of the present application.

Fig. 3A is an application scenario diagram illustrating a processing method of columnar storage data according to an exemplary embodiment of the present application.

Fig. 3B is a schematic diagram illustrating that new data is loaded in a memory according to an exemplary embodiment of the present application.

Fig. 4 is a hardware configuration diagram of a computer device in which the processing apparatus for storing data in a column format according to the present application is located.

FIG. 5 is a block diagram of a processing device for columnar storage of data shown herein, according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

First, a columnar storage format of a file will be described. Fig. 1A is a schematic diagram of a columnar storage format in the related art, where the columnar storage format in fig. 1A is to serialize all data values of a certain column in a data file together and continuously store the data values in a disk, and then store all data values of another column; the column-wise storage format may also be a partial data value for a certain column, as shown in fig. 1B, which is a schematic diagram of another column-wise storage format in the related art, in fig. 1B, the whole data is firstly divided into two blocks by rows (for example, the RowGroup concept in partial, the number of blocks may also be other values), and then the data in each block is stored in a column.

In addition, the tail of the data is written with Footer information, including metadata information of the data file, the number of columns, relative positions, data type information or statistical information, and the like. Since the columnar storage format is to store all or part of data of a certain column in a data file to serialization, the Footer information is used for recording relevant information during file storage, so as to be used for data reading.

In practical applications, a need to add new data to an original data file often arises, and in related technologies, when new data is acquired, because the tail of the data is written with Footer information, the new data is usually stored by newly creating a data file.

The embodiment of the application is different from the method that Fotter information of an original data file is stored at the tail of the file in the related technology, but an independent Fotter file is used for recording, so that newly added data can be directly added to the tail of the original data file, and the Fotter information of the newly added data is recorded in the Fotter file. According to the embodiment of the application, efficient streaming type additional data can be realized for the columnar storage data, the new additional data can be added into the original data file, and a new file recording mode is not needed, so that the processing efficiency is higher, the occupied resources are less, and the data query speed is higher. Next, examples of the present application will be described in detail.

As shown in fig. 2, fig. 2 is a flowchart illustrating a processing method of columnar storage data according to an exemplary embodiment of the present application, which is applicable to a database adopting a columnar storage format, and includes the following steps 201 to 202:

in step 201, new data for an original data file is received, the original data file is stored in a columnar storage format, and the Footer information of the original data file is recorded in a Footer file independent of the original data file.

In step 202, writing the new data into the tail of the original data file according to a columnar storage format, and adding Footer information for the new data into the Footer file to obtain an updated original data file and a Footer file.

For the original data file stored in a column, in this embodiment, a manner of creating a folder file is adopted to record the folder information of the original data file. Through the processing, when new data is needed, the new data can be directly added to the tail of the original data file and used as new data blocks in the original data file, the new data can still be written into the original data file, and efficient streaming data addition is achieved. Thereafter, the Footer information is updated in the Footer file.

In practical applications, for original data files that have been subjected to columnar storage, the original data files are already stored in a disk. In some examples, when the new data is obtained, the new data may be obtained in real time and written into the original data file in real time. In other examples, the data amount of the new data may be larger and may be obtained continuously, and in order to improve the data processing efficiency, after receiving the new data for the original data file, the method may further include:

and loading the received new data into the high-speed storage space.

The cache space may include a space for temporarily storing data, such as a memory or a cache, or a buffer space for data exchange, and may be determined according to a hardware environment where an actual database system is located, which is not limited in this embodiment.

Through the above manner, the newly added data can be temporarily loaded into a high-speed storage space such as a memory, and specifically, the mode of loading the newly added data into the memory can adopt a line type storage. And then, the newly added data is uniformly written into the original data file, so that the data processing efficiency can be improved. The preset storage condition represents the time for writing the newly added data into the original data file from the high-speed storage space, and can be flexibly configured in practical application, for example, the preset storage condition can add the loaded newly added data into the original data file when the current utilization rate of the high-speed storage space reaches a higher utilization rate, so as to prevent data loss caused by overflow of a memory or a cache and the like; or adding the loaded new data to the original data file when the current utilization rate of the high-speed storage space is lower so as to realize various modes of processing data under the condition that hardware is idle.

In an optional implementation manner, the preset storage condition may include one or more of the following conditions:

firstly, the data volume of the newly added data reaches a preset data volume threshold value. In this way, the preset storage condition takes the data volume as a consideration factor, so that the newly added data can be added to the original data file in time when the newly added data volume is large, and the problems of data loss and the like caused by the large data volume are prevented. The data amount threshold may be flexibly configured according to needs, which is not limited in this embodiment.

And secondly, the loading time of the newly added data in the memory reaches a preset time threshold. In this way, the preset storage condition takes the loading duration of the data as a consideration factor, and the newly added data can be added to the original data file in time after the newly added data is loaded in the memory for a certain time, so that the problems of data loss and the like caused by long data loading time are prevented. The duration threshold may be flexibly configured according to needs, which is not limited in this embodiment.

Through the two modes, the newly added data loaded in the high-speed storage space can be added to the original data file at a reasonable time, the data volume and/or the time length are taken as consideration factors, so that the resource consumption caused by adding the newly added data to the original data file quickly can be prevented, the newly added data can be added in time, and the problems of data loss and the like can be prevented.

In the related art, some hardware such as a memory or a cache may require to store data in blocks when loading data, and for this case, in this embodiment, the loading the newly added data into the memory may include: and splitting the newly added data into one or more data blocks and loading the data blocks into the memory by taking the size of a preset data block as a unit.

In a database system, for newly added data, the problem of timeliness of data visibility is usually involved. For example, after new data is loaded into the memory, a user needs to query some data, and a processing manner of querying data in an original data file is generally adopted in the related art. At this time, the newly added data may be loaded in the memory and is not yet stored in the disk, so if some data that the user needs to query is loaded in the memory, the data is not output, the data output to the user is incomplete, and the visible timeliness for loading the newly added data is poor. To address this issue, the method of the embodiment of the present application may further include:

when a data query request aiming at the original data file is acquired, reading first data meeting the request in the original data file according to a Footer file before updating, and reading second data meeting the request from newly added data loaded in the high-speed storage space.

In this embodiment, when a data query request for the original data file is obtained, on one hand, first data meeting the request may be read from the original data file stored in the disk, and on the other hand, second data meeting the request may also be read from newly added data loaded in the high-speed storage space. The first data and the second data are then merged, and the merged data is output as response data to the data query request. By the mode, the newly added data loaded in the high-speed storage space can be inquired, so that the problem that the data output to a user is incomplete can be solved, and the visibility timeliness of the data is high.

In practical applications, in order to prevent data loss, a copy file is usually generated for the original data file to perform data backup. If the received new data is loaded into the high-speed storage space, the loaded new data is not temporarily written into the original data file, so that a data loss may occur, in this embodiment, the method further includes:

In addition, the loaded copy file of the newly added data is stored in the same directory as the copy file of the original data file, so that the copy file of the newly added data corresponds to the copy file of the original data file, and data recovery is facilitated.

The solution provided in the present application is described in detail again by a specific example.

Fig. 3A is a diagram illustrating an application scenario of a processing method for columnar storage data according to an exemplary embodiment of the present application, where fig. 3A includes a database management system, which may be a database management system supporting a columnar storage format, such as partial or Orcfile. The processing scheme of the columnar storage data provided by the embodiment of the application can be used as an independent module or process and operated in the database management system to process the columnar storage data.

As shown in fig. 3A, a raw data file is maintained in the database management system, and the raw data file is stored in a columnar storage format at a location on the disk. The original data file is stored as in fig. 1B, the whole data is divided into a plurality of blocks (RowGroup) by rows, the data in each block is stored in a column, and the Footer information of the original data file is recorded by using a Footer file independent from the original data file.

And continuously inputting the newly added data aiming at the original data file into the database management system in a certain time period. In this embodiment, the newly added data may be continuously received, and the received newly added data may be stored in the memory. According to a loading mechanism of a memory, as shown in fig. 3B, which is a schematic diagram of loading new data in the memory according to an exemplary embodiment of the present application, the new data is written (written) into the memory in a stream-type append (attached record) manner, the data block (data block) shown in fig. 3B may exist in the memory, and when a total data amount of the data block corresponding to the new data reaches a certain size or a loading duration of the data block reaches a certain duration, a bottom interface of a database management system may be called, and the data block corresponding to the new data is appended to a tail end of an original data file to form a new RowGroup. When more datablocks correspond to the newly added data or other conditions are met, a plurality of datablocks can be merged (compact) and then added to the tail end of the original data file, so that the number of datablocks is reduced, and whether a DataBlock processing mode is adopted or not can be flexibly configured in practical application, which is not limited in this embodiment. On the other hand, the Footer information for the new data is added to the Footer file, and the updated original data file and the Footer file are obtained. Because Footer information is recorded, the received newly added data can be directly added to the original file without sorting.

After the data is written into the memory, the problem of data loss is considered, so that the processing of multiple copies of the data can be performed, and for newly added data loaded in the memory, a corresponding copy can be generated and is consistent with the storage position of the copy of the original data file, that is, the copy file of the newly added data is stored in the same directory as the copy file of the original data file.

In the data reading process, in addition to normal reading of the disk file, the present embodiment needs to consider the newly added data loaded in the memory, which is a key for improving the visible timeliness of the data. When a data query request is received, a read operation (Reader) may be performed as shown in fig. 3B, and the request may be split into two sub-requests: after the two data results are merged, as the return result of the request, the Query Engine data Query language is taken as an example in fig. 3B, and a table scan (TableScan) is performed to obtain the Query result.

Corresponding to the embodiments of the processing method of the column type stored data, the application also provides embodiments of a processing device of the column type stored data and a computer device applied by the processing device.

The embodiment of the processing device for storing data in a column mode can be applied to computer equipment. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. In the case of software implementation, as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for operation by a processor in which the processor processes the columnar storage data. From a hardware aspect, as shown in fig. 4, the hardware structure of the computer device in which the processing apparatus for storing data in a column is located in the present application is shown, except for the processor 410, the memory 430, the network interface 420, and the nonvolatile memory 440 shown in fig. 4, the computer device in which the apparatus 431 is located in the embodiment may also include other hardware according to the actual function of the computer device, which is not described again.

As shown in fig. 5, fig. 5 is a block diagram of a processing apparatus for processing columnar stored data according to an exemplary embodiment of the present application, including:

a data receiving module 51, configured to: receiving new data aiming at an original data file, wherein the original data file is stored in a column type storage format, and Footer information of the original data file is recorded by adopting a Footer file independent of the original data file.

A data writing module 52 configured to: and writing the new data into the tail part of the original data file according to a columnar storage format, and adding Footer information aiming at the new data into the Footer file to obtain an updated original data file and a Footer file.

In an optional implementation manner, the data receiving module 51 is further configured to:

the data writing module 52 is specifically configured to:

According to a third aspect of embodiments of the present application, there is provided a processing apparatus for columnar storage data, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to:

receiving new data aiming at an original data file, wherein the original data file is stored in a column type storage format, and Footer information of the original data file is recorded by adopting a Footer file independent of the original data file.

The implementation process of the function and the action of each module in the processing apparatus for processing the column-wise stored data is specifically described in detail in the implementation process of the corresponding step in the processing method for processing the column-wise stored data, and is not described herein again.

Accordingly, the present application also provides a computer storage medium having stored therein program instructions, the program instructions comprising:

Embodiments of the present application may take the form of a computer program product embodied on one or more storage media including, but not limited to, disk storage, CD-ROM, optical storage, and the like, in which program code is embodied. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A method of processing columnar stored data, the method comprising:

2. The method of claim 1, after the receiving new data for an original data file, the method comprising:

loading the received new data into a high-speed storage space;

3. The method of claim 2, the preset storage condition comprising one or more of:

4. The method of claim 2, further comprising:

5. The method of claim 2, further comprising:

6. A processing apparatus for columnar storage of data, the apparatus comprising:

7. The apparatus of claim 6, the data receiving module further configured to:

the data writing module is specifically configured to:

8. The apparatus of claim 7, the preset storage condition comprising one or more of:

9. The apparatus of claim 7, further comprising a reading module to:

10. The apparatus of claim 7, the apparatus further comprising a duplicate processing module to:

11. A computer device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

12. A computer storage medium having stored therein program instructions, the program instructions comprising: