WO2024082693A1 - 数据处理方法及装置 - Google Patents

数据处理方法及装置 Download PDF

Info

Publication number
WO2024082693A1
WO2024082693A1 PCT/CN2023/103426 CN2023103426W WO2024082693A1 WO 2024082693 A1 WO2024082693 A1 WO 2024082693A1 CN 2023103426 W CN2023103426 W CN 2023103426W WO 2024082693 A1 WO2024082693 A1 WO 2024082693A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
information column
indication information
indication
row
Prior art date
Application number
PCT/CN2023/103426
Other languages
English (en)
French (fr)
Inventor
康祥
宋立勇
蔺若林
巴肯斯蒂格
洪博斯塔德符文
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024082693A1 publication Critical patent/WO2024082693A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/466Transaction processing

Definitions

  • the present application relates to the field of database technology, and in particular to a data processing method and device.
  • TP Transactional processing
  • AP analytical processing
  • TP systems are used to manage and process transactions.
  • TP systems are used for sales order entry or bank transaction processing.
  • AP systems are used to analyze data and generate reports for business analysts.
  • reports generated by AP systems include summary sales statistics by geographic region, product category, or customer classification.
  • the hybrid transactional and analytical processing (HTAP) mode is a common hybrid load mode.
  • the HTAP mode can support a large number of concurrent updates, and the data synchronization delay is usually in the second or millisecond level.
  • the HTAP system includes TP loads and AP loads.
  • the TP load uses a row storage format
  • the AP load uses a column storage format.
  • the data on the TP load is synchronized to the AP load in real time. Since the AP load uses a column storage format, the AP load usually uses an append method to append data when updating data, which will cause a large amount of duplicate data in the AP load, and the data needs to be deduplicated when processing read requests.
  • the AP load periodically creates a full data snapshot of the latest version of the TP load, and processes read requests based on the full data snapshot.
  • the full data snapshot essentially contains the data of the AP load after data deduplication under the corresponding version.
  • the data used by the AP load when processing read requests is less real-time due to the length of the snapshot creation interval.
  • the storage resources occupied by the AP load to store the full data snapshot are relatively large.
  • the present application provides a data processing method and device, which can solve the problems that the data used by the current AP engine to process AP requests has poor real-time performance and that storing full data snapshots occupies more storage resources.
  • a data processing method includes: an analysis processing engine obtains incremental data from a transaction processing engine.
  • the analysis processing engine updates a storage table according to the incremental data.
  • the storage table adopts a column storage format.
  • the storage table includes a data information column and a validity indication information column.
  • the incremental data is stored in the data information column in the form of data rows.
  • the validity indication information column is used to indicate whether the data row in the data information column is valid or invalid.
  • the analysis and processing engine indicates whether the data rows in the data information column are valid or invalid by setting a validity indication information column in the storage table.
  • the analysis and processing engine can filter invalid data rows in the data information column based on the validity indication information column and read valid data rows in the data information column, thereby achieving fast data deduplication and improving data query efficiency. Since the analysis and processing engine does not need to create and store a full data snapshot of the transaction processing engine, the amount of data storage is reduced, thereby saving storage resources.
  • the analysis and processing engine synchronizes the incremental data from the transaction processing engine in real time, it can quickly update the latest status of the data through the valid indication information column, so that the analysis and processing engine can use the latest updated data as much as possible when processing data query requests, thereby improving data timeliness.
  • the analysis processing engine in response to receiving a data query request, the analysis processing engine outputs the valid data row indicated by the validity indication information column in the storage table.
  • the analysis and processing engine can filter invalid data rows in the data information column based on the validity indication information column and read valid data rows in the data information column, thereby achieving rapid data deduplication and improving data query efficiency.
  • the incremental data includes newly added data
  • the implementation method of the analysis and processing engine updating the storage table according to the incremental data includes: the analysis and processing engine adds the newly added data to the storage table by adding data rows, and adds a first indication in the indication information row corresponding to the data row where the newly added data is located in the validity indication information column, and the first indication is used to indicate that the data row is valid.
  • the analysis and processing engine only needs to add a data row in the data information column to add the new data and It is sufficient to add a corresponding indication information row in the information column to add the first indication, which can realize the rapid update of the newly added data and its validity, and the data is highly real-time.
  • the incremental data includes modified data
  • the implementation method of the analysis and processing engine updating the storage table according to the incremental data includes: the analysis and processing engine adds the modified data to the storage table by adding data rows, and adds a first indication to the indication information row corresponding to the data row where the modified data is located in the validity indication information column, and modifies the first indication in the indication information row corresponding to the data row where the modified data is located in the validity indication information column to a second indication, wherein the first indication is used to indicate that the data row is valid, and the second indication is used to indicate that the data row is invalid.
  • the analysis and processing engine only needs to add a data row in the data information column to add the modified data, and add a corresponding indication information row in the validity indication information column to add the first indication, and modify the indication in the indication information row in the validity indication information column corresponding to the data row where the modified data is located. This can achieve rapid updating of the modified data and its validity, and the modified data and its validity, and the data is highly real-time.
  • the incremental data includes deleted data
  • the analysis and processing engine updates the storage table according to the incremental data, including: the analysis and processing engine modifies the first indication in the indication information row corresponding to the data row where the deleted data is located in the validity indication information column to a second indication, where the second indication is used to indicate that the data row is invalid.
  • the analysis and processing engine only needs to modify the indication in the indication information row in the validity indication information column corresponding to the data row where the deleted data is located, so as to achieve a rapid update of the validity of the deleted data and achieve high data real-time performance.
  • the storage table also includes a readability identification information column, which is used to identify whether a data row in the data information column can be read or cannot be read, and the incremental data includes deleted data.
  • the implementation method of the analysis and processing engine updating the storage table according to the incremental data includes: the analysis and processing engine adds the deleted data to the storage table by adding data rows, and sets a first identifier in the identification information row corresponding to the data row where the newly added deleted data is located in the readability identification information column, and modifies the first indication in the indication information row corresponding to the data row where the original deleted data is located in the validity indication information column to a second indication, the first indication is used to indicate that the data row is valid, the second indication is used to indicate that the data row is invalid, and the first identifier is used to indicate that the data row cannot be read.
  • the analysis and processing engine needs to add a data row in the data information column to add and delete data, and set an identifier in the identification information row corresponding to the data row where the newly added deleted data is located in the readability identification information column to indicate that the data row cannot be read, which can achieve a rapid update of the validity of the deleted data and a high degree of data real-time performance.
  • the validity indication information column includes a first indication information column and a second indication information column.
  • the first indication information column and the second indication information column are used to poll and update the indication of the validity of the data row in the data information column, and the first indication information column and the second indication information column meet the following conditions: at the same time, at least one of the first indication information column and the second indication information column supports the data query function; when both the first indication information column and the second indication information column support the data query function, the most recently updated indication information column of the first indication information column and the second indication information column is used for the analysis and processing engine to perform data query.
  • two indication information columns are set in the storage table to poll and update the indication of the validity of the data rows in the data information column, so that the analysis and processing engine always has available indication information to assist in processing data query requests at any time, thereby improving data query efficiency.
  • the amount of data in the indication information column is small, it can basically achieve rapid synchronization with the data information column.
  • the analysis and processing engine can use the latest data as much as possible when processing data query requests, thereby improving data timeliness.
  • multiple indication information columns share the same data information column, there is no problem of redundant data storage, and the data storage cost is low.
  • the storage table further includes a version information column, and the version information column is used to indicate the time sequence in which the data rows in the data information column are added to the data information column.
  • a data processing device which is applied to an analysis processing engine.
  • the device includes: an acquisition module, which is used to acquire incremental data from a transaction processing engine.
  • An update module which is used to update a storage table according to the incremental data, wherein the storage table adopts a column storage format, and the storage table includes a data information column and a validity indication information column.
  • the incremental data is stored in the data information column in the form of a data row, and the validity indication information column is used to indicate whether the data row in the data information column is valid or invalid.
  • the device further comprises: a query module, configured to output valid data rows indicated by a validity indication information column in the storage table in response to receiving a data query request.
  • a query module configured to output valid data rows indicated by a validity indication information column in the storage table in response to receiving a data query request.
  • the incremental data includes new data
  • the update module is specifically used to: add the new data to the storage table by adding data rows, and add a first indication in the indication information row corresponding to the data row where the new data is located in the validity indication information column, and the first indication is used to indicate that the data row is valid.
  • the incremental data includes modified data
  • the update module is specifically used to: add the modified data to the storage table by adding data rows, and add a first indication to the indication information row corresponding to the data row where the modified data is located in the validity indication information column, and modify the first indication in the indication information row corresponding to the data row where the modified data is located in the validity indication information column to a second indication, the first indication is used to indicate that the data row is valid, and the second indication is used to indicate that the data row is invalid.
  • the incremental data includes deleted data
  • the update module is specifically used to: modify the first indication in the indication information row corresponding to the data row where the deleted data is located in the validity indication information column to a second indication, and the second indication is used to indicate that the data row is invalid.
  • the storage table also includes a readability identification information column, and the readability identification information column is used to identify that the data row in the data information column can be read or cannot be read.
  • the incremental data includes deleted data
  • the update module is specifically used to: add the deleted data to the storage table by adding data rows, and set a first identifier in the identification information row corresponding to the data row where the newly added deleted data is located in the readability identification information column, and modify the first indication in the indication information row corresponding to the data row where the original deleted data is located in the validity indication information column to a second indication, the first indication is used to indicate that the data row is valid, the second indication is used to indicate that the data row is invalid, and the first identifier is used to indicate that the data row cannot be read.
  • the validity indication information column includes a first indication information column and a second indication information column
  • the first indication information column and the second indication information column are used to poll and update the indication of the validity of the data rows in the data information column
  • the first indication information column and the second indication information column meet the following conditions: at the same time, at least one of the first indication information column and the second indication information column supports the data query function; when both the first indication information column and the second indication information column support the data query function, the most recently updated indication information column of the first indication information column and the second indication information column is used for the analysis and processing engine to perform data query.
  • the storage table further includes a version information column, and the version information column is used to indicate the time sequence in which the data rows in the data information column are added to the data information column.
  • a data processing device which may be an analysis and processing engine, comprising a memory and a processor, wherein the memory stores program instructions, and the processor runs the program instructions to execute the method in the first aspect and its various embodiments.
  • a computer-readable storage medium comprising program instructions, which, when executed on a computer device, causes the computer device to execute the method in the above-mentioned first aspect and its various embodiments.
  • a computer program product is provided.
  • the computer program product runs on a computer, the computer executes the method in the above-mentioned first aspect and its various embodiments.
  • FIG1 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG2 is a flow chart of a data processing method provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of the structure of a data processing device provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of the structure of another data processing device provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of the architecture of a data processing device provided in an embodiment of the present application.
  • Snapshot refers to a completely available copy of a data set.
  • a snapshot includes an image of the corresponding data set at a certain point in time (the start time of the copy). Based on the snapshot of the data set at a certain time, the data corresponding to the data set at that time can be queried.
  • a snapshot can be regarded as a copy of data or a replica of data.
  • a transaction in a database is a series of operations performed as a single logical unit of work, which is either executed completely or not executed at all. Transaction processing ensures that data-oriented resources are not permanently updated unless all operations within the transactional unit are completed successfully. By combining a set of related operations into a unit that either all succeeds or all fails, error recovery can be simplified and applications can be made more reliable.
  • atomicity, consistency, isolation, and durability ACID for short.
  • Transaction processing engine A database engine used to process transactions. User data is submitted and persisted in the form of transactions through the TP engine.
  • Analytical Processing Engine A database engine used to analyze transactions.
  • the AP engine synchronizes data on the TP engine in real time and provides data analysis capabilities.
  • HTAP is an emerging application architecture that breaks the "wall" between transaction processing and analytical processing and can achieve mixed-load data processing, that is, the HTAP mode is a common mixed-load mode.
  • the HTAP mode can support a large number of concurrent updates, and the data synchronization delay is usually in the second or millisecond level.
  • the AP load since the AP load adopts a columnar storage format, the AP load usually appends data when updating data, which will cause a large amount of duplicate data in the AP load. When processing read requests, the data needs to be deduplicated.
  • a data row is added to the storage table using the columnar storage format to record the data.
  • the AP load synchronizes the modified data to the local, it adds another row of data to the storage table to record the modified data.
  • there are two pieces of data with ID 1 in the storage table.
  • the AP load In order to improve data reading efficiency, the AP load currently periodically creates a full data snapshot of the latest version of the TP load, and processes read requests based on the full data snapshot.
  • the full data snapshot can also be called a global consistency snapshot.
  • the full data snapshot essentially contains the data of the AP load after data deduplication under the corresponding version.
  • the AP load creates several hidden tables under the tablespace. These hidden tables are shadow tables (shadow table) after the data of a certain version of the original table is deduplicated.
  • the original table is the table used by the AP load to store data from the TP load
  • a shadow table is a full data snapshot of a version of the TP load created by the AP load.
  • the snapshot creation interval will also be long. Limited by the snapshot creation interval, the real-time performance of the data used by the AP load to process read requests is poor. For example, the data updated by the AP load within the creation interval of two adjacent snapshots can only be used to process read requests after the next snapshot is created. On the other hand, the storage resources occupied by the AP load to store full data snapshots are relatively large.
  • an embodiment of the present application provides a data processing method.
  • the analysis and processing engine obtains the incremental data from the transaction processing engine, it updates the storage table according to the incremental data.
  • the storage table adopts a column storage format, and the storage table includes a data information column and a validity indication information column.
  • the incremental data is stored in the data information column in the form of data rows.
  • the validity indication information column is used to indicate whether the data row in the data information column is valid or invalid.
  • the validity indication information column indicates that a certain data row in the data information column is invalid, indicating that the data row has expired, for example, the data row has been deleted or the information in the data row has been modified, and at this time, the data row will not be read and output by the analysis and processing engine.
  • the validity indication information column indicates that a certain data row in the data information column is valid, indicating that the data row has not expired (is in effect), and the data row can be read and output by the analysis and processing engine.
  • the validity indication information column can be used to indicate that each data row in the data information column is valid or invalid.
  • the indication information row in the validity indication information column can correspond one-to-one to the data row in the data information column.
  • the analysis and processing engine indicates whether the data rows in the data information column are valid or invalid by setting a validity indication information column in the storage table.
  • the analysis and processing engine can filter the invalid data rows in the data information column based on the validity indication information column and read the valid data rows in the data information column, thereby achieving rapid data deduplication and improving data query efficiency. Since the analysis and processing engine does not need to create and store a full data snapshot of the transaction processing engine, the amount of data storage is reduced, thereby saving storage resources. In addition, after the analysis and processing engine synchronizes the incremental data from the transaction processing engine in real time, it can quickly update the latest status of the data through the valid indication information column, so that the analysis and processing engine can use the latest updated data as much as possible when processing data query requests, thereby improving data timeliness.
  • the data processing method provided in the embodiment of the present application can be applied to HTAP solutions, data warehouses, etc. Or it can also be applied to big data services, various data analysis systems, including but not limited to user data analysis, settlement services, etc.
  • FIG1 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • the application scenario includes a database system 101 and a terminal 102.
  • the database system 101 and the terminal 102 can communicate.
  • a user can access the database system 101 through the terminal 102, including writing data to the database system 101 or reading data from the database system 101.
  • the database system 101 includes a database engine and a database.
  • the database engine is an interface for users to access data in the operation database.
  • the database engine can be a storage engine implemented based on a log structured merge tree (LSM-Tree).
  • LSM-Tree log structured merge tree
  • the database engine includes a TP engine and an AP engine.
  • the TP engine provides a write interface to the user for responding to data write requests.
  • User data is submitted in the form of a transaction through the TP engine.
  • the AP engine synchronizes the data on the TP engine in real time and provides a read-only interface to the user for responding to data query requests.
  • the database includes an online transactional processing (OLTP) database and an online analytical processing (OLTP) database.
  • the OLTP database uses row storage (i.e., row storage format) to support transactional loads (TP loads), and the OLAP database uses column storage (i.e., column storage format) to support analytical loads (AP loads).
  • TP loads transactional loads
  • AP loads analytical loads
  • the data in the transaction processing engine described in the following embodiments of the present application can be stored in the OLTP database, and the storage table in the analytical processing engine can be stored in the OLAP database.
  • a user can send a data write request to the database system 101 through the terminal 102, and the TP engine responds to the data write request and stores the data persistently in the OLTP database in the form of a transaction.
  • the AP engine synchronizes the data in the OLTP database with the OLAP database through data synchronization.
  • a user can send a data query request to the database system 101 through the terminal 102, and the AP engine responds to the data query request, obtains the corresponding data from the OLAP database and outputs it to the terminal 102.
  • Figure 2 is a flow chart of a data processing method provided in an embodiment of the present application. As shown in Figure 2, the method includes:
  • Step 201 The analysis processing engine obtains incremental data from the transaction processing engine.
  • the incremental data includes one or more of modified data, deleted data, or newly added data.
  • Modified data on the transaction processing engine includes data rewritten at a storage address where data has already been written.
  • the modified data may be data written at a storage address by overwriting.
  • Deleted data on the transaction processing engine includes data deleted at a storage address where data has already been written.
  • Newly added data on the transaction processing engine includes data newly written at a storage address where data has not been written.
  • the analysis and processing engine synchronizes the incremental data in the transaction processing engine in real time.
  • the analysis and processing engine can replay the incrementally updated data in the transaction processing engine based on the logical log and perform incremental data synchronization.
  • the analysis and processing engine first determines whether there is existing data in the transaction processing engine, and the existing data can also be called historical data. If there is existing data in the transaction processing engine, the analysis and processing engine obtains a consistent view of the transaction processing engine and performs full data synchronization. If there is no existing data in the transaction processing engine, the analysis and processing engine performs incremental data synchronization.
  • Step 202 The analysis and processing engine updates a storage table according to the incremental data.
  • the storage table adopts a column storage format and includes a data information column and a validity indication information column.
  • the data information column is used to store the data synchronized from the transaction processing engine by the analysis processing engine in the form of data rows.
  • the incremental data from the transaction processing engine obtained by the analysis processing engine is stored in the data information column in the form of data rows.
  • the validity indication information column is used to indicate whether the data rows in the data information column are valid or invalid.
  • the validity indication information column indicates that a certain data row in the data information column is invalid, indicating that the data row has expired, for example, the data row has been deleted or the information in the data row has been modified, and at this time the data row will not be read and output by the analysis processing engine.
  • the validity indication information column indicates that a certain data row in the data information column is valid, indicating that the data row has not expired (is in effect), and the data row can be read and output by the analysis processing engine.
  • the row under the data information column in the storage table is called a data row
  • the row under the validity indication information column in the storage table is called an indication information row
  • the data rows in the storage table correspond to the indication information rows one by one, and each indication information row is provided with an indication indicating whether the corresponding data row is valid.
  • the embodiment of the present application uses a first indication in the text description to indicate that the data row is valid, and a second indication is used to indicate that the data row is invalid.
  • the first indication and the second indication can be represented by different numbers, letters or characters.
  • Table 1 shows the structure of a storage table in an analysis and processing engine.
  • the storage table in the analysis processing engine includes at least a data information column and a validity indication information column. Take the example of integrating the validity indication information column into one table. In this implementation, the analysis and processing engine does not need to create a separate table to indicate the validity of the data information, which reduces storage costs while enabling rapid data deduplication. In practical applications, the data information column and the validity indication information column can also be two independent tables connected by join. In this case, the storage table in the embodiment of the present application can be understood as a collection of multiple tables, the data information column can be regarded as a data table, and the validity indication information column can be regarded as a column-based snapshot indicating the validity of the data information in the data table. Since the amount of data in the validity indication information column is small, it takes less time to create and occupies less storage resources than the full data snapshot, which can also achieve the purpose of improving data timeliness and saving storage resources.
  • the storage table in the analysis and processing engine also includes a readability identification information column and/or a version information column.
  • the readability identification information column is used to identify whether the data row in the data information column can be read or cannot be read.
  • the version information column is used to indicate the time sequence in which the data row in the data information column is added to the data information column.
  • the row under the readability identification information column in the storage table is referred to as the identification information row
  • the row under the version information column in the storage table is referred to as the version information row.
  • the identification information row in the storage table corresponds to the data row one by one, and each identification information row is provided with an identification indicating whether the corresponding data row can be read.
  • the embodiment of the present application uses a first identification to identify that the data row cannot be read in the text description, and uses a second identification to identify that the data row can be read.
  • the first identification and the second identification can be represented by different numbers, letters or characters.
  • the version information row in the storage table corresponds to the data row one by one, and each version information row is provided with a version number indicating the corresponding data row.
  • the version number can be represented by a number, for example, the larger the value of the version number, the newer the version.
  • Table 2 shows the structure of another storage table in the analysis and processing engine.
  • the storage table in the analysis and processing engine includes a data information column, a readability identification information column, a version information column, and a validity indication information column.
  • the storage table may also include more data attributes, and the structure of the storage table may be designed according to actual needs, which is not limited in the present application embodiment.
  • the validity indication information column includes a first indication information column and a second indication information column.
  • the first indication information column and the second indication information column are used to poll and update the indication of the validity of the data row in the data information column.
  • the first indication information column and the second indication information column meet the following conditions: at the same time, at least one of the first indication information column and the second indication information column supports the data query function; when both the first indication information column and the second indication information column support the data query function, the most recently updated indication information column of the first indication information column and the second indication information column is used for the analysis and processing engine to perform data query.
  • Table 3 shows the structure of another storage table in the analysis and processing engine.
  • the storage table in the analysis and processing engine includes a data information column, a readability identification information column, a version information column, a first indication information column, and a second indication information column.
  • Table 3 takes the validity indication information column as an example, including a first indication information column and a second indication information column.
  • the validity indication information column may also include 3, 4 or more indication information columns, and multiple indication information columns are used to poll for indications of the validity of data rows in the updated data information column.
  • multiple indication information columns may also be used to indicate the validity of data rows in the updated data information column under multiple versions to meet the needs of users in multi-version concurrent access scenarios.
  • the polling cycle can be in the order of seconds or minutes, for example, the polling cycle is 5 minutes.
  • the analysis processing engine uses the first indication information column to update the indication of the validity of the data rows in the data information column. If the first indication information column is in an updated state during the i-th polling cycle, the data query function is provided by the second indication information column during the i-th polling cycle.
  • the analysis processing engine uses the second indication information column to update the indication of the validity of the data rows in the data information column. If the second indication information column is in an updated state during the i+1-th polling cycle, the data query function is provided by the first indication information column during the i+1-th polling cycle.
  • the first indication information column and the second indication information column are both in the update completion state in the i+2th polling cycle. Since the version of the second indication information column is newer than that of the first indication information column, the second indication information column provides data query function in the i+2th polling cycle. For the incremental data synchronized from the transaction processing engine in the i+3th polling cycle, the analysis processing engine The analysis processing engine uses the first indication information column with an older version to update the indication of the validity of the data row in the data information column.
  • the second indication information column provides data query function during the i+3th polling cycle. If the indication information column being updated during the current polling cycle is updated at a certain moment before the end of the polling cycle, the analysis processing engine can switch to use the latest updated indication information column for data query from that moment on.
  • the indication information column in the updating state is set with an error flag, and cannot provide data query function.
  • the indication information column in the updating completion state is set with a ready flag, and can provide data query function. If an indication information column is always in the updating state, then even if the analysis and processing engine synchronizes incremental data from the transaction processing engine in a new polling cycle, the analysis and processing engine will not use another indication information column to update the indication of the validity of the data rows in the data information column. That is, in this case, the analysis and processing engine will suspend the update of data validity based on the new incremental data. This can ensure that at least one indication information column can provide data query function at the same time, thereby improving data query efficiency.
  • two indication information columns are set in the storage table to poll and update the indication of the validity of the data rows in the data information column, so that the analysis and processing engine always has available indication information to assist in processing data query requests at any time, thereby improving data query efficiency.
  • the amount of data in the indication information column is small, it can basically achieve rapid synchronization with the data information column. By setting the polling cycle to a shorter duration, the analysis and processing engine can use the latest data as much as possible when processing data query requests, thereby improving data timeliness.
  • multiple indication information columns share the same data information column, there is no problem of redundant data storage, and the data storage cost is low.
  • the following embodiments of the present application respectively illustrate the implementation method of the above step 202 for three possible situations in which the incremental data includes modified data, deleted data or newly added data.
  • the incremental data includes newly added data.
  • the implementation of step 202 includes: the analysis and processing engine adds the newly added data to the storage table by adding a data row, and adds a first indication in the indication information row corresponding to the data row where the newly added data is located in the validity indication information column. The first indication is used to indicate that the data row is valid.
  • the analysis and processing engine only needs to add data rows in the data information column to add the new data, and add corresponding indication information rows in the validity indication information column to add the first indication. This can achieve rapid update of the new data and its validity, and the data is highly real-time.
  • the incremental data includes the modified data.
  • the implementation method of step 202 includes: the analysis and processing engine adds the modified data in the storage table by adding a data row, and adds a first indication in the indication information row corresponding to the data row where the modified data is located in the validity indication information column, and modifies the first indication in the indication information row corresponding to the data row where the modified data is located in the validity indication information column to a second indication.
  • the first indication is used to indicate that the data row is valid
  • the second indication is used to indicate that the data row is invalid.
  • the analysis and processing engine only needs to add a data row in the data information column to add the modified data, and add a corresponding indication information row in the validity indication information column to add the first indication, and modify the indication in the indication information row in the validity indication information column corresponding to the data row where the modified data is located. This can achieve rapid updating of the modified data and its validity, and the modified data and its validity, and the data is highly real-time.
  • the incremental data includes deleted data.
  • a first implementation of step 202 includes: the analysis and processing engine modifies the first indication in the indication information row corresponding to the data row where the deleted data is located in the validity indication information column to a second indication. The second indication is used to indicate that the data row is invalid.
  • a second implementation of step 202 includes: the analysis and processing engine adds the deleted data to the storage table by adding data rows, and sets a first identifier in the identification information row corresponding to the newly added data row where the deleted data is located in the readability identification information column, and modifies the first indication in the indication information row corresponding to the data row where the original deleted data is located in the validity indication information column to a second indication. The first indication is used to indicate that the data row is valid, the second indication is used to indicate that the data row is invalid, and the first identifier is used to indicate that the data row cannot be read.
  • the analysis and processing engine only needs to modify the indication in the indication information row corresponding to the data row where the deleted data is located in the validity indication information column, which can realize the rapid update of the validity of the deleted data, and the data real-time performance is high.
  • the analysis and processing engine needs to add a data row in the data information column to add and delete data, and set an identification in the identification information row corresponding to the data row where the newly added deleted data is located in the readability identification information column to indicate that the data row cannot be read, which can also realize the rapid update of the validity of the deleted data, and the data real-time performance is high.
  • the indication in the indication information row corresponding to the data row where the newly added deleted data is located in the validity indication information column is no longer effective and can be set to empty or any value.
  • the analysis processing engine can specifically adopt the following ideas to update Indicates the validity of the data row in the information column.
  • the one with the older version is selected for update. If the first indication information column or the second indication information column is in the update state, the one in the update state is selected for update.
  • the analysis and processing engine updates the indication information column, it first uniformly sets the first indication in the indication information row newly added after the last update of the indication information column, and then modifies the indication in the indication information row corresponding to the old version data in the indication information column to the second indication according to the version information column.
  • the old version data may be modified data or deleted data.
  • Step 203 In response to receiving the data query request, the analysis processing engine outputs the valid data row indicated by the validity indication information column in the storage table.
  • the data query request includes a data identifier. If multiple versions of data corresponding to the data identifier are stored in the storage table, the analysis and processing engine outputs the valid version of the data indicated by the validity indication information in the storage table, and the valid version is generally the latest version.
  • the two columns “ID” and “Name” are data information columns.
  • the first indication information column and the second indication information column are both in the update completion state.
  • the readability identification information column "-1" is used to indicate that the data row cannot be read, and "1" is used to indicate that the data row can be read.
  • the version information column Arabic numerals are used to indicate the version number, and the larger the number, the newer the version.
  • the indication information column "1" is used to indicate that the data row is valid, and "0" is used to indicate that the data row is invalid.
  • the second indication information column is polled for update.
  • Table 6 both the first indication information column and the second indication information column are in the update completion state.
  • the analysis processing engine synchronizes the incremental data (deleted data) on the transaction processing engine, and the storage table obtained is shown in Table 7 or Table 8.
  • the first indication information column is polled for update.
  • both the first indication information column and the second indication information column are in the update completion state.
  • both the first indication information column and the second indication information column are in the update completion state.
  • the analysis processing engine receives a data query request, the data shown in Table 11 may be output based on Table 9 or Table 10.
  • the analysis and processing engine indicates whether the data rows in the data information column are valid or invalid by setting a validity indication information column in the storage table.
  • the analysis and processing engine can filter the invalid data rows in the data information column based on the validity indication information column and read the valid data rows in the data information column, thereby realizing fast data deduplication.
  • the entire data deduplication process is transparent to the business layer and improves data query efficiency. Since the analysis and processing engine does not need to create and store a full data snapshot of the transaction processing engine, the amount of data storage is reduced, thereby saving storage resources.
  • the analysis and processing engine synchronizes the incremental data from the transaction processing engine in real time, it can quickly update the latest status of the data through the valid indication information column, so that the analysis and processing engine can use the latest updated data as much as possible when processing data query requests, thereby improving data timeliness.
  • the sequence of the steps of the data processing method provided in the embodiment of the present application can be appropriately adjusted, and the steps can also be increased or decreased accordingly according to the situation. Any technician familiar with the technical field can easily think of the method of change within the technical scope disclosed in this application, and it should be included in the protection scope of this application.
  • the validity indication information column provided in the embodiment of the present application can be used for data backup and archiving of historical data in addition to being used for analyzing the processing engine for data deduplication, and can also be used for lightweight multi-version control in the analysis processing engine to provide lightweight transaction support. The embodiments of the present application will not be repeated here one by one.
  • FIG3 is a schematic diagram of the structure of a data processing device provided in an embodiment of the present application.
  • the data processing device is applied to an analysis and processing engine.
  • the data processing device 300 includes: an acquisition module 301 and an update module 302 .
  • the acquisition module 301 is used to acquire the incremental data from the transaction processing engine.
  • the acquisition module 301 is specifically used to execute the above step 201.
  • the update module 302 is used to update the storage table according to the incremental data.
  • the storage table adopts a column storage format.
  • the storage table includes a data information column and a validity indication information column.
  • the incremental data is stored in the data information column in the form of data rows.
  • the validity indication information column is used to indicate whether the data row in the data information column is valid or invalid.
  • the update module 302 is specifically used to perform the above step 202.
  • the data processing device 300 further includes: a query module 303, which is used to output the valid data row indicated by the validity indication information column in the storage table in response to receiving the data query request.
  • the query module 303 is specifically used to execute the above step 203.
  • the incremental data includes new data
  • the update module 302 is specifically used to: add the new data to the storage table by adding data rows, and add a first indication in the indication information row corresponding to the data row where the new data is located in the validity indication information column, and the first indication is used to indicate that the data row is valid.
  • the incremental data includes modified data
  • the update module 302 is specifically used to: add the modified data to the storage table by adding data rows, and add a first indication to the indication information row corresponding to the data row where the modified data is located in the validity indication information column, and modify the first indication in the indication information row corresponding to the data row where the modified data is located in the validity indication information column to a second indication, the first indication being used to indicate that the data row is valid, and the second indication being used to indicate that the data row is invalid.
  • the incremental data includes deleted data
  • the update module 302 is specifically used to: modify the first indication in the indication information row corresponding to the data row where the deleted data is located in the validity indication information column to a second indication, and the second indication is used to indicate that the data row is invalid.
  • the storage table also includes a readability identification information column, and the readability identification information column is used to identify that the data row in the data information column can be read or cannot be read.
  • the incremental data includes deleted data
  • the update module 302 is specifically used to: add the deleted data to the storage table by adding data rows, and set a first identifier in the identification information row corresponding to the data row where the newly added deleted data is located in the readability identification information column, and modify the first indication in the indication information row corresponding to the data row where the original deleted data is located in the validity indication information column to a second indication, the first indication is used to indicate that the data row is valid, the second indication is used to indicate that the data row is invalid, and the first identifier is used to indicate that the data row cannot be read.
  • the validity indication information column includes a first indication information column and a second indication information column
  • the first indication information column and the second indication information column are used to poll and update the indication of the validity of the data rows in the data information column
  • the first indication information column and the second indication information column meet the following conditions: at the same time, at least one of the first indication information column and the second indication information column supports the data query function; when both the first indication information column and the second indication information column support the data query function, the most recently updated indication information column of the first indication information column and the second indication information column is used for the analysis and processing engine to perform data query.
  • the storage table further includes a version information column, and the version information column is used to indicate the time sequence in which the data rows in the data information column are added to the data information column.
  • FIG5 exemplarily provides a possible architecture diagram of a data processing device.
  • a data processing device 500 may include a processor 501, a memory 502, a communication interface 503, and a bus 504.
  • the number of processors 501 may be one or more, and FIG5 only illustrates one of the processors 501.
  • the processor 501 may be a central processing unit (CPU). If the data processing device has multiple processors 501, the types of the multiple processors 501 may be different, or may be the same. Optionally, the multiple processors of the data processing device may also be integrated into a multi-core processor.
  • the memory 502 is used to store computer instructions and data.
  • the memory 502 can store computer instructions and data required to implement the data processing method provided in the present application.
  • the memory 502 can be any one or any combination of the following storage media: non-volatile memory (such as read-only memory (ROM), solid state disk (SSD), hard disk (HDD, etc.), optical disk, volatile memory.
  • non-volatile memory such as read-only memory (ROM), solid state disk (SSD), hard disk (HDD, etc.
  • HDD hard disk
  • volatile memory volatile memory.
  • the communication interface 503 may be any one or any combination of the following devices: a network interface (such as an Ethernet interface), a wireless network card, or other device with a network access function.
  • the communication interface 503 is used for the data processing device 500 to perform data communication with other devices or components.
  • the bus 504 can connect the processor 501 with the memory 502 and the communication interface 503.
  • the processor 501 can access the memory 502, and can also use the communication interface 503 to exchange data with other devices or components.
  • the data processing device 500 executes the computer instructions in the memory 502 to implement the data processing method provided by the present application. For example, incremental data from the transaction processing engine is obtained.
  • the storage table is updated according to the incremental data, and the storage table adopts a column storage format, and the storage table includes a data information column and a validity indication information column.
  • the incremental data is stored in the data information column in the form of data rows, and the validity indication information column is used to indicate whether the data rows in the data information column are valid or invalid.
  • the data processing device 500 executes the computer instructions in the memory 502 to implement the steps of the data processing method provided by the present application, and the corresponding description in the above method embodiment can be referred to accordingly.
  • the embodiment of the present application also provides a computer-readable storage medium, which is a non-volatile computer-readable storage medium, and the computer-readable storage medium includes program instructions.
  • the program instructions When the program instructions are executed on a computer device, the computer device executes the following The data processing method provided in the embodiment of the present application.
  • the embodiment of the present application also provides a computer program product including instructions.
  • the computer program product When the computer program product is run on a computer, the computer executes the data processing method provided by the embodiment of the present application.
  • a and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone.
  • the character "/" in this article generally indicates that the associated objects before and after are in an "or" relationship.
  • the information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种数据处理方法及装置,属于数据库技术领域。分析处理引擎获取来自事务处理引擎的增量数据。分析处理引擎根据增量数据更新存储表。该存储表采用列式存储格式。该存储表包括数据信息列和有效性指示信息列。增量数据以数据行的形式存储在数据信息列中。有效性指示信息列用于指示数据信息列中的数据行有效或无效。分析处理引擎可基于有效性指示信息列过滤数据信息列中的无效数据行而读取数据信息列中的有效数据行,从而实现快速数据去重,有效性指示信息列的数据量较小且能够实现对数据有效性的快速更新,节约了存储资源且提高了数据时效性。

Description

数据处理方法及装置
本申请要求于2022年10月21日提交的申请号为202211298222.7、发明名称为“数据处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据库技术领域,特别涉及一种数据处理方法及装置。
背景技术
事务处理(transactional processing,TP)和分析处理(analytical processing,AP)是两种类型的数据库***应用。TP***用于管理和处理事务。例如TP***用于销售订单录入或银行事务处理。AP***用于分析数据,为业务分析师生成报告。例如AP***生成的报告包括按地理区域、产品类别或客户分类等划分的汇总销售统计资料。
随着现阶段数据业务越来越模糊,AP业务TP化,TP业务AP化,混合负载数据处理成为市场趋势。混合事务/分析处理(hybrid transactional and analytical processing,HTAP)模式是常见的一种混合负载模式。HTAP模式可以支持大量并发的更新,数据同步时延通常在秒级或毫秒级。HTAP***包括TP负载和AP负载,TP负载采用行式存储格式,AP负载采用列式存储格式。TP负载上的数据实时同步到AP负载上。由于AP负载采用列式存储格式,因此AP负载在数据更新时通常采用附加(append)方式追加数据,这会造成AP负载中存在大量重复数据,在处理读请求时需要对数据进行去重。
目前,AP负载定期创建TP负载最新版本的全量数据快照,并基于全量数据快照处理读请求。该全量数据快照实质上包含AP负载在对应版本下经过数据去重之后的数据。但是一方面,由于AP负载定期创建TP负载最新版本的全量数据快照,受限于快照创建间隔时长,AP负载处理读请求时所采用的数据实时性较差。另一方面,AP负载存储全量数据快照所占用的存储资源较多。
发明内容
本申请提供了一种数据处理方法及装置,可以解决目前AP引擎处理AP请求所采用的数据实时性较差、存储全量数据快照占用较多存储资源的问题。
第一方面,提供了一种数据处理方法。该方法包括:分析处理引擎获取来自事务处理引擎的增量数据。分析处理引擎根据增量数据更新存储表。该存储表采用列式存储格式。该存储表包括数据信息列和有效性指示信息列。增量数据以数据行的形式存储在数据信息列中。有效性指示信息列用于指示数据信息列中的数据行有效或无效。
本申请中,分析处理引擎通过在存储表中设置有效性指示信息列来指示数据信息列中的数据行有效或无效,在处理数据查询请求时,分析处理引擎可基于有效性指示信息列过滤数据信息列中的无效数据行而读取数据信息列中的有效数据行,从而实现快速数据去重,提高了数据查询效率。由于分析处理引擎无需创建并存储事务处理引擎的全量数据快照,因此减少了数据存储量,从而节约了存储资源。另外,分析处理引擎在实时同步来自事务处理引擎的增量数据之后,能够通过有效指示信息列快速更新数据最新状态,以使分析处理引擎在处理数据查询请求时尽可能使用最新的更新数据,提高了数据时效性。
在一种可能的实现方式中,响应于接收到数据查询请求,分析处理引擎输出存储表中有效性指示信息列所指示的有效数据行。
本申请中,分析处理引擎可基于有效性指示信息列过滤数据信息列中的无效数据行而读取数据信息列中的有效数据行,从而实现快速数据去重,提高了数据查询效率。
在一种可能的实现方式中,增量数据包括新增数据,分析处理引擎根据增量数据更新存储表的实现方式,包括:分析处理引擎采用增加数据行的方式在存储表中添加新增数据,并在有效性指示信息列中与该新增数据所在数据行对应的指示信息行添加第一指示,第一指示用于指示数据行有效。
这种实现方式下,分析处理引擎只需在数据信息列中增加数据行以添加新增数据,并在有效性指示信 息列中增加对应的指示信息行以添加第一指示即可,可实现对新增数据及其有效性的快速更新,数据实时性较高。
在一种可能的实现方式中,增量数据包括修改数据,分析处理引擎根据增量数据更新存储表的实现方式,包括:分析处理引擎采用增加数据行的方式在存储表中添加修改数据,并在有效性指示信息列中与该修改数据所在数据行对应的指示信息行添加第一指示,以及,将有效性指示信息列中与被修改数据所在数据行对应的指示信息行中的第一指示修改为第二指示,第一指示用于指示数据行有效,第二指示用于指示数据行无效。
这种实现方式下,分析处理引擎只需在数据信息列中增加数据行以添加修改数据,并在有效性指示信息列中增加对应的指示信息行以添加第一指示,并修改有效性指示信息列中与被修改数据所在数据行对应的指示信息行中的指示即可,可实现对修改数据及其有效性、被修改数据及其有效性的快速更新,数据实时性较高。
在一种可能的实现方式中,增量数据包括删减数据,分析处理引擎根据增量数据更新存储表的实现方式,包括:分析处理引擎将有效性指示信息列中与删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示,第二指示用于指示数据行无效。
这种实现方式下,分析处理引擎只需修改有效性指示信息列中与删除数据所在数据行对应的指示信息行中的指示即可,可实现对删除数据的有效性的快速更新,数据实时性较高。
在另一种可能的实现方式中,存储表还包括可读性标识信息列,可读性标识信息列用于标识数据信息列中的数据行可被读取或不可被读取,增量数据包括删减数据,分析处理引擎根据增量数据更新存储表的实现方式,包括:分析处理引擎采用增加数据行的方式在存储表中添加删减数据,且在可读性标识信息列中与新增的删减数据所在数据行对应的标识信息行中设置第一标识,并将有效性指示信息列中与原有的删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示,第一指示用于指示数据行有效,第二指示用于指示数据行无效,第一标识用于标识数据行不可被读取。
这种实现方式下,分析处理引擎需在数据信息列中增加数据行以添加删除数据,并在可读性标识信息列中与新增的删减数据所在数据行对应的标识信息行中设置标识以指示数据行不可被读取,可实现对删除数据的有效性的快速更新,数据实时性较高。
在一种可能的实现方式中,有效性指示信息列包括第一指示信息列和第二指示信息列。第一指示信息列和第二指示信息列用于轮询更新对数据信息列中的数据行的有效性的指示,且第一指示信息列和第二指示信息列满足以下条件:同一时刻,第一指示信息列和第二指示信息列中的至少一个支持数据查询功能;在第一指示信息列和第二指示信息列都支持数据查询功能的情况下,第一指示信息列和第二指示信息列中最近更新的指示信息列用于分析处理引擎进行数据查询。
本申请中,通过在存储表中设置两个指示信息列来轮询更新对数据信息列中的数据行的有效性的指示,使得分析处理引擎在任意时刻总是有可用的指示信息来辅助处理数据查询请求,从而提高数据查询效率。另外,由于指示信息列的数据量较小,可基本实现与数据信息列的同步快速更新,通过将轮询周期设置成较短时长,可以使分析处理引擎在处理数据查询请求时尽可能使用最新的数据,从而提高了数据时效性。另外,由于多个指示信息列共享同一数据信息列,因此不存在数据冗余存储的问题,数据存储成本较低。
在一种可能的实现方式中,存储表还包括版本信息列,版本信息列用于指示数据信息列中的数据行被添加至数据信息列的时间先后顺序。
第二方面,提供了一种数据处理装置,应用于分析处理引擎。该装置包括:获取模块,用于获取来自事务处理引擎的增量数据。更新模块,用于根据增量数据更新存储表,存储表采用列式存储格式,存储表包括数据信息列和有效性指示信息列,增量数据以数据行的形式存储在数据信息列中,有效性指示信息列用于指示数据信息列中的数据行有效或无效。
可选地,该装置还包括:查询模块,用于响应于接收到数据查询请求,输出存储表中有效性指示信息列所指示的有效数据行。
可选地,增量数据包括新增数据,更新模块,具体用于:采用增加数据行的方式在存储表中添加新增数据,并在有效性指示信息列中与新增数据所在数据行对应的指示信息行添加第一指示,第一指示用于指示数据行有效。
可选地,增量数据包括修改数据,更新模块,具体用于:采用增加数据行的方式在存储表中添加修改数据,并在有效性指示信息列中与修改数据所在数据行对应的指示信息行添加第一指示,以及,将有效性指示信息列中与被修改数据所在数据行对应的指示信息行中的第一指示修改为第二指示,第一指示用于指示数据行有效,第二指示用于指示数据行无效。
可选地,增量数据包括删减数据,更新模块,具体用于:将有效性指示信息列中与删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示,第二指示用于指示数据行无效。或者,存储表还包括可读性标识信息列,可读性标识信息列用于标识数据信息列中的数据行可被读取或不可被读取,增量数据包括删减数据,更新模块,具体用于:采用增加数据行的方式在存储表中添加删减数据,且在可读性标识信息列中与新增的删减数据所在数据行对应的标识信息行中设置第一标识,并将有效性指示信息列中与原有的删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示,第一指示用于指示数据行有效,第二指示用于指示数据行无效,第一标识用于标识数据行不可被读取。
可选地,有效性指示信息列包括第一指示信息列和第二指示信息列,第一指示信息列和第二指示信息列用于轮询更新对数据信息列中的数据行的有效性的指示,且第一指示信息列和第二指示信息列满足以下条件:同一时刻,第一指示信息列和第二指示信息列中的至少一个支持数据查询功能;在第一指示信息列和第二指示信息列都支持数据查询功能的情况下,第一指示信息列和第二指示信息列中最近更新的指示信息列用于分析处理引擎进行数据查询。
可选地,存储表还包括版本信息列,版本信息列用于指示数据信息列中的数据行被添加至数据信息列的时间先后顺序。
第三方面,提供了一种数据处理装置,该数据处理装置可以是分析处理引擎,包括存储器和处理器,存储器存储有程序指令,处理器运行程序指令以执行上述第一方面及其各实施方式中的方法。
第四方面,提供了一种计算机可读存储介质,包括程序指令,当程序指令在计算机设备上运行时,使得计算机设备执行上述第一方面及其各实施方式中的方法。
第五方面,提供了一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行上述第一方面及其各实施方式中的方法。
附图说明
图1是本申请实施例提供的一种应用场景示意图;
图2是本申请实施例提供的一种数据处理方法的流程示意图;
图3是本申请实施例提供的一种数据处理装置的结构示意图;
图4是本申请实施例提供的另一种数据处理装置的结构示意图;
图5是本申请实施例提供的一种数据处理装置的架构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
为了便于读者对本申请方案的理解,首先对本申请实施例涉及的一些名词进行解释。
快照:指关于数据集合的一个完全可用拷贝。快照中包括对应数据集合在某个时间点(拷贝的开始时刻)的映像,基于数据集合在某个时刻的快照可以查询到该数据集合在该时刻对应的数据。快照可以看作是数据的副本,也可以看作是数据的复制品。
事务:数据库中的事务是指作为单个逻辑工作单元执行的一系列操作,要么完整地执行,要么完全不执行。事务处理可以确保除非事务性单元内的所有操作都成功完成,否则不会永久更新面向数据的资源。通过将一组相关操作组合为一个要么全部成功要么全部失败的单元,可以简化错误恢复并使应用程序更加可靠。一个逻辑工作单元要成为事务,必须满足所谓的原子性(atomicity)、一致性(consistency)、隔离性(isolation)和持久性(durability)(简称:ACID)属性要求。在数据库中进行事务处理,尤其是在网上购 物的一次交易过程中,满足事务的ACID属性显得尤为重要。在正常情况下,付款操作顺利进行,最终交易成功,与交易相关的所有数据库信息也成功地更新。但是,如果在这一系列过程中任何一个环节出了差错,例如在更新商品库存信息时发生异常、顾客银行账户存款不足等,都将导致交易失败。一旦交易失败,数据库中所有信息都必须保持交易前的状态不变,否则,数据库的信息将会一片混乱而不可预测。
事务处理引擎(TP engine):一种数据库引擎,用于处理事务。用户数据通过TP引擎以事务形式提交并持久化。
分析处理引擎(AP engine):一种数据库引擎,用于分析事务。AP引擎实时同步TP引擎上的数据并提供数据分析功能。
随着现阶段数据业务越来越模糊,AP业务TP化,TP业务AP化,混合负载数据处理成为市场趋势。HTAP是一种新兴的应用体系结构,它打破了事务处理和分析处理之间的“墙”,可以实现混合负载数据处理,即HTAP模式是常见的一种混合负载模式。HTAP模式可以支持大量并发的更新,数据同步时延通常在秒级或毫秒级。HTAP***中,由于AP负载采用列式存储格式,因此AP负载在数据更新时通常采用附加方式追加数据,这会造成AP负载中存在大量重复数据,在处理读请求时需要对数据进行去重。例如,TP负载中写入了一条数据“ID=1,name=Tom”,AP负载将该条数据同步到本地后,在采用列式存储格式的存储表中添加一个数据行来记录该条数据。进一步地,TP负载中将“ID=1,name=Tom”的这条数据修改为“ID=1,name=Tom-1”,AP负载将修改后的数据同步到本地后,在存储表中再追加一行数据行来记录修改后的数据,此时存储表中存在ID=1的两条数据。这种情况下,AP负载在处理读请求时,则需要对存储表中ID=1的两条数据进行去重,比如仅读取相同ID下的最新数据,而过滤旧数据。
为了提高数据读取效率,目前AP负载定期创建TP负载最新版本的全量数据快照,并基于全量数据快照处理读请求。该全量数据快照也可称为全局一致性快照。该全量数据快照实质上包含AP负载在对应版本下经过数据去重之后的数据。例如AP负载在表空间下创建若干个隐藏的表,这些隐藏的表是原表某个版本数据去重后的一个影子表(shadow table)。其中,原表为AP负载用于存储来自TP负载的数据的表,一个影子表为AP负载创建的TP负载一个版本的全量数据快照。
上述方案存在一定局限性,一方面,由于AP负载定期创建TP负载最新版本的全量数据快照,而创建全量数据快照耗时较长,因此快照创建间隔时长也会较大。受限于快照创建间隔时长,AP负载处理读请求时所采用的数据实时性较差。例如AP负载在相邻两个快照的创建间隔时长内更新的数据,只有在下一个快照创建之后才能用于处理读请求。另一方面,AP负载存储全量数据快照所占用的存储资源较多。
基于此,本申请实施例提供了一种数据处理方法。分析处理引擎获取来自事务处理引擎的增量数据之后,根据该增量数据更新存储表。该存储表采用列式存储格式,该存储表中包括数据信息列和有效性指示信息列。其中,增量数据以数据行的形式存储在数据信息列中。有效性指示信息列用于指示数据信息列中的数据行有效或无效。具体而言,有效性指示信息列指示数据信息列中的某个数据行无效,表示该数据行已过期,例如该数据行已被删除或该数据行中的信息被修改过,此时该数据行不会被分析处理引擎读取并输出。有效性指示信息列指示数据信息列中的某个数据行有效,表示该数据行未过期(正生效),该数据行可被分析处理引擎读取并输出。其中,有效性指示信息列可以用于分别指示数据信息列中的各个数据行有效或无效。有效性指示信息列中的指示信息行可以与数据信息列中的数据行一一对应。本申请实施例中,分析处理引擎通过在存储表中设置有效性指示信息列来指示数据信息列中的数据行有效或无效,在处理数据查询请求时,分析处理引擎可基于有效性指示信息列过滤数据信息列中的无效数据行而读取数据信息列中的有效数据行,从而实现快速数据去重,提高了数据查询效率。由于分析处理引擎无需创建并存储事务处理引擎的全量数据快照,因此减少了数据存储量,从而节约了存储资源。另外,分析处理引擎在实时同步来自事务处理引擎的增量数据之后,能够通过有效指示信息列快速更新数据最新状态,以使分析处理引擎在处理数据查询请求时尽可能使用最新的更新数据,提高了数据时效性。
下面从应用场景、方法流程、虚拟装置、硬件装置等多个角度对本申请技术方案进行详细说明。
下面对本申请涉及的应用场景举例说明。
本申请实施例提供的数据处理方法可以应用于HTAP解决方案,数据仓库等。或者也可以应用于大数据服务,各类数据分析***,包括不限于用户数据分析,结算业务等。
例如,图1是本申请实施例提供的一种应用场景示意图。如图1所示,该应用场景包括数据库***101和终端102。数据库***101和终端102能够进行通信。用户可以通过终端102访问数据库***101,包括向数据库***101写数据或从数据库***101读数据。
数据库***101包括数据库引擎和数据库。数据库引擎是用户访问操作数据库中数据的接口。该数据库引擎可以是基于日志化结构合并树(log structured merge tree,LSM-Tree)实现的存储引擎。本申请实施例中,数据库引擎包括TP引擎和AP引擎。其中,TP引擎面向用户提供写入接口,用于响应数据写入请求。用户数据通过TP引擎以事务形式提交。AP引擎实时同步TP引擎上的数据,并面向用户提供只读接口,用于响应数据查询请求。数据库包括联机事务处理(online transactional processing,OLTP)数据库和联机分析处理(online analytical processing,OLTP)数据库,OLTP数据库采用行存(即行式存储格式)支持事务型负载(TP负载),OLAP数据库采用列存(即列式存储格式)支持分析型负载(AP负载)。本申请以下实施例中描述的事务处理引擎中的数据可以存储在OLTP数据库中,分析处理引擎中的存储表可以存储在OLAP数据库中。
在如图1所示的应用场景中,用户可以通过终端102向数据库***101发送数据写入请求,TP引擎响应该数据写入请求,并将数据以事务形式在OLTP数据库中持久化存储。AP引擎通过数据同步的方式将OLTP数据库中的数据同步在OLAP数据库中。用户可以通过终端102向数据库***101发送数据查询请求,AP引擎响应该数据查询请求,从OLAP数据库中获取对应的数据并输出至终端102。
下面对本申请涉及的方法流程举例说明。
例如,图2是本申请实施例提供的一种数据处理方法的流程示意图。如图2所示,该方法包括:
步骤201、分析处理引擎获取来自事务处理引擎的增量数据。
可选地,增量数据包括修改数据、删减数据或新增数据中的一种或多种。事务处理引擎上的修改数据包括在一个已经写入数据的存储地址重新写入的数据,例如修改数据可以是采用覆盖写的方式在一个存储地址中写入的数据。事务处理引擎上的删减数据包括在一个已经写入数据的存储地址删除的数据。事务处理引擎上的新增数据包括在一个未写入数据的存储地址新写入的数据。
可选地,分析处理引擎实时同步事务处理引擎中的增量数据。具体实现时,分析处理引擎可以基于逻辑日志回放事务处理引擎中增量更新的数据,并进行增量数据同步。在分析处理引擎确定事务处理引擎中有增量更新的数据之后,分析处理引擎首先判断事务处理引擎中是否有存量数据,存量数据也可称作历史数据。如果事务处理引擎中有存量数据,则分析处理引擎获取事务处理引擎的一致性视图,进行全量数据同步。如果事务处理引擎中没有存量数据,则分析处理引擎进行增量数据同步。
步骤202、分析处理引擎根据该增量数据更新存储表,该存储表采用列式存储格式,该存储表包括数据信息列和有效性指示信息列。
其中,数据信息列用于以数据行的形式存储分析处理引擎从事务处理引擎上同步过来的数据。相应地,分析处理引擎获取的来自事务处理引擎的增量数据以数据行的形式存储在数据信息列中。有效性指示信息列用于指示数据信息列中的数据行有效或无效。有效性指示信息列指示数据信息列中的某个数据行无效,表示该数据行已过期,例如该数据行已被删除或该数据行中的信息被修改过,此时该数据行不会被分析处理引擎读取并输出。有效性指示信息列指示数据信息列中的某个数据行有效,表示该数据行未过期(正生效),该数据行可被分析处理引擎读取并输出。
本申请实施例中,将存储表中数据信息列下的行称为数据行,将存储表中有效性指示信息列下的行称为指示信息行。可选地,存储表中的数据行与指示信息行一一对应,每个指示信息行中设置有指示对应的数据行是否有效的指示。为了便于说明,本申请实施例在文字描述上采用第一指示来指示数据行有效,采用第二指示来指示数据行无效。第一指示和第二指示可以采用不同的数字、字母或字符等表示。例如表1示出了分析处理引擎中一种存储表的结构。
表1
参见表1,分析处理引擎中的存储表至少包括数据信息列和有效性指示信息列。表1以数据信息列和 有效性指示信息列集成在一张表中为例。这种实现方式下,分析处理引擎无需单独创建表来指示数据信息的有效性,在能够实现数据快速去重处理的同时,降低了存储成本。实际应用中,数据信息列和有效性指示信息列也可以采用两张独立的表并采用join的方式连接,这种情况下,本申请实施例中的存储表可理解为多张表的集合,数据信息列可视为数据表,有效性指示信息列可视为指示数据表中数据信息的有效性的列式快照。由于有效性指示信息列的数据量较小,因此与全量数据快照相比,创建耗时更短且占用的存储资源更少,也能达到提高数据时效性以及节约存储资源的目的。
可选地,分析处理引擎中的存储表还包括可读性标识信息列和/或版本信息列。可读性标识信息列用于标识数据信息列中的数据行可被读取或不可被读取。版本信息列用于指示数据信息列中的数据行被添加至数据信息列的时间先后顺序。本申请实施例中,将存储表中可读性标识信息列下的行称为标识信息行,将存储表中版本信息列下的行称为版本信息行。可选地,存储表中的标识信息行与数据行一一对应,每个标识信息行中设置有标识对应的数据行是否可被读取的标识。为了便于说明,本申请实施例在文字描述上采用第一标识来标识数据行不可被读取,采用第二标识来标识数据行可被读取。第一标识和第二标识可以采用不同的数字、字母或字符等表示。可选地,存储表中的版本信息行与数据行一一对应,每个版本信息行中设置有指示对应的数据行的版本号。版本号可以采用数字表示,例如版本号的数值越大,表示版本越新。例如表2示出了分析处理引擎中另一种存储表的结构。
表2
参见表2,分析处理引擎中的存储表包括数据信息列、可读性标识信息列、版本信息列和有效性指示信息列。实际应用中,该存储表还可以包括更多的数据属性,可根据实际需求设计存储表的结构,本申请实施例对此不做限定。
可选地,有效性指示信息列包括第一指示信息列和第二指示信息列。第一指示信息列和第二指示信息列用于轮询更新对数据信息列中的数据行的有效性的指示。第一指示信息列和第二指示信息列满足以下条件:同一时刻,第一指示信息列和第二指示信息列中的至少一个支持数据查询功能;在第一指示信息列和第二指示信息列都支持数据查询功能的情况下,第一指示信息列和第二指示信息列中最近更新的指示信息列用于分析处理引擎进行数据查询。例如表3示出了分析处理引擎中又一种存储表的结构。
表3
参见表3,分析处理引擎中的存储表包括数据信息列、可读性标识信息列、版本信息列、第一指示信息列和第二指示信息列。表3中以有效性指示信息列包括第一指示信息列和第二指示信息列为例。实际应用中,有效性指示信息列还可以包括3个、4个或更多的指示信息列,多个指示信息列用于轮询更新数据信息列中的数据行的有效性的指示。本申请实施例中,还可以采用多个指示信息列指示更新数据信息列中的数据行在多个版本下的有效性,以满足用户多版本并发访问场景的需求。
可选地,轮询周期可以是秒级或分钟级,例如轮询周期为5分钟。对于分析处理引擎在第i个轮询周期内从事务处理引擎同步过来的增量数据,分析处理引擎采用第一指示信息列更新对数据信息列中的数据行的有效性的指示,如果第一指示信息列在第i个轮询周期内一直处于更新状态,则第i个轮询周期内由第二指示信息列提供数据查询功能。如果第一指示信息列在第i个轮询周期的结束时刻处于更新完成状态,对于分析处理引擎在第i+1个轮询周期内从事务处理引擎同步过来的增量数据,分析处理引擎采用第二指示信息列更新对数据信息列中的数据行的有效性的指示,如果第二指示信息列在第i+1个轮询周期内一直处于更新状态,则第i+1个轮询周期内由第一指示信息列提供数据查询功能。如果第二指示信息列在第i+1个轮询周期的结束时刻处于更新完成状态,且分析处理引擎在第i+2个轮询周期内未从事务处理引擎同步过来任何数据,那么第一指示信息列和第二指示信息列在第i+2个轮询周期内均处于更新完成状态,由于第二指示信息列的版本相较于第一指示信息列的版本更新,因此第i+2个轮询周期内由第二指示信息列提供数据查询功能。对于分析处理引擎在第i+3个轮询周期内从事务处理引擎同步过来的增量数据,分析处 理引擎采用版本更旧的第一指示信息列更新对数据信息列中的数据行的有效性的指示,如果第一指示信息列在第i+3个轮询周期内处于一直更新状态,则第i+3个轮询周期内由第二指示信息列提供数据查询功能。如果当前轮询周期内正在更新的指示信息列在该轮询周期结束之前的某个时刻更新完成,分析处理引擎可以从该时刻起切换使用最新更新完成的指示信息列进行数据查询。
一种可能实现方式,处于更新状态的指示信息列被设置error标识,无法提供数据查询功能。处于更新完成状态的指示信息列被设置ready标识,能够提供数据查询功能。如果一个指示信息列一直处于更新状态,那么即使分析处理引擎在新的轮询周期内从事务处理引擎同步过来增量数据,分析处理引擎也不会采用另一个指示信息列更新对数据信息列中的数据行的有效性的指示,也即是这种情况下,分析处理引擎会暂停基于新的增量数据进行数据有效性的更新。这样能够保证同一时刻至少一个指示信息列可提供数据查询功能,从而提高数据查询效率。
本申请实施例中,通过在存储表中设置两个指示信息列来轮询更新对数据信息列中的数据行的有效性的指示,使得分析处理引擎在任意时刻总是有可用的指示信息来辅助处理数据查询请求,从而提高数据查询效率。另外,由于指示信息列的数据量较小,可基本实现与数据信息列的同步快速更新,通过将轮询周期设置成较短时长,可以使分析处理引擎在处理数据查询请求时尽可能使用最新的数据,从而提高了数据时效性。另外,由于多个指示信息列共享同一数据信息列,因此不存在数据冗余存储的问题,数据存储成本较低。
本申请以下实施例分别针对增量数据包括修改数据、删减数据或新增数据的三种可能情况,对上述步骤202的实现方式进行说明。
第一种可能情况,增量数据包括新增数据。步骤202的实现方式包括:分析处理引擎采用增加数据行的方式在存储表中添加该新增数据,并在有效性指示信息列中与该新增数据所在数据行对应的指示信息行添加第一指示。第一指示用于指示数据行有效。
这种实现方式下,分析处理引擎只需在数据信息列中增加数据行以添加新增数据,并在有效性指示信息列中增加对应的指示信息行以添加第一指示即可,可实现对新增数据及其有效性的快速更新,数据实时性较高。
第二种可能情况,增量数据包括修改数据。步骤202的实现方式包括:分析处理引擎采用增加数据行的方式在存储表中添加该修改数据,并在有效性指示信息列中与该修改数据所在数据行对应的指示信息行添加第一指示,以及,将有效性指示信息列中与被修改数据所在数据行对应的指示信息行中的第一指示修改为第二指示。第一指示用于指示数据行有效,第二指示用于指示数据行无效。
这种实现方式下,分析处理引擎只需在数据信息列中增加数据行以添加修改数据,并在有效性指示信息列中增加对应的指示信息行以添加第一指示,并修改有效性指示信息列中与被修改数据所在数据行对应的指示信息行中的指示即可,可实现对修改数据及其有效性、被修改数据及其有效性的快速更新,数据实时性较高。
第三种可能情况,增量数据包括删减数据。步骤202的第一种实现方式包括:分析处理引擎将有效性指示信息列中与该删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示。第二指示用于指示数据行无效。在存储表包括可读性标识信息列的情况下,步骤202的第二种实现方式包括:分析处理引擎采用增加数据行的方式在存储表中添加该删减数据,且在可读性标识信息列中与新增的该删减数据所在数据行对应的标识信息行中设置第一标识,并将有效性指示信息列中与原有的删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示。第一指示用于指示数据行有效,第二指示用于指示数据行无效,第一标识用于标识数据行不可被读取。
第一种实现方式下,分析处理引擎只需修改有效性指示信息列中与删除数据所在数据行对应的指示信息行中的指示即可,可实现对删除数据的有效性的快速更新,数据实时性较高。第二种实现方式下,分析处理引擎需在数据信息列中增加数据行以添加删除数据,并在可读性标识信息列中与新增的删减数据所在数据行对应的标识信息行中设置标识以指示数据行不可被读取,也可实现对删除数据的有效性的快速更新,数据实时性较高。第二种实现方式下,有效性指示信息列中与新增的删减数据所在数据行对应的指示信息行中的指示不再起作用,可置为空或任意值。
在上述第一种可能情况、第二种可能情况以及第三种可能情况中的第二种实现方式下,如果有效性指示信息列包括版本信息列、第一指示信息列和第二指示信息列,分析处理引擎具体可以采用以下思路更新 指示信息列中对数据行的有效性指示。
第一,如果第一指示信息列和第二指示信息列都处于更新完成状态,则选择版本更旧的进行更新。如果第一指示信息列或第二指示信息列处于更新状态,则选择处于更新状态的进行更新。
第二,分析处理引擎在更新指示信息列时,首先在该指示信息列在上次更新之后新增的指示信息行中统一设置第一指示,然后根据版本信息列将指示信息列中与旧版本数据对应的指示信息行中的指示修改为第二指示。其中旧版本数据可以是被修改数据或被删除数据。
步骤203、响应于接收到数据查询请求,分析处理引擎输出该存储表中有效性指示信息列所指示的有效数据行。
可选地,数据查询请求包括数据标识,如果存储表中存储有该数据标识对应的多个版本的数据,分析处理引擎输出该存储表中有效性指示信息所指示的有效版本的数据,该有效版本一般为最新版本。
本申请以下实施例对上述数据处理方法的具体实现过程进行举例说明。
第一阶段,事务处理引擎上存在存量数据“ID=1,姓名=Tom”。分析处理引擎对事务处理引擎上的存量数据进行全量数据同步,得到的存储表如表4所示。
表4
参见表4,“ID”和“姓名”这两列为数据信息列。在表4中,第一指示信息列和第二指示信息列均处于更新完成状态。可读性标识信息列中,采用“-1”来标识数据行不可被读取,采用“1”来标识数据行可被读取。版本信息列中,采用***数字表示版本号,数字越大则表示版本越新。指示信息列中,采用“1”来指示数据行有效,采用“0”来指示数据行无效。
第二阶段,事务处理引擎中被***4条新数据,分别为“ID=2,姓名=Mike”、“ID=3,姓名=Tony”、“ID=4,姓名=Jim”和“ID=5,姓名=Ben”。分析处理引擎对事务处理引擎上的增量数据(新增数据)进行同步,得到的存储表如表5所示。
表5
在表5中,第一指示信息列处于更新完成状态。由于在“ID=5,姓名=Ben”这条数据之后没有新的数据更新,分析处理引擎已识别到第一指示信息列对应最新的数据版本为5,此时认为第一指示信息列已处于更新完成状态,即能够提供数据查询功能,所以没有必要进行第二指示信息列的更新,因此第二指示信息列中与ID=5的数据行对应的指示信息行置为0。
第三阶段,事务处理引擎上“ID=2,姓名=Mike”这条数据被修改为“ID=2,姓名=Mike-2”。其中,“Mike”为被修改数据,“Mike-2”为修改数据。分析处理引擎对事务处理引擎上的增量数据(修改数据)进行同步,得到的存储表如表6所示。
表6

第三阶段轮询到第二指示信息列更新。在表6中,第一指示信息列和第二指示信息列均处于更新完成状态。第一指示信息列指示ID=2的两条数据中,版本为2的旧版本数据有效,版本为6的新版本数据无效。第二指示信息列指示ID=2的两条数据中,版本为2的旧版本数据无效,版本为6的新版本数据有效。由于第二指示信息列的版本新于第一指示信息列的版本,因此由第二指示信息列提供数据查询功能。但也不排除需要查询旧版本数据的可能性,如果要查询旧版本数据,则由第一指示信息列提供数据查询功能。
第四阶段,事务处理引擎上“ID=3,姓名=Tony”这条数据被删除,该条数据即为删减数据。分析处理引擎对事务处理引擎上的增量数据(删减数据)进行同步,得到的存储表如表7或表8所示。
表7
表8
第四阶段轮询到第一指示信息列更新。在表7中,第一指示信息列和第二指示信息列均处于更新完成状态。第一指示信息列指示ID=3这条数据无效,第二指示信息列指示ID=3这条数据有效。由于第一指示信息列的版本新于第二指示信息列的版本,因此由第一指示信息列提供数据查询功能。在表8中,第一指示信息列和第二指示信息列均处于更新完成状态。第一指示信息列指示ID=3的两条数据中,版本为3的旧版本数据无效,且可读性标识信息列指示ID=3、版本为7的新版本数据不可被读取。第二指示信息列指示ID=3的两条数据中,版本为3的旧版本数据有效。由于第一指示信息列的版本新于第二指示信息列的版本,因此由第一指示信息列提供数据查询功能。
之后,随着下一个版本的创建,第二指示信息列中与ID=3、版本为3这行数据对应的指示信息行也置为0,表7进一步更新为表9,表8进一步更新为表10。
表9
表10

此时,如果分析处理引擎接收到数据查询请求,基于表9或表10可输出如表11所示的数据。
表11
基于表9可知,版本为2和版本为3的数据行无效,因此分析处理引擎会过滤版本为2和版本为3的数据行,读取并输出版本为1、6、4、5的数据行。基于表10可知,版本为2和版本为3的数据行无效,且版本为7的数据行不可被读取,因此分析处理引擎会过滤版本为2和版本为3的数据行,读取并输出版本为1、6、4、5的数据行。因此基于表9或表10进行全量数据查询,都会得到如表11所示的数据查询结果。
在本申请实施例提供的数据处理方法中,分析处理引擎通过在存储表中设置有效性指示信息列来指示数据信息列中的数据行有效或无效,在处理数据查询请求时,分析处理引擎可基于有效性指示信息列过滤数据信息列中的无效数据行而读取数据信息列中的有效数据行,从而实现快速数据去重,整个数据去重过程对业务层透明,且提高了数据查询效率。由于分析处理引擎无需创建并存储事务处理引擎的全量数据快照,因此减少了数据存储量,从而节约了存储资源。另外,分析处理引擎在实时同步来自事务处理引擎的增量数据之后,能够通过有效指示信息列快速更新数据最新状态,以使分析处理引擎在处理数据查询请求时尽可能使用最新的更新数据,提高了数据时效性。
需要说明的是,本申请实施例提供的数据处理方法的步骤的先后顺序可以进行适当调整,步骤也可以根据情况进行相应增减。任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化的方法,都应涵盖在本申请的保护范围之内。例如本申请实施例提供的有效性指示信息列除了可以用于分析处理引擎进行数据去重以外,还可以用于数据备份进行历史数据归档,也可以用于分析处理引擎内轻量级多版本控制,提供轻量级事务支持,本申请实施例在此不再一一赘述。
下面对本申请涉及的虚拟装置举例说明。
例如,图3是本申请实施例提供的一种数据处理装置的结构示意图。该数据处理装置应用于分析处理引擎。如图3所示,数据处理装置300包括:获取模块301和更新模块302。
获取模块301,用于获取来自事务处理引擎的增量数据。这里,获取模块301具体用于执行上述步骤201。
更新模块302,用于根据增量数据更新存储表,存储表采用列式存储格式,存储表包括数据信息列和有效性指示信息列,增量数据以数据行的形式存储在数据信息列中,有效性指示信息列用于指示数据信息列中的数据行有效或无效。这里,更新模块302具体用于执行上述步骤202。
可选地,如图4所示,数据处理装置300还包括:查询模块303,用于响应于接收到数据查询请求,输出存储表中有效性指示信息列所指示的有效数据行。这里,查询模块303具体用于执行上述步骤203。
可选地,增量数据包括新增数据,更新模块302,具体用于:采用增加数据行的方式在存储表中添加新增数据,并在有效性指示信息列中与新增数据所在数据行对应的指示信息行添加第一指示,第一指示用于指示数据行有效。
可选地,增量数据包括修改数据,更新模块302,具体用于:采用增加数据行的方式在存储表中添加修改数据,并在有效性指示信息列中与修改数据所在数据行对应的指示信息行添加第一指示,以及,将有效性指示信息列中与被修改数据所在数据行对应的指示信息行中的第一指示修改为第二指示,第一指示用于指示数据行有效,第二指示用于指示数据行无效。
可选地,增量数据包括删减数据,更新模块302,具体用于:将有效性指示信息列中与删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示,第二指示用于指示数据行无效。或者,存储表还包括可读性标识信息列,可读性标识信息列用于标识数据信息列中的数据行可被读取或不可被读取,增量数据包括删减数据,更新模块302,具体用于:采用增加数据行的方式在存储表中添加删减数据,且在可读性标识信息列中与新增的删减数据所在数据行对应的标识信息行中设置第一标识,并将有效性指示信息列中与原有的删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示,第一指示用于指示数据行有效,第二指示用于指示数据行无效,第一标识用于标识数据行不可被读取。
可选地,有效性指示信息列包括第一指示信息列和第二指示信息列,第一指示信息列和第二指示信息列用于轮询更新对数据信息列中的数据行的有效性的指示,且第一指示信息列和第二指示信息列满足以下条件:同一时刻,第一指示信息列和第二指示信息列中的至少一个支持数据查询功能;在第一指示信息列和第二指示信息列都支持数据查询功能的情况下,第一指示信息列和第二指示信息列中最近更新的指示信息列用于分析处理引擎进行数据查询。
可选地,存储表还包括版本信息列,版本信息列用于指示数据信息列中的数据行被添加至数据信息列的时间先后顺序。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
下面对本申请实施例的硬件装置举例说明。
本申请实施例提供了一种数据处理装置,该数据处理装置可以是分析处理引擎。图5示例性的提供了数据处理装置的一种可能的架构图。如图5所示,数据处理装置500可以包括处理器501、存储器502、通信接口503和总线504。在数据处理装置中,处理器501的数量可以是一个或多个,图5仅示意了其中一个处理器501。可选的,处理器501可以是中央处理器(central processing unit,CPU)。若数据处理装置具有多个处理器501,多个处理器501的类型可以不同,或者可以相同。可选的,数据处理装置的多个处理器还可以集成为多核处理器。
存储器502用于存储计算机指令和数据,存储器502可以存储实现本申请提供的数据处理方法所需的计算机指令和数据。存储器502可以是以下存储介质的任一种或任一种组合:非易失性存储器(如只读存储器(read-only memory,ROM)、固态硬盘(solid state disk,SSD)、硬盘(hard disk drive,HDD等)、光盘、易失性存储器。
通信接口503可以是以下器件的任一种或任一种组合:网络接口(如以太网接口)、无线网卡等具有网络接入功能的器件。
通信接口503用于数据处理装置500与其他设备或组件进行数据通信。
图5还示例性地绘制出总线504。总线504可以将处理器501与存储器502、通信接口503连接。这样,通过总线504,处理器501可以访问存储器502,还可以利用通信接口503与其他设备或组件进行数据交互。
在本申请中,数据处理装置500执行存储器502中的计算机指令,可以实现本申请提供的数据处理方法。例如,获取来自事务处理引擎的增量数据。根据增量数据更新存储表,该存储表采用列式存储格式,该存储表包括数据信息列和有效性指示信息列。增量数据以数据行的形式存储在数据信息列中,有效性指示信息列用于指示数据信息列中的数据行有效或无效。并且,数据处理装置500通过执行存储器502中的计算机指令,执行本申请提供的数据处理方法的步骤的实现过程可以相应参考上述方法实施例中对应的描述。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质为非易失性计算机可读存储介质,该计算机可读存储介质包括程序指令,当程序指令在计算机设备上运行时,使得计算机设备执行如 本申请实施例提供的数据处理方法。
本申请实施例还提供了一种包含指令的计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行本申请实施例提供的数据处理方法。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
在本申请实施例中,术语“第一”、“第二”和“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。
本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的构思和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (19)

  1. 一种数据处理方法,其特征在于,所述方法包括:
    分析处理引擎获取来自事务处理引擎的增量数据;
    所述分析处理引擎根据所述增量数据更新存储表,所述存储表采用列式存储格式,所述存储表包括数据信息列和有效性指示信息列,所述增量数据以数据行的形式存储在所述数据信息列中,所述有效性指示信息列用于指示所述数据信息列中的数据行有效或无效。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    响应于接收到数据查询请求,所述分析处理引擎输出所述存储表中所述有效性指示信息列所指示的有效数据行。
  3. 根据权利要求1或2所述的方法,其特征在于,所述增量数据包括新增数据,所述分析处理引擎根据所述增量数据更新存储表,包括:
    所述分析处理引擎采用增加数据行的方式在所述存储表中添加所述新增数据,并在所述有效性指示信息列中与所述新增数据所在数据行对应的指示信息行添加第一指示,所述第一指示用于指示数据行有效。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述增量数据包括修改数据,所述分析处理引擎根据所述增量数据更新存储表,包括:
    所述分析处理引擎采用增加数据行的方式在所述存储表中添加所述修改数据,并在所述有效性指示信息列中与所述修改数据所在数据行对应的指示信息行添加第一指示,以及,将所述有效性指示信息列中与被修改数据所在数据行对应的指示信息行中的第一指示修改为第二指示,所述第一指示用于指示数据行有效,所述第二指示用于指示数据行无效。
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述增量数据包括删减数据,所述分析处理引擎根据所述增量数据更新存储表,包括:
    所述分析处理引擎将所述有效性指示信息列中与所述删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示,所述第二指示用于指示数据行无效。
  6. 根据权利要求1至4任一所述的方法,其特征在于,所述存储表还包括可读性标识信息列,所述可读性标识信息列用于标识所述数据信息列中的数据行可被读取或不可被读取,所述增量数据包括删减数据,所述分析处理引擎根据所述增量数据更新存储表,包括:
    所述分析处理引擎采用增加数据行的方式在所述存储表中添加所述删减数据,且在所述可读性标识信息列中与新增的所述删减数据所在数据行对应的标识信息行中设置第一标识,并将所述有效性指示信息列中与原有的所述删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示,所述第一指示用于指示数据行有效,所述第二指示用于指示数据行无效,所述第一标识用于标识数据行不可被读取。
  7. 根据权利要求1至6任一所述的方法,其特征在于,所述有效性指示信息列包括第一指示信息列和第二指示信息列,所述第一指示信息列和所述第二指示信息列用于轮询更新对所述数据信息列中的数据行的有效性的指示,且所述第一指示信息列和所述第二指示信息列满足以下条件:
    同一时刻,所述第一指示信息列和所述第二指示信息列中的至少一个支持数据查询功能;
    在所述第一指示信息列和所述第二指示信息列都支持数据查询功能的情况下,所述第一指示信息列和所述第二指示信息列中最近更新的指示信息列用于所述分析处理引擎进行数据查询。
  8. 根据权利要求1至7任一所述的方法,其特征在于,所述存储表还包括版本信息列,所述版本信息列用于指示所述数据信息列中的数据行被添加至所述数据信息列的时间先后顺序。
  9. 一种数据处理装置,其特征在于,应用于分析处理引擎,所述装置包括:
    获取模块,用于获取来自事务处理引擎的增量数据;
    更新模块,用于根据所述增量数据更新存储表,所述存储表采用列式存储格式,所述存储表包括数据信息列和有效性指示信息列,所述增量数据以数据行的形式存储在所述数据信息列中,所述有效性指示信息列用于指示所述数据信息列中的数据行有效或无效。
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括:
    查询模块,用于响应于接收到数据查询请求,输出所述存储表中所述有效性指示信息列所指示的有效数据行。
  11. 根据权利要求9或10所述的装置,其特征在于,所述增量数据包括新增数据,所述更新模块,用于:
    采用增加数据行的方式在所述存储表中添加所述新增数据,并在所述有效性指示信息列中与所述新增数据所在数据行对应的指示信息行添加第一指示,所述第一指示用于指示数据行有效。
  12. 根据权利要求9至11任一所述的装置,其特征在于,所述增量数据包括修改数据,所述更新模块,用于:
    采用增加数据行的方式在所述存储表中添加所述修改数据,并在所述有效性指示信息列中与所述修改数据所在数据行对应的指示信息行添加第一指示,以及,将所述有效性指示信息列中与被修改数据所在数据行对应的指示信息行中的第一指示修改为第二指示,所述第一指示用于指示数据行有效,所述第二指示用于指示数据行无效。
  13. 根据权利要求9至12任一所述的装置,其特征在于,所述增量数据包括删减数据,所述更新模块,用于:
    将所述有效性指示信息列中与所述删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示,所述第二指示用于指示数据行无效。
  14. 根据权利要求9至12任一所述的装置,其特征在于,所述存储表还包括可读性标识信息列,所述可读性标识信息列用于标识所述数据信息列中的数据行可被读取或不可被读取,所述增量数据包括删减数据,所述更新模块,用于:
    采用增加数据行的方式在所述存储表中添加所述删减数据,且在所述可读性标识信息列中与新增的所述删减数据所在数据行对应的标识信息行中设置第一标识,并将所述有效性指示信息列中与原有的所述删减数据所在数据行对应的指示信息行中的第一指示修改为第二指示,所述第一指示用于指示数据行有效,所述第二指示用于指示数据行无效,所述第一标识用于标识数据行不可被读取。
  15. 根据权利要求9至14任一所述的装置,其特征在于,所述有效性指示信息列包括第一指示信息列和第二指示信息列,所述第一指示信息列和所述第二指示信息列用于轮询更新对所述数据信息列中的数据行的有效性的指示,且所述第一指示信息列和所述第二指示信息列满足以下条件:
    同一时刻,所述第一指示信息列和所述第二指示信息列中的至少一个支持数据查询功能;
    在所述第一指示信息列和所述第二指示信息列都支持数据查询功能的情况下,所述第一指示信息列和所述第二指示信息列中最近更新的指示信息列用于所述分析处理引擎进行数据查询。
  16. 根据权利要求9至15任一所述的装置,其特征在于,所述存储表还包括版本信息列,所述版本信息列用于指示所述数据信息列中的数据行被添加至所述数据信息列的时间先后顺序。
  17. 一种数据处理装置,其特征在于,包括存储器和处理器,所述存储器存储有程序指令,所述处理器 运行所述程序指令以执行权利要求1至8任一所述的方法。
  18. 一种计算机可读存储介质,其特征在于,包括程序指令,当所述程序指令在计算机设备上运行时,使得所述计算机设备执行如权利要求1至8任一所述的方法。
  19. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至8任一所述的方法。
PCT/CN2023/103426 2022-10-21 2023-06-28 数据处理方法及装置 WO2024082693A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211298222.7A CN117951141A (zh) 2022-10-21 2022-10-21 数据处理方法及装置
CN202211298222.7 2022-10-21

Publications (1)

Publication Number Publication Date
WO2024082693A1 true WO2024082693A1 (zh) 2024-04-25

Family

ID=90736823

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/103426 WO2024082693A1 (zh) 2022-10-21 2023-06-28 数据处理方法及装置

Country Status (2)

Country Link
CN (1) CN117951141A (zh)
WO (1) WO2024082693A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019251A (zh) * 2019-03-22 2019-07-16 深圳市腾讯计算机***有限公司 一种数据处理***、方法及设备
CN113010608A (zh) * 2021-04-07 2021-06-22 亿企赢网络科技有限公司 数据实时同步方法、装置及计算机可读存储介质
CN113874852A (zh) * 2019-05-23 2021-12-31 国际商业机器公司 用于在多主混合事务和分析处理***中演进大规模数据集的索引
CN115114344A (zh) * 2021-11-05 2022-09-27 腾讯科技(深圳)有限公司 事务处理方法、装置、计算设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019251A (zh) * 2019-03-22 2019-07-16 深圳市腾讯计算机***有限公司 一种数据处理***、方法及设备
CN113874852A (zh) * 2019-05-23 2021-12-31 国际商业机器公司 用于在多主混合事务和分析处理***中演进大规模数据集的索引
CN113010608A (zh) * 2021-04-07 2021-06-22 亿企赢网络科技有限公司 数据实时同步方法、装置及计算机可读存储介质
CN115114344A (zh) * 2021-11-05 2022-09-27 腾讯科技(深圳)有限公司 事务处理方法、装置、计算设备及存储介质

Also Published As

Publication number Publication date
CN117951141A (zh) 2024-04-30

Similar Documents

Publication Publication Date Title
EP3283963B1 (en) Backup and restore in a distributed database utilizing consistent database snapshots
US11734253B2 (en) Database change capture with transaction-consistent order
US8046334B2 (en) Dual access to concurrent data in a database management system
US9411866B2 (en) Replication mechanisms for database environments
EP2874077B1 (en) Stateless database cache
US10866865B1 (en) Storage system journal entry redaction
US8296269B2 (en) Apparatus and method for read consistency in a log mining system
US10866968B1 (en) Compact snapshots of journal-based storage systems
US7478112B2 (en) Method and apparatus for initializing data propagation execution for large database replication
CN111522631A (zh) 分布式事务处理方法、装置、服务器及介质
US20120284244A1 (en) Transaction processing device, transaction processing method and transaction processing program
WO2022002103A1 (zh) 一种在数据节点上回放日志的方法、数据节点及***
US11216412B2 (en) Intelligent merging for efficient updates in columnar databases
Kleppmann Designing data-intensive applications
US9390111B2 (en) Database insert with deferred materialization
US20230229645A1 (en) Schema management for journal-based storage systems
CN115617571A (zh) 一种数据备份方法、装置、***、设备及存储介质
US20230315713A1 (en) Operation request processing method, apparatus, device, readable storage medium, and system
WO2020192663A1 (zh) 一种数据管理方法及相关设备
CN111581227A (zh) 事件推送方法、装置、计算机设备及存储介质
WO2024082693A1 (zh) 数据处理方法及装置
US20190354600A1 (en) Transport handling of foreign key checks
WO2016085495A1 (en) Read-optimized database changes
US20180150498A1 (en) Database management device, information processing system, and database management method
CN113391933A (zh) 一种处理资金的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23878690

Country of ref document: EP

Kind code of ref document: A1