WO2023082992A1 - 数据处理方法以及*** - Google Patents

数据处理方法以及*** Download PDF

Info

Publication number
WO2023082992A1
WO2023082992A1 PCT/CN2022/127511 CN2022127511W WO2023082992A1 WO 2023082992 A1 WO2023082992 A1 WO 2023082992A1 CN 2022127511 W CN2022127511 W CN 2022127511W WO 2023082992 A1 WO2023082992 A1 WO 2023082992A1
Authority
WO
WIPO (PCT)
Prior art keywords
data processing
request
processing
target
requests
Prior art date
Application number
PCT/CN2022/127511
Other languages
English (en)
French (fr)
Inventor
刘显
郑方
罗从难
郭援非
朱澄
朱潇威
Original Assignee
阿里巴巴(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴(中国)有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2023082992A1 publication Critical patent/WO2023082992A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • the present application relates to the field of computer technology, in particular to a data processing method.
  • the present application also relates to a data processing system, a computing device, and a computer-readable storage medium.
  • the database adopts a multi-master MPP (Massive Parallel Processing) architecture.
  • the system consists of two groups of computer nodes: master nodes and data nodes.
  • the database relies on a component called the Global Transaction Manager (GTM) to support snapshot isolation.
  • GTM Global Transaction Manager
  • the GTM component can be implemented through multi-process or multi-thread to improve its parallelism, it is centralized in nature, and the number of connections with the GTM component will increase significantly with the increase of the number of concurrent transactions, not only for GTM
  • the operation of components causes great pressure, and will also become a serious system bottleneck of the entire distributed database system.
  • the present application provides a data processing method, a data processing system, a computing device, and a computer-readable storage medium, so as to solve the technical defects existing in the prior art.
  • a data processing method which is applied to a data processing node of a distributed data processing system, including: receiving multiple data processing requests sent by a client, based on multiple data processing requests The quantity determines the target processing quantity, and performs flow-limiting processing on multiple data processing requests according to the target processing quantity to obtain the target data processing request; forwards the target data processing request to the global transaction manager, and receives the target data processing request from the global transaction manager The processing result is processed, and the processing result is returned to the client corresponding to each data processing request.
  • a data processing system includes a data processing node, and the data processing node includes: a request receiving module configured to receive multiple data processing requests sent by a client , determine the target processing quantity based on the number of multiple data processing requests, and perform flow-limiting processing on multiple data processing requests according to the target processing quantity to obtain the target data processing request; the proxy module is configured to forward the target data processing request to the global
  • the transaction manager receives the processing result of the target data processing request processed by the global transaction manager, and returns the processing result to the client corresponding to each data processing request.
  • a computing device including: a memory and a processor; wherein the memory is used to store computer-executable instructions, the processor is used to execute computer-executable instructions, and the processor executes computer-executable instructions.
  • the instruction is used to realize the steps of the above-mentioned data processing method.
  • a computer-readable storage medium which stores computer-executable instructions, and when the instructions are executed by a processor, the steps of the above-mentioned data processing method are implemented.
  • Fig. 1 is a multi-master distributed database architecture of a data processing method provided by an embodiment of the present application
  • FIG. 2 is an overall architecture diagram of a data processing system provided by an embodiment of the present application
  • Fig. 3 is a flow chart of a data processing method provided by an embodiment of the present application applied to a data processing node of a distributed data processing system;
  • FIG. 4 is a schematic structural diagram of a shared memory area of a data processing method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of various state transitions of processing slots in a shared memory area of a data processing method provided by an embodiment of the present application;
  • Fig. 6 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • Fig. 7 is a structural block diagram of a computing device provided by an embodiment of the present application.
  • first, second, etc. may be used to describe various information in one or more embodiments of the present application, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second, and similarly, second may also be referred to as first, without departing from the scope of one or more embodiments of the present application. Depending on the context, the word “if” as used herein may be interpreted as “at” or “when” or “in response to a determination.”
  • a transaction In a database management system, a transaction is a single unit of logic or work, sometimes consisting of multiple operations. Any logical computation done in a consistent mode in a database is called a transaction. For example, a transfer from one bank account to another: a complete transaction requires subtracting the amount to be transferred from one account and adding the same amount to the other account.
  • Database Transactions By definition, a database transaction must be atomic (it must complete in its entirety or have no impact), consistent (it must conform to existing constraints in the database), isolated (it must not affect other transactions), and durable (it must be written to persistent storage). Database practitioners often use the acronym ACID to refer to these properties of database transactions.
  • Distributed database It is a database that stores data in different physical locations. It may be stored on multiple computers located in the same physical location (such as a data center); or it may be distributed across an interconnected computer network.
  • Multi-master architecture It is a popular way to build a distributed database system.
  • a distributed database system consists of two groups of computer nodes: master nodes and data nodes.
  • Each master node maintains an up-to-date copy of system catalogs and metadata (for example, table and index definitions).
  • the database's data is stored in multiple data nodes, subject to user-specified partitioning and/or replication strategies.
  • Database Session Represents a connection between an application (or client) and a database that stores its persistent objects.
  • the connection is usually established through the TCP network protocol.
  • the client application can query and manipulate the data in the database by sending SQL statements through the connection with the database.
  • the database can use processes or threads to receive and service all SQL statements from the session.
  • the database system deallocates any resources (such as processing processes or threads) associated with the session.
  • Multi-version concurrency control MCC or MVCC
  • MCC Multi-version concurrency control
  • Snapshot isolation is the guarantee that all reads made within a transaction will see a consistent snapshot of the database (actually it reads the last committed value that existed when it started), and will only A transaction's updates are committed only if they do not conflict with updates from other transactions since the snapshot began.
  • Proxy In computer networking, a proxy server is a server application or device that acts as an intermediary for client requests seeking resources from servers that provide them. Thus, the proxy server operates on behalf of the client when requesting services, potentially masking the true origin of the request to the resource server. Rather than connecting directly to a server that can fulfill a requested resource (such as a file or web page), a client directs the request to a proxy server, which evaluates the request and performs the required network transaction. This is a way to simplify or control the complexity of requests, or to provide additional benefits such as load balancing, privacy, or security.
  • the database adopts a multi-master MPP (Massive Parallel Processing) architecture.
  • the system consists of two groups of computer nodes: master nodes and data nodes.
  • Each master node maintains an up-to-date copy of system catalogs and metadata (for example, table and index definitions).
  • the database's data is stored in multiple data nodes, subject to user-specified partitioning and/or replication strategies.
  • a client connects to one of the hosts (eg, via the TCP network protocol) to establish a database session, and can then submit SQL statements over that connection.
  • the corresponding master node will parse the SQL statement, generate an optimized query plan, and dispatch the query plan to the data nodes for execution.
  • Each data node executes the query plan sent by the master node with the data stored locally, exchanges intermediate data with each other when necessary, and finally sends the query result back to the master node.
  • the master node merges and assembles the final query results and sends them back to the client.
  • FIG. 1 shows a multi-master distributed database architecture 100 of a data processing method provided in some embodiments of the present application.
  • the distributed database architecture 100 in FIG. 1 is a distributed database with two main database nodes, respectively main node 1 and main node 2, and each main node corresponds to two clients, and the clients corresponding to main node 1 are respectively Client 1, client 2, and the clients corresponding to master node 2 are client 3 and client 4 respectively.
  • the data nodes corresponding to the two master nodes are data node 1, data node 2, data node 3, Data node 4, and the multi-master distributed database architecture also has a global transaction manager (GTM).
  • GTM global transaction manager
  • each set of rectangles in Figure 1 represents a process or thread for running transactions in a database session.
  • the database supports database transactions that conform to the ACID properties of the SQL standard.
  • any SQL statement submitted to the database is executed within a database transaction.
  • Such transactions are either explicitly specified by the client via BEGIN/COMMIT/ABORT statements, or created implicitly and internally by the database system for individual SQL statements when the client does not explicitly specify a transaction scope.
  • a transaction in a database typically involves the stages of creating and scheduling a query plan for the transaction's SQL statements, and multiple data nodes executing the query. If the transaction contains DDL (data definition language) statements that modify system catalogs and metadata (for example, CREATE TABLE), the transaction will also span all other master data in the system. Since transactions involve multiple distributed computer nodes, a distributed transaction protocol is used to ensure that transactions satisfy ACID properties. For example, the database uses a standard two-phase commit protocol. The master node acts as the coordinator of the distributed transaction, and the involved data nodes (sub-nodes) and other master nodes act as participants.
  • DDL data definition language
  • the coordinator starts the first phase called “prepare” (or “voting”): it asks each participant to vote whether the transaction should be committed; vote (commit or abort) to reply. If the coordinator receives commit votes from all participants, then it initiates a second phase called “commit”: it asks all participants to commit the transaction locally, and once all participants have confirmed that they are done, the coordinator commits the distributed affairs. If the coordinator receives any abort votes from participants during the prepare phase, it asks all participants to abort the transaction locally and aborts the distributed transaction.
  • the database relies on a component called the Global Transaction Manager (GTM) to support snapshot isolation, a popular variant of the multiversion concurrency control mechanism used to manage concurrent transactions.
  • GTM Global Transaction Manager
  • GTM is a centralized service in a distributed database responsible for assigning unique identifiers to transactions, tracking the status of distributed transactions (status is in progress, committed or aborted), and generating distributed snapshots.
  • a master node When a master node starts a distributed transaction, it sends a request to GTM to register the new transaction with GTM and assigns the transaction a unique identifier (called the Global Transaction ID or GXID for short).
  • the global transaction ID uniquely identifies a transaction. Whenever a transaction inserts or modifies a row in a database table, a version of the row's data is stored in the table with the data payload and transaction ID.
  • the transaction ID is implemented internally as a hidden column that is transparent to the database user.
  • GTM After a transaction is registered with GTM, GTM sets its status to In Progress. Later when the corresponding master node commits or aborts the transaction, the master node notifies GTM of the change and GTM sets the transaction state accordingly.
  • the master node When the master node dispatches a query in a transaction to a data node (Remove), it sends a distributed snapshot request to GTM.
  • the content of the distributed snapshot indicates which transactions (according to their global transaction IDs) were in progress at the time.
  • GTM checks its transaction state trace and returns the IDs of all currently active transactions.
  • This distributed snapshot is sent to the data nodes along with the query plan.
  • a data node executes a query and needs to access a row in a table, it uses distributed snapshots to determine whether a certain version of the row is visible to the current transaction. For example, if the executing transaction ID is listed in the set of in-progress transactions in the distributed snapshot, the current transaction should not read this version, otherwise it will read data from uncommitted transactions and violate the isolation property.
  • GTM component can be implemented through multi-process or multi-thread to improve its parallelism, it is centralized in nature and becomes a serious system bottleneck as the number of concurrent transactions increases. There are two main obstacles to scaling GTM for high concurrency:
  • each master node can accept at most N client sessions and there are M master nodes, then the system can have at most N x M concurrent sessions at any time. If each session establishes its own TCP connection to GTM, GTM will need to handle at most N x M connections. In a large deployment with high concurrency, this may exceed the TCP connection limit of a single machine. Since the whole benefit of having multiple master nodes is to provide clients with scalable connection points beyond the capacity of a single machine, connecting each session to a centralized GTM would defeat the fundamental purpose of a multi-master architecture.
  • GTM could serve this many connections from a single session, it would lose a lot of opportunity for efficiency. For example, suppose there are K concurrent sessions on the same master node, and they are all executing read-only transactions. A read-only transaction is one that does not modify or insert any data. If each session establishes its own connection with GTM and sends a distributed snapshot request from its own connection, GTM will receive K such requests, calculate K distributed snapshots and send back K results. However, this is redundant and unnecessary, GTM could use the first distributed snapshot of the computation as a response to all K concurrent snapshot requests from these read-only transactions.
  • the master node can combine K concurrent requests into one request, let GTM only calculate the distributed snapshot once, and then fetch the returned result, we can reduce the number of network messages and the workload completed by GTM by K times, and combine them Return to K transactions. Even for different concurrent requests from the same master node (for example, one session is requesting a global transaction ID, another session is requesting a distributed snapshot), sending these requests in batches is usually more efficient than sending them individually from separate connections.
  • connectionless network protocols such as UDP can overcome the connection limitation, but the result is to increase the complexity of system implementation to ensure reliable communication with GTM. Also, using UDP doesn't help with the second problem at all.
  • Postgres-XL is a multi-master distributed database that also uses a centralized global transaction manager (GTM).
  • GTM in Postgres-XL is implemented as a multi-threaded independent process responsible for transaction ID allocation, transaction state tracking and distributed snapshot calculation.
  • User database sessions on the master node can connect directly to GTM to request transaction IDs, notify of state changes, or request distributed snapshots.
  • the database can deploy multiple GTM agent processes and have the user's database session connect to one of the GTM agents.
  • the GTM proxy forwards the request to the GTM server and sends the response back to the user session.
  • Postgres-XL uses proxy modules to improve the scalability of centralized GTM, but it also has four obvious defects: First, the GTM proxy in Postgres-XL only supports TCP connections and communication with the master node. Communication for user database sessions. Even if the GTM agent and the user's database session are running on the same master node, they communicate with each other via TCP. This communication is less efficient than other shared memory based communication mechanisms due to the extra memory copy and unnecessary network stack overhead. Second, when there are multiple GTM proxies, Postgres-XL does not specify how to distribute the user's database sessions among these GTM proxies in a load-balanced manner.
  • the GTM proxy in Postgres-XL when the GTM proxy in Postgres-XL receives multiple concurrent requests at the same time, it will package these requests into one message and send it to the GTM server, and decompress all the responses sent back by the GTM server. However, the GTM proxy in Postgres-XL does not detect or eliminate redundant concurrent requests (such as multiple distributed snapshot requests from concurrent read-only transactions). Fourth, Postgres-XL does not allow individual database sessions to choose whether to connect to GTM directly or through the GTM proxy. This setting is system-wide and cannot be changed dynamically without restarting the database system. This is inflexible and cannot support the situation where multiple users need to use different connection methods to GTM at the same time.
  • a database system might restrict ordinary users' database sessions to use the GTM proxy, but allow high-priority users or system administrators to use GTM dedicated connections for emergency or maintenance tasks.
  • the data processing method provided in this paper will allow a single user session to dynamically choose whether to directly connect to GTM or use a GTM proxy. Compared with Postgres-XL, this method will be more flexible and can better serve mixed usage scenarios.
  • the data processing method provided by the embodiment of the present application proposes an agent-based method to solve the scalability challenge of the centralized GTM in the multi-master distributed database. It should be noted that the method proposed in the present application does not It is not limited to specific distributed transaction protocols (such as Two-Phase Commit) and concurrency control mechanisms (such as Snapshot Isolation) used in the database, and can be widely applied to any multi-master distributed database that adopts centralized transaction management.
  • distributed transaction protocols such as Two-Phase Commit
  • concurrency control mechanisms such as Snapshot Isolation
  • a data processing method is provided, and the present application also relates to a data processing system, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
  • FIG. 2 shows a schematic diagram of an overall architecture 200 of a data processing system provided according to an embodiment of the present application.
  • the data processing method proposes to use a proxy module to improve the durability and scalability of a centralized global transaction manager (GTM) in a multi-master distributed database.
  • the data processing system includes a data processing node, wherein the data processing node includes: a request receiving module configured to receive multiple data processing requests sent by the client, and determine the target based on the number of multiple data processing requests processing quantity, and perform flow-limiting processing on multiple data processing requests according to the target processing quantity to obtain the target data processing request; the proxy module is configured to forward the target data processing request to the global transaction manager, and receive the global transaction manager’s response to the target The processing result of the data processing request is processed, and the processing result is returned to the client corresponding to each data processing request.
  • a request receiving module configured to receive multiple data processing requests sent by the client, and determine the target based on the number of multiple data processing requests processing quantity, and perform flow-limiting processing on multiple data processing requests according to the target processing quantity to obtain the target data processing request
  • the proxy module is configured to forward the target data
  • the overall architecture 200 of the distributed data processing system in FIG. 2 is a distributed database with two master database nodes, respectively master node 1 (master1) and master node 2 (master2), and each master node corresponds to two clients.
  • the clients corresponding to master node 1 are client 1 and client 2 respectively
  • the clients corresponding to master node 2 are client 3 and client 4 respectively
  • the data nodes corresponding to the two master nodes are data Node 1, data node 2, data node 3, data node 4, and the multi-master distributed database architecture also has a global transaction manager (GTM).
  • GTM global transaction manager
  • the master node 1 and master node 2 are respectively configured
  • the module in the master node that receives the data processing request sent by the client and the proxy module communicates through the shared memory area.
  • each master node in the distributed database is equipped with a set of agent module processes or threads.
  • Each agent module serves one or more user database sessions on the primary database node where it is deployed.
  • Each user database session can dynamically choose to connect to GTM directly or indirectly through a proxy module on the same master database node.
  • a database session chooses to use a proxy and needs to send a request to GTM (for example, assigning a global transaction ID or obtaining a distributed snapshot)
  • GTM for example, assigning a global transaction ID or obtaining a distributed snapshot
  • the proxy module After receiving at least one request from the user database session, if the connection has not been established, the proxy module will establish a connection with GTM, and send the request issued by the user database session.
  • GTM processes each request and sends a response back to the proxy module.
  • the proxy module determines which database session it belongs to and sends the response back to the database session.
  • Agent modules and user database sessions communicate through an efficient shared memory mechanism. For example, when there are multiple proxy modules, our method distributes user sessions to these GTM's proxy modules in a load-balanced manner. The proxy module automatically detects and eliminates redundant requests to GTM by concurrent transactions to reduce the network traffic to GTM and the workload performed by GTM. Alternatively, this method allows a single user session to dynamically choose whether to directly connect to the GTM or use the GTM proxy module, which can more flexibly serve such mixed usage scenarios.
  • the data processing system in the multi-master distributed database managed by the centralized global transaction manager, uses the agent module configured in the master database node to process the received data request based on an efficient shared memory mechanism Perform current limiting processing, and in the case of configuring multiple proxy modules, distribute data processing requests to the proxy modules in a load-balanced manner, and then the proxy module helps GTM reduce data redundancy requests and reduce global transaction management.
  • Network traffic and workload performed by GTM; in some embodiments, individual user sessions are also allowed to dynamically choose whether to connect directly to GTM or use the GTM proxy module, which can more flexibly serve such mixed usage scenarios.
  • FIG. 3 shows a schematic diagram of a process 300 in which a data processing method provided by an embodiment of the present application is applied to a data processing node of a distributed data processing system, specifically including the following steps:
  • the data processing method provided in this embodiment is applicable to any multi-master distributed database that adopts centralized transaction management, and there is no excessive limitation here.
  • Step 302 Receive multiple data processing requests sent by the client, determine the target processing quantity based on the number of multiple data processing requests, and perform flow-limiting processing on the multiple data processing requests according to the target processing quantity to obtain the target data processing request.
  • the data processing node of the distributed data processing system can be understood as the master node of the distributed database, that is, the data writing node, which can read and write data, etc., while the master node in the prior art After receiving multiple data processing requests sent by multiple clients, the node will directly forward the multiple data processing requests to the global transaction manager for centralized transaction management, and then execute subsequent data transactions.
  • the global transaction manager for centralized transaction management
  • the node will directly forward the multiple data processing requests to the global transaction manager for centralized transaction management, and then execute subsequent data transactions.
  • Due to multiple All data processing requests are directly connected to the global transaction management, which may cause a state of system crash due to too many connections of the global transaction manager; Multiple data processing requests are limited in the master node, so as to reduce the number of data processing requests performed by the global transaction manager.
  • the master node of the distributed database after the master node of the distributed database receives multiple data processing requests sent by the client, it will determine the target processing quantity processed at the same time according to the number of multiple data processing requests, and use the target processing quantity to The above-mentioned multiple data processing requests are subjected to flow-limiting processing, and then the target data processing requests are screened out; for example, the master node of the distributed database receives 1000 data processing requests, and according to the 1000 data processing requests, it is determined that the master node can only 100 data processing requests are processed, therefore, the 1000 data processing requests are subjected to flow-limiting processing, and only 100 data processing requests are obtained from the 1000 data processing requests in the order of requests as target data processing requests.
  • the database system Before the master node of the distributed database receives multiple data processing requests from the client, the database system allows each individual user session to dynamically specify whether to use a proxy module or a dedicated link to GTM to perform transactions in the session; for example, receiving a client Before multiple data processing requests sent by the terminal, it also includes: the request receiving module judges whether to start the proxy module in the data processing node based on the preset project requirements, and if so, sends a data processing instruction to the client, wherein the data processing instruction is The request sent by the client is forwarded to the proxy module for processing.
  • a proxy module can be configured in the master node of the distributed database to process multiple data processing requests on behalf of the proxy to reduce the global Network traffic and workload performed by the transaction manager.
  • the request receiving module in the master node of the distributed database can also determine whether to start the agent module configured in the master node according to different project requirements. If it is determined based on the project requirements, a large number of data processing requests are required. In the case of processing, it may agree to start the proxy module, and the master node may send a data processing instruction to the client, wherein the data processing instruction is an instruction to forward the request sent by the client to the proxy module for processing.
  • a database session when a database session dynamically changes its connection mode, it follows one of two steps: if it currently has a direct connection to GTM and has now chosen to use a proxy, it closes the connection to GTM, and obtains Shared memory communication slots to interact with agents. If it is currently using a shared memory communication slot and now chooses to connect directly to GTM, then it discards the shared memory communication slot and establishes a new direct connection to GTM. Database systems may set quotas or limits to limit which users or sessions can use direct connections to GTM.
  • the distributed database processing system can also dynamically determine whether to use the agent module configured in the master node based on actual project requirements, thereby improving the performance of the distributed data processing system. processing efficiency.
  • the data processing node includes a request receiving module, a shared memory area, and an agent module; before determining the target processing quantity based on the number of multiple data processing requests, it further includes: the request receiving module establishes a communication connection with the agent module based on the shared memory area.
  • the master node of the distributed database includes a request receiving module, a proxy module, and a shared memory area.
  • the proxy module is configured in the master node, and the proxy module can act on behalf of the master node to limit the flow of data processing requests. module.
  • the request receiving module in the master node can establish a communication connection with the agent module configured on the master node through the shared memory area, so that the subsequent agent module can obtain the data processing request sent by the request receiving module from the shared memory area .
  • each master node in a multi-master distributed database using a centralized global transaction manager, can be configured with a proxy module for collecting, combining and forwarding data from user sessions on the master node.
  • GTM requests and returns the returned response to the client, and the communication between the proxy module and the user session is run through an efficient shared memory mechanism on the master node to improve the efficiency of data processing.
  • the target processing quantity of data processing requests processed by the proxy module at the same time can be determined; for example, the target processing quantity can be determined based on the number of multiple data processing requests
  • the number includes: the request receiving module determines the number of processing slots to be allocated in the shared memory area based on the number of multiple data processing requests, and takes the number of processing slots to be allocated as the target processing number.
  • the distributed database system can allocate a shared memory area on each master node, dedicated to the user's database session and data exchange between agents, if the database system uses multiple threads in a single process Realized, then this shared memory area can be allocated on the heap memory that all threads in the database system can access, including user database session threads and agent threads. If the database system is implemented as multiple processes, this shared memory area can be allocated on the shared memory area provided by the underlying operating system and made available to multiple processes.
  • the request receiving module in the master node can determine the number of slots to be processed that can be allocated for multiple data processing requests in the shared memory area based on the number of multiple data processing requests received.
  • the number of processing slots to be allocated means the number of data processing requests processed by the proxy module at the same time.
  • the target processing quantity is determined by determining the number of processing slots to be allocated in the shared memory area, so that the subsequent agent module can perform flow-limiting processing on multiple data processing requests based on the target processing quantity.
  • the data processing requests can be placed in the processing slots of the shared memory area;
  • the data processing request includes: the request receiving module places the data processing request equal to the target processing quantity in the slot to be allocated in the shared memory area according to the target processing quantity; The same data processing request, and treat the data processing request as the target data processing request.
  • the request receiving module in the master node of the distributed database places, based on the determined target processing quantity, the data processing request equal to the target processing quantity in the processing slot to be allocated in the shared memory area, representing the proxy module
  • the data processing requests that can be processed at the same time are in the processing slots to be allocated, so as to realize the current limiting processing of multiple data processing requests, and then, the agent module can obtain the same number of data from the processing slots as the target A data processing request, and use the obtained data processing request as a target data processing request.
  • the data processing request provided by the embodiment of the present application implements multiple data processing requests by placing the data processing request in the processing slot to be allocated in the shared memory area, and the proxy module obtains the data processing request from the processing slot to be allocated.
  • Current limiting processing to improve the data processing efficiency of the distributed data processing system.
  • the embodiment of the present application provides a structural diagram of the shared memory area used by the data processing method to communicate with the proxy module, as shown in Figure 4.
  • a schematic diagram of a structure 400 of a shared memory area of a data processing method provided by an embodiment of the present application is shown.
  • the shared memory area consists of two parts, one of which is a set of communication slots, each slot in the communication slot is allocated to a user database session, and is used for the user database session to exchange GTM requests and responses with the agent .
  • the other part is the proxy bookkeeping, a proxy module that uses each element of the array to keep track of various states.
  • the number of communication slots is set based on the maximum number of user database sessions on a single master node, which guarantees that each user database session has a dedicated shared memory communication slot.
  • the processing slots there may be more processing slots in the shared memory area, but the processing slots need to be configured before they can be used, and some resource costs are required during the configuration process, so how to Efficient use of configured processing slots is an effective way to save resource costs.
  • the method adopted in the embodiment of the present application is to determine the state of the processing slot based on the semaphore in the processing slot, whether it is occupied state, after the proxy module obtains the data processing request from the processing slot, the processing slot returns to the idle state, and the next data processing request can be allocated to the processing slot; for example, the request receiving module processes the data according to the target Placing the data processing requests equal to the target processing quantity in the processing slots to be allocated in the shared memory area, including: the request receiving module determines the semaphore of each processing slot in the shared memory area based on the target processing quantity; The semaphore places the data processing requests equal to the target processing quantity in the processing slots to be allocated in the shared memory area, and modifies the communication state in the processing slots to be
  • the agent module uses the semaphore to notify the user database session about the change of the slot state. For example, the agent module will use the credit to wake up the user session waiting for the response to be ready.
  • the semaphore is in The process of each data exchange may be different, for example, the proxy module and the user database session are separate processes, which can be implemented with POSIX semaphore system calls, or if the proxy module and the user database session are different threads in the same process , which can be implemented as a pthead condition variable.
  • the request receiving module of the master node of the distributed database can determine the semaphore of each processing slot to be allocated in the shared memory area based on the target processing quantity, and based on the semaphore, set the The data processing request is placed in the processing slot to be allocated in the shared memory area.
  • the communication status of the processing slot to be allocated needs to be modified, so as to facilitate subsequent determination of whether the processing slot can place data processing according to the communication status of the processing slot to be allocated. ask.
  • the slot state represents the current state of the slot, during the communication between the user database session and the agent module, the slot will transition between various states, see Figure 5, which shows the A schematic diagram of various state transitions 500 of the processing slots in the shared memory area of the data processing method provided in the embodiment of the application.
  • Fig. 5 shows the schematic diagram of the conversion 500 of various slot states, respectively FREE (the slot is not used by any user database session), EMPTY (the slot has been allocated to the database user session, but has not yet stored any request), REQUEST_AVAIL (the request has been stored in the slot), WAIT_RESPONSE (the proxy has sent a request to GTM and is waiting for a response from GTM), RESPONSE_AVAIL (the response to the outstanding request has been stored in the slot and is ready to be received by the database user session), ERROR ( An error occurred while servicing the current request), FREEING (the slot has been abandoned by the database session and is ready to be recycled).
  • FREE the slot is not used by any user database session
  • EMPTY the slot has been allocated to the database user session, but has not yet stored any request
  • REQUEST_AVAIL the request has been stored in the slot
  • WAIT_RESPONSE the proxy has sent a request to GTM and is waiting for a response from GTM
  • RESPONSE_AVAIL the
  • the process of a user database session acquiring a shared memory communication slot, the process of a user database session giving up a communication slot, and the process of a user database session sending a request and receiving a response through a shared memory communication slot can illustrate the process of slot state transition:
  • the process for the user database session to acquire a shared memory communication slot is as follows: Step 1: Obtain an exclusive lock on the communication slot array; Step 2: Traverse the communication slot array to find a slot whose status is FREE. If found, go to step 3, if the slot is not found, go to step 7; step 3: change the status of the slot to EMPTY; step 4: assign a proxy to serve according to the load balancing strategy described in the next section slot, and add the slot's array index to the agent's array of slot indices.
  • Step 5 Release the exclusive lock on the communication slot array;
  • Step 6 Arrange a callback function that will be executed at the end of this session to discard the acquired slots;
  • Step 7 Return whether the communication slot was successfully acquired.
  • Step 1 Change the state of the slot to FREEING
  • Step 2 Set a semaphore (Semaphore) in the accounting data of the corresponding agent to notify the agent of the change in the state of the slot; the agent will process Slot recycling.
  • the process of sending a request and receiving a response by a user database session through a shared memory communication slot is as follows: Step 1: Store the request in the Request buffer of the communication slot. The request is formatted and serialized into a continuous byte sequence, and the index of the communication slot is stored as part of the request; Step 2: Change the state of the slot to REQUEST_AVAIL; Step 3: Set the Semaphore in the accounting data of the corresponding agent, Notify the proxy of the arrival of a new request; Step 4: Wait for the semaphore in the communication slot to obtain notification from the proxy; Step 5: After receiving the Semaphore notification from the proxy module (proxy), check the slot status: (1) If the status is RESPONSE_AVAIL, indicating that there is a response in the Response buffer for this slot; in this case, change the slot state to EMPTY and return the response. (2) If the status is ERROR, change the slot status to EMPTY and return an error. (3) If the state is REQUEST_AVAIL
  • the data processing request is a large buffer that can store a single request to GTM.
  • a user inter-database session needs to send a request to GTM, he will construct the request and store it in the request buffer, and then The proxy module reads the request from the buffer and sends it to GTM for processing, and the user database session can format and serialize the request into a specific format.
  • another buffer for data response can store a single response from GTM, when the proxy module receives a response from GTM, it stores the response to the buffer, and notifies the user database session of its availability through the semaphore of the slot .
  • the array of shared memory communication slots is protected by a read-write lock, which itself also resides in the shared memory region.
  • a user database session When a user database session needs to acquire a slot, it acquires this lock in write (or exclusive) mode and selects a free slot for use.
  • Each agent repeatedly scans the communication slots for outstanding requests and slots that have been abandoned by the database session and thus should be reclaimed; the agent acquires this lock in read (or say shared) mode before scanning.
  • the data processing method provided by the embodiment of the present application can determine the current status of the slot through the semaphore, and then record the communication status of the slot to realize efficient allocation of data processing requests, which facilitates the subsequent improvement of the efficiency of processing data processing requests.
  • the bookkeeping data records the state of the corresponding agent module, such as whether it is connected or disconnected with the agent module, and also records the index data of the slot. Record how many slots are used and how many slots are idle, that is to say, the bookkeeping data in the shared memory area is to record the state of the agent module itself; for example, the request receiving module will process the number with the target according to the target After the data processing requests with the same number of processing are placed in the processing slots to be allocated in the shared memory area, it also includes: the request receiving module records the pending data processing requests in the agent booklet in the shared memory area based on the data processing requests with the same number of processing as the target Assigns the state of the processing slot and records the state of the connection to the agent module.
  • proxy bookkeeping can be understood as bookkeeping data (per-proxy Data), and the number of units in bookkeeping data is set according to the number of proxy modules on the master node, and the bookkeeping of each proxy module
  • the record data includes the following fields: the user database session uses a semaphore to notify the agent module of the arrival of a new data processing request or the abandonment of the communication slot.
  • the agent module When receiving the notification, the agent module starts to scan the communication sequence to perform work; the agent identification (proxy_id) is The unique identifier of the agent that owns this bookkeeping data; the slot index data is an index array of communication slots served by this agent module, when there are multiple agent modules on the master node, each agent module will serve a subset of slots , whose array of slot indices identifies those slots.
  • the request receiving module records the state of the processing slot to be allocated in the agent booklet of the shared memory area based on the data processing requests equal to the target processing quantity, and also records the state of the agent module.
  • the data processing method provided by the embodiment of the present application records the state of the slot and the state of the agent module in the agent book of the shared memory area, and based on the conversion and recording of the state, not only can reduce the conflict of concurrent access, but also can reduce the resources. overhead.
  • Step 304 Forward the target data processing request to the global transaction manager, receive the processing result of the target data processing request processed by the global transaction manager, and return the processing result to the client corresponding to each data processing request.
  • the distributed data processing node forwards the determined target data processing request to the global transaction manager, and after the global transaction manager processes the data processing request, it returns the target data processing request to the distributed data processing node.
  • the processing result of the processing request is processed, and the processing result is returned to the client corresponding to each data processing request.
  • forwarding the target data processing request to the global transaction manager includes: when the proxy module determines to establish a communication connection with the global transaction manager, forwarding the target data processing request to the global transaction manager based on the communication connection.
  • the agent module in the master node of the distributed database determines to establish a communication connection with the global transaction manager, it forwards the target data processing request to the global transaction manager through the communication connection.
  • the request receiving module can also receive the result of calculating the distributed snapshot request returned by the global transaction manager; for example, forward the target data processing request to the global transaction manager, and Receiving the processing result of the target data processing request processed by the global transaction manager includes: when the request receiving module determines that the target data processing request is a distributed snapshot request, forwarding the distributed snapshot request to the global transaction manager; The request receiving module receives the result calculated by the global transaction manager for each distributed snapshot request.
  • the request receiving module determines that the target processing request is a distributed snapshot request, it sends the distributed snapshot request to GTM, and GTM processes the request by calculating the distributed snapshot, and uses the distributed snapshot and Combining the response of the slot index list contained in the request to reply to the proxy module, after receiving the response, the proxy module stores a copy of the distributed snapshot into the response buffer of each communication slot contained in the response.
  • this embodiment also supports forwarding distributed snapshot requests to the global transaction manager using hybrid communication mode, step 1: given a set of new requests in the communication slots in the waiting queue, the agent will check them , and find all requests to obtain distributed snapshots initiated for read-only transactions, and requests to obtain transaction IDs initiated for any type of transaction.
  • step 2 If the currently processed set of transactions does contain multiple requests for distributed snapshots initiated for read-only transactions, the broker will merge them and construct a new joint request. It is still of type Distributed Snapshot Request and contains a list of slot indices indicating which communication slot each original request originated from. The proxy sends this new federation request to GTM. After GTM calculates the distributed snapshot, it returns it and the received slot index list as a response to the agent.
  • Step 3 Similarly, if the currently processed group of transactions does contain multiple requests for obtaining transaction IDs, the agent will also merge them and construct a new joint request whose type is still a global transaction ID request and contains A list of slot indices indicating which communication slots they all originate from.
  • GTM allocates a group of continuous transaction IDs, and returns the range information and slot index list of the continuous transaction IDs to the agent.
  • the agent writes each transaction ID in the range to the data buffer of each index pointed to by the slot index list.
  • multiple agent modules can also be configured in the master node in the distributed database; for example, the request receiving module establishes a communication connection with the agent module based on the shared memory area, including: the request receiving module establishes a communication connection with the agent module based on the preset configuration
  • the quantity determines two or more agent modules which are the same as the quantity of the preset configuration, and establishes communication connections with the two or more agent modules based on the shared memory area.
  • the number of proxy modules on the master node is configurable.
  • our method distributes user sessions to these GTM proxies in a load-balanced manner.
  • This assignment is simple, deterministic, and often results in an even distribution of work among the agents. More complex alternative allocation strategies are possible. For example, a session might check and compare the lengths of the broker's slot index arrays and choose the one that serves the fewest number of slots.
  • this embodiment proposes a load balancing method to distribute user sessions to multiple proxy modules; for example, forwarding the target data processing request to the global transaction manager, including :
  • the request receiving module distributes the target data processing request to two or more proxy modules based on a preset load balancing mechanism, and the proxy module forwards the target data processing request to the global transaction manager.
  • the proxy module after the request receiving module distributes the target data processing request to two or more proxy modules based on a preset load balancing mechanism, the proxy module will automatically detect and eliminate redundant requests to GTM from concurrent transactions, To reduce the network traffic to GTM and the workload performed by GTM.
  • the proxy checks for both types of concurrent requests, which can be serviced together and thus can be considered redundant.
  • the first type is distributed snapshot requests from concurrent read-only transactions. GTM can compute a single distributed snapshot and return it as a response to all these requests. Therefore, the broker first renders multiple concurrent distributed snapshot requests as a single combined message, sends it to GTM, and when it receives the response, returns a copy of the response to each requesting transaction.
  • the second type is to request a global transaction ID from a concurrent transaction. GTM can return a contiguous range of global transaction IDs instead of assigning each transaction a separate ID and letting the broker handle one for each requesting transaction.
  • Step 1 Given a set of new requests in the communication slots in the waiting queue, the proxy will check them and find all requests for distributed snapshots initiated for read-only transactions request, and a request to get a transaction ID for any kind of transaction.
  • Step 2 If the currently processed set of transactions does contain multiple requests for distributed snapshots initiated for read-only transactions, the broker will merge them and construct a new joint request. It is still of type Distributed Snapshot Request and contains a list of slot indices indicating which communication slot each original request originated from. The proxy sends this new federation request to GTM. After GTM calculates the distributed snapshot, it returns it and the received slot index list as a response to the agent.
  • Step 3 Similarly, if the currently processed group of transactions does contain multiple requests for obtaining transaction IDs, the agent will also merge them and construct a new joint request whose type is still a global transaction ID request and contains A list of slot indices indicating which communication slots they all originate from.
  • GTM allocates a group of continuous transaction IDs, and returns the range information and slot index list of the continuous transaction IDs to the agent.
  • the agent writes each transaction ID in the range to the data buffer of each index pointed to by the slot index list.
  • each proxy module in the embodiment of the present application provides services for one or more user database sessions through their communication slots, wherein the workflow of the proxy module is as follows: Step 1: Wait for the semaphore of proxy bookkeeping data to get notifications from user database sessions. After receiving the notification, go to step 2; step 2: acquire a shared lock on the communication slot array; step 3: scan each slot, whose index is recorded in the agent's slot index array. These are the slots that serve this agent. For each such slot, check its status.
  • Step 4 For each slot in the freeing_slots list , to reclaim it by changing its state to FREE, or remove this slot from the agent's slot index array; Step 5: Unlock the communication slot array; Step 6: Send a request from a slot in the pending_requests list: Step 6.1: If there are multiple pending requests, merge them into one or more combined requests and send the combined message to GTM. Otherwise, a single request is sent to GTM. The details of constructing the combined request will be described later.
  • Step 6.2 Receive response from GTM.
  • GTM may return combined responses corresponding to multiple requests, or a single response. For composite responses, from which to build a single response. For each response, store it into the corresponding communication slot's response buffer and change the slot state to RESPONSE_AVAIL. If the agent does not receive a response from GTM (for example, due to network disconnection or GTM failure), set the corresponding slot status to ERROR.
  • both database sessions and proxies modify slot state through atomic instructions (eg, compare and swap). It should be noted that the recycling of a communication slot is done in two steps: the session sets the slot state to FREEING, and then the broker reclaims it and sets its state to FREE. This arrangement reduces conflicts for concurrent access to the array of communication slots, thereby reducing locking overhead.
  • each master node in a multi-master distributed database using a centralized global transaction manager, can be configured with a number of proxy modules for collecting, combining and forwarding information from the user session's GTM request, and routes the returned response back to the user session.
  • Communication between agent modules and user sessions is via an efficient shared memory mechanism on the master server; the number of agent modules on one master is configurable.
  • our method distributes user sessions to these proxy modules in a load-balanced manner; the proxy automatically detects and eliminates redundant requests to GTM by concurrent transactions to reduce the network traffic to GTM and the execution time of GTM Workload; the database system allows each individual user session to dynamically specify whether to use a proxy or a dedicated connection to GTM to perform transactions within the session.
  • FIG. 6 shows a schematic structural diagram of a data processing device 600 provided by an embodiment of the present application.
  • the device 600 is applied to a data processing node of a distributed data processing system, including: a request receiving module 602 configured to receive multiple data processing requests sent by a client, based on the number of multiple data processing requests Determine the target processing quantity, and perform flow-limiting processing on multiple data processing requests according to the target processing quantity to obtain the target data processing request; the proxy module 604 is configured to forward the target data processing request to the global transaction manager and receive the global transaction The manager processes the processing result of the target data processing request, and returns the processing result to the client corresponding to each data processing request.
  • a request receiving module 602 configured to receive multiple data processing requests sent by a client, based on the number of multiple data processing requests Determine the target processing quantity, and perform flow-limiting processing on multiple data processing requests according to the target processing quantity to obtain the target data processing request
  • the proxy module 604 is configured to forward the target data processing request to the global transaction manager and receive the global transaction The manager processes
  • the request receiving module 602 is further configured to establish a communication connection with the proxy module based on the shared memory area.
  • the request receiving module 602 is further configured to determine the number of processing slots to be allocated in the shared memory area based on the number of multiple data processing requests, and use the number of processing slots to be allocated as the target processing number .
  • the request receiving module 602 is further configured to place the data processing request equal to the target processing quantity in the processing slot to be allocated in the shared memory area according to the target processing quantity;
  • the proxy module 604 is further configured to acquire data processing requests equal to the target processing quantity from the processing slots to be allocated, and use the data processing requests as the target data processing requests.
  • the proxy module 604 is further configured to forward the target data processing request to the global transaction manager based on the communication connection if it is determined to establish a communication connection with the global transaction manager.
  • the request receiving module 602 is further configured to determine the semaphore of each processing slot to be allocated in the shared memory area based on the target processing quantity; place the data processing request equal to the target processing quantity in the In the processing slot to be allocated in the shared memory area, and modify the communication status in the processing slot to be allocated.
  • the request receiving module 602 is further configured to record the state of the processing slot to be allocated in the agent booklet in the shared memory area based on the data processing request that is the same as the target processing quantity, and record the status of the processing slot with the agent module connection status.
  • the request receiving module 602 is further configured to forward the distributed snapshot request to the global transaction manager when it is determined that the target data processing request is a distributed snapshot request;
  • the request receiving module 602 is further configured to receive the result of calculating each distributed snapshot request by the global transaction manager.
  • the request receiving module 602 is further configured to determine two or more proxy modules that are the same as the preset configuration number based on the preset configuration number, and based on the shared memory area and the two or more The proxy module establishes a communication connection.
  • the request receiving module 602 is further configured to distribute the target data processing request to two or more proxy modules based on a preset load balancing mechanism, and the proxy module forwards the target data processing request to the global transaction management device.
  • the request receiving module 602 is further configured to judge whether to start the proxy module in the data processing node based on preset project requirements, and if so, send a data processing instruction to the client, wherein the data processing instruction is to send the client
  • the request sent by the terminal is forwarded to the instruction of the agent module for processing.
  • the data processing device performs flow-limiting processing on multiple data processing requests received from the client in the data processing node, and processes them in batches according to a certain number of data processing requests sent by the client All data processing requests sent can reduce the request processing pressure of data processing nodes, and subsequent forwarding of target data processing requests to the global transaction manager can also improve the processing efficiency of the global transaction manager and improve the entire distributed data processing system processing performance.
  • FIG. 7 shows a structural block diagram of a computing device 700 provided according to an embodiment of the present application.
  • Components of the computing device 700 include, but are not limited to, memory 710 and processor 720 .
  • the processor 720 is connected to the memory 710 through the bus 730, and the database 750 is used for storing data.
  • Computing device 700 also includes an access device 740 that enables computing device 700 to communicate via one or more networks 760 .
  • networks include the Public Switched Telephone Network (PSTN), Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), or a combination of communication networks such as the Internet.
  • Access device 740 may include one or more of any type of network interface (e.g., a network interface card (NIC)), wired or wireless, such as an IEEE 802.11 wireless local area network (WLAN) wireless interface, Worldwide Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, etc.
  • NIC network interface card
  • the above-mentioned components of the computing device 700 and other components not shown in FIG. 7 may also be connected to each other, for example, through a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 7 is only for the purpose of illustration, rather than limiting the scope of the application. Those skilled in the art can add or replace other components as needed.
  • Computing device 700 can be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile telephones (e.g., smartphones), ), wearable computing devices (eg, smart watches, smart glasses, etc.), or other types of mobile devices, or stationary computing devices such as desktop computers or PCs.
  • mobile computers or mobile computing devices e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.
  • mobile telephones e.g., smartphones
  • wearable computing devices eg, smart watches, smart glasses, etc.
  • desktop computers or PCs e.g., desktop computers or PCs.
  • Computing device 700 may also be a mobile or stationary server.
  • the processor 720 is configured to execute the following computer-executable instructions, wherein the processor 720 implements the steps of the above data processing method when executing the computer-executable instructions.
  • An embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions, and implements the steps of the above data processing method when the computer-executable instructions are executed by a processor.
  • the data processing method provided in an embodiment of the present application is applied to a data processing node of a distributed data processing system, including: receiving multiple data processing requests sent by a client, determining a target processing quantity based on the number of multiple data processing requests, And perform flow-limiting processing on multiple data processing requests according to the target processing quantity to obtain the target data processing request; forward the target data processing request to the global transaction manager, and receive the processing result of the global transaction manager processing the target data processing request, And the processing result is returned to the client corresponding to each data processing request.
  • Reducing the request processing pressure of the data processing node, and subsequently forwarding the target data processing request to the global transaction manager can also improve the processing efficiency of the global transaction manager and improve the processing performance of the entire distributed data processing system.
  • the above-mentioned computer instructions include computer program codes, and the computer program codes may be in the form of source code, object code, executable file or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random Access memory (RAM, Random Access Memory), electrical carrier signal, telecommunication signal and software distribution medium, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electrical carrier signal telecommunication signal and software distribution medium, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种数据处理方法、***、计算设备、计算机可读存储介质,其中,数据处理方法应用于分布式数据处理***的数据处理节点,包括:接收客户端发送的多个数据处理请求,基于多个数据处理请求的数量确定目标处理数量,并根据目标处理数量对多个数据处理请求进行限流处理,获得目标数据处理请求;将目标数据处理请求转发至全局事务管理器,接收全局事务管理器对目标数据处理请求进行处理的处理结果,并将处理结果返回至每个数据处理请求对应的客户端。

Description

数据处理方法以及***
本申请要求于2021年11月12日提交中国专利局、申请号为202111339770.5、发明名称为“数据处理方法以及***”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种数据处理方法。本申请同时涉及一种数据处理***,一种计算设备,以及一种计算机可读存储介质。
背景技术
数据库采用多主MPP(Massive Parallel Processing)架构。该***由两组计算机节点组成:主节点和数据节点。数据库依赖于一个称为全局事务管理器(GTM)的组件来支持快照隔离。尽管GTM组件可以通过多进程或多线程实现以提高其并行性,但它本质上是中心化的,并且随着并发事务数量的增加与GTM组件的连接数量会大幅度地增加,不仅会对GTM组件的运行造成较大的压力,也会成为整个分布式数据库***严重的***瓶颈。
发明内容
有鉴于此,本申请提供了一种数据处理方法,一种数据处理***,一种计算设备,以及一种计算机可读存储介质,以解决现有技术中存在的技术缺陷。
根据本申请实施例的第一方面,提供了一种数据处理方法,应用于分布式数据处理***的数据处理节点,包括:接收客户端发送的多个数据处理请求,基于多个数据处理请求的数量确定目标处理数量,并根据目标处理数量对多个数据处理请求进行限流处理,获得目标数据处理请求;将目标数据处理请求转发至全局事务管理器,接收全局事务管理器对目标数据处理请求进行处理的处理结果,并将处理结果返回至每个数据处理请求对应的客户端。
根据本申请实施例的第二方面,提供了一种数据处理***,此数据处理***包括数据处理节点,该数据处理节点包括:请求接收模块,被配置为接收客户端发送的多个数据处理请求,基于多个数据处理请求的数量确定目标处理数量,并根据目标处理数量对多个数据处理请求进行限流处理,获得目标数据处理请求;代理模块,被配置为将目标数据处理请求转发至全局事务管理器,接收全局事务管理器对目标数据处理请求进行处理的处理结果,并将处理结果返回至每个数据处理请求对应的客户端。
根据本申请实施例的第三方面,提供了一种计算设备,包括:存储器和处理器;其中,存储器用于存储计算机可执行指令,处理器用于执行计算机可执行指令,处理器执行计算机可执行指令时实现上述数据处理方法的步骤。
根据本申请实施例的第四方面,提供了一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现上述数据处理方法的步骤。
上述概述仅仅是为了说明书的目的,并不意图以任何方式进行限制。除上述描述的示 意性的方面、实施方式和特征之外,通过参考附图和以下的详细描述,本申请进一步的方面、实施方式和特征将会是容易明白的。
附图说明
在附图中,除非另外规定,否则贯穿多个附图相同的附图标记表示相同或相似的部件或元素。这些附图不一定是按照比例绘制的。应该理解,这些附图仅描绘了根据本申请公开的一些实施方式,而不应将其视为是对本申请范围的限制。
图1是本申请一个实施例提供的一种数据处理方法的多主分布式数据库架构;
图2是本申请一个实施例提供的一种数据处理***的整体架构图;
图3是本申请一个实施例提供的一种数据处理方法应用于分布式数据处理***的数据处理节点的流程图;
图4是本申请一个实施例提供的一种数据处理方法的共享内存区域的结构示意图;
图5是本申请一个实施例提供的一种数据处理方法的共享内存区域的处理插槽各种状态转换示意图;
图6是本申请一个实施例提供的一种数据处理装置的结构示意图;以及
图7是本申请一个实施例提供的一种计算设备的结构框图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。
在本申请一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请一个或多个实施例。在本申请一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义,“多种”一般包含至少两种,但是不排除包含至少一种的情况。还应当理解,本申请一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
首先,对本申请一个或多个实施例涉及的名词术语进行解释。
事务:在数据库管理***中,事务是单个逻辑或工作单元,有时由多个操作组成。在数据库中以一致模式完成的任何逻辑计算都称为事务。例如,从一个银行账户到另一个银行账户的转账:完整的交易需要从一个账户中减去要转账的金额,然后将相同的金额添加到另一个账户中。
数据库事务:根据定义,数据库事务必须是原子的(它必须完整地完成或没有任何影 响)、一致的(它必须符合数据库中现有的约束)、隔离的(它不能影响其他事务)和持久的(它必须写入持久存储)。数据库从业者经常使用首字母缩写词ACID来指代数据库事务的这些属性。
分布式数据库:是将数据存储在不同物理位置的数据库。它可能存储在位于同一物理位置(例如数据中心)的多台计算机中;或者可能分散在互连的计算机网络上。
多主架构:是构建分布式数据库***的流行方式。在这种架构中,分布式数据库***由两组计算机节点组成:主节点和数据节点。每个主节点都保留了***目录和元数据(例如,表和索引定义)的最新副本。数据库的数据存储在多个数据节点中,受用户指定的分区和/或复制策略的约束。
数据库会话:表示应用程序(或客户端)与存储其持久对象的数据库之间的连接。通常通过TCP网络协议建立连接。一旦建立了会话,客户端应用程序就可以通过与数据库的连接发送SQL语句来查询和操作数据库中的数据。在数据库***方面,数据库可以使用进程或线程来接收和服务来自会话的所有SQL语句。一旦客户端应用程序与会话断开连接,数据库***就会取消分配与会话关联的任何资源(例如处理进程或线程)。
多版本并发控制(MVCC):多版本并发控制(MCC或MVCC),是数据库管理***常用的一种并发控制方法,用于提供对数据库的并发访问。当MVCC数据库需要更新一条数据时,它不会用新数据覆盖原始数据项,而是创建数据项的更新版本。因此存储了多个版本。每个事务看到的版本取决于实现的隔离级别。MVCC实现的最常见的隔离级别是快照隔离。使用快照隔离,事务观察事务开始时的数据状态。
快照隔离:快照隔离是保证在一个事务中进行的所有读取都将看到数据库的一致快照(实际上它读取的是在它开始时存在的最后提交的值),并且只有在自己所做的更新与自快照开始以后来自其他事物所做的更新没有冲突的情况下,该事务的更新才能成功提交。
代理(代理服务器):在计算机网络中,代理服务器是一种服务器应用程序或设备,它充当客户端请求的中介,从提供这些资源的服务器寻求资源。因此,代理服务器在请求服务时代表客户端运行,可能会掩盖对资源服务器的请求的真实来源。客户端不是直接连接到可以满足请求资源(例如文件或网页)的服务器,而是将请求定向到代理服务器,代理服务器评估请求并执行所需的网络事务。这是一种简化或控制请求复杂性的方法,或提供额外的好处,例如负载平衡、隐私或安全性。
数据库采用多主MPP(Massive Parallel Processing)架构。该***由两组计算机节点组成:主节点和数据节点。每个主节点都保留了***目录和元数据(例如,表和索引定义)的最新副本。数据库的数据存储在多个数据节点中,受用户指定的分区和/或复制策略的约束。
客户端连接到其中一个主机(例如,通过TCP网络协议)以建立数据库会话,然后可以通过该连接提交SQL语句。对于来自客户端的每条SQL语句,相应的主节点都会解析该SQL语句,生成优化的查询计划,并将查询计划分派到数据节点执行。每个数据节点用本地存储的数据执行主节点发送的查询计划,必要时相互交换中间数据,最后将查询结果发送回主节点。主节点合并组装最终查询结果并发回客户端。
具体可参见图1,图1示出了本申请在一些实施方式中提供的一种数据处理方法的多 主分布式数据库架构100。
图1中分布式数据库架构100为具有两个主数据库节点的分布式数据库,分别为主节点1和主节点2,每个主节点对应具有两个客户端,主节点1对应的客户端分别为客户端1和客户端2,主节点2对应的客户端分别为客户端3和客户端4,同时,两个主节点所对应的数据节点分别为数据节点1、数据节点2、数据节点3、数据节点4,且该多主分布式数据库架构中还具有全局事务管理器(GTM)。
在一些实施方式中,图1中每组矩形代表用于在一个数据库会话中运行事务的进程或线程。数据库支持符合SQL标准的ACID属性的数据库事务。事实上,任何提交给数据库的SQL语句都是在一个数据库事务中执行的。此类事务要么由客户端通过BEGIN/COMMIT/ABORT语句显式指定,要么由数据库***在客户端未显示指定事务范围时为单个SQL语句隐式和内部创建。
数据库中的事务通常涉及为事务的SQL语句创建和调度查询计划的阶段,以及执行查询的多个数据节点。如果事务包含修改***目录和元数据的DDL(数据定义语言)语句(例如,CREATE TABLE),则事务也将跨越***中的所有其他主数据。由于事务涉及多个分布式计算机节点,因此使用分布式事务协议来确保事务满足ACID属性。例如,数据库采用标准的两阶段提交协议。主节点充当分布式事务的协调器,所涉及的数据节点(分节点)和其他主节点充当参与者。当事务中的工作完成时,协调器启动称为“准备”(或“投票”)的第一阶段:它要求每个参与者投票是否应该提交事务;每个参与者根据其本地执行结果以自己的投票(提交或中止)进行回复。如果协调器收到所有参与者的提交投票,那么它启动称为“提交”的第二阶段:它要求所有参与者在本地提交事务,一旦所有参与者都确认他们已完成,协调器提交分布式事务。如果协调器在准备阶段收到参与者的任何中止投票,它会要求所有参与者在本地中止事务并中止分布式事务。
在一些实施方式中,数据库依赖于一个称为全局事务管理器(GTM)的组件来支持快照隔离,这是用于管理并发事务的多版本并发控制机制的流行变体。GTM是分布式数据库中的中心化服务,负责为事务分配唯一标识符,跟踪分布式事务的状态(状态为进行中、已提交或已中止),并生成分布式快照。
当一个主节点开始一个分布式事务时,它会向GTM发送一个请求,要求将新事务注册到GTM并为这个事务分配一个唯一的标识符(称为全局事务ID或简称GXID)。全局事务ID唯一标识一个事务。每当事务在数据库表中***或修改行时,行数据的一个版本就会存储在表中,其中包含数据有效负载和事务ID。事务ID在内部实现为对数据库用户透明的隐藏列。
交易在GTM注册后,GTM将其状态设置为进行中。稍后当相应的主节点提交或中止事务时,主节点通知GTM更改,GTM相应地设置事务状态。
当主节点将事务中的查询分派到数据节点(Remove)时,它会向GTM发送一个分布式快照请求。分布式快照的内容指明了当时正在进行哪些事务(根据其全局事务ID)。为了计算分布式快照,GTM检查其事务状态的跟踪记录并返回当前所有的活动事务的ID。这种分布式快照与查询计划一起发送到数据节点。当一个数据节点执行查询并需要访问表中的行时,它使用分布式快照来确定该行的某个版本对于当前事务是否可见。例如,如果 执行中的事务ID列在分布式快照中的进行中事务集中,则当前事务不应读取此版本,否则将从未提交的事务中读取数据并违反隔离属性。
尽管GTM组件可以通过多进程或多线程实现以提高其并行性,但它本质上是中心化的,并且随着并发事务数量的增加会成为严重的***瓶颈。为高并发扩展GTM的主要障碍有两个:
第一,在多主分布式数据库中,假设每个主节点最多可以接受N个客户端会话,并且有M个主节点,那么***在任何时候最多可以有N x M个并发会话。如果每个会话都与GTM建立自己的TCP连接,则GTM将需要处理最多N x M个连接。在高并发的大型部署中,这可能会超过单台机器的TCP连接限制。由于拥有多个主节点的全部好处是为客户端提供超出单台机器容量的可扩展连接点,因此将每个会话连接到集中式GTM将违背多主架构的基本目的。
第二,即使GTM可以从单个会话中为这么多连接提供服务,它也会失去提高效率的大量机会。比如,假设在同一个主节点上有K个并发会话,并且它们都在执行只读事务。只读事务是不修改或不***任何数据的事务。如果每个会话与GTM建立自己的连接,并从自己的连接发送分布式快照请求,GTM将收到K个这样的请求,计算K次分布式快照并将K个结果发回。然而,这是多余且不必要的,GTM可以使用计算的第一个分布式快照作为对来自这些只读事务的所有K个并发快照请求的响应。如果主节点可以将K个并发请求合并为一个请求,让GTM只计算一次分布式快照,然后将返回的结果取出,我们可以将网络消息的数量和GTM完成的工作量减少K倍,并将它们返回给K个事务。即使对于来自同一个主节点的不同并发请求(例如,一个session正在请求全局事务ID,另一个session正在请求分布式快照),批量发送这些请求通常比从单独的连接单独发送它们更有效。
然而,对于上述第一个问题,在现有技术中可能会认为UDP等无连接网络协议可以克服连接限制,但结果是增加了***实现的复杂性,以确保与GTM的可靠通信。此外,使用UDP根本无助于解决第二个问题。
在后续的分布式数据处理***的发展中,Postgres-XL是一个多主分布式数据库,也采用了集中式全局事务管理器(GTM)。Postgres-XL中的GTM作为一个多线程的独立进程实现,负责事务ID分配、事务状态跟踪和分布式快照计算。主节点上的用户数据库会话可以直接连接到GTM以请求事务ID、通知状态更改或请求分布式快照。或者,数据库可以部署多个GTM代理进程,并让用户的数据库会话连接到其中一个GTM代理。GTM代理将请求转发到GTM服务器并将响应发送回用户会话。
由此可知,Postgres-XL是使用代理模块来提高中心化GTM的可扩展性,但还会具有四个明显的缺陷:第一,Postgres-XL中的GTM代理仅支持TCP连接和与主节点上用户数据库会话的通信。即使GTM代理和用户的数据库会话运行在同一个主节点上,它们也通过TCP相互通信。由于额外的内存副本和不必要的网络堆栈开销,这种通信的效率低于其他基于共享内存的通信机制。第二,当有多个GTM代理时,Postgres-XL没有指定如何以负载平衡的方式将用户的数据库会话分配给这些GTM代理。第三,当Postgres-XL中的GTM代理同时收到多个并发请求时,它会将这些请求打包成一条消息 并发送给GTM服务器,并解压GTM服务器发回的所有响应。但是,Postgres-XL中的GTM代理不会检测或消除冗余并发请求(例如来自并发只读事务的多个分布式快照请求)。第四,Postgres-XL不允许单个数据库会话选择是直接连接还是通过GTM代理连接到GTM。该设置是***范围的,不能在不重新启动数据库***的情况下动态更改。这不灵活,不能支持多个用户需要同时使用不同的连接方式到GTM的情况。例如,数据库***可能会限制普通用户的数据库会话使用GTM代理,但允许高优先级用户或***管理员使用GTM专用连接来处理紧急或维护任务。而本提供的数据处理方法将允许单个用户会话动态选择是直接连接GTM还是使用GTM代理,相比较于Postgres-XL,该方法将会更加灵活运用,可以更好地服务于混合使用的场景。
综上,本申请实施例提供的数据处理方法,提出了一种基于代理的方法,以解决多主分布式数据库中集中式GTM的可扩展性挑战,需要说明的是,本申请提出的方法并不限于数据库中使用的特定分布式事务协议(如Two-Phase Commit)和并发控制机制(如Snapshot Isolation),可广泛适用于任何采用集中事务管理的多主分布式数据库。
在本申请中,提供了一种数据处理方法,本申请同时涉及一种数据处理***,一种计算设备,以及一种计算机可读存储介质,在下面的实施例中逐一进行详细说明。
图2示出了根据本申请一个实施例提供的一种数据处理***的整体架构200的示意图。
需要说明的是,本申请实施例提供的数据处理方法提出使用代理模块,来提高多主分布式数据库中集中式全局事务管理器(GTM)的持久性和可扩展性。在一些实施方式中,该数据处理***包括数据处理节点,其中,数据处理节点包括:请求接收模块,被配置为接收客户端发送的多个数据处理请求,基于多个数据处理请求的数量确定目标处理数量,并根据目标处理数量对多个数据处理请求进行限流处理,获得目标数据处理请求;代理模块,被配置为将目标数据处理请求转发至全局事务管理器,接收全局事务管理器对目标数据处理请求进行处理的处理结果,并将处理结果返回至每个数据处理请求对应的客户端。
图2中分布式数据处理***的整体架构200为具有两个主数据库节点的分布式数据库,分别为主节点1(master1)和主节点2(master2),每个主节点均对应具有两个客户端,主节点1对应的客户端分别为客户端1和客户端2,主节点2对应的客户端分别为客户端3和客户端4,同时,两个主节点所对应的数据节点分别为数据节点1、数据节点2、数据节点3、数据节点4,且该多主分布式数据库架构中还具有全局事务管理器(GTM),需要说明的是,在主节点1和主节点2中分别配置了代理模块,同时,主节点中接收客户端发送的数据处理请求的模块与代理模块之间,通过共享内存区域进行通信连接。
在一些实施方式中,分布式数据库中的每个主节点都配备了一组代理模块的进程或线程。每个代理模块在部署它的主数据库节点上为一个或多个用户数据库会话提供服务。每个用户数据库会话都可以动态选择直接或间接通过同一主数据库节点上的代理模块连接到GTM。当数据库会话选择使用代理并需要向GTM发出请求(例如,分配全局事务ID或获取分布式快照)时,它会将请求处理给相应的代理模块并等待其返回响应。收到来自用户数据库会话的至少一个请求后,如果还没有建立连接,代理模块会与GTM建立连接,并发送用户数据库会话发出的请求。GTM处理每个请求,并将响应发送回代理模块。对 于从GTM收到的每个响应,代理模块确定它属于哪个数据库会话并将响应发送回数据库会话。代理模块和用户数据库会话通过高效的共享内存机制进行通信。例如,当有多个代理模块时,我们的方法以负载平衡的方式将用户会话分配给这些GTM的代理模块。代理模块会自动检测并消除并发事务对GTM的冗余请求,以减少到GTM的网络流量和GTM执行的工作量。又或者,本方法允许单个用户会话动态选择是直接连接到GTM还是使用GTM代理模块,能够更加灵活地服务于这种混合使用场景。
本申请实施例提供的数据处理***,在集中式全局事务管理器管理的多主分布式数据库中,利用配置在主数据库节点中的代理模块,基于高效的共享内存机制对接收到的数据处理请求进行限流处理,且在配置了多个代理模块的情况下,以负载均衡的方式将数据处理请求分配至代理模块,进而代理模块帮助GTM减少了数据冗余请求,也减少了全局事务管理的网络流量和GTM执行的工作量;在一些实施方式中,也允许单个用户会话动态选择是直接连接到GTM还是使用GTM代理模块,能够更加灵活地服务于这种混合使用场景。
参见图3,图3示出了本申请一个实施例提供的一种数据处理方法应用于分布式数据处理***的数据处理节点的流程300的示意图,具体包括以下步骤:
需要说明的是,本实施例提供的数据处理方法可适用于任何采用集中事务管理的多主分布式数据库,在此不做过多限定。
步骤302:接收客户端发送的多个数据处理请求,基于多个数据处理请求的数量确定目标处理数量,并根据目标处理数量对多个数据处理请求进行限流处理,获得目标数据处理请求。
在一些实施方式中,分布式数据处理***的数据处理节点可以理解为分布式数据库的主节点,即数据写入节点,能够进行数据的读取、数据的写入等,而现有技术中主节点在接收多个客户端发送的多个数据处理请求之后,将该多个数据处理请求会直接转发至集中事务管理的全局事务管理器中,进而执行后续的数据事务的执行,但由于多个数据处理请求都直接与全局事务管理相连接,可能会导致由于全局事务管理器的连接数量过多而出现的***崩溃的状态;由此,本申请实施例提供的数据处理方法,在分布式数据库的主节点中就对多个数据处理请求进行限流处理,以便于能够减少全局事务管理器执行数据处理请求的数量。
在一些实施方式中,分布式数据库的主节点接收到客户端发送的多个数据处理请求之后,将根据多个数据处理请求的数量确定同一时间处理的目标处理数量,并通过该目标处理数量对上述多个数据处理请求进行限流处理,进而筛选出目标数据处理请求;例如,分布式数据库的主节点接收了1000个数据处理请求,根据该1000个数据处理请求确定主节点在同一时间只能处理100个数据处理请求,因此,对该1000个数据处理请求进行限流处理,仅从1000个数据处理请求中按照请求的顺序获取100个数据处理请求作为目标数据处理请求。
而在分布式数据库的主节点接收到客户端的多个数据处理请求之前,数据库***允许每个单独的用户会话动态指定是否使用代理模块或专用链接到GTM来执行会话中的事务;例如,接收客户端发送的多个数据处理请求之前,还包括:请求接收模块基于预设项目需 求判断是否启动数据处理节点中的代理模块,若是,则向客户端发送数据处理指令,其中,数据处理指令为将客户端发送的请求转发至代理模块处理的指令。
需要说明的是,分布式数据库***为了提高数据处理能力,以及减少全局事务管理器的事务处理压力,可在分布式数据库的主节点中配置代理模块,代理处理多个数据处理请求,以减少全局事务管理器的网络流量和执行的工作量。
在一些实施方式中,分布式数据库主节点中的请求接收模块还可根据不同的项目需求,确定是否启动该主节点中配置的代理模块,若确定基于该项目需求,需要大量的数据处理请求进行处理的情况下,则可同意启动代理模块,该主节点可向客户端发送数据处理指令,其中,该数据处理指令为将客户端发送的请求转发至代理模块处理的指令。
在一些实施方式中,当数据库会话动态更改其连接模式时,它将遵循以下两个步骤之一:如果它当前与GTM有直接连接并且现在选择使用代理,则它关闭与GTM的连接,并获取共享内存通信槽以与代理交互。如果它当前使用共享内存通信槽,现在选择直接连接到GTM,那么它放弃共享内存通信槽,并建立到GTM的新的直接连接。数据库***可能会设置配额或限制来限制哪些用户或会话可以使用直接连接到GTM。
在一些实施方式中,本申请实施例提供的数据处理方法,分布式数据库处理***还可基于实际的项目需求,动态确定是否要使用配置在主节点中的代理模块,进而提高分布式数据处理***的处理效率。例如,数据处理节点包括请求接收模块、共享内存区域以及代理模块;基于多个数据处理请求的数量确定目标处理数量之前,还包括:请求接收模块基于共享内存区域与代理模块建立通信连接。
需要说明的是,在分布式数据库的主节点中包括请求接收模块、代理模块以及共享内存区域,代理模块是配置在主节点中,且代理模块可以代理主节点对数据处理请求进行限流处理的模块。
在一实施例中,主节点中的请求接收模块可通过共享内存区域,与配置在主节点的代理模块建立通信连接,便于后续代理模块可从共享内存区域中获取请求接收模块发送的数据处理请求。
本申请实施例提供的数据处理方法,在采用集中式全局事务管理器的多主分布式数据库中,每个主节点可配置代理模块,用于收集、组合和转发来自该主节点上用户会话的GTM请求,并将返回的响应返回至客户端,且代理模块与用户会话之间的通信是通过主节点上的高效共享内存机制运行的,以提高数据处理的效率。
在一些实施方式中,可基于共享内存区域中的待分配处理插槽的数量,进而确定代理模块同一时间内处理数据处理请求的目标处理数量;例如,基于多个数据处理请求的数量确定目标处理数量,包括:请求接收模块基于多个数据处理请求的数量在共享内存区域中确定待分配处理插槽的数量,并将待分配处理插槽的数量作为目标处理数量。
在一实施例中,分布式数据库***可为每个主节点上分配一个共享内存区域,专用于用户的数据库会话和代理之间的数据交换,如果数据库***是在单个进程中使用多个线程来实现的,那么这个共享内存区域可以分配在数据库***中所有线程都可以访问的堆内存上,包括用户数据库会话线程和代理线程。如果数据库***实现为多个进程,则可以在底层操作***提供的共享内存区域上分配此共享内存区域,并使其可用到多个进程。
例如,主节点中的请求接收模块可基于接收到的多个数据处理请求的数量,在共享内存区域中确定能够为多个数据处理请求分配的待处理插槽的数量,在初始化的状态下,待分配处理插槽的数量意味着同一时间内代理模块处理数据处理请求的数量。
本申请实施例提供的数据处理方法,通过在共享内存区域中确定待分配处理插槽的数量,确定目标处理数量,以便于后续代理模块基于目标处理数量对多个数据处理请求进行限流处理。
为了使得代理模块对多个数据处理请求进行限流处理,可将数据处理请求放置于共享内存区域的处理插槽中;例如,根据目标处理数量对多个数据处理请求进行限流处理,获得目标数据处理请求,包括:请求接收模块根据目标处理数量将与目标处理数量相同的数据处理请求放置于共享内存区域的待分配处理插槽中;代理模块从待分配处理插槽中获取与目标处理数量相同的数据处理请求,并将数据处理请求作为目标数据处理请求。
在一些实施方式中,分布式数据库主节点中的请求接收模块基于确定的目标处理数量,将与该目标处理数量相同的数据处理请求放置于共享内存区域的待分配处理插槽中,表示代理模块可在同一时间处理的数据处理请求在待分配的处理插槽中,进而实现对多个数据处理请求的限流处理,然后,代理模块可从待处理插槽中获取与该目标处理数量相同的数据处理请求,并将获取到的数据处理请求作为目标数据处理请求。
本申请实施例提供的数据处理请求,通过将数据处理请求放置于共享内存区域的待分配处理插槽,且代理模块从待分配处理插槽中获取数据处理请求,实现对多个数据处理请求的限流处理,以提高分布式数据处理***的数据处理效率。
基于上述代理模块对多个数据处理请求详细的限流处理过程,本申请实施例提供了数据处理方法用于与代理模块进行通信的共享内存区域的结构图,可参见图4,图4示出了本申请一实施例提供的一种数据处理方法的共享内存区域的结构400的示意图。
图4中,共享内存区域包括两个部分,其中,一个部分是一组通信槽,通信槽中的每个插槽分配给一个用户数据库会话,并用于该用户数据库会话与代理交换GTM请求和响应。另一个部分为代理薄记,一个代理模块使用阵列的每个单元来跟踪各种状态。通信槽的数量是根据单个主节点上的最大用户数据库会话数设置的,这样可以保证每个用户数据库会话都有一个专用的共享内存通信槽。
在一些实施方式中,共享内存区域中可能会有较多的处理插槽,但是需要对处理插槽进行配置后,才可使用,且在配置的过程中,还需要花费一些资源成本,因此如何高效地利用已经配置好的处理插槽,是节省资源成本的有效方式,进而,本申请实施例采用的方式为依据处理插槽中的信号量,确定该处理插槽的状态,是否是被占用的状态,在代理模块从处理插槽中获取数据处理请求之后,则该处理插槽即恢复为空闲状态,可为该处理插槽分配下一个数据处理请求;例如,请求接收模块根据目标处理数量将与目标处理数量相同的数据处理请求放置于共享内存区域的待分配处理插槽中,包括:请求接收模块基于目标处理数量确定共享内存区域的每个待分配处理插槽的信号量;基于所述信号量将与目标处理数量相同的数据处理请求放置于共享内存区域的待分配处理插槽中,并修改待分配处理插槽中的通信状态。
需要说明的是,代理模块使用信号量,是来通知用户数据库会话有关插槽状态更改的 信息,例如,代理模块将使用信用量唤醒等待响应准备好的用户会话,应该注意的是,信号量在每次数据交换的过程中可能会不同,例如,代理模块和用户数据库会话是单独的进程,它可以用POSIX信号量***调用来实现,或者如果代理模块和用户数据库会话是同一进程中的不同线程,它可以作为pthead条件变量来实现。
在一实施例中,分布式数据库主节点的请求接收模块可基于目标处理数量确定出共享内存区域的每个待分配处理插槽的信号量,并基于信号量将与所述目标处理数量相同的数据处理请求放置于共享内存区域的待分配处理插槽中,同时,还需修改待分配处理插槽的通信状态,便于后续根据待分配处理插槽的通信状态确定处理插槽是否能够放置数据处理请求。
在一些实施方式中,插槽状态表示插槽的当前状态,在用户数据库会话和代理模块之间的通信期间,插槽会在各种状态之间转换,参见图5,图5示出了本申请实施例提供的数据处理方法的共享内存区域的处理插槽各种状态转换500的示意图。
图5中表示各种插槽状态的转换500的示意图,分别有FREE(该插槽未被任何用户数据库会话使用)、EMPTY(插槽已分配给数据库用户会话,但尚未存放任何请求)、REQUEST_AVAIL(请求已存入插槽)、WAIT_RESPONSE(代理已经向GTM发送了请求,正在等待GTM的响应)、RESPONSE_AVAIL(未完成请求的响应已存储在槽中并准备好供数据库用户会话接收)、ERROR(服务当前请求时出错)、FREEING(插槽已被数据库会话放弃,并准备好回收)。
例如,用户数据库会话获取共享内存通信槽的过程、用户数据库会话放弃通信槽的过程、用户数据库会话通过共享内存通信槽发送请求并接收响应的过程,可以说明插槽状态转换的过程:
在一实施例中,用户数据库会话获取共享内存通信槽的过程如下:步骤1:获取通信槽阵列的排他锁;步骤2:遍历通信槽数组,找到状态为FREE的槽。若找到,则执行步骤3,若找不到该插槽,则执行步骤7;步骤3:将插槽的状态更改为EMPTY;步骤4:根据下一节描述的负载均衡策略分配一个代理来服务这个槽,并将这个槽的数组索引添加到代理的槽索引数组中。步骤5:解除对通讯槽阵列的排他锁;步骤6:安排一个回调函数,该函数将在此会话结束时执行以放弃获取的插槽;步骤7:返回是否成功获取通信槽。
在一实施例中,一旦获得了通信槽,用户数据库会话就会继续使用它,直到会话结束。用户数据库会话放弃通信槽的过程如下:步骤1:将插槽的状态更改为FREEING;步骤2:在对应代理的记账数据中设置信号量(Semaphore),通知代理槽状态的变化;代理将处理插槽的回收。
在一实施例中,用户数据库会话通过共享内存通信槽发送请求并接收响应的过程如下:步骤1:将请求存入通信槽的Request buffer中。请求被格式化并序列化为连续的字节序列,通信槽的索引作为请求的一部分存储;步骤2:将插槽的状态更改为REQUEST_AVAIL;步骤3:在对应代理的记账数据中设置Semaphore,通知代理新请求的到来;步骤4:等待通信槽中的信号量以获取来自代理的通知;步骤5:收到代理模块(proxy)的Semaphore通知后,检查插槽状态:(1)如果状态为RESPONSE_AVAIL,表示该插槽的Response buffer中有响应;在这种情况下,将插槽状态更改为EMPTY并返回响应。(2) 如果状态为ERROR,则将插槽状态更改为EMPTY并返回错误。(3)如果状态为REQUEST_AVAIL或WAIT_RESPONSE,则返回步骤4继续等待。
需要说明的是,数据处理请求是一个较大的缓冲区,可以存储对GTM的单个请求,当用户互数据库会话需要向GTM发出请求时,他会构造请求并将其存入请求缓冲区,然后代理模块从缓冲区读取请求并将其发送到GTM进行处理,用户数据库会话可以将请求格式化并序列化为特定的格式。而数据的响应的另一个缓冲区,能够存储来自GTM的单个响应,当代理模块收到来自GTM的响应时,他将响应存储到缓冲区,并通过插槽的信号量通知用户数据库会话其可用性。共享内存通信槽阵列受读写锁保护,该锁本身也驻留在共享内存区域中。当用户数据库会话需要获取一个槽时,它以写(或说独占)模式获取此锁,并选择一个空闲槽供使用。每个代理重复扫描通信槽以查找未完成的请求和已被数据库会话放弃并因此应回收的时隙;代理在扫描之前以读取(或说共享)模式获取此锁。
本申请实施例提供的数据处理方法,通过信号量可以确定插槽的当前状态,且后续通过记录插槽的通信状态,能够实现高效地分配数据处理请求,便于后续提升处理数据处理请求的效率。
此外,在共享内存区域中还具有薄记数据的部分,该薄记数据是记录对应的代理模块的状态,比如和代理模块是连接状态还是断开状态,同时还会记录插槽的索引数据,记录有多少个插槽被使用,有多少个插槽被闲置,也即是说,共享内存区域中的薄记数据是记录代理模块本身的状态;例如,请求接收模块根据目标处理数量将与目标处理数量相同的数据处理请求放置于共享内存区域的待分配处理插槽中之后,还包括:请求接收模块基于与目标处理数量相同的数据处理请求,在共享内存区域中的代理薄记中记录待分配处理插槽的状态,并记录与代理模块的连接状态。
在一些实施方式中,代理薄记可以理解为薄记数据(per-proxy Data),且在薄记数据中的单元数是根据主节点上的代理模块的数量设置的,每个代理模块的薄记数据包括以下字段:用户数据库会话使用信号量来通知代理模块新数据处理请求的到达或者通信槽的放弃,在收到通知时候,代理模块开始扫描通信时序以进行工作;代理标识(proxy_id)是拥有此薄记数据的代理的唯一标识;插槽的索引数据是由该代理模块服务的通信槽的索引数组,当主节点上有多个代理模块时,每个代理模块将服务一个插槽子集,其插槽索引数组标识这些插槽。
在一实施例中,请求接收模块基于与目标处理数量相同的数据处理请求,在共享内存区域的代理薄记中记录待分配处理插槽的状态,同时,也可记录代理模块的状态。
本申请实施例提供的数据处理方法,通过在共享内存区域的代理薄记中记录插槽的状态以及代理模块的状态,基于状态的转换与记录,不仅能够减少并发访问的冲突,也能减少资源开销。
步骤304:将目标数据处理请求转发至全局事务管理器,接收全局事务管理器对目标数据处理请求进行处理的处理结果,并将处理结果返回至每个数据处理请求对应的客户端。
在一些实施方式中,分布式数据处理节点将确定的目标数据处理请求转发至全局事务管理器,在全局事务管理器对该数据处理请求进行处理之后,再向分布式数据处理节点返 回对目标数据处理请求进行处理的处理结果,并将该处理结果返回至每个数据处理请求对应的客户端。
例如,将目标数据处理请求转发至全局事务管理器,包括:代理模块在确定与全局事务管理器建立通信连接的情况下,基于通信连接将目标数据处理请求转发至全局事务管理器。
在一实施方式中,分布式数据库的主节点中的代理模块在确定与全局事务管理器建立通信连接的情况下,将目标数据处理请求通过该通信连接转发至全局事务管理器。
在确定目标数据处理请求的类型为分布式快照请求后,请求接收模块还可接收全局事务管理器返回的计算分布式快照请求的结果;例如,将目标数据处理请求转发至全局事务管理器,并接收全局事务管理器对目标数据处理请求进行处理的处理结果,包括:请求接收模块在确定目标数据处理请求为分布式快照请求的情况下,将分布式快照请求转发至所述全局事务管理器;请求接收模块接收全局事务管理器计算每个分布式快照请求的结果。
在一实施例中,在请求接收模块确定目标处理处理请求为分布式快照请求之后,将该分布式快照请求发送到GTM,而GTM通过计算分布式快照来处理请求,并使用包含分布式快照和组合请求中包含的槽索引列表的响应来回复代理模块,在收到响应后,代理模块将分布式快照的副本存储到响应中包含的每个通信槽的响应缓冲区中。
在一实施例中,本实施例还支持利用混合通信模式将分布式快照请求转发至全局事务管理器,步骤1:给定一组在等待队列里的通信槽里的新请求,代理将检查它们,并找到所有为只读事务发起的获取分布式快照的请求,以及为任何一种事务发起的获取事务ID的请求。步骤2:如果当前处理的这组事务确实包含多个为只读事务发起的获取分布式快照的请求,代理将合并它们,并构造一个新的联合请求。它的类型仍旧为分布式快照请求,并包含了一个槽索引列表,指明每个原始请求的来源是哪个通信槽。代理将这个新的联合请求发送给GTM。GTM计算出分布式快照后,把它以及收到的槽索引列表一起作为响应返回给代理。在收到GTM的响应后,代理拷贝消息中的分布式快照到每一个槽索引指向的通信槽的结果数据缓冲区。步骤3:同样的,如果当前处理的这组事务确实包含了多个获取事务ID的请求,代理也将合并它们,并构造一个新的联合请求,其类型仍旧为全局事务ID请求,同时包含了一个槽索引列表表明它们都是来源于哪个通信槽。这个联合请求发给GTM后,GTM分配一组连续的事务ID,把这一段连续事务ID的范围信息以及槽索引列表返回给代理。收到响应后,代理把该范围内的每个事务ID逐个写入到槽索引列表指向的每个索引的数据缓冲区。
在另一实施方式中,分布式数据库中的主节点中还可配置多个代理模块;例如,请求接收模块基于共享内存区域与所述代理模块建立通信连接,包括:请求接收模块基于预设配置数量确定与预设配置数量相同的两个或两个以上的代理模块,并基于共享内存区域与两个或两个以上的代理模块建立通信连接。
在一实施例中,主节点上的代理模块数量是可配置的,当有多个代理时,我们的方法以负载平衡的方式将用户会话分配给这些GTM代理。假设每个主服务器上有K个代理。每当用户数据库会话需要获取通信槽时,它从头开始扫描通信槽数组(即数组中的第一个槽位)以找到状态为空闲的时隙。然后它通过计算确定哪个代理服务这个插槽:proxy id= slot index modulo K(即代理数量);换句话说,插槽以静态循环方式分配给代理。这种分配是简单的、确定性的,并且通常会导致将工作平均分配给代理。更复杂的替代分配策略是可能的。例如,会话可能会检查和比较代理的插槽索引数组的长度,并选择服务最少插槽数的那个。
在一实施例中,当有多个代理模块的时,本实施例提出一种负载均衡的方式将用户会话分配给多个代理模块;例如,将目标数据处理请求转发至全局事务管理器,包括:请求接收模块基于预设负载平衡机制将目标数据处理请求分配至两个或两个以上的代理模块,代理模块将目标数据处理请求转发至全局事务管理器。
在一实施例中,在请求接收模块基于预设负载平衡机制将目标数据处理请求分配至两个或两个以上的代理模块后,代理模块会自动检测并消除并发事务对GTM的冗余请求,以减少到GTM的网络流量和GTM执行的工作量。代理检查两种类型的并发请求,它们可以一起提供服务,因此可以被视为冗余。第一种类型是来自并发只读事务的分布式快照请求。GTM可以计算单个分布式快照并将其作为对所有这些请求的响应返回。因此,代理首先将多个并发的分布式快照请求呈现为单个组合消息,发送到GTM,并在收到响应后,将响应的副本返回给每个请求事务。第二种类型是从并发事务请求全局事务ID。GTM可以返回一个连续范围的全局事务ID,而不是为每个事务分配单独的ID,并让代理为每个请求事务处理一个。
通过代理模块进行消息缩减和分组的过程如下:步骤1:给定一组在等待队列里的通信槽里的新请求,代理将检查它们,并找到所有为只读事务发起的获取分布式快照的请求,以及为任何一种事务发起的获取事务ID的请求。步骤2:如果当前处理的这组事务确实包含多个为只读事务发起的获取分布式快照的请求,代理将合并它们,并构造一个新的联合请求。它的类型仍旧为分布式快照请求,并包含了一个槽索引列表,指明每个原始请求的来源是哪个通信槽。代理将这个新的联合请求发送给GTM。GTM计算出分布式快照后,把它以及收到的槽索引列表一起作为响应返回给代理。在收到GTM的响应后,代理拷贝消息中的分布式快照到每一个槽索引指向的通信槽的结果数据缓冲区。步骤3:同样的,如果当前处理的这组事务确实包含了多个获取事务ID的请求,代理也将合并它们,并构造一个新的联合请求,其类型仍旧为全局事务ID请求,同时包含了一个槽索引列表表明它们都是来源于哪个通信槽。这个联合请求发给GTM后,GTM分配一组连续的事务ID,把这一段连续事务ID的范围信息以及槽索引列表返回给代理。收到响应后,代理把该范围内的每个事务ID逐个写入到槽索引列表指向的每个索引的数据缓冲区。
在一实施例中,本申请实施例中每个代理模块通过他们的通信槽为一个或多个用户数据库会话提供服务,其中,代理模块的工作流程如下:步骤1:等待代理簿记数据的信号量以获取来自用户数据库会话的通知。收到通知后,转到步骤2;步骤2:获取通信槽阵列上的共享锁;步骤3:扫描每个插槽,其索引记录在代理的插槽索引数组中。这些是为此代理提供服务的插槽。对于每个这样的插槽,检查其状态。如果状态为FREEING,则将该插槽添加到名为freeing_slots的列表中;如果状态为REQUEST_AVAIL,则将该槽添加到另一个名为pending_requests的列表中;步骤4:对于freeing_slots列表中的每个槽,通过将其状态更改为FREE来回收它,或者从代理的插槽索引数组中删除此插槽;步骤5: 解除对通讯槽阵列的锁定;步骤6:从pending_requests列表中的插槽发送请求:步骤6.1:如果有多个待处理请求,将它们合并为一个或多个组合请求,并将组合消息发送给GTM。否则,将单个请求发送到GTM。构建组合请求的细节将在后面描述。发送请求后,将插槽状态更改为WAIT_RESPONSE。如果代理发送请求失败,则将相应槽的状态更改为ERROR,将这些槽从pending_requests列表中删除,重新建立与GTM的连接,然后转到步骤1。步骤6.2:接收来自GTM的响应。GTM可能会返回对应于多个请求的组合响应,或单个响应。对于组合响应,从中构建单个响应。对于每个响应,将其存储到相应通信槽的响应缓冲区中,并将槽状态更改为RESPONSE_AVAIL。如果代理没有收到GTM的响应(例如,由于网络断开或GTM故障),将相应的插槽状态设置为ERROR。一旦插槽的状态更改为RESPONSE_AVAIL或ERROR,设置插槽的信号量以通知数据库会话,并从pending_requests列表中删除插槽。当pending_requests列表为空时,转到步骤1。
基于此,数据库会话和代理都通过原子指令(例如,比较和交换)修改插槽状态。需要注意的是,一个通信槽的回收分两步完成:会话将槽状态设置为FREEING,然后代理回收它并将其状态设置为FREE。这种安排减少了对通信槽阵列的并发访问的冲突,从而减少了锁定开销。
综上,本申请实施例提供的数据处理方法,在采用集中式全局事务管理器的多主分布式数据库中,每个主节点可配置数量的代理模块,用于收集、组合和转发来自该用户会话的GTM请求,并将返回的响应路由回用户会话。代理模块和用户会话之间的通信是通过主服务器上的高效共享内存机制进行的;一个master上的代理模块数量是可配置的。当有多个代理模块时,我们的方法以负载平衡的方式将用户会话分配给这些代理模块;代理自动检测并消除并发事务对GTM的冗余请求,以减少到GTM的网络流量和GTM执行的工作量;数据库***允许每个单独的用户会话动态指定是否使用代理或专用连接到GTM来执行会话中的事务。
与上述方法实施例相对应,本申请还提供了数据处理装置实施例,图6示出了本申请一个实施例提供的一种数据处理装置600的结构示意图。如图6所示,该装置600应用于分布式数据处理***的数据处理节点,包括:请求接收模块602,被配置为接收客户端发送的多个数据处理请求,基于多个数据处理请求的数量确定目标处理数量,并根据目标处理数量对多个数据处理请求进行限流处理,获得目标数据处理请求;代理模块604,被配置为将目标数据处理请求转发至全局事务管理器,并接收全局事务管理器对目标数据处理请求进行处理的处理结果,且将处理结果返回至每个数据处理请求对应的客户端。
在一实施例中,请求接收模块602,进一步被配置为基于共享内存区域与代理模块建立通信连接。
在一实施例中,请求接收模块602,进一步被配置为基于多个数据处理请求的数量在共享内存区域中确定待分配处理插槽的数量,并将待分配处理插槽的数量作为目标处理数量。
在一实施例中,请求接收模块602,进一步被配置为根据目标处理数量将与目标处理数量相同的数据处理请求放置于共享内存区域的待分配处理插槽中;
在一实施例中,代理模块604,进一步被配置为从待分配处理插槽中获取与目标处理 数量相同的数据处理请求,并将数据处理请求作为目标数据处理请求。
在一实施例中,代理模块604,进一步被配置为在确定与全局事务管理器建立通信连接的情况下,基于通信连接将目标数据处理请求转发至全局事务管理器。
在一实施例中,请求接收模块602,进一步被配置为基于目标处理数量确定共享内存区域的每个待分配处理插槽的信号量;基于信号量将与目标处理数量相同的数据处理请求放置于共享内存区域的待分配处理插槽中,并修改待分配处理插槽中的通信状态。
在一实施例中,请求接收模块602,进一步被配置为基于与目标处理数量相同的数据处理请求,在共享内存区域中的代理薄记中记录待分配处理插槽的状态,并记录与代理模块的连接状态。
在一实施例中,请求接收模块602,进一步被配置为在确定目标数据处理请求为分布式快照请求的情况下,将分布式快照请求转发至全局事务管理器;
在一实施例中,请求接收模块602,进一步被配置为接收全局事务管理器计算每个分布式快照请求的结果。
在一实施例中,请求接收模块602,进一步被配置为基于预设配置数量确定与预设配置数量相同的两个或两个以上的代理模块,并基于共享内存区域与两个或两个以上的代理模块建立通信连接。
在一实施例中,请求接收模块602,进一步被配置为基于预设负载平衡机制将目标数据处理请求分配至两个或两个以上的代理模块,代理模块将目标数据处理请求转发至全局事务管理器。
在一实施例中,请求接收模块602,进一步被配置为基于预设项目需求判断是否启动数据处理节点中的代理模块,若是,则向客户端发送数据处理指令,其中,数据处理指令为将客户端发送的请求转发至代理模块处理的指令。
本申请实施例提供的数据处理装置,通过在数据处理节点中对接收到的客户端发送的多个数据处理请求进行限流处理,按照一定数量的数据处理请求,分批次地处理由客户端发送的所有数据处理请求,能够减少数据处理节点的请求处理的压力,并且,后续将目标数据处理请求转发至全局事务管理器中也能提高全局事务管理器的处理效率,提升整个分布式数据处理***的处理性能。
上述为本申请实施例的一种数据处理装置的示意性方案。需要说明的是,该数据处理装置的技术方案与上述的数据处理方法的技术方案属于同一构思,数据处理装置的技术方案未详细描述的细节内容,均可以参见上述数据处理方法的技术方案的描述。
图7示出了根据本申请一个实施例提供的一种计算设备700的结构框图。该计算设备700的部件包括但不限于存储器710和处理器720。处理器720与存储器710通过总线730相连接,数据库750用于保存数据。
计算设备700还包括接入设备740,接入设备740使得计算设备700能够经由一个或多个网络760通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备740可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太 网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。
在本申请的一个实施例中,计算设备700的上述部件以及图7中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图7所示的计算设备结构框图仅仅是出于示例的目的,而不是对本申请范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。
计算设备700可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备700还可以是移动式或静止式的服务器。
处理器720用于执行如下计算机可执行指令,其中,处理器720执行计算机可执行指令时实现上述数据处理方法的步骤。
上述为本申请实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的数据处理方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述数据处理方法的技术方案的描述。
本申请一实施例还提供一种计算机可读存储介质,其存储有计算机可执行指令,该计算机可执行指令被处理器执行时实现上述数据处理方法的步骤。
上述为本申请实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的数据处理方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述数据处理方法的技术方案的描述。
本申请一个实施例中提供的数据处理方法,应用于分布式数据处理***的数据处理节点,包括:接收客户端发送的多个数据处理请求,基于多个数据处理请求的数量确定目标处理数量,并根据目标处理数量对多个数据处理请求进行限流处理,获得目标数据处理请求;将目标数据处理请求转发至全局事务管理器,接收全局事务管理器对目标数据处理请求进行处理的处理结果,并将处理结果返回至每个数据处理请求对应的客户端。例如,通过在数据处理节点中对接收到的客户端发送的多个数据处理请求进行限流处理,按照一定数量的数据处理请求,分批次地处理由客户端发送的所有数据处理请求,能够减少数据处理节点的请求处理的压力,并且,后续将目标数据处理请求转发至全局事务管理器中也能提高全局事务管理器的处理效率,提升整个分布式数据处理***的处理性能。
上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
上述计算机指令包括计算机程序代码,计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只 读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,本申请中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本申请实施例所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上公开的本申请实施例只是用于帮助阐述本申请。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为本申请具体实施方式。显然,根据本申请实施例的内容,可作很多的修改和变化。本申请选取并具体描述这些实施例,是为了更好地解释本申请实施例的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本申请。本申请仅受权利要求书及其全部范围和等效物的限制。

Claims (14)

  1. 一种数据处理方法,应用于分布式数据处理***的数据处理节点,包括:
    接收客户端发送的多个数据处理请求,基于所述多个数据处理请求的数量确定目标处理数量,并根据所述目标处理数量对所述多个数据处理请求进行限流处理,获得目标数据处理请求;
    将所述目标数据处理请求转发至全局事务管理器,并接收所述全局事务管理器对所述目标数据处理请求进行处理的处理结果,且将所述处理结果返回至每个数据处理请求对应的客户端。
  2. 根据权利要求1所述的数据处理方法,所述数据处理节点包括请求接收模块、共享内存区域以及代理模块;
    所述基于所述多个数据处理请求的数量确定目标处理数量之前,还包括:
    所述请求接收模块基于所述共享内存区域与所述代理模块建立通信连接。
  3. 根据权利要求2所述的数据处理方法,所述基于所述多个数据处理请求的数量确定目标处理数量,包括:
    所述请求接收模块基于所述多个数据处理请求的数量在所述共享内存区域中确定待分配处理插槽的数量,并将所述待分配处理插槽的数量作为目标处理数量。
  4. 根据权利要求3所述的数据处理方法,所述根据所述目标处理数量对所述多个数据处理请求进行限流处理,获得目标数据处理请求,包括:
    所述请求接收模块根据所述目标处理数量将与所述目标处理数量相同的数据处理请求放置于所述共享内存区域的待分配处理插槽中;
    所述代理模块从所述待分配处理插槽中获取与所述目标处理数量相同的数据处理请求,并将所述数据处理请求作为目标数据处理请求。
  5. 根据权利要求2或4所述的数据处理方法,所述将所述目标数据处理请求转发至全局事务管理器,包括:
    所述代理模块在确定与所述全局事务管理器建立通信连接的情况下,基于所述通信连接将所述目标数据处理请求转发至所述全局事务管理器。
  6. 根据权利要求4所述的数据处理方法,所述请求接收模块根据所述目标处理数量将与所述目标处理数量相同的数据处理请求放置于所述共享内存区域的待分配处理插槽中,包括:
    所述请求接收模块基于所述目标处理数量确定所述共享内存区域的每个待分配处理插槽的信号量;
    基于所述信号量将与所述目标处理数量相同的数据处理请求放置于所述共享内存区域的待分配处理插槽中,并修改所述待分配处理插槽中的通信状态。
  7. 根据权利要求6所述的数据处理方法,所述请求接收模块根据所述目标处理数量将与所述目标处理数量相同的数据处理请求放置于所述共享内存区域的待分配处理插槽中之后,还包括:
    所述请求接收模块基于所述与目标处理数量相同的数据处理请求,在所述共享内存区域中的代理薄记中记录所述待分配处理插槽的状态,并记录与所述代理模块的连接状态。
  8. 根据权利要求7所述的数据处理方法,所述将所述目标数据处理请求转发至全局事务管理器,并接收所述全局事务管理器对所述目标数据处理请求进行处理的处理结果,包括:
    所述请求接收模块在确定所述目标数据处理请求为分布式快照请求的情况下,将所述分布式快照请求转发至所述全局事务管理器;
    所述请求接收模块接收所述全局事务管理器计算所述每个分布式快照请求的结果。
  9. 根据权利要求2所述的数据处理方法,所述请求接收模块基于共享内存区域与所述代理模块建立通信连接,包括:
    所述请求接收模块基于预设配置数量确定与所述预设配置数量相同的两个或两个以上的代理模块,并基于共享内存区域与所述两个或两个以上的代理模块建立通信连接。
  10. 根据权利要求9所述的数据处理方法,所述将所述目标数据处理请求转发至全局事务管理器,包括:
    所述请求接收模块基于预设负载平衡机制将所述目标数据处理请求分配至所述两个或两个以上的代理模块,所述代理模块将所述目标数据处理请求转发至全局事务管理器。
  11. 根据权利要求10所述的数据处理方法,所述接收客户端发送的多个数据处理请求之前,还包括:
    所述请求接收模块基于预设项目需求判断是否启动所述数据处理节点中的代理模块,若是,则向客户端发送数据处理指令,其中,所述数据处理指令为将所述客户端发送的请求转发至代理模块处理的指令。
  12. 一种数据处理***,所述数据处理***包括数据处理节点,所述数据处理节点包括:
    请求接收模块,被配置为接收客户端发送的多个数据处理请求,基于所述多个数据处理请求的数量确定目标处理数量,并根据所述目标处理数量对所述多个数据处理请求进行限流处理,获得目标数据处理请求;
    代理模块,被配置为将所述目标数据处理请求转发至全局事务管理器,并接收所述全局事务管理器对所述目标数据处理请求进行处理的处理结果,且将所述处理结果返回至每个数据处理请求对应的客户端。
  13. 一种计算设备,包括:
    存储器和处理器;
    所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,其中,所述处理器执行所述计算机可执行指令时实现权利要求1-11任意一项所述数据处理方法的步骤。
  14. 一种计算机可读存储介质,其存储有计算机可执行指令,该计算机可执行指令被处理器执行时实现权利要求1-11任意一项所述数据处理方法的步骤。
PCT/CN2022/127511 2021-11-12 2022-10-26 数据处理方法以及*** WO2023082992A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111339770.5A CN114254036A (zh) 2021-11-12 2021-11-12 数据处理方法以及***
CN202111339770.5 2021-11-12

Publications (1)

Publication Number Publication Date
WO2023082992A1 true WO2023082992A1 (zh) 2023-05-19

Family

ID=80792453

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/127511 WO2023082992A1 (zh) 2021-11-12 2022-10-26 数据处理方法以及***

Country Status (2)

Country Link
CN (1) CN114254036A (zh)
WO (1) WO2023082992A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114254036A (zh) * 2021-11-12 2022-03-29 阿里巴巴(中国)有限公司 数据处理方法以及***
CN114691051B (zh) * 2022-05-30 2022-10-04 恒生电子股份有限公司 数据处理方法以及装置
CN117971506B (zh) * 2024-03-29 2024-06-18 天津南大通用数据技术股份有限公司 Mpp数据库查询任务均衡的方法、***、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106462594A (zh) * 2014-04-10 2017-02-22 华为技术有限公司 一种大规模并行处理数据库的***和方法
CN108984571A (zh) * 2017-06-05 2018-12-11 中兴通讯股份有限公司 事务标识操作方法、***和计算机可读存储介质
CN111427966A (zh) * 2020-06-10 2020-07-17 腾讯科技(深圳)有限公司 数据库事务处理方法、装置及服务器
CN111680050A (zh) * 2020-05-25 2020-09-18 杭州趣链科技有限公司 一种联盟链数据的分片处理方法、设备和存储介质
CN112685142A (zh) * 2020-12-30 2021-04-20 北京明朝万达科技股份有限公司 分布式数据处理***
CN114254036A (zh) * 2021-11-12 2022-03-29 阿里巴巴(中国)有限公司 数据处理方法以及***

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070079072A1 (en) * 2005-09-30 2007-04-05 Collier Josh D Preemptive eviction of cache lines from a directory
US10885023B1 (en) * 2014-09-08 2021-01-05 Amazon Technologies, Inc. Asynchronous processing for synchronous requests in a database
RU2016123959A (ru) * 2016-06-16 2017-12-21 Общество С Ограниченной Ответственностью "Яндекс" Способ и система для обработки запроса на транзакцию в распределенных системах обработки данных
US10810268B2 (en) * 2017-12-06 2020-10-20 Futurewei Technologies, Inc. High-throughput distributed transaction management for globally consistent sharded OLTP system and method of implementing
CN111078147B (zh) * 2019-12-16 2022-06-28 南京领行科技股份有限公司 一种缓存数据的处理方法、装置、设备及存储介质
CN112988883B (zh) * 2019-12-16 2023-03-10 金篆信科有限责任公司 数据库的数据同步方法、装置以及存储介质
CN113032410B (zh) * 2019-12-25 2024-05-03 阿里巴巴集团控股有限公司 数据处理方法、装置、电子设备及计算机存储介质
CN111143389B (zh) * 2019-12-27 2022-08-05 腾讯科技(深圳)有限公司 事务执行方法、装置、计算机设备及存储介质
CN113495872A (zh) * 2020-04-08 2021-10-12 北京万里开源软件有限公司 分布式数据库中的事务处理方法及***

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106462594A (zh) * 2014-04-10 2017-02-22 华为技术有限公司 一种大规模并行处理数据库的***和方法
CN108984571A (zh) * 2017-06-05 2018-12-11 中兴通讯股份有限公司 事务标识操作方法、***和计算机可读存储介质
CN111680050A (zh) * 2020-05-25 2020-09-18 杭州趣链科技有限公司 一种联盟链数据的分片处理方法、设备和存储介质
CN111427966A (zh) * 2020-06-10 2020-07-17 腾讯科技(深圳)有限公司 数据库事务处理方法、装置及服务器
CN112685142A (zh) * 2020-12-30 2021-04-20 北京明朝万达科技股份有限公司 分布式数据处理***
CN114254036A (zh) * 2021-11-12 2022-03-29 阿里巴巴(中国)有限公司 数据处理方法以及***

Also Published As

Publication number Publication date
CN114254036A (zh) 2022-03-29

Similar Documents

Publication Publication Date Title
WO2023082992A1 (zh) 数据处理方法以及***
CN107771321B (zh) 数据中心中的恢复
Kemme et al. Online reconfiguration in replicated databases based on group communication
US7426653B2 (en) Fault tolerant distributed lock management
US9417977B2 (en) Distributed transactional recovery system and method
US20030187927A1 (en) Clustering infrastructure system and method
US9417906B2 (en) Transaction participant registration with caveats
CN105512266A (zh) 一种实现分布式数据库操作一致性的方法及装置
JP2014522513A (ja) マルチサーバ予約システムにおける同期化メカニズムのための方法及びシステム
US7203863B2 (en) Distributed transaction state management through application server clustering
CN112162846B (zh) 事务处理方法、设备及计算机可读存储介质
WO2022170979A1 (zh) 日志执行方法、装置、计算机设备及存储介质
CN111404931A (zh) 一种基于持久性内存的远程数据传输方法
US11522966B2 (en) Methods, devices and systems for non-disruptive upgrades to a replicated state machine in a distributed computing environment
CA2492829A1 (en) Asynchronous messaging in storage area network
US7228455B2 (en) Transaction branch management to ensure maximum branch completion in the face of failure
WO2023103340A1 (zh) 一种区块数据提交的方法及装置
JP2001229058A (ja) データベースサーバ処理方法
Pandey et al. SP-LIFT: a serial parallel linear and fast-paced recovery-centered transaction commit protocol
CN113190624A (zh) 基于分布式跨容器的异步转同步调用方法及装置
CN115145997A (zh) 一种分布式事务实现方法及分布式***
CN112162988A (zh) 一种分布式事务的处理方法、装置和电子设备
CN114448781B (zh) 一种数据处理***
US20220358098A1 (en) Efficient update-anywhere replication of queue operations on a replicated message queue
CN117762656A (zh) 数据的处理方法及装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891788

Country of ref document: EP

Kind code of ref document: A1