WO2021143010A1 - 一种分布式计算任务的响应方法及设备 - Google Patents

一种分布式计算任务的响应方法及设备 Download PDF

Info

Publication number
WO2021143010A1
WO2021143010A1 PCT/CN2020/092723 CN2020092723W WO2021143010A1 WO 2021143010 A1 WO2021143010 A1 WO 2021143010A1 CN 2020092723 W CN2020092723 W CN 2020092723W WO 2021143010 A1 WO2021143010 A1 WO 2021143010A1
Authority
WO
WIPO (PCT)
Prior art keywords
field
data
distributed computing
target
historical
Prior art date
Application number
PCT/CN2020/092723
Other languages
English (en)
French (fr)
Inventor
吴昌远
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021143010A1 publication Critical patent/WO2021143010A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Definitions

  • This application belongs to the field of data processing technology, and in particular relates to a method and device for responding to distributed computing tasks.
  • the inventor realizes that when the existing distributed computing technology uses a distributed computing engine, due to changes in data, it is often necessary to integrate multiple query tables in the database and perform calculation responses based on the integrated data tables. However, in the process of integrating the query table, it is necessary to reorganize the data in the table. When the amount of data in the data table is large, more hardware resources need to be consumed to perform the combing operation of the data table, which increases the processing time and reduces The efficiency of distributed computing.
  • the embodiments of the present application provide a response method and device for distributed computing tasks to solve the existing distributed computing technology.
  • it is necessary to reorganize the data in the table.
  • the data volume of the data table is large, more hardware resources need to be consumed to perform the combing operation of the data table, which increases the processing time and causes the problem of low efficiency of distributed computing.
  • the first aspect of the embodiments of the present application provides a response method for distributed computing tasks, including:
  • the field query table is broadcast to each distributed node, so that the distributed node configures the field query table with the local target based on the broadcast merge method Table merge
  • the distributed computing task is executed based on the target configuration table after the merger of each distributed node.
  • the second aspect of the embodiments of the present application provides a response device for distributed computing tasks, including:
  • the target field identification unit is configured to determine the target field of the distributed computing task if the distributed computing task is received;
  • the field query table generating unit is used to extract the target data of the target field from the benchmark query table to generate a field query table
  • a field data volume statistics unit used to count the field data volume of the field query table
  • the broadcast merging trigger unit is configured to broadcast the field query table to each distributed node if the field data amount is less than the preset broadcast trigger threshold, so that the distributed node can merge the field based on the broadcast merging method.
  • the query table is merged with the local target configuration table;
  • the distributed computing task response unit is configured to execute the distributed computing task based on the target configuration table after each distributed node is merged.
  • the third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • a terminal device including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, accomplish:
  • the field query table is broadcast to each distributed node, so that the distributed node configures the field query table with the local target based on the broadcast merge method Table merge
  • the distributed computing task is executed based on the target configuration table after the merger of each distributed node.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium that stores a computer program that implements the steps of the first aspect when the computer program is executed by a processor.
  • the embodiment of the present application can improve the efficiency of distributed computing by analyzing distributed computing tasks. Reduce the amount of data in the data table that needs to be sent, reduce the data read and write pressure of distributed nodes, and improve the response rate of distributed computing.
  • FIG. 1 is an implementation flowchart of a method for responding to a distributed computing task provided by the first embodiment of the present application
  • FIG. 2 is a specific implementation flow chart of a method for responding to distributed computing tasks provided by the second embodiment of the present application
  • FIG. 3 is a specific implementation flowchart of a method S202 for responding to a distributed computing task provided by the third embodiment of the present application;
  • FIG. 4 is a specific implementation flow chart of a method S101 for responding to a distributed computing task provided by the fourth embodiment of the present application;
  • FIG. 5 is a specific implementation flowchart of a method S101 for responding to a distributed computing task provided by the fifth embodiment of the present application;
  • FIG. 6 is a specific implementation flow chart of a method for responding to a distributed computing task provided by the sixth embodiment of the present application.
  • FIG. 7 is a specific implementation flow chart of a method S102 for responding to a distributed computing task provided by the seventh embodiment of the present application;
  • FIG. 8 is a structural block diagram of a response device for distributed computing tasks provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a terminal device provided by another embodiment of the present application.
  • the embodiment of the application analyzes distributed computing tasks to determine the target field required for the calculation operation, and extracts target data related to the target field from the benchmark query table to generate a field query table, thereby splitting a total data table For task-related sub-tables, a large amount of invalid data that has nothing to do with this calculation is removed, and the data volume of sub-tables is reduced.
  • the field data volume of the field query table is less than the broadcast trigger threshold, it will be sent by broadcast.
  • the field query table is sent to each distributed storage node, so that the distributed storage node merges the field query table with the target configuration table based on the broadcast merge Broadcast Join method.
  • the Broadcast Join merge method does not need to associate the field number Key value,
  • the entire data table is sorted out, which is a fast merging method of data tables, which improves the efficiency of distributed computing and solves the existing distributed computing technology.
  • it is necessary to reorganize the data in the table.
  • the amount of data in the data table is large, more hardware resources need to be consumed to perform the combing operation of the data table, which increases the processing time and causes the problem of low efficiency of distributed computing.
  • the execution subject of the process is the terminal device.
  • the terminal equipment includes, but is not limited to: servers, computers, smart phones, and tablet computers that can respond to distributed computing tasks.
  • the terminal device may be a server in a distributed computing system deployed based on the Spark engine.
  • the server and a plurality of different distributed storage nodes form a distributed computing system for storing data uploaded by each user terminal, and Respond to distributed computing tasks.
  • the embodiments of the application can be applied to the data processing field of distributed computing tasks involved in artificial intelligence, such as machine learning, program logic design, and so on.
  • the embodiments of the present application can also be applied to data processing fields involving distributed computing tasks in big data, such as data integration, data mining, etc., which can be specifically determined based on actual application scenarios, and are not limited here.
  • Fig. 1 shows an implementation flowchart of the distributed computing task response method provided by the first embodiment of the present application, which is described in detail as follows:
  • the user can generate distributed computing tasks on the local terminal, and send the distributed computing tasks to the terminal device through the client corresponding to the distributed computing system.
  • the distributed computing task carries the program identifier of the client.
  • the terminal device can identify the program identifier to determine whether the user terminal is a legitimate terminal; if so, execute S101 On the contrary, it is recognized as an invalid task.
  • the terminal device can also be set with timing tasks.
  • the preset calculation trigger conditions are met, distributed calculation tasks are automatically created, and S101 operations are performed, such as setting the calculation of periodic triggers such as statistics on the sales records of the current month on the last day of each month.
  • trigger scripts can be configured for this type of distributed computing tasks. When it is detected that the preset trigger period is met at the current moment, the corresponding trigger scripts are executed to generate corresponding distributed computing tasks.
  • the distributed computing task is specifically a computing system built with a Spark engine.
  • the basic principle of building a framework for processing data stream Stream data on the Spark distributed computing system is to divide the Stream data into multiple small data fragments, and process the data fragments in a manner similar to batch processing.
  • Spark Streaming is built on the Spark distributed computing system, on the one hand, Spark’s low-latency execution can be used for real-time computing.
  • the Spark distributed computing system is compared to other processing frameworks based on Record (such as Storm).
  • the narrow-dependent elastic distributed data set RDD can be recalculated from the source data to achieve the purpose of fault-tolerant processing.
  • the Spark distributed computing system can be divided into at least one computing-driven device driver, that is, the terminal device in this embodiment, and several scheduler executors.
  • the scheduler is on each node of the RDD distribution, that is, the distributed computing device in this embodiment. node. Connect to the Spark cluster through SparkContext, create RDDs, accumulators, and broadcast variables.
  • the computing-driven device divides the computing task into a series of small fragments, namely tasks, and sends them to distributed nodes for execution.
  • Distributed nodes can communicate. After each distributed node completes its own slicing task, it sends all the information to the computing drive device, and the response result is sent to the user terminal through the computing drive device.
  • the distributed computing system is a distributed computing system built based on the Spark engine, the aforementioned computing tasks may be based on tasks in the Spark-SQL language.
  • the distributed computing task may include computing content, and the terminal device analyzes the computing content.
  • the distributed computing task is determined by determining the computing type corresponding to the computing content and the target object for which the calculation is requested. The corresponding target field.
  • the target data of the target field is extracted from the reference query table to generate a field query table.
  • the terminal device stores a reference query table, which records existing fields of all objects, that is, belongs to the total data table.
  • a distributed computing task may only involve some fields in the benchmark query table. Therefore, in order to reduce the amount of data when the data table is sent, it can be divided into sub-data tables based on the benchmark query table, that is, to extract the target used for this calculation. The data of each record linked to the field and the target field is sufficient, and there is no need to send the entire benchmark query table to the distributed node.
  • the existing fields of the benchmark query data packet include "user number”, “user age”, “user address”, “associated user list” and “contact information”, and the distributed computing received by the terminal device
  • the calculation content of the task is to count the average age of users, and the target fields are "user number” and "user age”.
  • the terminal device only needs to be based on the data of "user number” and "user age” in the benchmark query table ,
  • the average age of the user can be calculated, and the field query table is generated based on the target data corresponding to the two fields of "user number” and “user age”, and the field query table is sent to the distributed node without having to send all existing
  • the benchmark query table composed of fields is sent to the distributed nodes, reducing the data transmission volume by 60%.
  • the reference query table may contain the object information of multiple different objects, and each object information contains the parameter values of all the existing fields of the reference query table. Therefore, when extracting the target data of the target field, the actual The above is to extract the parameter value of each object with respect to the target field to form the above-mentioned target data.
  • the terminal device After the terminal device extracts the field query table from the reference query table, it needs to determine the field data volume of the field query table. Because the distributed computing system uses different data according to the data volume of the field query table Table consolidation method. Therefore, before sending the field query table to each distributed node, the data amount of the target data corresponding to each target field is identified, and all the target data amounts are accumulated, that is, the field data amount about the entire data table can be calculated.
  • the data table merging method of the distributed computing system can be at least divided into Broadcast Join, Hash Switch and Shuffle Hash Join, and Sort Merge Join.
  • Broadcast Join because the hash switch merge and the sort connection merge both need to reconstruct the data table, that is, the two tables are partitioned according to the key corresponding to the data, and then the data with the same key value in each partition is connected, thereby Combine the data with the same key value in the two data tables to achieve the merging of the two data tables.
  • the above method involves the transmission of a large amount of data between different distributed nodes and occupies the network input and output IO interfaces of the distributed nodes. Internet resources.
  • the broadcast merge method should be added to merge the data tables.
  • the trigger condition of the broadcast merge method is that the amount of data in the data table is less than the broadcast trigger threshold, and the amount of data in the data table is greater than
  • the distributed computing system merges the data tables by hash switching and merging and sorting connection merging.
  • the field query table is broadcast to each distributed node, so that the distributed node combines the field query table with the The local target configuration table is merged.
  • the data table can be merged using the broadcast merge method. Therefore, the field look-up table can be sent to each distributed node. After the node receives the field lookup table sent by broadcast, it determines that this combination method adopts the broadcast merge BroadcastJoin method to merge the field lookup table and the target configuration table stored locally.
  • the broadcast trigger threshold can be dynamically adjusted according to the current number of tasks. Specifically, if the number of tasks of the current distributed computing task is large, the network resources allocated by each computing task are less. In this case, the broadcast trigger threshold can be increased, thereby increasing the probability of using the broadcast combination method; while in idle time , That is, when the number of current tasks is small, the broadcast trigger threshold can be lowered, and the merge is performed through hash switching and merging and sorting and connecting merging. In this case, before performing the operation of S104, the terminal device can obtain the number of distributed computing tasks currently being processed, and calculate the broadcast trigger threshold corresponding to the number of tasks through the preset broadcast trigger threshold conversion algorithm. And compare the broadcast trigger threshold with the amount of field data.
  • the distributed computing task is executed based on the target configuration table after each distributed node is merged.
  • the terminal device can send each field query table to multiple distributed nodes connected downstream, and then the distributed node can merge the field query table with the local target configuration table, and each distributed node executes the data After the table merging operation, the merged target configuration table is returned to the terminal device.
  • the terminal device as a management device for managing multiple distributed nodes, is used to analyze computing tasks and determine multiple computing operations contained in multiple distributed computing tasks.
  • the computing operations include extraction of target fields and data extraction. Different calculation types are assigned to different distributed nodes for operations such as merging and data merging calculations. Therefore, after obtaining the merged target configuration table, the terminal device can determine the target fields required by the distributed computing task The associated distributed nodes and send data query tasks to each distributed node.
  • the distributed nodes can feed back the data obtained by the query to the terminal device, and then the terminal device can send the received data to perform data merging and data calculation Distributed nodes in, perform subsequent calculation tasks, and feed back the calculation results to the terminal device, and respond to the distributed calculation tasks through the above process.
  • the method for responding to a distributed computing task analyzes the distributed computing task, determines the target field required for the calculation operation, and extracts the target related to the target field from the reference query table.
  • Data generate a field query table, thereby split a total data table into task-related sub-tables, eliminate a large amount of invalid data that has nothing to do with this calculation, reduce the amount of data in the sub-table, and query the table in this field
  • the field query table is sent to each distributed storage node by broadcast transmission, so that the distributed storage node merges the field query table with the target configuration table based on the broadcast merge Broadcast Join method.
  • the Broadcast Join method does not need to associate the field number Key value, the entire data table is sorted, which is a fast merging method of the data table, which improves the efficiency of distributed computing.
  • this embodiment can filter out valid data when generating the data table, that is, the target data of the target field related to this calculation, and reduce the data of the data table that needs to be sent.
  • the field query table and the target configuration table are merged through the simple merge Broadcast Join method, which greatly reduces the number of data combing operations and reduces the data read and write pressure of distributed nodes, thereby Improve the response rate of distributed computing.
  • Fig. 2 shows a specific implementation flow chart of a method for responding to a distributed computing task provided by the second embodiment of the present application.
  • the broadcast is sent.
  • the field query table is sent to each distributed node, it also includes: S201 ⁇ S205, the details are as follows:
  • the method further includes:
  • the terminal device can dynamically adjust the broadcast trigger threshold according to the current network situation, and specifically can be determined according to the historical broadcast threshold configured in history and the current network situation. Therefore, the terminal device can obtain the network resource parameters and historical operation records at the current moment, and calculate the broadcast trigger threshold corresponding to the current moment based on the two parameters obtained above.
  • the current time is specifically the time when the distributed computing task is received.
  • the acquired network resource parameters may be multiple, and the network resource parameters include but are not limited to: network packet loss rate, network transmission rate, bit error rate, network delay, etc.
  • the terminal device can be pre-divided into different characteristic time periods.
  • the terminal device can determine the characteristic time period that the current moment falls into, and obtain the historical operation record of the creation time within the characteristic time period as the current moment association Historical operation records.
  • the network resource parameters are imported into a preset threshold factor conversion model, and a first threshold factor is calculated.
  • the terminal device may be provided with a threshold factor conversion model, and the terminal device imports the network resource parameter into the threshold factor conversion model, and outputs the first threshold factor corresponding to the network resource parameter at the current moment.
  • the larger the value of the network resource parameter the more network resources are currently available.
  • the larger the value of the corresponding first threshold factor the higher the probability of using the broadcast combination mode; on the contrary, if the value of the network resource parameter is The smaller the value, the smaller the currently available network resources, the smaller the value of the first threshold factor corresponding to the test, and the lower the probability of using the broadcast combination method.
  • the threshold factor conversion model may be a hash function.
  • the weight value of the historical operation record is configured based on the creation time of each historical operation record.
  • the historical operation record includes the creation time of the record
  • the terminal device can configure the weight value of the historical operation record according to the difference between each creation time and the current moment.
  • the smaller the difference between the current moment and the creation time the higher the weight value of the corresponding historical operation record; conversely, the greater the difference between the current moment and the creation time, the greater the weight of the corresponding historical operation record The lower the value.
  • the difference between the creation time and the current moment is smaller, the difference between the system structure of the distributed computing system and the total amount of data in the database at the time of responding to historical operation records and the system structure and total amount of data at the current moment is smaller Therefore, the higher the reference value of the corresponding historical broadcast threshold, the larger the corresponding weight value, which can improve the accuracy of the current broadcast trigger threshold.
  • a second threshold factor is calculated according to the historical broadcast threshold of each historical operation record and the weight value.
  • the historical operation record contains the historical broadcast threshold value compared when responding to the historical calculation task, and the terminal device may weight and accumulate the historical broadcast threshold value and weight value of each historical operation record, thereby calculating the second threshold factor.
  • the broadcast trigger threshold is calculated according to the first threshold factor and the second threshold factor.
  • the terminal device after the terminal device calculates the first threshold factor related to the network resource parameter line pipe and the second threshold factor related to the historical broadcast threshold, it can calculate the broadcast trigger threshold corresponding to the current moment based on the two parameters mentioned above. , To achieve the purpose of dynamically adjusting the broadcast trigger threshold.
  • the current network resource parameters and historical operation records are obtained to calculate the broadcast trigger threshold at the current moment, so that the broadcast trigger threshold compared with the field data volume at the current moment matches the current network load situation , Improve the accuracy of the broadcast trigger threshold.
  • FIG. 3 shows a specific implementation flowchart of a method S202 for responding to a distributed computing task provided by the third embodiment of the present application.
  • the method S202 for responding to a distributed computing task provided by this embodiment includes: S2021 to S2022, and the details are as follows:
  • the importing the network resource parameters into a preset threshold factor conversion model to calculate the first threshold factor includes:
  • the terminal device can obtain the currently accessed network, that is, the network where it is located, the maximum available resource parameter, that is, the upper limit value of each network resource parameter.
  • the network resource parameter includes an uplink rate and a downlink rate
  • the maximum available resource parameter includes the maximum uplink rate and the maximum downlink rate
  • the network resource parameters such as bit error rate and packet loss rate are converted to positive Directional parameters, such as the maximum correct rate of data transmission, which corresponds to the minimum bit error rate, and the success rate of data packet transmission, which corresponds to the minimum packet loss rate.
  • the maximum available resource parameter and the network resource parameter are imported into a preset threshold factor conversion model to calculate the first threshold factor;
  • the threshold factor conversion model is specifically:
  • FirstBrdcst is the first threshold factor
  • CurrentResource i is the i-th network resource parameter
  • MaxWebResource i is the i-th maximum available resource parameter
  • BaseLv is the preset benchmark coefficient
  • n is the network resource parameter total.
  • the terminal device can calculate the ratio between the current network resource parameter and the maximum available resource parameter. If the network resource parameter is closer to the maximum available resource parameter, it means that the current network environment is better and can be used for transmission. For the data table of big data, the corresponding value of the first threshold factor is also larger; conversely, if the difference between the network resource parameter and the maximum available resource parameter is larger, it means that the current network environment is poor. At this time, The corresponding value of the first threshold factor is also smaller.
  • the terminal device obtains the maximum available resource parameter of the network where it is currently located, and compares the network resource parameter with the maximum available resource parameter, and calculates the first threshold factor, so that each network resource parameter can be classified. Yihua, oh, did it, and improved the accuracy of the first threshold factor.
  • FIG. 4 shows a specific implementation flowchart of a method S101 for responding to a distributed computing task provided by the fourth embodiment of the present application.
  • a distributed computing task response method S101 provided in this embodiment includes: S1011 to S1014, which are detailed as follows:
  • determining the target field of the distributed computing task includes:
  • the distributed computing task is analyzed to obtain the task type of the distributed computing task.
  • the terminal device can determine the target field associated with the distributed computing task in a self-recognition manner, without the need for a user to manually set it.
  • Different computing tasks correspond to different task types, and different task types require different data to be called in response. Therefore, the terminal device can analyze the distributed computing task and identify the task type corresponding to the computing task. Specifically, the terminal device can extract the calculation content of the distributed computing task, extract the calculation keyword in the calculation content, and determine the task type associated with the calculation keyword.
  • all historical response results of the terminal device's response are stored in the calculation response database.
  • the historical response result includes the target field associated when responding to the historical calculation task, that is, the aforementioned historical field.
  • the terminal device can extract historical response results matching the task type from the computing response database according to the task type of the distributed computing task, that is, the task type of the historical response result extracted above and the task type of the currently required distributed computing task Consistent, it can be determined that the target field associated in the historical response result may also be the target field associated with the calculation task currently required to be calculated.
  • all the historical response results of the aforementioned terminal device responses can also be stored in the blockchain, that is, all historical response results can be stored based on the distributed storage of the blockchain.
  • the above-mentioned computer response database may be integrated by a database or realized through a data warehouse, which may be specifically determined based on actual application scenarios, which is not limited here.
  • the terminal device can count the number of occurrences of each historical field in all historical response results. If the number of occurrences is greater, it means that the historical field has a higher degree of association with the task type; otherwise, if the historical field is The smaller the number of occurrences of the field in all historical response results, the lower the degree of relevance to the task type.
  • the terminal device can calculate the appearance frequency according to the appearance time of the historical field in each historical response result, and based on the number of appearances and the appearance frequency, the correlation between each historical field and the task type can be identified.
  • the history field whose correlation degree is greater than a preset correlation threshold is selected as the target field.
  • the terminal device after the terminal device calculates the correlation degree between each historical field and the task type, it can select the historical field with the correlation degree greater than the correlation threshold as the target field, so as to realize the purpose of automatically identifying the target field of the distributed computing task.
  • the historical response result matching the task type of the current computing task is obtained, thereby automatically extracting the target field corresponding to the computing task from the historical field contained in the historical response result, which reduces user operations and improves distribution Calculate the response efficiency of the task.
  • FIG. 5 shows a specific implementation flow chart of a method S101 for responding to a distributed computing task provided by the fifth embodiment of the present application.
  • a response method S101 of a distributed computing task provided by this embodiment includes: S1015 to S1017, which are detailed as follows:
  • determining the target field of the distributed computing task includes:
  • the distributed computing task is based on a computing task generated using the Spark-SQL framework.
  • the terminal device can parse the distributed computing task and extract the SQL statement carried by the computing task. Since the SQL statement is used Query the target data in the database, that is, the SQL statement carries the target field information corresponding to the calculation task, and the SQL statement can be parsed to automatically determine the SQL language.
  • semantic analysis is performed on the SQL statement to obtain query keywords corresponding to the SQL statement.
  • the terminal device can extract a SQL sentence database, which records a plurality of standard segments, and the terminal device can extract the characteristic sentences used to define the query data association from the SQL sentences based on the standard segments, and Based on the query keywords contained in the characteristic sentence.
  • an existing field matching each of the query keywords is searched in the reference query table, and the existing field matching the query keyword is identified as the target field.
  • the terminal device can recognize whether each query keyword exists in an existing field in the reference query table, and if so, recognize the existing field as a target field.
  • Fig. 6 shows a specific implementation flow chart of a method for responding to a distributed computing task provided by the sixth embodiment of the present application.
  • the method for responding to a distributed computing task provided in this embodiment after the counting of the field data volume of the field query table, further includes: S601 ⁇ S602, the details are as follows:
  • the method further includes:
  • the terminal device can obtain that if it detects that the field data volume of the field lookup table is greater than or equal to the broadcast trigger threshold, it recognizes that the two data tables need to be merged by means of Hash Switching and Shuffle Hash Join, and Shuffle Hash Join needs to determine the data number of each target data, that is, the Key value, so that the distributed node can determine the associated data based on the Key value. Therefore, the terminal device needs to add each data number to the field query table.
  • the field query table added with the data number is sent to each distributed node, so that each distributed node can query the local data in the local target configuration table based on the data number and compare it with each The associated data corresponding to the field data, and reconstruct the target configuration table.
  • the terminal device sends the field query table with the data label added to each distributed node.
  • the distributed node can compare the key value of each target data and the key value of the local data in the local target configuration table. , Recognize the local data and target data with the same key value as mutually related data, and reconstruct the target configuration table according to the relationship between the data, so as to merge the target data of the field query table into the target configuration table.
  • the two data tables are merged by Shuffle Hash Join, which can sort the data tables and facilitate the management of distributed data.
  • FIG. 7 shows a specific implementation flowchart of a method S102 for responding to a distributed computing task provided by the seventh embodiment of the present application.
  • a distributed computing task response method S102 provided in this embodiment includes: S1021 to S1022, and the details are as follows:
  • the terminal device can separate the field query table required for this calculation from the reference query table, that is, separate the small table from the large table. Specifically, the terminal device can determine each existing field in the reference query table, that is, whether the aforementioned reference field matches the target field, and if so, identify the reference field as a valid field.
  • the data downstream of the valid field is the target data that needs to be extracted by the target field, and based on the target data and the valid fields, the data from the reference query table Separate to get the field query table.
  • the amount of data in the data table to be sent can be reduced, and the consumption of network resources can be reduced.
  • FIG. 8 shows a structural block diagram of a device for responding to a distributed computing task provided by an embodiment of the present application.
  • the units included in the device for responding to a distributed computing task are used to perform steps in the embodiment corresponding to FIG. 1 .
  • only the parts related to this embodiment are shown.
  • the response device for the distributed computing task includes:
  • the target field identification unit 81 is configured to determine the target field of the distributed computing task if the distributed computing task is received;
  • the field query table generating unit 82 is configured to extract the target data of the target field from the reference query table to generate a field query table
  • the field data volume statistics unit 83 is configured to count the field data volume of the field query table
  • the broadcast merging trigger unit 84 is configured to, if the field data amount is less than a preset broadcast trigger threshold, broadcast the field query table to each distributed node, so that the distributed node will broadcast the merging method to the The field query table is merged with the local target configuration table;
  • the distributed computing task response unit 85 is configured to execute the distributed computing task based on the target configuration table after each distributed node is merged.
  • the response device for the distributed computing task further includes:
  • the network resource determining unit is configured to obtain network resource parameters at the current moment and historical operation records associated with the preset time period at the current moment;
  • the first threshold factor calculation unit is configured to import the network resource parameters into a preset threshold factor conversion model to calculate the first threshold factor
  • the weight value determining unit is configured to configure the weight value of the historical operation record based on the creation time of each historical operation record;
  • the second threshold factor calculation unit is configured to calculate the second threshold factor according to the historical broadcast threshold of each historical operation record and the weight value
  • the broadcast trigger threshold calculation unit is configured to calculate the broadcast trigger threshold according to the first threshold factor and the second threshold factor.
  • the first threshold factor calculation unit includes:
  • the maximum available resource parameter obtaining unit is used to obtain the maximum available resource parameter of the network where it is located;
  • the first threshold factor conversion unit is configured to import the maximum available resource parameter and the network resource parameter into a preset threshold factor conversion model to calculate the first threshold factor;
  • the threshold factor conversion model is specifically:
  • FirstBrdcst is the first threshold factor
  • CurrentResource i is the i-th network resource parameter
  • MaxWebResource i is the i-th maximum available resource parameter
  • BaseLv is the preset benchmark coefficient
  • n is the network resource parameter total.
  • the target field identifying unit 81 includes:
  • a task type identification unit for analyzing the distributed computing task to obtain the task type of the distributed computing task
  • the historical field obtaining unit is configured to extract historical response results matching the task type from the calculated response database, and identify the historical fields contained in each historical response result;
  • An association degree calculation unit configured to calculate the association degree of each historical field based on the number of occurrences and the appearance time of each historical field in all the historical response results
  • the target field selecting unit is configured to select the historical field whose correlation degree is greater than a preset correlation threshold as the target field.
  • the target field identifying unit 81 includes:
  • the SQL sentence extraction unit is used to extract the structured query language SQL sentence contained in the distributed computing task
  • the query keyword obtaining unit is used to perform semantic analysis on the SQL statement to obtain the query keyword corresponding to the SQL statement;
  • the query keyword screening unit is configured to query the existing field matching each of the query keywords in the reference query table, and identify the existing field matching the query keyword as the target field .
  • the response device for the distributed computing task further includes:
  • a field lookup table adjustment unit configured to add the data number of each target data to the field lookup table if the amount of field data is greater than or equal to the broadcast trigger threshold
  • the hash sorting and merging unit is configured to send the field query table with the data number added to each distributed node, so that each distributed node can configure the local data of the local target configuration table based on the data number Query the associated data corresponding to each of the field data, and reconstruct the target configuration table.
  • the field query table generating unit 81 includes:
  • a valid field identification unit configured to identify the reference field as a valid field if any reference field in the reference query table matches the target field of the distributed computing task
  • the target data selection unit is used to identify all the data associated with the valid fields as target data, and generate a field query table according to all the target data and the valid fields.
  • the distributed computing task response device can also filter out valid data when generating the data table, that is, the target data of the target field related to this calculation, thereby reducing the data in the data table that needs to be sent.
  • the field query table and the target configuration table are merged through the simple Broadcast Join method, which greatly reduces the number of data sorting operations and reduces the data read and write pressure of distributed nodes. Improve the response rate of distributed computing.
  • FIG. 9 is a schematic diagram of a terminal device provided by another embodiment of the present application.
  • the terminal device 9 of this embodiment includes: a processor 90, a memory 91, and a computer program 92 that is stored in the memory 91 and can run on the processor 90, such as a distributed computing task Responding procedures.
  • the processor 90 executes the computer program 92, the steps in the embodiment of the method for responding to each distributed computing task described above are implemented, for example, S101 to S105 shown in FIG. 1.
  • the processor 90 executes the computer program 92, the functions of the units in the foregoing device embodiments, for example, the functions of the modules 81 to 85 shown in FIG. 8 are realized.
  • the computer program 92 may be divided into one or more units, and the one or more units are stored in the memory 91 and executed by the processor 90 to complete the application.
  • the one or more units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 92 in the terminal device 9.
  • the computer program 92 may be divided into a target field identification unit, a field look-up table generation unit, a field data volume statistics unit, a broadcast merge trigger unit, and a distributed computing task response unit. The specific functions of each unit are as described above.
  • the terminal device 9 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the terminal device may include, but is not limited to, a processor 90 and a memory 91.
  • FIG. 9 is only an example of the terminal device 9 and does not constitute a limitation on the terminal device 9. It may include more or less components than shown in the figure, or a combination of certain components, or different components.
  • the terminal device may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 90 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 91 may be an internal storage unit of the terminal device 9, for example, a hard disk or a memory of the terminal device 9.
  • the memory 91 may also be an external storage device of the terminal device 9, such as a plug-in hard disk equipped on the terminal device 9, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD). Card, Flash Card, etc.
  • the memory 91 may also include both an internal storage unit of the terminal device 9 and an external storage device.
  • the memory 91 is used to store the computer program and other programs and data required by the terminal device.
  • the memory 91 can also be used to temporarily store data that has been output or will be output.
  • a computer-readable storage medium may be a non-volatile computer-readable storage medium. Wherein, the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the method provided in the embodiment of the present application.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种分布式计算任务的响应方法及设备,所述方法包括:若接收到分布式计算任务,则确定所述分布式计算任务的目标字段;从基准查询表中提取所述目标字段的目标数据,生成字段查询表;统计所述字段查询表的字段数据量;若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点,以使所述分布式节点基于广播合并方式将所述字段查询表与本地的目标配置表合并;基于各个分布式节点合并后的所述目标配置表,执行所述分布式计算任务。所述方法能够在生成数据表时筛选出有效数据,减少所需发送的数据表的数据量,减少数据梳理操作的次数,减少分布式节点的数据读写压力,提高分布式计算的响应速率。

Description

一种分布式计算任务的响应方法及设备
本申请要求于2020年1月17日提交中国专利局,申请号为2020100547822、发明名称为“一种分布式计算任务的响应方法及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于数据处理技术领域,尤其涉及一种分布式计算任务的响应方法及设备。
背景技术
随着电子化进程的不断推进,大部分文件可为数字化文件,并存储于云端数据库,而为了保证数据库的存取效率,大部分数据存储模式采用分布式存储,将属于同一的电子文件划分多个不同的数据库,并交由各个分布式节点进行存储,因此在响应数据计算任务时,则需要采用分布式计算框架,例如Spark引擎,基于查询表从各个分布式节点提取数据。
发明人意识到,现有的分布式计算技术在使用分布式计算引擎时,由于数据会存在变更情况,往往需要整合数据库内的多个查询表,并根据整合后的数据表执行计算响应。然而在整合查询表的过程中,需要重新梳理表内的各个数据,在数据表的数据量较大时,则需要消耗较多的硬件资源执行数据表的梳理操作,增加了处理时间,降低了分布式计算的效率。
发明内容
有鉴于此,本申请实施例提供了一种分布式计算任务的响应方法及设备,以解决现有的分布式计算技术,在整合查询表的过程中,需要重新梳理表内的各个数据,在数据表的数据量较大时,则需要消耗较多的硬件资源执行数据表的梳理操作,增加了处理时间,分布式计算的效率低的问题。
本申请实施例的第一方面提供了一种分布式计算任务的响应方法,包括:
若接收到分布式计算任务,则确定所述分布式计算任务的目标字段;
从基准查询表中提取所述目标字段的目标数据,生成字段查询表;
统计所述字段查询表的字段数据量;
若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点,以使所述分布式节点基于广播合并方式将所述字段查询表与本地的目标配置表合并;
基于各个分布式节点合并后的所述目标配置表,执行所述分布式计算任务。
本申请实施例的第二方面提供了一种分布式计算任务的响应设备,包括:
目标字段识别单元,用于若接收到分布式计算任务,则确定所述分布式计算任务的目标字段;
字段查询表生成单元,用于从基准查询表中提取所述目标字段的目标数据,生成字段查询表;
字段数据量统计单元,用于统计所述字段查询表的字段数据量;
广播合并触发单元,用于若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点,以使所述分布式节点基于广播合并方式将所述字段查询表与本地的目标配置表合并;
分布式计算任务响应单元,用于基于各个分布式节点合并后的所述目标配置表,执行所述分布式计算任务。
本申请实施例的第三方面提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:
若接收到分布式计算任务,则确定所述分布式计算任务的目标字段;
从基准查询表中提取所述目标字段的目标数据,生成字段查询表;
统计所述字段查询表的字段数据量;
若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点,以使所述分布式节点基于广播合并方式将所述字段查询表与本地的目标配置表合并;
基于各个分布式节点合并后的所述目标配置表,执行所述分布式计算任务。
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现第一方面的各个步骤。
本申请实施例通过对分布式计算任务进行解析,可提高分布式计算的效率。减少所需发送的数据表的数据量,减少了分布式节点的数据读写压力,从而提高了分布式计算的响应速率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请第一实施例提供的一种分布式计算任务的响应方法的实现流程图;
图2是本申请第二实施例提供的一种分布式计算任务的响应方法具体实现流程图;
图3是本申请第三实施例提供的一种分布式计算任务的响应方法S202具体实现流程图;
图4是本申请第四实施例提供的一种分布式计算任务的响应方法S101具体实现流程图;
图5是本申请第五实施例提供的一种分布式计算任务的响应方法S101具体实现流程图;
图6是本申请第六实施例提供的一种分布式计算任务的响应方法具体实现流程图;
图7是本申请第七实施例提供的一种分布式计算任务的响应方法S102具体实现流程图;
图8是本申请一实施例提供的一种分布式计算任务的响应设备的结构框图;
图9是本申请另一实施例提供的一种终端设备的示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例通过对分布式计算任务进行解析,确定计算操作所需的目标字段,并从基准查询表提取与目标字段相关的目标数据,生成字段查询表,从而将一个总的数据表拆分为与任务相关的分表,剔除了大量与本次计算无关的无效数据,减少了分表的数据量,并在该字段查询表的字段数据量小于广播触发阈值时,通过广播发送的方式将字段查询表发送给各个分布式存储节点,以便分布式存储节点基于广播合并Broadcast Join的方式将字段查询表与目标配置表进行合并,由于Broadcast Join的合并方式无需对字段编号Key值进行关联,对整个数据表进行梳理,属于数据表的快速合并的方式,提高了分布式计算的效率,解决了现有的分布式计算技术,在整合查询表的过程中,需要重新梳理表内的各个数据,在数据表的数据量较大时,则需要消耗较多的硬件资源执行数据表的梳理操作,增加了处理时间,分布式计算的效率低的问题。
在本申请实施例中,流程的执行主体为终端设备。该终端设备包括但不限于:服务器、计算机、智能手机以及平板电脑等能够响应分布式计算任务的设备。具体地,该终端设备可以为基于Spark引擎部署的分布式计算***中的服务器,该服务器与多个不同的分布式存储节点共同构成分布式计算***,用于存储各个用户终端上传的数据,并响应分布式计算任务。
本申请实施例可适用于人工智能所涉及到分布式计算任务的数据处理领域,如机器学 习、程序逻辑设计等。本申请实施例也可适用于大数据中所涉及到分布式计算任务的数据处理领域,如数据集成,数据挖掘等,具体可基于实际应用场景确定,在此不做限制。
图1示出了本申请第一实施例提供的分布式计算任务的响应方法的实现流程图,详述如下:
在S101中,若接收到分布式计算任务,则确定所述分布式计算任务的目标字段。
在本实施例中,用户可以在本地终端生成分布式计算任务,并通过与分布式计算***相对应的客户端将分布式计算任务发送给终端设备。在该情况下,分布式计算任务携带有客户端的程序标识,终端设备在接收到分布式计算任务后,可以对该程序标识进行识别,确定该用户终端是否为合法的终端;若是,则执行S101的操作;反之,则识别为无效任务。终端设备还可以设置有定时任务,在满足预设的计算触发条件时,自动创建分布式计算任务,并执行S101的操作,例如设置每月最后一日统计当月的销售记录等周期性触发的计算任务,可以为该类型的分布式计算任务配置触发脚本,在检测到当前时刻满足预设的触发周期,则执行对应的触发脚本,生成对应的分布式计算任务。
具体地,在本实施例中,该分布式计算任务具体为以Spark引擎搭建的计算***。其中,构建在Spark分布式计算***上处理数据流Stream数据的框架,基本的原理是将Stream数据分成多个小的数据片段,以类似batch批量处理的方式来处理数据片段。由于Spark Streaming是构建在Spark分布式计算***上,一方面是因为Spark的低延迟执行,可以用于实时计算,另一方面Spark分布式计算***相比基于Record的其它处理框架(如Storm),其中的窄依赖的弹性分布式数据集RDD可以从源数据重新计算达到容错处理目的。此外由于将数据分成多个小的数据片段,采用小批量处理的方式使得Spark分布式计算***可以同时兼容批量和实时数据处理的逻辑和算法,方便了一些需要历史数据和实时数据联合分析的特定应用场合。Spark分布式计算***可以分为至少一个计算驱动设备driver,即本实施例中的终端设备,和若干个调度器executor,该调度器在RDD分布的各个节点上,即本实施例中的分布式节点。通过SparkContext连接Spark集群、创建RDD、累加器、广播变量broadcast variables。计算驱动设备会把计算任务分成一系列小的分片,即task,然后送到分布式节点执行。分布式节点之间可以通信,在每个分布式节点完成自己的分片任务后,将所有的信息发送给计算驱动设备,通过计算驱动设备将响应结果发送给用户终端。可选地,若该分布式计算***为基于Spark引擎搭建的分布式计算***,则上述的计算任务可以基于Spark-SQL语言的任务。
在本实施例中,分布式计算任务可以包含有计算内容,终端设备对该计算内容进行解析,可选地,通过确定计算内容对应的计算类型以及请求计算的目标对象,确定该分布式计算任务对应的目标字段。
在S102中,从基准查询表中提取所述目标字段的目标数据,生成字段查询表。
在本实施例中,终端设备存储有基准查询表,该基准查询表记录所有对象的已有字段,即属于总数据表。而一次分布式计算任务可能只涉及基准查询表内的部分字段,因此,为了减少数据表发送时的数据量,可以基于基准查询表划分为子数据表,即提取本次计算所需要使用的目标字段以及目标字段下联的各个记录的数据即可,无需将整个基准查询表发送给分布式节点。
举例性地,该基准查询数据包的已有字段包括有“用户编号”、“用户年龄”、“用户地址”、“关联用户列表”以及“联系方式”,而终端设备接收到的分布式计算任务的计算内容是统计用户的平均年龄,则目标字段为“用户编号”以及“用户年龄”,此时,终端设备则只需基于基准查询表内的“用户编号”以及“用户年龄”的数据,即可计算得到用户的平均年龄,并根据“用户编号”以及“用户年龄”两个字段对应的目标数据,生成字段查询表,将字段查询表发送给分布式节点,而无需将所有已有字段构成的基准查询表发送给分 布式节点,减少了60%的数据传输量。
在本实施例中,基准查询表内可以包含有多个不同对象的对象信息,每个对象信息包含有基准查询表所有已有字段的参数值,因此,在提取目标字段的目标数据时,实际上是提取各个对象关于该目标字段的参数值,构成上述的目标数据。
在S103中,统计所述字段查询表的字段数据量。
在本实施例中,终端设备从基准查询表中提取得的字段查询表后,需要确定该字段查询表的字段数据量,由于分布式计算***根据字段查询表的数据量不同,使用不同的数据表合并方式。因此,在发送字段查询表给各个分布式节点之前,识别各个目标字段对应的目标数据的数据量,将所有目标数据量进行累加,即可以的计算得到关于整个数据表的字段数据量。
在本实施例中,分布式计算***的数据表合并方式可以至少划分为广播合并Broadcast Join,哈希切换合并Shuffle Hash Join,排序连接合并Sort Merge Join。其中,由于哈希切换合并以及排序连接合并均需要对数据表进行重构,即对两个表中根据数据所对应的key分区,再将每个分区中key值相同的数据进行连接操作,从而将两个数据表中同样key值的数据合并,实现了两个数据表的合并,然而上述方式涉及到不同分布式节点之间大量数据的传输,并且占用分布式节点的网络输入输出IO接口的网络资源。因此,通过哈希切换合并以及排序连接合并对数据表进行合并,则会降低分布式计算的响应效率,增加了计算时长。为了减少网络资源消耗以及提高计算任务的响应效率,应该增加广播合并方式对数据表进行合并,而广播合并方式的触发条件是数据表的数据量小于广播触发阈值,而在数据表的数据量大于或等于广播触发阈值时,分布式计算***则会通过哈希切换合并以及排序连接合并两种方式对数据表进行合并,而通过从基准查询表中提取字段查询表,则过滤了大量无效的数据,从而减少了数据表的数据量,增加了广播合并方式的概率。
在S104中,若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点,以使所述分布式节点基于广播合并方式将所述字段查询表与本地的目标配置表合并。
在本实施例中,若检测到字段查询表的字段数据量小于广播触发阈值,则可以使用广播合并的方式进行数据表合并,因此可以将该字段查询表发送给各个分布式节点,在分布式节点接收到通过广播方式发送的字段查询表后,则确定本次合并方式采用广播合并BroadcastJoin的方式合并字段查询表以及存储于本地的目标配置表。
优选地,在本实施例中,该广播触发阈值可以根据当前的任务个数动态调整。具体地,若当前分布式计算任务的任务数量较多,则每个计算任务分配得到的网络资源较少,此时,可以提高广播触发阈值,从而提高广播合并方式的使用概率;而在闲时,即当前的任务个数较少时,则可以降低广播触发阈值,通过哈希切换合并以及排序连接合并进行合并。在该情况下,终端设备在执行S104的操作之前,可以获取当前正在处理的分布式计算任务的任务个数,并通过预设的广播触发阈值转换算法,计算任务个数对应的广播触发阈值,并将广播触发阈值与字段数据量进行比较。
在S105中,基于各个分布式节点合并后的所述目标配置表,执行所述分布式计算任务。
在本实施例中,终端设备可以将各个字段查询表发送给下联的多个分布式节点,继而分布式节点可以将字段查询表与本地的目标配置表进行合并,各个分布式节点在执行了数据表合并操作后,将合并后的目标配置表返回给终端设备。其中,终端设备作为管理多个分布式节点的管理设备,用于进行计算任务的解析,确定多个分布式计算任务内包含的多个计算操作,计算操作包括有目标字段的提取、提取数据的合并、数据合并后的计算等,不同的计算类型交由不同的分布式节点进行操作,因此,终端设备在获取了得到合并后的目标配置表后,可以确定分布式计算任务所需的目标字段所关联的分布式节点,并向各个 分布式节点发送数据查询任务,分布式节点可以将查询得到的数据反馈给终端设备,继而终端设备可以将接收到的数据下发到执行数据合并以及数据计算的分布式节点进行后续的计算任务,并将计算结果反馈给终端设备,通过上述流程对分布式计算任务进行响应。
以上可以看出,本申请实施例提供的一种分布式计算任务的响应方法通过对分布式计算任务进行解析,确定计算操作所需的目标字段,并从基准查询表提取与目标字段相关的目标数据,生成字段查询表,从而将一个总的数据表拆分为与任务相关的分表,剔除了大量与本次计算无关的无效数据,减少了分表的数据量,并在该字段查询表的字段数据量小于广播触发阈值时,通过广播发送的方式将字段查询表发送给各个分布式存储节点,以便分布式存储节点基于广播合并Broadcast Join的方式将字段查询表与目标配置表进行合并,由于Broadcast Join的合并方式无需对字段编号Key值进行关联,对整个数据表进行梳理,属于数据表的快速合并的方式,提高了分布式计算的效率。与现有的分布式计算的响应技术相比,本实施例能够通过在生成数据表时筛选出有效数据,即与本次计算相关的目标字段的目标数据,减少所需发送的数据表的数据量,在数据表的数据量较少的情况下,通过简易合并的Broadcast Join方式合并字段查询表以及目标配置表,大大减少数据梳理操作的次数,减少了分布式节点的数据读写压力,从而提高了分布式计算的响应速率。
图2示出了本申请第二实施例提供的一种分布式计算任务的响应方法的具体实现流程图。参见图2,相对于图1所述实施例,本实施例提供的一种分布式计算任务的响应方法中在所述若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点之前,还包括:S201~S205,具体详述如下:
进一步地,在所述若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点之前,还包括:
在S201中,获取当前时刻的网络资源参量,以及所述当前时刻所在的预设时间段关联的历史运行记录。
在本实施例中,终端设备可以根据当前的网络情况,动态调整广播触发阈值,具体可以根据历史配置的历史广播阈值以及当前的网络情况进行确定。因此,终端设备可以获取当前时刻的网络资源参量以及历史运行记录,基于上述获取得到的两个参量,计算得到当前时刻对应的广播触发阈值。其中,当前时刻具体为接收到分布式计算任务对应的时刻。获取得的网络资源参量可以为多个,该网络资源参量包括但不限于:网络丢包率、网络传输速率、误码率、网络时延等。
在本实施例中,终端设备可以预先划分有不同的特征时间段,终端设备可以判断当前时刻落入的特征时间段,并获取创建时间在该特征时间段内的历史运行记录作为当前时刻关联的历史运行记录。
在S202中,将所述网络资源参量导入预设的阈值因子转换模型,计算第一阈值因子。
在本实施例中,终端设备可以设置有阈值因子转换模型,终端设备将网络资源参量导入到该阈值因子转换模型内,输出当前时刻的网络资源参量对应的第一阈值因子。具体地,该网络资源参量的数值越大,则表示当前可用的网络资源越多,此时对应的第一阈值因子的数值越大,提高广播合并方式使用概率;反之,若网络资源参量的数值越小,则表示当前可用的网络资源越小,测试对应第一阈值因子的数值越小,降低广播合并方式使用概率。该阈值因子转换模型可以为哈希函数。
在S203中,基于各个所述历史运行记录的创建时间,配置所述历史运行记录的权重值。
在本实施例中,历史运行记录包含有该记录的创建时间,终端设备可以根据各个创建时间与当前时刻之间的差值,配置该历史运行记录的权重值。其中,当前时刻与创建时间之间的差值越小,则对应的历史运行记录的权重值越高;反之,当前时刻与创建时间之间的差值越大,则对应的历史运行记录的权重值越低。由于创建时间与当前时刻的具体之间 的差值越小,则响应历史运行记录时刻分布式计算***的***结构以及数据库的数据总量与当前时刻的***结构与数据总量的差异度越小,因此对应的历史广播阈值的参考价值越高,因此对应的权重值越大,从而能够提高当前的广播触发阈值的准确性。
在S204中,根据各个所述历史运行记录的历史广播阈值以及所述权重值,计算第二阈值因子。
在本实施例中,历史运行记录包含响应历史计算任务时所比对的历史广播阈值,终端设备可以将各个历史运行记录的历史广播阈值以及权重值进行加权累加,从而计算得到第二阈值因子。
在S205中,根据所述第一阈值因子以及所述第二阈值因子,计算所述广播触发阈值。
在本实施例中,终端设备在计算了与网络资源参量线管的第一阈值因子以及与历史广播阈值相关的第二阈值因子后,可以基于上述两个参量计算得到当前时刻对应的广播触发阈值,实现动态调整广播触发阈值的目的。
在本申请实施例中,通过获取当前的网络资源参量以及历史运行记录,计算得到当前时刻的广播触发阈值,以使当前时刻与字段数据量比对的广播触发阈值与当前的网络负载情况相匹配,提高了广播触发阈值的准确性。
图3示出了本申请第三实施例提供的一种分布式计算任务的响应方法S202的具体实现流程图。参见图3,相对于图2所述的实施例,本实施例提供的一种分布式计算任务的响应方法S202包括:S2021~S2022,具体详述如下:
进一步地,所述将所述网络资源参量导入预设的阈值因子转换模型,计算第一阈值因子,包括:
在S2021中,获取所在网络的最大可用资源参量。
在本实施例中,终端设备可以获取当前所接入网络,即上述的所在网络,最大可用资源参量,即各个网络资源参量的上限数值。举例性地,该网络资源参量包括有上行速率以及下行速率,则该最大可用资源参量则包括上行最高速率以及下行最高速率;而对于误码率以及丢包率等网络资源参量,则转换为正向参量,例如数据传输的最大正确率,即对应最小误码率,以及数据包发送成功率,即对应最小丢包率。
在S2022中,将所述最大可用资源参量以及所述网络资源参量导入预设的阈值因子转换模型,计算所述第一阈值因子;所述阈值因子转换模型具体为:
Figure PCTCN2020092723-appb-000001
其中,FirstBrdcst为所述第一阈值因子;CurrentResource i为第i个所述网络资源参量;MaxWebResource i为第i个所述最大可用资源参量;BaseLv为预设的基准系数;n为网络资源参量的总数。
在本实施例中,终端设备可以计算当前的网络资源参量与最大可用资源参量之间比值,若该网络资源参量与最大可用资源参量越接近,则表示当前的网络环境较优,可以用于传输大数据的数据表,因此对应的第一阈值因子的数值也越大;反之,若网络资源参量与最大可用资源参量之间的差值越大,则表示当前的网络环境较差,此时,对应的第一阈值因子的数值也越小。
在本申请实施例中,终端设备通过获取当前所在网络的最大可用资源参量,并通过网络资源参量与最大可用资源参量进行比对,计算得到第一阈值因子,从而能够对各个网络资源参量进行归一化啊哦做,提高了第一阈值因子的准确性。
图4示出了本申请第四实施例提供的一种分布式计算任务的响应方法S101的具体实现流程图。参见图4,相对于图1所述实施例,本实施例提供的一种分布式计算任务的响应方法S101包括:S1011~S1014,具体详述如下:
进一步地,所述若接收到分布式计算任务,则确定所述分布式计算任务的目标字段,包括:
在S1011中,解析所述分布式计算任务,得到所述分布式计算任务的任务类型。
在本实施例中,终端设备可以通过自识别的方式,确定该分布式计算任务关联的目标字段,而无需用户手动设置。不同的计算任务对应不同的任务类型,任务类型不同在响应时所需调用的数据也存在差异,因此,终端设备可以对分布式计算任务进行解析,并识别该计算任务对应的任务类型。具体地,终端设备可以提取分布式计算任务的计算内容,并提取计算内容中的计算关键词,确定该计算关键词关联的任务类型。
在S1012中,从计算响应数据库中提取与所述任务类型匹配的历史响应结果,并识别各个所述历史响应结果包含的历史字段。
在本实施例中,计算响应数据库内存储有终端设备响应的所有历史响应结果。该历史响应结果包含有响应历史计算任务时所关联的目标字段,即上述的历史字段。终端设备可以根据分布式计算任务的任务类型,从计算响应数据库中提取与任务类型匹配的历史响应结果,即上述提取得到的历史响应结果的任务类型与当前所需的分布式计算任务的任务类型一致,从而可以判定历史响应结果内关联的目标字段也可能为当前所需计算的计算任务关联的目标字段。
其中,上述终端设备响应的所有历史响应结果也可存储于区块链中,即所有历史响应结果可基于区块链的分布式存储实现存储。或者,上述计算机响应数据库可以由数据库集成,也可以通过数据仓库实现,具体可基于实际应用场景确定,在此不做限制。
在S1013中,基于各个所述历史字段在所有所述历史响应结果中的出现次数以及出现时间,分别计算各个所述历史字段的关联度。
在本实施例中,终端设备可以统计各个历史字段在所有历史响应结果中的出现次数,若该出现次数越大,则表示该历史字段与该任务类型的关联度越高;反之,若该历史字段在所有历史响应结果中的出现次数越小,则与任务类型的关联度越低。并且,终端设备可以根据历史字段在各个历史响应结果的出现时间,计算出现频率,基于出现次数以及出现频率,可以识别得到各个历史字段与任务类型之间的关联度。
在S1014中,选取所述关联度大于预设的关联阈值的所述历史字段作为所述目标字段。
在本实施例中,终端设备在计算得到各个历史字段与任务类型的关联度后,可以选取关联度大于关联阈值的历史字段作为目标字段,实现自动识别分布式计算任务的目标字段的目的。
在本申请实施例中,通过获取与当前计算任务的任务类型匹配的历史响应结果,从而通过历史响应结果内包含的历史字段自动提取与计算任务对应的目标字段,减少了用户操作,提高了分布式计算任务的响应效率。
图5示出了本申请第五实施例提供的一种分布式计算任务的响应方法S101的具体实现流程图。参见图5,相对于图1所述实施例,本实施例提供的一种分布式计算任务的响应方法S101包括:S1015~S1017,具体详述如下:
进一步地,所述若接收到分布式计算任务,则确定所述分布式计算任务的目标字段,包括:
在S1015中,提取所述分布式计算任务包含的结构化查询语言SQL语句。
在本实施例中,该分布式计算任务是基于采用Spark-SQL框架生成的计算任务,终端设备可以对分布式计算任务进行解析,提取该计算任务携带有的SQL语句,由于SQL语 句是用于查询数据库内的目标数据,即该SQL语句中携带有该计算任务所对应的目标字段信息,可以对SQL语句进行解析,以自动确定SQL语言。
在S1016中,对所述SQL语句进行语义分析,获取所述SQL语句对应的查询关键词。
在本实施例中,终端设备可以提取SQL语句库,该SQL语句库记录有多个标准语段,终端设备可以基于标准语段,从SQL语句中提取用于定义查询数据关联的特征语句,并基于该特征语句包含的查询关键词。
在S1017中,在所述基准查询表中查询与各个所述查询关键词匹配的已有字段,并将与所述查询关键词匹配的所述已有字段识别为所述目标字段。
在本实施例中,终端设备可以识别各个查询关键词是否存在于基准查询表内的已有字段中,若是,则识别该已有字段为目标字段。
在本申请实施例中,通过对计算任务内的SQL语句进行语义解析,提取查询关键词,并基于查询关键词从基准查询表中提取匹配的目标字段,实现了自动识别目标字段的目的,减少了用户操作,提高了分布式计算任务的响应效率。
图6示出了本申请第六实施例提供的一种分布式计算任务的响应方法的具体实现流程图。参见图6,相对于图1至图5任一所述实施例,本实施例提供的一种分布式计算任务的响应方法在所述统计所述字段查询表的字段数据量之后,还包括:S601~S602,具体详述如下:
进一步地,在所述统计所述字段查询表的字段数据量之后,还包括:
在S601中,若所述字段数据量大于或等于所述广播触发阈值,则将各个目标数据的数据编号添加到所述字段查询表。
在本实施例中,终端设备可以获取若检测到该字段查询表的字段数据量大于或等于广播触发阈值,则识别需要通过哈希切换合并Shuffle Hash Join的方式合并两个数据表,而Shuffle Hash Join则需要确定各个目标数据的数据编号,即Key值,以便分布式节点可以基于Key值确定关联数据,因此,终端设备需要将各个数据编号添加到字段查询表。
在S602中,将添加有所述数据编号的所述字段查询表发送给各个分布式节点,以便各个所述分布式节点基于所述数据编号,在本地的目标配置表的本地数据中查询与各个所述字段数据对应的关联数据,并重构所述目标配置表。
在本实施例中,终端设备将添加有数据标号的字段查询表发送给各个分布式节点,分布式节点可以根据各个目标数据的key值以及本地的目标配置表内本地数据的key值进行比对,将key值相同的本地数据以及目标数据识别为互为关联数据,并根据数据之间的关联关系,重构目标配置表,实现将字段查询表的目标数据合并到目标配置表内。
在本申请实施例中,在数据量较大的情况下,通过Shuffle Hash Join的方式合并两个数据表,能够对数据表进行梳理,便于对分布式数据的管理。
图7示出了本申请第七实施例提供的一种分布式计算任务的响应方法S102的具体实现流程图。参见图7,相对于图1至图5任一所述实施例,本实施例提供的一种分布式计算任务的响应方法S102包括:S1021~S1022,具体详述如下:
在S1021中,若所述基准查询表中任一基准字段与所述分布式计算任务的所述目标字段匹配,则识别所述基准字段为有效字段。
在本实施例中,终端设备在确定了分布式计算任务的目标字段后,可以从基准查询表中分离出本次计算所需的字段查询表,即从大表中分离出小表。具体地,终端设备可以判断基准查询表内各个已有字段,即上述的基准字段是否与目标字段匹配,若是,则识别该基准字段为有效字段。
在S1022中,将所有所述有效字段关联的数据识别为目标数据,并根据所有所述目标数据以及有效字段,生成字段查询表。
在本实施例中,终端设备在识别得到基准查询表内所有有效字段后,该有效字段下联的数据即为目标字段所需提取的目标数据,并基于目标数据以及有效字段,从基准查询表中分离得到字段查询表。
在本申请实施例中,通过识别基准查询表内的有效字段,生成字段查询表,能够减少所需发送的数据表的数据量,减少网络资源的消耗。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
图8示出了本申请一实施例提供的一种分布式计算任务的响应设备的结构框图,该分布式计算任务的响应设备包括的各单元用于执行图1对应的实施例中的各步骤。具体请参阅图1与图1所对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。
参见图8,所述分布式计算任务的响应设备包括:
目标字段识别单元81,用于若接收到分布式计算任务,则确定所述分布式计算任务的目标字段;
字段查询表生成单元82,用于从基准查询表中提取所述目标字段的目标数据,生成字段查询表;
字段数据量统计单元83,用于统计所述字段查询表的字段数据量;
广播合并触发单元84,用于若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点,以使所述分布式节点基于广播合并方式将所述字段查询表与本地的目标配置表合并;
分布式计算任务响应单元85,用于基于各个分布式节点合并后的所述目标配置表,执行所述分布式计算任务。
可选地,所述分布式计算任务的响应设备还包括:
网络资源确定单元,用于获取当前时刻的网络资源参量,以及所述当前时刻所在的预设时间段关联的历史运行记录;
第一阈值因子计算单元,用于将所述网络资源参量导入预设的阈值因子转换模型,计算第一阈值因子;
权重值确定单元,用于基于各个所述历史运行记录的创建时间,配置所述历史运行记录的权重值;
第二阈值因子计算单元,用于根据各个所述历史运行记录的历史广播阈值以及所述权重值,计算第二阈值因子;
广播触发阈值计算单元,用于根据所述第一阈值因子以及所述第二阈值因子,计算所述广播触发阈值。
可选地,所述第一阈值因子计算单元包括:
最大可用资源参量获取单元,用于获取所在网络的最大可用资源参量;
第一阈值因子转换单元,用于将所述最大可用资源参量以及所述网络资源参量导入预设的阈值因子转换模型,计算所述第一阈值因子;所述阈值因子转换模型具体为:
Figure PCTCN2020092723-appb-000002
其中,FirstBrdcst为所述第一阈值因子;CurrentResource i为第i个所述网络资源参量;MaxWebResource i为第i个所述最大可用资源参量;BaseLv为预设的基准系数;n为网络 资源参量的总数。
可选地,所述目标字段识别单元81包括:
任务类型识别单元,用于解析所述分布式计算任务,得到所述分布式计算任务的任务类型;
历史字段获取单元,用于从计算响应数据库中提取与所述任务类型匹配的历史响应结果,并识别各个所述历史响应结果包含的历史字段;
关联度计算单元,用于基于各个所述历史字段在所有所述历史响应结果中的出现次数以及出现时间,分别计算各个所述历史字段的关联度;
目标字段选取单元,用于选取所述关联度大于预设的关联阈值的所述历史字段作为所述目标字段。
可选地,所述目标字段识别单元81包括:
SQL语句提取单元,用于提取所述分布式计算任务包含的结构化查询语言SQL语句;
查询关键词获取单元,用于对所述SQL语句进行语义分析,获取所述SQL语句对应的查询关键词;
查询关键词筛选单元,用于在所述基准查询表中查询与各个所述查询关键词匹配的已有字段,并将与所述查询关键词匹配的所述已有字段识别为所述目标字段。
可选地,所述分布式计算任务的响应设备还包括:
字段查询表调整单元,用于若所述字段数据量大于或等于所述广播触发阈值,则将各个目标数据的数据编号添加到所述字段查询表;
哈希排序合并单元,用于将添加有所述数据编号的所述字段查询表发送给各个分布式节点,以便各个所述分布式节点基于所述数据编号,在本地的目标配置表的本地数据中查询与各个所述字段数据对应的关联数据,并重构所述目标配置表。
可选地,所述字段查询表生成单元81包括:
有效字段识别单元,用于若所述基准查询表中任一基准字段与所述分布式计算任务的所述目标字段匹配,则识别所述基准字段为有效字段;
目标数据选取单元,用于将所有所述有效字段关联的数据识别为目标数据,并根据所有所述目标数据以及有效字段,生成字段查询表。
因此,本申请实施例提供的分布式计算任务的响应设备同样可以通过在生成数据表时筛选出有效数据,即与本次计算相关的目标字段的目标数据,减少所需发送的数据表的数据量,在数据表的数据量较少的情况下,通过简易合并的Broadcast Join方式合并字段查询表以及目标配置表,大大减少数据梳理操作的次数,减少了分布式节点的数据读写压力,从而提高了分布式计算的响应速率。
图9是本申请另一实施例提供的一种终端设备的示意图。如图9所示,该实施例的终端设备9包括:处理器90、存储器91以及存储在所述存储器91中并可在所述处理器90上运行的计算机程序92,例如分布式计算任务的响应程序。所述处理器90执行所述计算机程序92时实现上述各个分布式计算任务的响应方法实施例中的步骤,例如图1所示的S101至S105。或者,所述处理器90执行所述计算机程序92时实现上述各装置实施例中各单元的功能,例如图8所示模块81至85功能。
示例性的,所述计算机程序92可以被分割成一个或多个单元,所述一个或者多个单元被存储在所述存储器91中,并由所述处理器90执行,以完成本申请。所述一个或多个单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序92在所述终端设备9中的执行过程。例如,所述计算机程序92可以被分割成目标字段识别单元、字段查询表生成单元、字段数据量统计单元、广播合并触发单元以及分布式计算任务响应单元,各单元具体功能如上所述。
所述终端设备9可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述终端设备可包括,但不仅限于,处理器90、存储器91。本领域技术人员可以理解,图9仅仅是终端设备9的示例,并不构成对终端设备9的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端设备还可以包括输入输出设备、网络接入设备、总线等。
所称处理器90可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器91可以是所述终端设备9的内部存储单元,例如终端设备9的硬盘或内存。所述存储器91也可以是所述终端设备9的外部存储设备,例如所述终端设备9上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器91还可以既包括所述终端设备9的内部存储单元也包括外部存储设备。所述存储器91用于存储所述计算机程序以及所述终端设备所需的其他程序和数据。所述存储器91还可以用于暂时地存储已经输出或者将要输出的数据。
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质。其中,该计算机可读存储介质可以是非易失性,也可以是易失性。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本申请实施例所提供的方法。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种分布式计算任务的响应方法,其中,包括:
    若接收到分布式计算任务,则确定所述分布式计算任务的目标字段;
    从基准查询表中提取所述目标字段的目标数据,生成字段查询表;
    统计所述字段查询表的字段数据量;
    若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点,以使所述分布式节点基于广播合并方式将所述字段查询表与本地的目标配置表合并;
    基于各个分布式节点合并后的所述目标配置表,执行所述分布式计算任务。
  2. 根据权利要求1所述的响应方法,其中,在所述若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点之前,还包括:
    获取当前时刻的网络资源参量,以及所述当前时刻所在的预设时间段关联的历史运行记录;
    将所述网络资源参量导入预设的阈值因子转换模型,计算第一阈值因子;
    基于各个所述历史运行记录的创建时间,配置所述历史运行记录的权重值;
    根据各个所述历史运行记录的历史广播阈值以及所述权重值,计算第二阈值因子;
    根据所述第一阈值因子以及所述第二阈值因子,计算所述广播触发阈值。
  3. 根据权利要求2所述的响应方法,其中,所述将所述网络资源参量导入预设的阈值因子转换模型,计算第一阈值因子,包括:
    获取所在网络的最大可用资源参量;
    将所述最大可用资源参量以及所述网络资源参量导入预设的阈值因子转换模型,计算所述第一阈值因子;所述阈值因子转换模型具体为:
    Figure PCTCN2020092723-appb-100001
    其中,FirstBrdcst为所述第一阈值因子;CurrentResource i为第i个所述网络资源参量;MaxWebResource i为第i个所述最大可用资源参量;BaseLv为预设的基准系数;n为网络资源参量的总数。
  4. 根据权利要求1所述的响应方法,其中,所述若接收到分布式计算任务,则确定所述分布式计算任务的目标字段,包括:
    解析所述分布式计算任务,得到所述分布式计算任务的任务类型;
    从计算响应数据库中提取与所述任务类型匹配的历史响应结果,并识别各个所述历史响应结果包含的历史字段;
    基于各个所述历史字段在所有所述历史响应结果中的出现次数以及出现时间,分别计算各个所述历史字段的关联度;
    选取所述关联度大于预设的关联阈值的所述历史字段作为所述目标字段。
  5. 根据权利要求1所述的响应方法,其中,所述若接收到分布式计算任务,则确定所述分布式计算任务的目标字段,包括:
    提取所述分布式计算任务包含的结构化查询语言SQL语句;
    对所述SQL语句进行语义分析,获取所述SQL语句对应的查询关键词;
    在所述基准查询表中查询与各个所述查询关键词匹配的已有字段,并将与所述查询关键词匹配的所述已有字段识别为所述目标字段。
  6. 根据权利要求1-5任一项所述的响应方法,其中,在所述统计所述字段查询表的字段数据量之后,还包括:
    若所述字段数据量大于或等于所述广播触发阈值,则将各个目标数据的数据编号添加到所述字段查询表;
    将添加有所述数据编号的所述字段查询表发送给各个分布式节点,以便各个所述分布式节点基于所述数据编号,在本地的目标配置表的本地数据中查询与各个所述字段数据对应的关联数据,并重构所述目标配置表。
  7. 根据权利要求1-5任一项所述的响应方法,其中,所述从基准查询表中提取所述目标字段的目标数据,生成字段查询表,包括:
    若所述基准查询表中任一基准字段与所述分布式计算任务的所述目标字段匹配,则识别所述基准字段为有效字段;
    将所有所述有效字段关联的数据识别为目标数据,并根据所有所述目标数据以及有效字段,生成字段查询表。
  8. 根据权利要求4所述的响应方法,其中,所述解析所述分布式计算任务,得到所述分布式计算任务的任务类型,包括:
    提取所述分布式计算任务的计算内容,提取所述计算内容的计算关键词;
    确定所述计算关键词关联的任务类型,将所述计算关键词关联的任务类型确定为所述分布式计算任务的任务类型。
  9. 根据权利要求4所述的响应方法,其中,所述基于各个所述历史字段在所有所述历史响应结果中的出现次数以及出现时间,分别计算各个所述历史字段的关联度,包括:
    确定各个所述历史字段在所有所述历史响应结果中的出现次数和出现时间;
    基于所述出现时间确定数各个所述历史字段在所述所有历史响应结果中的出现频率;
    基于所述出现次数和所述出现频率确定出各个所述历史字段的关联度。
  10. 一种分布式计算任务的响应设备,其中,包括:
    目标字段识别单元,用于若接收到分布式计算任务,则确定所述分布式计算任务的目标字段;
    字段查询表生成单元,用于从基准查询表中提取所述目标字段的目标数据,生成字段查询表;
    字段数据量统计单元,用于统计所述字段查询表的字段数据量;
    广播合并触发单元,用于若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点,以使所述分布式节点基于广播合并方式将所述字段查询表与本地的目标配置表合并;
    分布式计算任务响应单元,用于基于各个分布式节点合并后的所述目标配置表,执行所述分布式计算任务。
  11. 一种终端设备,其中,所述终端设备包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:
    若接收到分布式计算任务,则确定所述分布式计算任务的目标字段;
    从基准查询表中提取所述目标字段的目标数据,生成字段查询表;
    统计所述字段查询表的字段数据量;
    若所述字段数据量小于预设的广播触发阈值,则广播发送所述字段查询表至各个分布式节点,以使所述分布式节点基于广播合并方式将所述字段查询表与本地的目标配置表合并;
    基于各个分布式节点合并后的所述目标配置表,执行所述分布式计算任务。
  12. 根据权利要求11所述的终端设备,其中,所述处理器执行所述计算机程序时实现:
    获取当前时刻的网络资源参量,以及所述当前时刻所在的预设时间段关联的历史运行记录;
    将所述网络资源参量导入预设的阈值因子转换模型,计算第一阈值因子;
    基于各个所述历史运行记录的创建时间,配置所述历史运行记录的权重值;
    根据各个所述历史运行记录的历史广播阈值以及所述权重值,计算第二阈值因子;
    根据所述第一阈值因子以及所述第二阈值因子,计算所述广播触发阈值。
  13. 根据权利要求12所述的终端设备,其中,所述处理器执行所述计算机程序时实现:
    获取所在网络的最大可用资源参量;
    将所述最大可用资源参量以及所述网络资源参量导入预设的阈值因子转换模型,计算所述第一阈值因子;所述阈值因子转换模型具体为:
    Figure PCTCN2020092723-appb-100002
    其中,FirstBrdcst为所述第一阈值因子;CurrentResource i为第i个所述网络资源参量;MaxWebResource i为第i个所述最大可用资源参量;BaseLv为预设的基准系数;n为网络资源参量的总数。
  14. 根据权利要求11所述的终端设备,其中,所述处理器执行所述计算机程序时实现:
    解析所述分布式计算任务,得到所述分布式计算任务的任务类型;
    从计算响应数据库中提取与所述任务类型匹配的历史响应结果,并识别各个所述历史响应结果包含的历史字段;
    基于各个所述历史字段在所有所述历史响应结果中的出现次数以及出现时间,分别计算各个所述历史字段的关联度;
    选取所述关联度大于预设的关联阈值的所述历史字段作为所述目标字段。
  15. 根据权利要求11所述的终端设备,其中,所述处理器执行所述计算机程序时实现:
    提取所述分布式计算任务包含的结构化查询语言SQL语句;
    对所述SQL语句进行语义分析,获取所述SQL语句对应的查询关键词;
    在所述基准查询表中查询与各个所述查询关键词匹配的已有字段,并将与所述查询关键词匹配的所述已有字段识别为所述目标字段。
  16. 根据权利要求11-15任一项所述的终端设备,其中,所述处理器执行所述计算机程序时实现:
    若所述字段数据量大于或等于所述广播触发阈值,则将各个目标数据的数据编号添加到所述字段查询表;
    将添加有所述数据编号的所述字段查询表发送给各个分布式节点,以便各个所述分布式节点基于所述数据编号,在本地的目标配置表的本地数据中查询与各个所述字段数据对应的关联数据,并重构所述目标配置表。
  17. 根据权利要求11-15任一项所述的终端设备,其中,所述处理器执行所述计算机程序时实现:
    若所述基准查询表中任一基准字段与所述分布式计算任务的所述目标字段匹配,则识别所述基准字段为有效字段;
    将所有所述有效字段关联的数据识别为目标数据,并根据所有所述目标数据以及有效字段,生成字段查询表。
  18. 根据权利要求14所述的终端设备,其中,所述处理器执行所述计算机程序时实现:
    提取所述分布式计算任务的计算内容,提取所述计算内容的计算关键词;
    确定所述计算关键词关联的任务类型,将所述计算关键词关联的任务类型确定为所述分布式计算任务的任务类型。
  19. 根据权利要求14所述的终端设备,其中,所述处理器执行所述计算机程序时实现:
    确定各个所述历史字段在所有所述历史响应结果中的出现次数和出现时间;
    基于所述出现时间确定数各个所述历史字段在所述所有历史响应结果中的出现频率;
    基于所述出现次数和所述出现频率确定出各个所述历史字段的关联度。
  20. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至9任一项所述方法的步骤。
PCT/CN2020/092723 2020-01-17 2020-05-27 一种分布式计算任务的响应方法及设备 WO2021143010A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010054782.2 2020-01-17
CN202010054782.2A CN111241163A (zh) 2020-01-17 2020-01-17 一种分布式计算任务的响应方法及设备

Publications (1)

Publication Number Publication Date
WO2021143010A1 true WO2021143010A1 (zh) 2021-07-22

Family

ID=70872766

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092723 WO2021143010A1 (zh) 2020-01-17 2020-05-27 一种分布式计算任务的响应方法及设备

Country Status (2)

Country Link
CN (1) CN111241163A (zh)
WO (1) WO2021143010A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597148A (zh) * 2020-11-25 2021-04-02 联想(北京)有限公司 一种数据表的连接方法和装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372190A (zh) * 2016-08-31 2017-02-01 华北电力大学(保定) 实时olap查询方法和装置
CN109408711A (zh) * 2018-09-29 2019-03-01 北京三快在线科技有限公司 数据过滤方法、装置、电子设备及存储介质
KR20190092901A (ko) * 2018-01-31 2019-08-08 주식회사 데이터스트림즈 SparkSQL 기반의 데이터 페더레이션장치

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372190A (zh) * 2016-08-31 2017-02-01 华北电力大学(保定) 实时olap查询方法和装置
KR20190092901A (ko) * 2018-01-31 2019-08-08 주식회사 데이터스트림즈 SparkSQL 기반의 데이터 페더레이션장치
CN109408711A (zh) * 2018-09-29 2019-03-01 北京三快在线科技有限公司 数据过滤方法、装置、电子设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHAO, SHUAI: "Research and Application of SQL Join Optimization Based on Spark", MASTER THESIS, 3 June 2017 (2017-06-03), CN, pages 1 - 77, XP009529319 *

Also Published As

Publication number Publication date
CN111241163A (zh) 2020-06-05

Similar Documents

Publication Publication Date Title
US10235376B2 (en) Merging metadata for database storage regions based on overlapping range values
CN108009236B (zh) 一种大数据查询方法、***、计算机及存储介质
TWI512506B (zh) Sorting method and device for search results
US9298775B2 (en) Changing the compression level of query plans
US11301425B2 (en) Systems and computer implemented methods for semantic data compression
US11003649B2 (en) Index establishment method and device
CN112541074A (zh) 日志解析方法、装置、服务器和存储介质
CN111460153A (zh) 热点话题提取方法、装置、终端设备及存储介质
CN110569289B (zh) 基于大数据的列数据处理方法、设备及介质
WO2017185576A1 (zh) 一种多流流式数据的处理方法、***、存储介质及设备
WO2019196239A1 (zh) 一种线程接口的管理方法、终端设备及计算机可读存储介质
CN116383238B (zh) 基于图结构的数据虚拟化***、方法、装置、设备及介质
US11709831B2 (en) Cost-based query optimization for array fields in database systems
US20220019764A1 (en) Method and device for classifying face image, electronic device and storage medium
WO2021143010A1 (zh) 一种分布式计算任务的响应方法及设备
US20220004524A1 (en) Chunking method and apparatus
WO2021103594A1 (zh) 一种默契度检测方法、设备、服务器及可读存储介质
WO2022253131A1 (zh) 数据解析方法、装置、计算机设备和存储介质
US12026162B2 (en) Data query method and apparatus, computing device, and storage medium
CN115510139A (zh) 数据查询方法和装置
CN113590322A (zh) 一种数据处理方法和装置
CN111771195A (zh) 流处理设备和数据流处理方法
US11544240B1 (en) Featurization for columnar databases
CN114138743A (zh) 基于机器学习的etl任务自动配置方法及装置
Ojewole et al. Window join approximation over data streams with importance semantics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20914490

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20914490

Country of ref document: EP

Kind code of ref document: A1