CN116069817A - Task processing method, device, equipment and storage medium - Google Patents

Task processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN116069817A
CN116069817A CN202310140043.9A CN202310140043A CN116069817A CN 116069817 A CN116069817 A CN 116069817A CN 202310140043 A CN202310140043 A CN 202310140043A CN 116069817 A CN116069817 A CN 116069817A
Authority
CN
China
Prior art keywords
information
script
index
task
optimization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310140043.9A
Other languages
Chinese (zh)
Inventor
张晓泽
刘业辉
安金龙
杨尚昂
张如飞
赵东旭
田恒宇
姬浩然
李瑾
姜乐
张宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202310140043.9A priority Critical patent/CN116069817A/en
Publication of CN116069817A publication Critical patent/CN116069817A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a task processing method, device, equipment and storage medium, and relates to the technical field of computers. The method comprises the following steps: acquiring a script to be processed; analyzing the script to be processed, and matching the script to be processed with a keyword library to obtain a keyword matching result; obtaining first prompt information according to the keyword matching result so as to perform first optimization processing on the script to be processed according to the first prompt information; acquiring log information of a task for executing a script to be processed; scanning and analyzing the log information to obtain task execution information corresponding to the script to be processed; determining an index corresponding to the task execution information as an index to be optimized according to an optimization rule in an optimization rule base; and obtaining second prompt information according to the index to be optimized, and performing second optimization processing on the task execution parameters according to the second prompt information. The method improves the utilization rate of the computing resources.

Description

Task processing method, device, equipment and storage medium
Technical Field
The disclosure relates to the technical field of computers, and in particular relates to a task processing method, a task processing device, electronic equipment and a readable storage medium.
Background
Along with the increasing of data volume, the requirements of people on big data processing are also higher, and in order to ensure the timeliness of the output number, the timeliness of task execution can be improved by increasing the computing resources and other modes. But as computing resources increase, so does the cost. Therefore, the task needs to be optimized as to how to save and utilize resources to the greatest extent while the timeliness is ensured.
Because of the complexity of business logic and computational logic, developers are required to have a great deal of theoretical knowledge and practical experience in optimizing tasks, which are relatively scarce. Most developers only stay on the level of use, but the optimization task is only slightly easy, and the effect of reasonably utilizing resources is not achieved.
As described above, how to perform task optimization to improve the utilization of resources is called a problem to be solved.
The above information disclosed in the background section is only for enhancement of understanding of the background of the disclosure and therefore it may include information that does not form the prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
The disclosure aims to provide a task processing method, a task processing device, electronic equipment and a readable storage medium, which can improve the utilization rate of resources at least to a certain extent.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
According to an aspect of the present disclosure, there is provided a task processing method including: acquiring a script to be processed; analyzing the script to be processed, and matching the script to be processed with a keyword library to obtain a keyword matching result; obtaining first prompt information according to the keyword matching result so as to perform first optimization processing on the script to be processed according to the first prompt information; acquiring log information of a task for executing the script to be processed; scanning and analyzing the log information to obtain task execution information corresponding to the script to be processed; determining an index corresponding to the task execution information as an index to be optimized according to an optimization rule in an optimization rule base; and obtaining second prompt information according to the index to be optimized, so as to perform second optimization processing on the task execution parameters according to the second prompt information.
According to an embodiment of the disclosure, the script to be processed includes a first script that first uploads a task execution platform; the log information of the task executing the script to be processed comprises log information of the task of the first script; obtaining first prompt information according to the keyword matching result, wherein the first prompt information comprises: if the keyword matching result is that the command to be optimized is not matched, the first script is run on the task execution platform; obtaining log information of a task executing the script to be processed, including: and acquiring log information of the task of the first script.
According to an embodiment of the present disclosure, the obtaining the first prompt information according to the keyword matching result further includes: and if the keyword matching result is that the command to be optimized is matched, acquiring the first prompt information, wherein the first prompt information comprises information for prompting an optimization scheme of the command to be optimized.
According to an embodiment of the present disclosure, the command to be optimized includes at least one of a global ordering command, a data deduplication command, a first data filtering command, a multi-table merge command without predicate pushdown, a create partition command without specified partition, and a select command without corresponding column clipping operation; the optimization scheme of the global ordering command is to replace the global ordering command with a local ordering command, the optimization scheme of the data deduplication command is to replace the data deduplication command with a data grouping command, the optimization scheme of the first data filtering command is to replace the first data filtering command with a second data filtering command, the optimization scheme of the multi-table merging command without predicate pushing is predicate pushing, the optimization scheme of the creating partition command without specifying the partition is a specified partition, and the optimization scheme of the selecting command without corresponding column clipping operation is column clipping.
According to an embodiment of the disclosure, analyzing the script to be processed, and matching the script to be processed with a keyword library to obtain a keyword matching result, including: analyzing the first script, and carrying out validity check on the first script; and if the validity check result of the first script is passed, matching the first script with the keyword library to obtain the keyword matching result.
According to an embodiment of the present disclosure, the task execution information includes size information of an input table; the optimization rule includes a table size threshold; determining the index corresponding to the task execution information as the index to be optimized according to the optimization rule in the optimization rule base, wherein the method comprises the following steps: judging whether the size of the input table is smaller than the table size threshold according to the size information of the input table; if the size of the input table is smaller than the table size threshold, determining that a small table exists for task execution, wherein the size of the input table is the index to be optimized; the second prompt information comprises information for prompting the addition of broadcast combination parameters; obtaining second prompt information according to the index to be optimized, including: and if the size of the input table is determined to be the index to be optimized, obtaining the information for prompting to add the broadcast combination parameters.
According to an embodiment of the present disclosure, the task execution information further includes a phase execution index value including a read data amount and a start time; the optimization rule further comprises a stage execution index threshold, wherein the stage execution index threshold comprises a read data volume threshold and a starting time threshold; determining the index corresponding to the task execution information as the index to be optimized according to the optimization rule in the optimization rule base, and further comprising: determining a corresponding execution index as the index to be optimized according to the read data quantity and a corresponding read data quantity threshold; determining a corresponding execution index as the index to be optimized according to the starting time and a corresponding starting time threshold; the second prompt information comprises information for prompting to add the first optimization parameters and information for prompting to adjust the task execution time; obtaining second prompt information according to the index to be optimized, including: if the execution index corresponding to the read data quantity is determined to be the index to be optimized, obtaining corresponding information for prompting to add a first optimization parameter; and if the execution index corresponding to the starting time is determined to be the index to be optimized, acquiring the information for prompting to adjust the task execution time.
According to an embodiment of the disclosure, the task execution information further includes execution exception information; the optimization rule further comprises preset abnormal information; determining the index corresponding to the task execution information as the index to be optimized according to the optimization rule in the optimization rule base, and further comprising: matching the execution anomaly information with preset anomaly information, and determining a corresponding index as the index to be optimized according to a matching result; the second prompt information comprises information for prompting to adjust a second optimization parameter; obtaining second prompt information according to the index to be optimized, including: and if the execution index corresponding to the execution abnormality information is determined to be the index to be optimized, obtaining corresponding information for prompting to adjust the second optimization parameter.
According to an embodiment of the disclosure, the script to be processed includes a second script that has been run on the task execution platform; the log information of the task executing the script to be processed comprises log information of the task executing the second script; obtaining first prompt information according to the keyword matching result, wherein the first prompt information comprises: if the keyword matching result is that the command to be optimized is not matched, obtaining first prompt information, wherein the first prompt information comprises information for prompting to obtain a platform log; obtaining log information of a task executing the script to be processed, including: and obtaining log information of the task of the test run second script according to the information for prompting to obtain the platform log.
According to still another aspect of the present disclosure, there is provided a task processing device including: the first acquisition module is used for acquiring a script to be processed; the first matching module is used for analyzing the script to be processed, and matching the script to be processed with a keyword library to obtain a keyword matching result; the first optimizing module is used for obtaining first prompt information according to the keyword matching result so as to perform first optimizing processing on the script to be processed according to the first prompt information; the second acquisition module is used for acquiring log information of the task for executing the script to be processed; the log scanning module is used for scanning and analyzing the log information to acquire task execution information corresponding to the script to be processed; the second matching module is used for determining that the index corresponding to the task execution information is the index to be optimized according to the optimization rules in the optimization rule base; and the second optimization module is used for obtaining second prompt information according to the index to be optimized so as to perform second optimization processing on the task execution parameters according to the second prompt information.
According to still another aspect of the present disclosure, there is provided an electronic apparatus including: a memory, a processor, and executable instructions stored in the memory and executable in the processor, the processor implementing any of the methods described above when executing the executable instructions.
According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement any of the methods described above.
According to the task processing method provided by the embodiment of the disclosure, the acquired script to be processed is analyzed, the script to be processed is matched with the keyword library, a keyword matching result is obtained, then first prompt information is obtained according to the keyword matching result, first optimization processing is carried out on the script to be processed according to the first prompt information, log information of tasks executing the script to be processed is obtained, the log information is scanned and analyzed, task execution information corresponding to the script to be processed is obtained, then an index corresponding to the task execution information is determined to be the index to be optimized according to the optimization rule in the optimization rule library, second prompt information is obtained according to the index to be optimized, and second optimization processing is carried out on the task execution parameters according to the second prompt information, so that the utilization rate of resources can be improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
Fig. 1 shows a flowchart of a task processing method in an embodiment of the present disclosure.
FIG. 2 illustrates a flow chart of another task processing method in an embodiment of the present disclosure.
Fig. 3 is a schematic diagram illustrating the processing procedure of step S204 shown in fig. 2 in an embodiment.
Fig. 4 shows a flow chart of yet another task processing method in an embodiment of the present disclosure.
Fig. 5 shows a schematic diagram of the processing procedure of steps S110 to S114 shown in fig. 1 in an embodiment.
Fig. 6 shows a schematic diagram of the processing procedure of steps S110 to S114 shown in fig. 1 in another embodiment.
Fig. 7 shows a schematic diagram of the processing procedure of steps S110 to S114 shown in fig. 1 in still another embodiment.
Fig. 8 is a schematic diagram of a task optimization flow according to the one shown in fig. 1 to 7.
Fig. 9 shows a block diagram of a task processing device in an embodiment of the present disclosure.
Fig. 10 illustrates a block diagram of another task processing device in an embodiment of the present disclosure.
Fig. 11 shows a schematic structural diagram of an electronic device in an embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, apparatus, steps, etc. In other instances, well-known structures, methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present disclosure, the meaning of "a plurality" is at least two, such as two, three, etc., unless explicitly specified otherwise. The symbol "/" generally indicates that the context-dependent object is an "or" relationship.
In the present disclosure, unless explicitly specified and limited otherwise, terms such as "connected" and the like are to be construed broadly and, for example, may be electrically connected or may communicate with each other; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the terms in this disclosure will be understood by those of ordinary skill in the art as the case may be.
Terms related to embodiments of the present disclosure are explained below.
Apache Spark: apache Spark is a fast and versatile computational engine designed for large-scale data processing. Spark is a generic parallel framework of Hadoop MapReduce-like origin, developed by UC Berkeley AMP lab (AMP laboratories, bokrill division, california). Spark has the advantages of Hadoop MapReduce, but different from MapReduce, the intermediate output result of the job (job) can be stored in a memory, so that HDFS does not need to be read and written, and therefore Spark can be better suitable for algorithms of MapReduce needing iteration such as data mining and machine learning.
Jobhistory: the Spark history log server (history server) can record the log of Spark operation, and the log after Spark operation can be recorded and saved through the history log server.
Exekutor: spark Executor is a Java virtual machine (Java Virtual Machine, JVM) process in a working node (Worker) in a cluster, and is responsible for running specific tasks in Spark jobs, where the tasks are independent of each other. When Spark applications are started, the Executor nodes are started simultaneously and exist along with the whole Spark application life cycle. If there is a failure or crash of the Executor node, the Spark application may also continue to execute, and may schedule the task on the faulty node to continue to run on other Executor nodes.
Exechamor has two core functions: and the task forming the Spark application is responsible for running, and the result is returned to the Driver process. They provide in-memory storage for RDDs requiring caching in the user program through their own Block Manager (Block Manager). RDD is cached directly in the exechamter process, so tasks can fully utilize cached data to accelerate operation at runtime.
With the advent of the big data age, the technology used for big data has been put together. Common offline computing engines for big data are MapReduce and Spark, which are based on the principle of dividing the computing phase into a map (map) and a reduce (reduce) phase. The map stage is understood to read and scrub data, prepare the data for the reduce stage, perform aggregate computing operations and output the results. Apache Spark is one of the most popular offline computing engines at present, and has the characteristics of high speed, easiness in development, strong universality and the like. Spark SQL is a module of Spark for processing structured data, providing interactive query functionality. The SQL statement can be used for large data development processing, so that development work of large data developers and data analysts is greatly simplified, and working efficiency is improved.
As described above, the task is optimized to save and utilize resources to the greatest extent while ensuring timeliness of big data processing. However, the optimization task in the related art is difficult to achieve the effect of reasonably utilizing resources. And the technical level and development habit of different developers are different, so that the problems of long task execution time, high resource consumption, poor readability, high maintenance cost and the like can occur.
Therefore, the present disclosure provides a task processing method, which diagnoses the problem points existing in script and task execution by respectively performing rule comparison analysis on the script to be processed and the corresponding task execution log, and gives an optimization suggestion, so as to reduce the development threshold and improve the utilization rate of resources.
FIG. 1 is a flow chart illustrating a method of task processing according to an exemplary embodiment. The method as shown in fig. 1 may be applied, for example, to a task execution platform, such as MapReduce, spark, etc.
Referring to fig. 1, a method 10 provided by an embodiment of the present disclosure may include the following steps.
In step S102, a script to be processed is acquired.
In some embodiments, the script to be processed may include a first script that first uploads the task execution platform. Embodiments of scanning and trial listening for a first script may refer to fig. 2 and 3.
In some embodiments, the script to be processed may include a second script that has been run on the task execution platform. An embodiment of the second script and its log for inspection may refer to fig. 4.
In step S104, the script to be processed is parsed, and the script to be processed is matched with the keyword library, so as to obtain a keyword matching result.
In some embodiments, if the script to be processed is the first script, the specific implementation of step S104 may refer to step S204.
In some embodiments, if the script to be processed is the second script, the specific implementation of step S104 may refer to step S404.
In step S106, a first prompt message is obtained according to the keyword matching result, so as to perform a first optimization process on the script to be processed according to the first prompt message.
In some embodiments, the specific implementation of step S106 may refer to step S206.
In step S108, log information of a task of executing a script to be processed is acquired.
In some embodiments, the log information of the task executing the script to be processed may include log information of the task of the first script.
In some embodiments, the log information of the task executing the script to be processed may include log information of the task executing the second script, e.g., a latest execution log file of the task executing the script to be processed in a history of Spark may be obtained.
In step S110, the log information is scanned and parsed to obtain task execution information corresponding to the script to be processed.
In some embodiments, the task execution information may include size information of the input table.
In some embodiments, the task execution information may include a phase execution index value, which may include a read data amount and a start time.
For example, after the script to be processed is submitted to the task execution platform, the task execution platform generates a corresponding number of jobs (jobs) through analysis, and segments each job into 1 or more stages (stages), where each stage is sequentially executed. Each stage is further divided into a plurality of next task execution units task. The execution index of the job, such as each job running time length, a corresponding stage Identification (ID), etc., may be obtained. The execution index of the stage may also be obtained, for example, start and end time, running duration of each stage, input (input) or data redistribution (read) data amount of each stage, task id corresponding to the stage, number of tasks, start and end time of each task, running duration of each task, input or data amount of read of each task, and so on.
In some embodiments, the task execution information may include an execution index of an execution unit (executor) that executes the task, such as the number of executors, a CPU and a memory of each executor application, and so on.
In some embodiments, the task execution information may include execution Exception information, e.g., may include Exception information such as errors (Error) and exceptions (exceptions).
In step S112, the index corresponding to the task execution information is determined as the index to be optimized according to the optimization rule in the optimization rule base.
In some embodiments, the optimization rule for entering the size information of the table may include a table size threshold, and a specific implementation of corresponding determination of the index to be optimized may refer to fig. 5.
In some embodiments, the optimization rule for the phase execution index value may include a phase execution index threshold, where the phase execution index threshold includes a read data amount threshold and a start time threshold, and a corresponding specific implementation of determining the index to be optimized may refer to fig. 6.
In some embodiments, the optimization rule for executing the anomaly information further includes preset anomaly information, and a specific implementation of corresponding determination of the to-be-optimized index may refer to fig. 7.
In step S114, second prompt information is obtained according to the index to be optimized, so as to perform second optimization processing on the task execution parameters according to the second prompt information.
In some embodiments, when the index to be optimized is the presence of a small table, the second prompt information may include information for prompting to add the broadcast combining parameter, and the specific implementation may refer to fig. 5.
In some embodiments, when the index to be optimized is a stage execution index, the second prompt information may include information for prompting to add the first optimization parameter and information for prompting to adjust the task execution time, and the specific implementation may refer to fig. 6.
In some embodiments, when the index to be optimized is an index corresponding to the execution anomaly information, the second hint information may include information for hinting to adjust the second optimization parameter, and the specific embodiment may refer to fig. 7.
According to the task processing method provided by the embodiment of the disclosure, the acquired to-be-processed script is analyzed, the to-be-processed script is matched with the keyword library to obtain a keyword matching result, then first prompt information is obtained according to the keyword matching result, first optimization processing is carried out on the to-be-processed script according to the first prompt information, log information of tasks executing the to-be-processed script is acquired, the log information is scanned and analyzed to acquire task execution information corresponding to the to-be-processed script, then an index corresponding to the task execution information is determined to be the to-be-optimized index according to the optimization rule in the optimization rule library, second prompt information is obtained according to the to-be-optimized index, second optimization processing is carried out on task execution parameters according to the second prompt information, development thresholds and development costs are greatly reduced through diagnosis on the script and the tasks, and the utilization rate of cluster resources is improved.
FIG. 2 is a flow chart illustrating another task processing method according to an exemplary embodiment. Fig. 2 is a relation to fig. 1, in which fig. 2 is an embodiment in which the script to be processed in fig. 1 is a first script of the first upload task execution platform.
Referring to fig. 2, a method 20 provided by an embodiment of the present disclosure may include the following steps.
In step S202, a first script of the first upload task execution platform is acquired.
In step S204, the first script is parsed, and the first script is matched with the keyword library, so as to obtain a keyword matching result.
In some embodiments, after the first script is parsed by SQL server, validity verification is first performed, and then keyword matching is performed, and a specific embodiment may refer to fig. 3.
In some embodiments, keywords/words in the first script may be scanned and matched against a library of keywords/words that are not suggested for use, which may be keywords/words in the command statement.
In step S206, if the keyword matching result is that the command to be optimized is matched, a first prompt message is obtained, where the first prompt message includes information for prompting an optimization scheme of the command to be optimized, so as to execute the optimization scheme of the command to be optimized.
In some embodiments, the command to be optimized may include at least one of a global ordering command, a data deduplication command, a first data filtering command, a multi-table merge command without predicate pushdown, a create partition command without specified partition, and a select command without corresponding column clipping operation.
In some embodiments, the optimization scheme of the global ordering command may be to replace the global ordering command with the local ordering command, the optimization scheme of the data deduplication command may be to replace the data deduplication command with the data grouping command, the optimization scheme of the first data filtering command may be to replace the first data filtering command with the second data filtering command, the optimization scheme of the multi-table merging command without predicate pushing may be predicate pushing, the optimization scheme of the create partition command without specifying a partition may be specifying a partition, and the optimization scheme of the select command without performing the corresponding column clipping operation may be column clipping.
For example, the global ordering command may be order by, and the command performs global ordering on input data, and places all data into the same reduce to process, and only one reduce task (task) is enabled to process no matter how much data and how much file is. All data can be carried out at the same reducer, and under the condition of large data volume, the capacity of a disk and a memory of a single node can be exceeded, so that tasks fail or the execution time is extremely long. The order by may be replaced with a local ordering command sort by, which orders the data in each reduce, i.e., the data from each reduce is ordered, but the global data is not necessarily ordered (unless it is possible to order the data globally in the case of only one reduce), and in general, the data may be ordered locally first and then globally, which may greatly improve the processing efficiency.
For another example, the data deduplication command may be remove, and the command will put the data in a reduce for deduplication when deduplicating, which is prone to performance bottlenecks and even Memory overflow (Out of Memory, oom). The distict can be replaced by a data packet command group by, the data packet can be distributed to different products for calculation, and finally the count operator is used for counting the number. In the case of large data volumes, the efficiency is significantly higher than the combination of count discrete using the combination of group by and count. Examples are as follows:
select count(distinct column_1)from t_table;
after optimization:
select
count(column_1)
from(
select column_1from t_table group by column_1
)t;
for another example, the first data filtering command may be not in, which is used to filter data to obtain data that does not match the conditional data set. But the key results in a cartesian product calculation, resulting in a slow task running speed. The non in statement can be replaced by a second data filtering command such as left join/semi join/anti join, and the execution efficiency can be remarkably improved. Examples are as follows:
SELECT t1.column_1FROM t_table_1t1 WHERE t1.column_1NOT IN(SELECT column_1FROM t_table_2);
after optimization:
SELECT t1.column_1FROM t_table_1t1 LEFT JOIN t_table_2t2 ON t1.column_1=t2.column_1WHERE COALESCE(t2.column_1,”)=”;
as another example, in the case of multi-table merging (join), the join is completed and then filtered, typically using the where condition. Predicate downpushing may be performed, i.e., the filtered expression is moved as close as possible to the data source so that irrelevant data can be skipped directly when actually executing. That is, in an appropriate scene, the filtering condition is preferentially executed. Examples are as follows:
SELECT t1.column_1,t1.column_2,t2.column_3FROM t_table_1t1LEFT JOIN t_table_2t2 ON t1.column_1=t2.column_1WHERE t1.column_2='1';
After optimization:
SELECT t1.column_1,t1.column_2,t2.column_3FROM(SELECT t.column_1,t.column_2FROM t_table_1t WHERE t.column_2='1')t1
LEFT JOIN t_table_2t2
ON t1.column_1=t2.column_1;
for another example, if a partition is not specified when creating the partition table, a specific partition is specified.
For another example, query using select is performed, requiring column clipping.
In step S208, if the keyword matching result is that the command to be optimized is not matched, the first script is run on the task execution platform.
In step S210, log information of a task of the trial run first script is acquired.
In step S212, the log information is scanned and analyzed to obtain task execution information corresponding to the first script.
In step S214, the index corresponding to the task execution information is determined as the index to be optimized according to the optimization rule in the optimization rule base.
In step S216, second prompt information is obtained according to the index to be optimized, so as to perform second optimization processing on the task execution parameters according to the second prompt information.
In some embodiments, the specific implementation of step S210 to step S216 may refer to step S108 to step S114.
According to the method provided by the embodiment of the disclosure, the uploading of the script with the nonstandard quality is greatly shielded by checking and trial running the newly uploaded script and providing an optimization scheme.
Fig. 3 is a schematic diagram illustrating the processing procedure of step S204 shown in fig. 2 in an embodiment. As shown in fig. 3, in the embodiment of the present disclosure, the step S204 may further include the following steps.
Step S302, analyzing the first script and verifying the validity of the first script.
In some embodiments, the first script may be parsed by SQL server.
And step S304, if the validity check result of the first script is passed, matching the first script with a keyword library to obtain a keyword matching result.
Step S306, if the validity check result of the first script is not passed, ending the flow.
In some embodiments, the first script may be validated to verify whether there is a grammar problem therein, and if so, the process may be ended directly and the grammar problem may be prompted. If grammar problem exists, keyword matching is continued.
According to the method provided by the embodiment of the disclosure, through carrying out validity check on the newly uploaded script, the script with grammar problem can be screened out before keyword matching is carried out, the processing amount of script diagnosis is reduced, and the efficiency of task optimization flow is improved.
FIG. 4 is a flowchart illustrating yet another task processing method according to an exemplary embodiment. FIG. 4 differs from FIG. 2 in that FIG. 4 is an embodiment in which the script to be processed in FIG. 1 is a second script that has been run on the task execution platform.
Referring to fig. 4, a method 40 provided by an embodiment of the present disclosure may include the following steps.
In step S402, a second script that has been run on the task execution platform is acquired.
In step S404, the second script is parsed, and the second script is matched with the keyword library, so as to obtain a keyword matching result.
In some embodiments, the second script may be parsed by SQL server, keywords/words in the second script may be scanned and matched against a library of keywords/words that are not recommended for use, which may be keywords/words in the command statement.
In step S406, if the keyword matching result is that the command to be optimized is matched, information for prompting the optimization scheme of the command to be optimized is obtained to execute the optimization scheme of the command to be optimized.
In some embodiments, the specific implementation of step S406 may refer to step S206.
In step S408, if the keyword matching result is that the command to be optimized is not matched, information for prompting to acquire the platform log is obtained.
In step S410, log information of the task of the trial run second script is obtained from the information for prompting the acquisition of the platform log.
In step S412, the log information is scanned and parsed to obtain task execution information corresponding to the second script.
In step S414, the index corresponding to the task execution information is determined as the index to be optimized according to the optimization rule in the optimization rule base.
In step S416, second prompt information is obtained according to the to-be-optimized index, so as to perform second optimization processing on the task execution parameters according to the second prompt information.
In some embodiments, the specific implementation of step S410 to step S416 may refer to step S108 to step S114.
According to the method provided by the embodiment of the disclosure, the problems existing in the previous script can be found by carrying out inspection on the script and the task existing in the cluster, an optimization scheme is provided, and resources are utilized more reasonably.
Fig. 5 shows a schematic diagram of the processing procedure of steps S110 to S114 shown in fig. 1 in an embodiment. The task execution optimization flow shown in fig. 5 is directed to the case where a small table exists.
Step S502, scanning and analyzing the log information to obtain the size information of an input table in the task execution information corresponding to the script to be processed.
Step S504, judging whether the size of the input table is smaller than a table size threshold according to the size information of the input table.
Step S506, if the size of the input table is smaller than the table size threshold, determining that the task execution exists in a small table, wherein the size of the input table is an index to be optimized.
For example, the table size threshold may be 70MB, or 80MB, or 90MB, or the like.
Step S508, if the size of the input table is determined to be the index to be optimized, information prompting to add the broadcast combining parameters is obtained, so as to add the broadcast combining parameters according to the prompting.
In some embodiments, the presence tab may take the form of a broadcast join, suggesting the addition of the following parameters:
set Spark.sql.autoBroadcastJoinThreshold=83886080;
in step S510, if the size of the input table is not smaller than the table size threshold, the size of the input table is not determined as the index to be optimized.
Fig. 6 shows a schematic diagram of the processing procedure of steps S110 to S114 shown in fig. 1 in another embodiment. The task execution optimization flow shown in fig. 6 aims at the situations of data inclination, small files, unreasonable segmentation strategies, unreasonable parallelism setting, shortage of cluster queue resources and the like.
Step S602, scanning and analyzing the log information to obtain the read data quantity and the starting time in the stage execution index value in the task execution information corresponding to the script to be processed.
Step S6042, determining the corresponding execution index as the index to be optimized according to the read data amount and the corresponding read data amount threshold.
In some embodiments, the maximum, minimum, and average values of the input or shuffle read data amounts of the task in each stage may be obtained, and the difference value may be calculated, and if the difference value > (average value 75%) (75% is the corresponding read data amount threshold, and may be set to 70%, or 80%, etc. according to the actual situation), then it may be determined that a data skew exists.
In some embodiments, if the average value of the data amount read by each task in the input stage is less than 50MB (50 MB is the corresponding read data amount threshold value, and may be set to 40MB, or 60MB according to the actual situation), it is determined that a doclet exists.
In some embodiments, if the average value of the data amount read by each task in the input stage is greater than 600MB (600 MB is the corresponding read data amount threshold value, and may be set to 500MB, or 700MB according to the actual situation), then the slicing policy is determined to be unreasonable.
In some embodiments, if the average value of the data amount of the shuffle read in the stage is smaller than 400MB or larger than 700MB (400 MB and 700MB are corresponding read data amount thresholds, and may also be set to 300MB and 600MB, or 500MB and 800MB, etc. according to the actual situation), it is determined that the parallelism setting is not reasonable.
Step S6044, determining the corresponding execution index as the index to be optimized according to the starting time and the corresponding starting time threshold.
In some embodiments, if the startup interval time of two adjacent stages exceeds 1 minute (1 minute is the corresponding startup time threshold, and may be set to be the startup time threshold of 0.5 minute, or 1.5 minutes, etc. according to the actual situation), it may be determined that the cluster queue resources are tensed.
In some embodiments, if the difference between the end time and the start time of the task exceeds 3 minutes (3 minutes is the corresponding start time threshold, and may also be set as the start time threshold for 2 minutes, or 4 minutes, etc. according to the actual situation), it may be determined that the cluster queue resources are tensed.
In step S6062, if it is determined that the execution index corresponding to the read data amount is the index to be optimized, the corresponding information for prompting to add the first optimization parameter is obtained.
In some embodiments, if it is determined that the data is skewed, it may be recommended to add parameters that do not take effect to solve the hot key problem by itself:
set spark.sql.adaptive.enabled=true;
set spark.sql.adaptive.skewJoin.enabled=true;
set spark.sql.adaptive.skewJoin.enhance.enabled=true;
set spark.sql.adaptive.forceOptimizeSkewedJoin=true;
if the spark.sql.adaptive.enabled is TRUE, the adaptive query execution is started, and the query plan is re-optimized in the query execution process according to accurate runtime statistical information; spark.sql.adaptive.skew join.enabled: when both the parameter and 'Spark. Sql. Adaptive. Enabled' are TRUE, spark dynamically handles tilting in random connections (sort merge and random hash) by splitting (and copying if needed) the tilting partition. The spark.sql.adaptive.enabled parameter is a Spark enabled adaptive query function, which can be understood as turning on the auto-optimization function. After the parameter is started, the Spark engine can acquire information of the task in running, and the information is dynamically optimized in the task executing process. spark.sql.adaptive.skew join.enabled is a tilt processing switch, after the parameter is turned on, spark splits the tilted part to be processed by multiple tasks, and finally the results are merged by the unit.
In some embodiments, if it is determined that a doclet exists, it may be recommended to add the following parameters:
set spark.sql.files.openCostInBytes=4m;
set spark.sql.files.maxPartitionBytes=600m;
wherein, spark.file.openCostInBytes represents the number of bytes that can be scanned simultaneously by the settings to measure the estimated cost of opening the file. This is used when placing multiple files into a partition. One can overestimate a bit so that the partitioning of small files is faster than the partitioning of large files.
spark.sql.files.maxpartitionbytes represents the maximum number of bytes packed into a single partition when a file is read. This configuration may be effective when using file-based sources such as part, JSON, and ORC.
In some embodiments, if it is determined that the file splitting policy is not reasonable, it may be recommended to add the following parameters:
set hive.exec.orc.split.strategy=ETL;
set mapred.max.split.size=629145600;
set mapred.min.split.size.per.node=629145600;
set mapred.min.split.size.per.rack=629145600;
the set hive, exec, orc, split, strategy is used for setting a split splitting strategy when the file is read orc. The BI strategy takes files as granularity to divide split; the ETL strategy can split the file, and a plurality of strips form a split; the HYBRID strategy is: the ETL policy is used when the average size of the file is greater than the hadoop maximum split value (256M default), otherwise the BI policy is used. Set mapred.max.split.size is used to Set the file maximum cut threshold. Set mapred.min.split.size.per.node is used to Set the minimum split threshold for each machine node. Set mapred.min.split.size.per.rack is used to Set the minimum cut threshold for each rack.
In some embodiments, if it is determined that the parallelism setting is not reasonable, the following parameters may be adjusted according to the data amount:
set spark.sql.shuffle.fractions= (shuffle read total/600 MB); the// is the default number of partitions used when concatenating or aggregating shuffled data.
In step S6064, if it is determined that the execution index corresponding to the start time is the index to be optimized, information prompting to adjust the task execution time is obtained.
In some embodiments, if it is determined that the queue resources are tight, the review queue may be prompted to monitor the adjustment task execution time.
Fig. 7 shows a schematic diagram of the processing procedure of steps S110 to S114 shown in fig. 1 in still another embodiment. The task execution optimization flow shown in fig. 7 is directed to the case of log error reporting.
Step S702, scanning and analyzing the log information to obtain abnormal execution information in the task execution information corresponding to the script to be processed.
Step S704, matching the execution anomaly information with preset anomaly information, and determining the corresponding index as the index to be optimized according to the matching result.
In some embodiments, if an error is found in the log: java.lang.outofmemory error: java heel space, org.apoche.spark.shuffle.fetchfaildedex: java heel space, container killed on request.exit code is 143, the memory required for program operation is larger than spark.executor.memory. Typically, the amount of data processed or buffered is large, the existing memory is insufficient and the memory allocation rate is greater than the GC recycling rate.
In some embodiments, if an error is found in the log: the threshold of physical RESOURCE utilization is exceeded. Exposed_resource_type_mem, then the physical machine memory usage where the executor is located exceeds the physical machine RESOURCE utilization threshold, so that the container is evicted by the yacn, whose current eviction policy is that which is scheduled first to be evicted later.
In some embodiments, if a large number of latch failure errors occur in the log, it may be determined that the disk is busy.
Step S706, if it is determined that the execution index corresponding to the execution anomaly information is the index to be optimized, the corresponding information for prompting to adjust the second optimization parameter is obtained.
In some embodiments, if Container killed on request. Exit code is 143 information occurs, adjustments may be prompted according to the following scheme: increasing spark. Executor. Memory, reducing spark. Executor. Cores, reducing unnecessary cache operations, avoiding making braddast to larger data as much as possible, avoiding a shuffle operator as much as possible, or optimizing program logic/underlying data. Wherein, spark. Executor. Memory represents the amount of memory (memory) used by each executor process, and spark. Executor. Cores represents the number of cores to be used on each executor.
In some embodiments, if the extracted_reserved_type_mem information occurs, the adjustment may be made according to the following scheme: because the memory of the physical machine is used to exceed the threshold value, the task level cannot be well avoided, the core and the memory of the executor can be reduced in the same proportion (the memory allocated by each kernel is ensured to be unchanged), the application of a large amount of memory is avoided as much as possible, and the memory utilization of the physical machine is prevented from reaching the threshold value.
In some embodiments, if it is determined that the disk is busy, the adjustment may be performed according to the following scheme: and executing or starting an RSS mechanism when the disk is not busy according to the disk monitoring selection.
Fig. 8 is a schematic diagram of a task optimization flow according to the one shown in fig. 1 to 7. The task optimization flow shown in fig. 8 may be executed by the diagnostic optimizing apparatus 8002, and may include the following steps S802 to S824.
Step S802, acquire the new script 8002 uploaded or the script 8006 uploaded before the periodic inspection platform, and parse the script by an SQL parser (server).
Step S804, validity check is carried out on the newly uploaded script. If the new upload script grammar has a problem, if the validity check is not passed, the process is directly ended (step S824), and the grammar problem is prompted. If the validity check is passed, the step S806 is continued.
In step S806, the keywords in the script are scanned and matched with the keyword library 8008 that is not recommended to be used.
Step S808, it is determined whether the keyword is not suggested. If the keyword is not suggested, a corresponding optimization scheme 8010 is prompted, and the specific embodiment can refer to fig. 2. If not, step S810 is continued.
In step S810, the latest execution log file of the task in Spark jobstone 8012 is obtained.
Step S812, it is determined whether the execution log of the task is acquired. If the execution log of the task is not acquired, a trial run is performed by the scheduling platform (step S814), and in the process of the trial run, the execution log information of Spark can be acquired by the Spark monitor. If so, step S816 continues.
Step S816, after obtaining the execution log of the spark task, performs scan analysis on the log file.
Step S818, obtain the task execution information in the log file. For the specific embodiment, reference is made to step S110.
Step S820, after the task execution information is obtained, the matching with the rules in the optimization rule base 8014 is started. For specific embodiments, reference may be made to step S112.
Step S822, judging whether the scan log is matched with the corresponding rule. If the rule is matched with the rule, a corresponding specific optimization scheme 8010 is given, and the specific embodiment can refer to step S114. If the corresponding rule is not matched, the process goes to step S824.
In step S824, the flow ends.
Fig. 9 is a block diagram of a task processing device, according to an example embodiment. The apparatus as shown in fig. 9 may be applied to a task execution platform, for example.
Referring to fig. 9, an apparatus 90 provided by an embodiment of the present disclosure may include a first acquisition module 902, a first matching module 904, a first optimization module 906, a second acquisition module 908, a log scanning module 910, a second matching module 912, and a second optimization module 914.
The first obtaining module 902 may be configured to obtain a script to be processed.
The first matching module 904 may be configured to parse the script to be processed, and match the script to be processed with the keyword library, so as to obtain a keyword matching result.
The first optimization module 906 may be configured to obtain first prompt information according to the keyword matching result, so as to perform a first optimization process on the script to be processed according to the first prompt information.
The first optimization module 906 may be configured to try to run the first script on the task execution platform if the keyword matching result is that the command to be optimized is not matched.
The second acquisition module 908 may be used to acquire log information of tasks executing the script to be processed.
The log scanning module 910 may be configured to scan and parse the log information to obtain task execution information corresponding to the script to be processed.
The second matching module 912 may be configured to determine, according to an optimization rule in the optimization rule base, an index corresponding to the task execution information as an index to be optimized.
The second optimization module 914 may be configured to obtain second prompt information according to the index to be optimized, so as to perform a second optimization process on the task execution parameter according to the second prompt information.
FIG. 10 is a block diagram of another task processing device, according to an example embodiment. The apparatus shown in fig. 10 may be applied to a task execution platform, for example.
Referring to fig. 10, an apparatus 100 provided by an embodiment of the present disclosure may include a first acquisition module 1002, a validity check module 1003, a first matching module 1004, a commissioning module 1005, a first optimization module 1006, a second acquisition module 1008, a log scanning module 1010, a second matching module 1012, and a second optimization module 1014.
The first obtaining module 1002 may be configured to obtain a script to be processed.
The script to be processed may include a first script that first uploads the task execution platform.
The script to be processed may also include a second script that has been run on the task execution platform.
The validity checking module 1003 may be configured to parse the first script and perform validity checking on the first script.
The first matching module 1004 may be configured to parse the script to be processed, and match the script to be processed with the keyword library, so as to obtain a keyword matching result.
The first matching module 1004 may be further configured to match the first script with the keyword library if the validity check result of the first script is passed, so as to obtain a keyword matching result.
The commissioning module 1005 may be configured to commission the first script on the task execution platform if the keyword matching result is that the command to be optimized is not matched.
The command to be optimized may include at least one of a global ordering command, a data deduplication command, a first data filtering command, a multi-table merge without predicate pushdown command, a create partition command without specified partition, and a select command without corresponding column clipping operation.
The first optimization module 1006 may be configured to obtain first hint information according to the keyword matching result, so as to perform a first optimization process on the script to be processed according to the first hint information.
The first optimizing module 1006 may be further configured to obtain first prompting information if the keyword matching result is that the keyword matching result matches the command to be optimized, where the first prompting information includes information for prompting an optimization scheme of the command to be optimized.
The optimization scheme of the global ordering command can be that the global ordering command is replaced by the local ordering command, the optimization scheme of the data deduplication command can be that the data deduplication command is replaced by the data grouping command, the optimization scheme of the first data filtering command can be that the first data filtering command is replaced by the second data filtering command, the optimization scheme of the multi-table merging command without predicate pushing can be that the predicate pushing is performed, the optimization scheme of the creating partition command without specifying the partition can be that the partition is specified, and the optimization scheme of the selecting command without corresponding column clipping operation can be that the column clipping is performed.
The first optimizing module 1006 may be further configured to obtain first prompt information if the keyword matching result is that the command to be optimized is not matched, where the first prompt information includes information for prompting to obtain a platform log.
The second acquisition module 1008 may be used to acquire log information of tasks executing scripts to be processed.
The log information of the task executing the script to be processed may include log information of the task of the first script to be tried.
The log information of the task executing the script to be processed may include log information of the task executing the second script.
The second obtaining module 1008 may be further configured to obtain log information of a task of the second script according to information for prompting to obtain a platform log.
The log scanning module 1010 may be configured to scan and parse the log information to obtain task execution information corresponding to the script to be processed.
The task execution information may include size information of the input table.
The task execution information may further include a phase execution index value including a read data amount and a start time.
The task execution information also includes execution abnormality information.
The second matching module 1012 may be configured to determine, according to an optimization rule in the optimization rule base, an index corresponding to the task execution information as an index to be optimized.
The optimization rule may include a table size threshold.
The optimization rules may also include phase execution index thresholds including a read data amount threshold and a start time threshold.
The optimization rules may also include preset anomaly information.
The second matching module 1012 may be further configured to determine whether the size of the input table is smaller than a table size threshold according to the size information of the input table; if the size of the input table is smaller than the table size threshold, determining that the task execution exists in the small table, wherein the size of the input table is an index to be optimized.
The second matching module 1012 may be further configured to determine, according to the read data amount and the corresponding read data amount threshold, that the corresponding execution index is an index to be optimized; and determining the corresponding execution index as an index to be optimized according to the starting time and the corresponding starting time threshold.
The second matching module 1012 may be further configured to match the execution anomaly information with preset anomaly information, and determine, according to a matching result, that the corresponding index is the index to be optimized.
The second optimization module 1014 may be configured to obtain second hint information according to the index to be optimized, so as to perform a second optimization process on the task execution parameter according to the second hint information.
The second hint information may include information that hints to add broadcast combining parameters.
The second prompt information may further include information prompting the addition of the first optimization parameter and information prompting the adjustment of the task execution time.
The second hint information may also include information that hints to adjust the second optimization parameter.
The second optimization module 1014 may be further configured to obtain information prompting to add the broadcast combining parameter if the size of the input table is determined to be the index to be optimized.
The second optimization module 1014 may be further configured to obtain corresponding information for prompting to add the first optimization parameter if it is determined that the execution index corresponding to the read data amount is an index to be optimized; if the execution index corresponding to the starting time is determined to be the index to be optimized, information prompting to adjust the task execution time is obtained.
The second optimization module 1014 may be further configured to obtain corresponding information for prompting to adjust the second optimization parameter if it is determined that the execution index corresponding to the execution exception information is an index to be optimized.
Specific implementation of each module in the apparatus provided in the embodiments of the present disclosure may refer to the content in the foregoing method, which is not described herein again.
Fig. 11 shows a schematic structural diagram of an electronic device in an embodiment of the disclosure. It should be noted that the apparatus shown in fig. 11 is only an example of a computer system, and should not impose any limitation on the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 11, the apparatus 1100 includes a Central Processing Unit (CPU) 1101 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1102 or a program loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for the operation of the device 1100 are also stored. The CPU1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.
The following components are connected to the I/O interface 1105: an input section 1106 including a keyboard, a mouse, and the like; an output portion 1107 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a LAN card, a modem, and the like. The communication section 1109 performs communication processing via a network such as the internet. The drive 1110 is also connected to the I/O interface 1105 as needed. Removable media 1110 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed in drive 1110, so that a computer program read out therefrom is installed as needed in storage section 1108.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 1109, and/or installed from the removable media 1110. The above-described functions defined in the system of the present disclosure are performed when the computer program is executed by a Central Processing Unit (CPU) 1101.
It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The described modules may also be provided in a processor, for example, as: a processor includes a first acquisition module, a first matching module, a first optimization module, a second acquisition module, a log scanning module, a second matching module, and a second optimization module. The names of these modules do not in any way limit the module itself, for example, the first acquisition module may also be described as "module for acquiring a script to be processed".
As another aspect, the present disclosure also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by one of the devices, cause the device to implement:
acquiring a script to be processed; analyzing the script to be processed, and matching the script to be processed with a keyword library to obtain a keyword matching result; obtaining first prompt information according to the keyword matching result so as to perform first optimization processing on the script to be processed according to the first prompt information; acquiring log information of a task for executing a script to be processed; scanning and analyzing the log information to obtain task execution information corresponding to the script to be processed; determining an index corresponding to the task execution information as an index to be optimized according to an optimization rule in an optimization rule base; and obtaining second prompt information according to the index to be optimized, and performing second optimization processing on the task execution parameters according to the second prompt information.
Exemplary embodiments of the present disclosure are specifically illustrated and described above. It is to be understood that this disclosure is not limited to the particular arrangements, instrumentalities and methods of implementation described herein; on the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (12)

1. A method of task processing, comprising:
acquiring a script to be processed;
analyzing the script to be processed, and matching the script to be processed with a keyword library to obtain a keyword matching result;
obtaining first prompt information according to the keyword matching result so as to perform first optimization processing on the script to be processed according to the first prompt information;
acquiring log information of a task for executing the script to be processed;
scanning and analyzing the log information to obtain task execution information corresponding to the script to be processed;
determining an index corresponding to the task execution information as an index to be optimized according to an optimization rule in an optimization rule base;
and obtaining second prompt information according to the index to be optimized, so as to perform second optimization processing on the task execution parameters according to the second prompt information.
2. The method of claim 1, wherein the script to be processed comprises a first script that first uploads a task execution platform;
the log information of the task executing the script to be processed comprises log information of the task of the first script;
obtaining first prompt information according to the keyword matching result, wherein the first prompt information comprises:
If the keyword matching result is that the command to be optimized is not matched, the first script is run on the task execution platform;
obtaining log information of a task executing the script to be processed, including:
and acquiring log information of the task of the first script.
3. The method of claim 2, wherein obtaining the first hint information based on the keyword matching result further comprises:
and if the keyword matching result is that the command to be optimized is matched, acquiring the first prompt information, wherein the first prompt information comprises information for prompting an optimization scheme of the command to be optimized.
4. The method of claim 3, wherein the command to be optimized comprises at least one of a global ordering command, a data deduplication command, a first data filtering command, a multi-table merge without predicate pushdown command, a create partition command without specified partitions, and a select command without corresponding column clipping operations;
the optimization scheme of the global ordering command is to replace the global ordering command with a local ordering command, the optimization scheme of the data deduplication command is to replace the data deduplication command with a data grouping command, the optimization scheme of the first data filtering command is to replace the first data filtering command with a second data filtering command, the optimization scheme of the multi-table merging command without predicate pushing is predicate pushing, the optimization scheme of the creating partition command without specifying the partition is a specified partition, and the optimization scheme of the selecting command without corresponding column clipping operation is column clipping.
5. The method of claim 2, wherein parsing the script to be processed, matching the script to be processed with a keyword library, and obtaining a keyword matching result, comprises:
analyzing the first script, and carrying out validity check on the first script;
and if the validity check result of the first script is passed, matching the first script with the keyword library to obtain the keyword matching result.
6. The method according to claim 1, wherein the task execution information includes size information of an input table;
the optimization rule includes a table size threshold;
determining the index corresponding to the task execution information as the index to be optimized according to the optimization rule in the optimization rule base, wherein the method comprises the following steps:
judging whether the size of the input table is smaller than the table size threshold according to the size information of the input table;
if the size of the input table is smaller than the table size threshold, determining that a small table exists for task execution, wherein the size of the input table is the index to be optimized;
the second prompt information comprises information for prompting the addition of broadcast combination parameters;
obtaining second prompt information according to the index to be optimized, including:
And if the size of the input table is determined to be the index to be optimized, obtaining the information for prompting to add the broadcast combination parameters.
7. The method of claim 6, wherein the task execution information further includes a phase execution index value, the phase execution index value including a read data amount and a start time;
the optimization rule further comprises a stage execution index threshold, wherein the stage execution index threshold comprises a read data volume threshold and a starting time threshold;
determining the index corresponding to the task execution information as the index to be optimized according to the optimization rule in the optimization rule base, and further comprising:
determining a corresponding execution index as the index to be optimized according to the read data quantity and a corresponding read data quantity threshold;
determining a corresponding execution index as the index to be optimized according to the starting time and a corresponding starting time threshold;
the second prompt information comprises information for prompting to add the first optimization parameters and information for prompting to adjust the task execution time;
obtaining second prompt information according to the index to be optimized, including:
if the execution index corresponding to the read data quantity is determined to be the index to be optimized, obtaining corresponding information for prompting to add a first optimization parameter;
And if the execution index corresponding to the starting time is determined to be the index to be optimized, acquiring the information for prompting to adjust the task execution time.
8. The method of claim 6, wherein the task execution information further includes execution exception information;
the optimization rule further comprises preset abnormal information;
determining the index corresponding to the task execution information as the index to be optimized according to the optimization rule in the optimization rule base, and further comprising:
matching the execution anomaly information with preset anomaly information, and determining a corresponding index as the index to be optimized according to a matching result;
the second prompt information comprises information for prompting to adjust a second optimization parameter;
obtaining second prompt information according to the index to be optimized, including:
and if the execution index corresponding to the execution abnormality information is determined to be the index to be optimized, obtaining corresponding information for prompting to adjust the second optimization parameter.
9. The method of claim 1, wherein the script to be processed comprises a second script that has been run on a task execution platform;
the log information of the task executing the script to be processed comprises log information of the task executing the second script;
Obtaining first prompt information according to the keyword matching result, wherein the first prompt information comprises:
if the keyword matching result is that the command to be optimized is not matched, obtaining first prompt information, wherein the first prompt information comprises information for prompting to obtain a platform log;
obtaining log information of a task executing the script to be processed, including:
and obtaining log information of the task of the test run second script according to the information for prompting to obtain the platform log.
10. A task processing device, comprising:
the first acquisition module is used for acquiring a script to be processed;
the first matching module is used for analyzing the script to be processed, and matching the script to be processed with a keyword library to obtain a keyword matching result;
the first optimizing module is used for obtaining first prompt information according to the keyword matching result so as to perform first optimizing processing on the script to be processed according to the first prompt information;
the second acquisition module is used for acquiring log information of the task for executing the script to be processed;
the log scanning module is used for scanning and analyzing the log information to acquire task execution information corresponding to the script to be processed;
The second matching module is used for determining that the index corresponding to the task execution information is the index to be optimized according to the optimization rules in the optimization rule base;
and the second optimization module is used for obtaining second prompt information according to the index to be optimized so as to perform second optimization processing on the task execution parameters according to the second prompt information.
11. An electronic device, comprising: memory, a processor and executable instructions stored in the memory and executable in the processor, wherein the processor implements the method of any of claims 1-9 when executing the executable instructions.
12. A computer readable storage medium having stored thereon computer executable instructions, which when executed by a processor implement the method of any of claims 1-9.
CN202310140043.9A 2023-02-20 2023-02-20 Task processing method, device, equipment and storage medium Pending CN116069817A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310140043.9A CN116069817A (en) 2023-02-20 2023-02-20 Task processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310140043.9A CN116069817A (en) 2023-02-20 2023-02-20 Task processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116069817A true CN116069817A (en) 2023-05-05

Family

ID=86180088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310140043.9A Pending CN116069817A (en) 2023-02-20 2023-02-20 Task processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116069817A (en)

Similar Documents

Publication Publication Date Title
US10169409B2 (en) System and method for transferring data between RDBMS and big data platform
Shi et al. Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs
US9348677B2 (en) System and method for batch evaluation programs
US9811557B2 (en) Optimizing query statements in relational databases
US20110264626A1 (en) Gpu enabled database systems
WO2019148713A1 (en) Sql statement processing method and apparatus, computer device, and storage medium
US10496659B2 (en) Database grouping set query
US10331670B2 (en) Value range synopsis in column-organized analytical databases
US20200293536A1 (en) Stream processing in search data pipelines
Karnagel et al. Local vs. Global Optimization: Operator Placement Strategies in Heterogeneous Environments.
Breß et al. A framework for cost based optimization of hybrid CPU/GPU query plans in database systems
US9600786B2 (en) Optimizing analytic flows
US20230289364A1 (en) Visual data computing platform using a progressive computation engine
KR101772333B1 (en) INTELLIGENT JOIN TECHNIQUE PROVIDING METHOD AND SYSTEM BETWEEN HETEROGENEOUS NoSQL DATABASES
US11934927B2 (en) Handling system-characteristics drift in machine learning applications
Gupta et al. An approach for optimizing the performance for apache spark applications
Tsoumakos et al. The case for multi-engine data analytics
CN116069817A (en) Task processing method, device, equipment and storage medium
CN113220530B (en) Data quality monitoring method and platform
US9449046B1 (en) Constant-vector computation system and method that exploits constant-value sequences during data processing
KR101621490B1 (en) Query execution apparatus and method, and system for processing data employing the same
Lian et al. ContTune: Continuous Tuning by Conservative Bayesian Optimization for Distributed Stream Data Processing Systems
Sangroya et al. Performance assurance model for hiveql on large data volume
CN108763489B (en) Method for optimizing Spark SQL execution workflow
Hsiao et al. Reducing mapreduce abstraction costs for text-centric applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination