CN113419957A - Rule-based big data offline batch processing performance capacity scanning method and device - Google Patents

Rule-based big data offline batch processing performance capacity scanning method and device Download PDF

Info

Publication number
CN113419957A
CN113419957A CN202110741372.XA CN202110741372A CN113419957A CN 113419957 A CN113419957 A CN 113419957A CN 202110741372 A CN202110741372 A CN 202110741372A CN 113419957 A CN113419957 A CN 113419957A
Authority
CN
China
Prior art keywords
hql
script
sentences
statement
performance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110741372.XA
Other languages
Chinese (zh)
Inventor
赵吉昆
张世瑛
梁晔华
王泽普
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110741372.XA priority Critical patent/CN113419957A/en
Publication of CN113419957A publication Critical patent/CN113419957A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3684Test management for test design, e.g. generating new test cases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2372Updates performed during offline database operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a rule-based big data offline batch processing performance capacity scanning method and device, relates to the technical field of big data, and can also be used in the financial field, and the method comprises the following steps: splitting the HQL script program code according to a preset spacer to obtain at least one HQL statement; sequentially carrying out script analysis on the HQL sentences, judging whether the HQL sentences subjected to the script analysis meet preset potential performance grammar rules, and if so, outputting the corresponding HQL sentences, HQL script program codes and potential performance grammar to a set summary file; the method and the device can effectively, accurately and conveniently carry out hidden danger investigation on the performance capacity of the HQL script.

Description

Rule-based big data offline batch processing performance capacity scanning method and device
Technical Field
The application relates to the technical field of big data, can also be used in the financial field, and particularly relates to a rule-based big data offline batch processing performance capacity scanning method and device.
Background
The big data service cloud platform of an enterprise provides services such as data access, storage, calculation, safety management, resource management and the like for various professional big data analysis applications, along with the continuous perfection of platform construction, the technical system is increasingly huge, the loaded service functions are increasingly abundant, and the performance expression of specific models and logics of various application scenes under a new technical architecture gradually becomes one of the focuses of product research and development processes while the platform operation and maintenance system is continuously enlarged. The method also puts higher requirements on a performance capacity testing method and a tool of a large data platform under a multi-application and multi-tenant framework system.
The inventor finds that in the prior art, after an offline batch processing service system accessing a big data service cloud reaches a certain number (for example, hundreds), in the process of project development online and iterative optimization based on the big data service cloud at each stage, developers need to perform business logic processing and other related operations on basic data or subject data in a data lake or a data warehouse through hive sql (hereinafter, HQL) scripts, and as coding styles of the developers writing the HQL scripts are different, and mastering levels of HQL grammars are different, various inefficient writing methods can be doped in the process of project development, so that the operation efficiency is slow, meanwhile, large data distributed cluster resources are consumed, and the offline batch processing scripts have a large optimization space. However, for the implementation of the HQL script with complex logic or the implementation of the HQL script associated with the basic data based on the large table (TB level), the potential performance hazard needs to be emphasized, so that the importance of checking and checking the low-efficiency writing method in the HQL script and feeding back the correction is obvious.
Disclosure of Invention
Aiming at the problems in the prior art, the application provides a method and a device for scanning the performance capacity of big data offline batch processing based on rules, which can effectively, accurately and conveniently perform hidden danger investigation on the performance capacity of an HQL script.
In order to solve at least one of the above problems, the present application provides the following technical solutions:
in a first aspect, the present application provides a rule-based big data offline batch performance capacity scanning method, including:
splitting the HQL script program code according to a preset spacer to obtain at least one HQL statement;
and sequentially carrying out script analysis on the HQL sentences, judging whether the HQL sentences subjected to the script analysis meet preset potential performance grammar rules, and if so, outputting the corresponding HQL sentences, HQL script program codes and potential performance grammar to a setting summary file.
Further, the judging whether the HQL statement subjected to the script parsing meets a preset potential performance hazard grammar rule includes:
extracting a set source table in the HQL sentences analyzed by the script, and determining set condition sentences in the source table;
and judging whether the conditional statement contains a partition limited field, if not, judging that the HQL statement conforms to a preset full-table scanning hidden danger grammar rule, and otherwise, judging that the HQL statement is normal.
Further, the judging whether the HQL statement subjected to the script parsing meets a preset potential performance hazard grammar rule includes:
extracting a set source table in the HQL statement after the script is analyzed, and determining a set insert statement in the source table;
and judging whether the inserted statement contains a partition limited field, if not, judging that the HQL statement conforms to a preset full-table insertion hidden danger grammar rule, and otherwise, judging that the HQL statement is normal.
Further, the judging whether the HQL statement subjected to the script parsing meets a preset potential performance hazard grammar rule includes:
and judging whether the HQL sentences analyzed by the script contain any one of set query sentences, set Cartesian product query sentences, set sequencing sentences, set statistical sentences and set record insertion functions, if so, judging that the HQL sentences accord with preset potential performance hazard grammar rules, and otherwise, judging that the HQL sentences are normal.
Further, the judging whether the HQL statement subjected to the script parsing meets a preset potential performance hazard grammar rule includes:
and judging whether the number of the set recorded merged sentences in the HQL sentences analyzed by the script exceeds a threshold value, if so, judging that the HQL sentences conform to a preset potential performance hazard grammar rule, and otherwise, judging that the HQL sentences are normal.
Further, the judging whether the HQL statement analyzed by the script includes a query statement for setting a cartesian product, if yes, judging that the HQL statement conforms to a preset potential performance hazard grammar rule, otherwise, judging that the HQL statement is normal, includes:
and when judging that the two tables are connected and inquired in the HQL sentences analyzed by the script, judging whether specific condition sentences do not exist, if so, judging that the HQL sentences accord with preset potential performance hazard grammar rules, and otherwise, judging that the HQL sentences are normal.
Further, the judging whether the HQL statement subjected to the script parsing includes the set sorting statement, if yes, judging that the HQL statement conforms to a preset potential performance hazard grammar rule, otherwise, judging that the HQL statement is normal, including:
judging whether the HQL sentences analyzed by the script contain set sorting statement order by, if so, judging that the HQL sentences accord with preset potential performance hazard grammar rules, replacing the set sorting statement order by with any one of sorting statements sort by or distributed by, and otherwise, judging that the HQL sentences are normal.
In a second aspect, the present application provides a big data offline batch processing performance capacity scanning apparatus based on rules, including:
the HQL script splitting module is used for splitting the HQL script program codes according to a preset spacer to obtain at least one HQL statement;
and the performance hidden danger rule judging module is used for sequentially carrying out script analysis on the HQL sentences, judging whether the HQL sentences subjected to the script analysis meet preset performance hidden danger grammar rules or not, and if so, outputting the corresponding HQL sentences, HQL script program codes and performance hidden danger grammars to a setting summary file.
Further, the performance risk rule determining module includes:
the conditional statement determining unit is used for extracting a set source table in the HQL statement analyzed by the script and determining a set conditional statement in the source table;
and the full-table scanning hidden danger judging unit is used for judging whether the conditional statement contains a partition limiting field, if not, judging that the HQL statement conforms to a preset full-table scanning hidden danger grammar rule, and otherwise, judging that the HQL statement is normal.
Further, the performance risk rule determining module includes:
an insertion statement determining unit, configured to extract a set source table in the HQL statement after the script is parsed, and determine a set insertion statement in the source table;
and the full-table insertion hidden danger judging unit is used for judging whether the inserted statement contains a partition limiting field, if not, judging that the HQL statement conforms to a preset full-table insertion hidden danger grammar rule, and otherwise, judging that the HQL statement is normal.
Further, the performance risk rule determining module includes:
and the potential performance hazard grammar judging unit is used for judging whether any one of a set query statement, a set Cartesian product query statement, a set sorting statement, a set statistic statement and a set record inserting function is contained in the HQL statement after the script is analyzed, if so, judging that the HQL statement accords with a preset potential performance hazard grammar rule, and otherwise, judging that the HQL statement is normal.
Further, the performance risk rule determining module includes:
and the recording and merging hidden danger judging unit is used for judging whether the number of the set recording and merging sentences in the HQL sentences analyzed by the script exceeds a threshold value, if so, judging that the HQL sentences accord with a preset performance hidden danger grammar rule, and otherwise, judging that the HQL sentences are normal.
In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the rule-based big data offline batch performance capacity scanning method when executing the program.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the rule-based big data offline batch performance capacity scanning method.
According to the technical scheme, the method and the device for scanning the performance capacity of the big data offline batch processing based on the rules are characterized in that high-risk grammars are sequentially scanned and traversed on HQL sentences in HQL script program codes, so that the running time of application operation of an offline analysis and mining scene of a big data platform is effectively improved, and hidden danger investigation on the performance capacity of the HQL script is accurately and conveniently carried out.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart illustrating a method for rule-based off-line batch performance volume scanning of big data according to an embodiment of the present application;
FIG. 2 is a second flowchart of a rule-based off-line batch performance volume scanning method according to an embodiment of the present application;
FIG. 3 is a third flowchart illustrating a rule-based off-line batch performance volume scanning method for big data in an embodiment of the present application;
FIG. 4 is a block diagram of an embodiment of a rule-based big data offline batch performance capacity scanning apparatus;
FIG. 5 is a second block diagram of a rule-based off-line batch performance volume scanning apparatus according to an embodiment of the present application;
FIG. 6 is a third block diagram of a rule-based off-line batch performance volume scanning apparatus according to an embodiment of the present application;
FIG. 7 is a fourth block diagram of a rule-based off-line batch performance volume scanning apparatus according to an embodiment of the present application;
FIG. 8 is a fifth block diagram of a rule-based off-line batch performance volume scanning apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Considering that in the prior art, after an offline batch processing service system accessing a big data service cloud reaches a certain number (for example, hundreds), in the processes of project development online and iterative optimization based on the big data service cloud at each period, developers need to perform business logic processing and other related operations on basic data or subject data in a data lake or a data warehouse through hive sql (hereinafter, HQL) scripts, and since the encoding styles of the developers writing HQL scripts are different and the mastering levels of HQL grammars are different, various inefficient writing methods can be doped in the process of developing each project to cause slow operation efficiency and large consumption of distributed cluster resources of big data, such offline batch processing scripts have the problem of large optimization space, the application provides a rule-based scanning method and device for capacity of offline batch processing performance of big data, the high-risk grammar is sequentially scanned and traversed on the HQL sentences in the HQL script program codes, so that the running timeliness of the application operation of the off-line analysis and mining scene of the large data platform is effectively improved, and the hidden danger of the performance capacity of the HQL script is accurately and conveniently checked.
In order to effectively, accurately and conveniently check the hidden danger of the performance capacity of the HQL script, the present application provides an embodiment of a rule-based big data offline batch processing performance capacity scanning method, and referring to fig. 1, the rule-based big data offline batch processing performance capacity scanning method specifically includes the following contents:
step S101: and splitting the HQL script program code according to a preset spacer to obtain at least one HQL statement.
Optionally, in the present application, the HQL script may be split into multiple HQL statements at intervals of a semicolon code, and whether each HQL statement has a performance problem is sequentially checked, for example, splitting the HQL script by calling a library such as sqlparse in the Python, and extracting related information according to a data type (the sqlparse splits the HQL statement into tokens, and each token corresponds to one data type).
Step S102: and sequentially carrying out script analysis on the HQL sentences, judging whether the HQL sentences subjected to the script analysis meet preset potential performance grammar rules, and if so, outputting the corresponding HQL sentences, HQL script program codes and potential performance grammar to a setting summary file.
Optionally, for the judgment of whether the script HQL sentence contains the keyword in the preset potential performance risk grammar rule, each line of HQL sentence in the script can be scanned line by line, whether the keyword related to the rule exists in the line of HQL sentence is judged, meanwhile, the judgment function can be packaged into an interface, all HQL scripts are analyzed subsequently by calling the interface, and if a corresponding writing method exists, the script name, the HQL sentence paragraph and the potential performance risk grammar are output to a final summary file.
As can be seen from the above description, the rule-based big data offline batch processing performance capacity scanning method provided in the embodiment of the present application can effectively improve the application operation running time of the offline analysis and mining scenario of the big data platform by sequentially scanning and traversing the high-risk syntax of the HQL statements in the HQL script program code, and accurately and conveniently perform hidden danger troubleshooting on the performance capacity of the HQL script.
In order to accurately determine the hidden danger syntax of full-table scan, in an embodiment of the method for scanning performance capacity of big data offline batch processing based on rules according to the present application, referring to fig. 2, the step S102 may further specifically include the following steps:
step S201: and extracting a set source table in the HQL sentences analyzed by the script, and determining set condition sentences in the source table.
Step S202: and judging whether the conditional statement contains a partition limited field, if not, judging that the HQL statement conforms to a preset full-table scanning hidden danger grammar rule, and otherwise, judging that the HQL statement is normal.
Specifically, all the from source tables appearing in the script and the where condition statements or and condition statements of each source table are extracted, whether the condition contains a limitation on the partition field (for example, where pt _ dt ═ w, where pt _ dt between, where pt _ dt in, etc.) is judged, if yes, the full-table scan rule is satisfied, and if no correlation condition exists, the full-table scan rule is judged to be violated.
In order to accurately determine the full-table insertion hidden danger syntax, in an embodiment of the method for scanning performance capacity of big data offline batch processing based on rules according to the present application, referring to fig. 3, the step S102 may further specifically include the following steps:
step S301: and extracting a setting source table in the HQL statement analyzed by the script, and determining a setting insertion statement in the source table.
Step S302: and judging whether the inserted statement contains a partition limited field, if not, judging that the HQL statement conforms to a preset full-table insertion hidden danger grammar rule, and otherwise, judging that the HQL statement is normal.
Specifically, all insertion statements (for example, insert into/rewrite statements) appearing in the script are extracted, and whether a limitation that a target table to which data is to be inserted has a corresponding partition exists after the statement is determined (for example, partition (pt _ dt ═ yyyyy-mm-dd '), partition (end _ dt ═ yyyyy-mm-dd'), and the like).
Therefore, unnecessary full-table scanning and full-table insertion writing methods existing in the script can be found, and the operation duration of the job and the consumption of cluster computing resources caused by full-table reading or data insertion are reduced by limiting time partitions.
In order to accurately determine other syntax of performance risk, in an embodiment of the method for scanning performance capacity of big data offline batch processing based on rules, the step S102 may further include the following steps:
and judging whether the HQL sentences analyzed by the script contain any one of set query sentences, set Cartesian product query sentences, set sequencing sentences, set statistical sentences and set record insertion functions, if so, judging that the HQL sentences accord with preset potential performance hazard grammar rules, and otherwise, judging that the HQL sentences are normal.
In a practical embodiment of the application, in order to avoid the occurrence of select query statements in the script as much as possible, for a large table with huge data volume and large number of fields in the table, the buffer area may be occupied by the large data volume when the result is output by using the method for querying, so that the normal process is affected, and the script can be modified to select only the fields to be used for querying in the subsequent modification.
In a possible embodiment of the present application, when there is a join query association (e.g., a join association) between a table and a table, a specific conditional statement (e.g., an on statement or a where condition) must be added to the table, and a "join.
In a practical embodiment of the present application, when the set sorting statement (e.g., order by) is used in the script, the query result needs to be globally sorted, and performing such an operation on the table with a large data size consumes a large amount of resources and time, so when the set sorting statement order by occurs in the script, if a scenario that is not necessarily used, other sorting statements sort by or distributed by and other methods may be used instead, and only internal sorting is performed in each reduced function reducer range.
In a practical embodiment of the present application, when a statistical statement (e.g., a count (distict) statement) is used, a full aggregation manner is adopted, only one MR program in Hadoop big data is started during running, and in the case of a large data size, data tilting is easy because the statistical statement count (distict) is grouped according to a specific field group by field and sorted according to a specific word distict field. For this situation, during the subsequent optimization, the calculation is performed by grouping according to the specific word group by and then counting the statement count, so that two MR programs are started to improve the performance.
In a practical embodiment of the present application, the record insertion function collectisetset () in the database sql is a set, and duplicate record insertions are not allowed. The first piece of data can be acquired, an MR task mechanism is operated after submission of the HQL, at the moment, statistics needs to be carried out on the reduce end in Hadoop big data, if grouping is not carried out according to a specific field group by field, all data can flow to one reduce end to carry out data processing, at the moment, the situation that a certain reduce end node occupies high resources can occur, meanwhile, data inclination is easy to occur, the length is acquired after size, and the operation can only place all data in the same reduce end to carry out statistics statement count (diagnosis) processing. Similarly, when statistics and deduplication are performed, simultaneous processing of multiple reduce ends cannot be performed, because if the multiple reduce ends perform processing, tasks have independent data, and this situation cannot be accurately counted at this time, a situation that a single-node memory occupies a large area may occur, so that group by field grouping operation needs to be performed during statistics. The record insert function collectist _ list and the record insert function collectist _ set are all used for converting a certain column in a group into an array to return, and the difference is that the record insert function collectist _ list is not deduplicated, and the record insert function collectist _ set is deduplicated. The record insert function collectlist is also disabled for the same reason as the record insert function collectset.
In order to accurately determine other syntax of performance risk, in an embodiment of the method for scanning performance capacity of big data offline batch processing based on rules, the step S102 may further include the following steps:
and judging whether the number of the set recorded merged sentences in the HQL sentences analyzed by the script exceeds a threshold value, if so, judging that the HQL sentences conform to a preset potential performance hazard grammar rule, and otherwise, judging that the HQL sentences are normal.
Specifically, when the number of all recording and merging statements used in each segment of HQL statement is more than 2, or the data volume of each recording and merging statement all part is too large, the running time is too long due to the large data volume, and the HQL statement can be divided into a plurality of insert statements insert intro during subsequent optimization.
In order to effectively, accurately and conveniently check the hidden danger of the performance capacity of the HQL script, the present application provides an embodiment of a rule-based big data offline batch processing performance capacity scanning apparatus for implementing all or part of the content of the rule-based big data offline batch processing performance capacity scanning method, and referring to fig. 4, the rule-based big data offline batch processing performance capacity scanning apparatus specifically includes the following contents:
the HQL script splitting module 10 is configured to split an HQL script program code according to a preset spacer, so as to obtain at least one HQL statement.
And the performance hidden danger rule judging module 20 is used for sequentially performing script analysis on the HQL sentences, judging whether the HQL sentences subjected to the script analysis meet preset performance hidden danger grammar rules, and if so, outputting the corresponding HQL sentences, HQL script program codes and performance hidden danger grammars to a setting summary file.
As can be seen from the above description, the rule-based big data offline batch processing performance capacity scanning device provided in the embodiment of the present application can scan and traverse the high-risk syntax in sequence through the HQL statements in the HQL script program code, effectively improve the application operation running timeliness of the offline analysis and mining scenario of the big data platform, and accurately and conveniently perform hidden danger troubleshooting on the performance capacity of the HQL script.
In order to accurately determine the full-table scan hidden danger syntax, in an embodiment of the rule-based big data offline batch performance capacity scanning apparatus of the present application, referring to fig. 5, the performance hidden danger rule determining module 20 includes:
the conditional statement determining unit 21 is configured to extract a setting source table in the HQL statement after the script is parsed, and determine a setting conditional statement in the source table.
And the full-table scanning hidden danger judging unit 22 is configured to judge whether the conditional statement includes a partition defining field, determine that the HQL statement conforms to a preset full-table scanning hidden danger syntax rule if the conditional statement does not include the partition defining field, and otherwise determine that the HQL statement is normal.
In order to accurately determine the full table insertion hidden danger syntax, in an embodiment of the rule-based big data offline batch performance capacity scanning apparatus of the present application, referring to fig. 6, the performance hidden danger rule determining module 20 includes:
and an insertion statement determining unit 23, configured to extract the setting source table in the HQL statement subjected to the script parsing, and determine a setting insertion statement in the source table.
And the full-table insertion hidden danger judging unit 24 is configured to judge whether the insertion statement includes a partition defining field, determine that the HQL statement conforms to a preset full-table insertion hidden danger syntax rule if the insertion statement does not include the partition defining field, and determine that the HQL statement is normal if the insertion statement does not include the partition defining field.
In order to accurately determine other performance risk grammars, in an embodiment of the rule-based off-line batch performance capacity scanning apparatus for big data of the present application, referring to fig. 7, the performance risk rule determining module 20 includes:
and the potential performance hazard grammar judging unit 25 is configured to judge whether the HQL sentences subjected to the script analysis include any one of a set query sentence, a set cartesian product query sentence, a set sorting sentence, a set statistical sentence, and a set record insertion function, determine that the HQL sentences conform to a preset potential performance hazard grammar rule if the set query sentence, the set sorting sentence, the set statistical sentence, and the set record insertion function, and determine that the HQL sentences are normal if the set query sentence, the set cartesian product query sentence, the set statistical sentence, and the set record insertion function are not included.
In order to accurately determine other performance risk grammars, in an embodiment of the rule-based off-line batch performance capacity scanning apparatus for big data of the present application, referring to fig. 8, the performance risk rule determining module 20 includes:
and the recording and merging hidden danger judging unit 26 is configured to judge whether the number of the set recording and merging sentences in the HQL sentences after the script analysis exceeds a threshold value, determine that the HQL sentences meet a preset performance hidden danger grammar rule if the number of the set recording and merging sentences exceeds the threshold value, and determine that the HQL sentences are normal if the number of the set recording and merging sentences exceeds the threshold value.
To further illustrate the present solution, the present application further provides a specific application example of implementing the rule-based big data offline batch performance capacity scanning method by using the rule-based big data offline batch performance capacity scanning apparatus, which specifically includes the following contents:
and automatically and statically scanning the HQL scripts under the corresponding paths by inputting the paths of the HQL scripts to be detected, and then judging whether the scripts have potential performance hazard grammars or not by combining the comparison of established rules. And if the corresponding writing method exists, outputting the script name, the HQL sentence paragraph and the potential performance grammar to a final summary file.
Specific established rules may include the following:
(1) full table scan and full table insertion rules
When data is inserted into the target partition table or from the select source data in the base partition table, a definition of the partition table needs to be added (for example, select from abc. table where pt _ dt is 'yyyy-mm-dd', insert int/over write table abc. table partition (pt _ dt is 'yyyy-mm-dd'), where abc is the library name of the library where the table is located, and pt _ dt is the partition field of the pasting source table, based on the actual condition of the table). This rule aims to discover the unnecessary full-table scan and full-table insert writes present in the script, by limiting the time partitioning to reduce the job run-time and cluster computing resource consumption associated with full-table reads or inserts of data.
(2) SELECT or SELECT t
The method avoids the occurrence of select query statements in the script as much as possible, and for a large table with huge data volume and large number of fields in the table, the buffer area is occupied by the large data volume when the result is output and the normal process is influenced by the query by the method.
(3) Cartesian product query
When there is a join association between a table and a table, an on statement or where condition must be added later, and after the on condition there cannot be a join.
(4) Ordering Using order by
When the order by is used in the script, the query result needs to be globally sequenced, and when the operation is performed on the table with large data volume, a large amount of resources and time are consumed, so when the order by statement appears in the script, if the scenario is not necessarily used, the scenario can be replaced by a method such as sort by or distributed by, and the internal sequencing is performed only in each reducer range.
(5) Using COUNT (DISTINCT) operations
When a count (distict) statement is used, a full aggregation mode is adopted, only one MR is started during operation, and data tilting is easy under the condition of large data volume because the count (distict) is grouped according to a group by field and sorted according to the distict field. For this case, the subsequent optimization is replaced by group by count, which enables two MRs to improve performance.
(6) The number of union all parts is more than 2
When the number of the used units all in each segment of HQL statement is more than 2, or the data volume of each unit part is too large, the running time is too long due to the large data volume, and the HQL statement can be divided into a plurality of insert intos statements during subsequent optimization.
(7) Using the collect _ list, collect _ set function
In sql, collect _ set () is a set, and duplicate record insertions are not allowed. The first one of data can be obtained, an MR task mechanism is operated after submission according to an HQL, statistics needs to be carried out at a reduce end at the moment, all data can flow to one reduce to carry out data processing when group by is not carried out, the condition that a certain reduce node resource occupies a high level can occur at the moment, meanwhile, the data is easy to incline and the length is obtained after size is extremely easy to occur, the operation can only place all the data in the same reduce to carry out count (pause) processing, the same processing can be carried out only, when statistics and deduplication are carried out, simultaneous processing of multiple reduces can not be achieved, because the task data are independent if multiple reduces are carried out, the condition that single-node memory occupies a large amount can occur when the accurate statistics can not be carried out at the moment, group by operation is recommended to be carried out when statistics is carried out. Both the collection _ list and the collection _ set convert a certain column in the grouping into an array for returning, and the difference is that the collection _ list is not deduplicated and the collection _ set is deduplicated. Thus, collect _ list is also disabled for the same reason as collect _ set.
Splitting the HQL script by calling a library such as sqlparse in the Python and extracting related information according to data types (the sqlparse can split the HQL script into tokens, and each token corresponds to one data type); and judging whether the script contains the keywords in the rules or not by scanning each line of sentences in the script line by line, judging whether the sentences contain the keywords related to the rules or not, packaging the judging function into an interface, and then realizing the analysis of all HQL scripts in a mode of calling the interface.
As can be seen from the above, the present application can achieve at least the following technical effects:
(1) offline batch processing operation efficiency improvement based on big data service cloud platform
The device scans and traverses low-efficiency and high-risk grammars in the HQL script, prompts are given to the grammars which accord with set rules, developers are fed back to adjust and modify the grammars, the batch operation performance after adjustment and modification can be optimized, and the operation timeliness of related application operation of an off-line analysis mining scene based on a big data platform is greatly improved.
(2) Big data distributed cluster computing resource intensive
The device can greatly reduce the computing resource consumption of the operation in unit time on the big data distributed cluster by optimizing and avoiding the low-efficiency grammar in the HQL statement, saves the resources which do not need to be occupied, enables the whole system to operate orderly, and improves the whole working efficiency of the big data platform.
(3) Script performance analysis automation
The script which does not meet the performance requirement can be screened out according to the rule through the large data performance capacity scanning device, the specific rule type which does not meet the performance requirement is marked, the problem of quick and accurate positioning is solved, and the optimization direction is determined in the subsequent performance optimization of the script. The process is completely automatically operated by a program without human intervention, so that the rules and the flow of script verification are more standardized, errors in judgment of the script due to human subjective factors cannot occur, the time required by manual script analysis is greatly reduced, and the efficiency of script analysis is improved.
(4) High flexibility, supporting single and batch analysis, and continuous and complete rule base
The large data performance capacity scanner supports large-batch script analysis and single script inspection, so that the large data performance capacity scanner can be used for finding out the performance problem existing in the script and can also be used for subsequently judging whether the performance problem of the modified script is solved or not: the modified script is analyzed by the large data performance capacity scanning device, whether the previous performance problem is improved or not is judged, whether the new performance problem exists or not is checked, the number of input scripts can be flexibly adjusted according to requirements, and the operation flexibility is high. Meanwhile, the method is suitable for analyzing all HQL scripts, rules in the device can be continuously modified, perfected and added, a flexible mechanism is set for the subsequent updating iteration of the whole device, the function can be updated through micro adjustment of codes, and the method is a very friendly device capable of continuously adjusting and optimizing. Other functions can be realized by modifying the internal rules, such as analysis of other types of grammar scripts, and the applicability is strong.
In order to effectively, accurately and conveniently check the hidden danger of the performance capacity of the HQL script on the hardware level, the present application provides an embodiment of an electronic device for implementing all or part of the contents in the rule-based big data offline batch performance capacity scanning method, where the electronic device specifically includes the following contents:
a processor (processor), a memory (memory), a communication Interface (Communications Interface), and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the communication interface is used for realizing information transmission between the rule-based big data offline batch processing performance capacity scanning device and relevant equipment such as a core service system, a user terminal and a relevant database; the logic controller may be a desktop computer, a tablet computer, a mobile terminal, and the like, but the embodiment is not limited thereto. In this embodiment, the logic controller may refer to an embodiment of the rule-based off-line batch performance capacity scanning method for big data and an embodiment of the rule-based off-line batch performance capacity scanning apparatus in the embodiments for implementation, and the contents thereof are incorporated herein, and repeated details are not repeated.
It is understood that the user terminal may include a smart phone, a tablet electronic device, a network set-top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), an in-vehicle device, a smart wearable device, and the like. Wherein, intelligence wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..
In practical applications, part of the rule-based off-line batch performance capacity scanning method for big data can be executed on the electronic device side as described above, or all operations can be completed in the client device. The selection may be specifically performed according to the processing capability of the client device, the limitation of the user usage scenario, and the like. This is not a limitation of the present application. The client device may further include a processor if all operations are performed in the client device.
The client device may have a communication module (i.e., a communication unit), and may be communicatively connected to a remote server to implement data transmission with the server. The server may include a server on the task scheduling center side, and in other implementation scenarios, the server may also include a server on an intermediate platform, for example, a server on a third-party server platform that is communicatively linked to the task scheduling center server. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed apparatus.
Fig. 9 is a schematic block diagram of a system configuration of an electronic device 9600 according to an embodiment of the present application. As shown in fig. 9, the electronic device 9600 can include a central processor 9100 and a memory 9140; the memory 9140 is coupled to the central processor 9100. Notably, this fig. 9 is exemplary; other types of structures may also be used in addition to or in place of the structure to implement telecommunications or other functions.
In one embodiment, the functionality of the rule-based big data offline batch performance volume scanning method may be integrated into the central processor 9100. The central processor 9100 may be configured to control as follows:
step S101: and splitting the HQL script program code according to a preset spacer to obtain at least one HQL statement.
Step S102: and sequentially carrying out script analysis on the HQL sentences, judging whether the HQL sentences subjected to the script analysis meet preset potential performance grammar rules, and if so, outputting the corresponding HQL sentences, HQL script program codes and potential performance grammar to a setting summary file.
As can be seen from the above description, according to the electronic device provided by the embodiment of the application, high-risk grammars are sequentially scanned and traversed on HQL sentences in HQL script program codes, so that the running timeliness of application operations of offline analysis and mining scenes of a large data platform is effectively improved, and hidden dangers of performance capacity of an HQL script are accurately and conveniently checked.
In another embodiment, the rule-based big data offline batch performance capacity scanning apparatus may be configured separately from the central processor 9100, for example, the rule-based big data offline batch performance capacity scanning apparatus may be configured as a chip connected to the central processor 9100, and the function of the rule-based big data offline batch performance capacity scanning method may be implemented by the control of the central processor.
As shown in fig. 9, the electronic device 9600 may further include: a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is noted that the electronic device 9600 also does not necessarily include all of the components shown in fig. 9; in addition, the electronic device 9600 may further include components not shown in fig. 9, which may be referred to in the prior art.
As shown in fig. 9, a central processor 9100, sometimes referred to as a controller or operational control, can include a microprocessor or other processor device and/or logic device, which central processor 9100 receives input and controls the operation of the various components of the electronic device 9600.
The memory 9140 can be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 9100 can execute the program stored in the memory 9140 to realize information storage or processing, or the like.
The input unit 9120 provides input to the central processor 9100. The input unit 9120 is, for example, a key or a touch input device. Power supply 9170 is used to provide power to electronic device 9600. The display 9160 is used for displaying display objects such as images and characters. The display may be, for example, an LCD display, but is not limited thereto.
The memory 9140 can be a solid state memory, e.g., Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 9140 could also be some other type of device. Memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application/function storage portion 9142, the application/function storage portion 9142 being used for storing application programs and function programs or for executing a flow of operations of the electronic device 9600 by the central processor 9100.
The memory 9140 can also include a data store 9143, the data store 9143 being used to store data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver storage portion 9144 of the memory 9140 may include various drivers for the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, contact book applications, etc.).
The communication module 9110 is a transmitter/receiver 9110 that transmits and receives signals via an antenna 9111. The communication module (transmitter/receiver) 9110 is coupled to the central processor 9100 to provide input signals and receive output signals, which may be the same as in the case of a conventional mobile communication terminal.
Based on different communication technologies, a plurality of communication modules 9110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, may be provided in the same electronic device. The communication module (transmitter/receiver) 9110 is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and receive audio input from the microphone 9132, thereby implementing ordinary telecommunications functions. The audio processor 9130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 9130 is also coupled to the central processor 9100, thereby enabling recording locally through the microphone 9132 and enabling locally stored sounds to be played through the speaker 9131.
An embodiment of the present application further provides a computer-readable storage medium capable of implementing all steps in the rule-based big data offline batch performance capacity scanning method with the execution subject being the server or the client in the foregoing embodiments, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements all steps of the rule-based big data offline batch performance capacity scanning method with the execution subject being the server or the client in the foregoing embodiments, for example, when the processor executes the computer program, the processor implements the following steps:
step S101: and splitting the HQL script program code according to a preset spacer to obtain at least one HQL statement.
Step S102: and sequentially carrying out script analysis on the HQL sentences, judging whether the HQL sentences subjected to the script analysis meet preset potential performance grammar rules, and if so, outputting the corresponding HQL sentences, HQL script program codes and potential performance grammar to a setting summary file.
As can be seen from the above description, according to the computer-readable storage medium provided in the embodiment of the present application, high-risk grammars are sequentially scanned and traversed on HQL statements in an HQL script program code, so that the running time of application operations in an offline analysis and mining scenario of a large data platform is effectively improved, and hidden danger troubleshooting can be accurately and conveniently performed on the performance capacity of an HQL script.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A big data offline batch processing performance capacity scanning method based on rules is characterized by comprising the following steps:
splitting the HQL script program code according to a preset spacer to obtain at least one HQL statement;
and sequentially carrying out script analysis on the HQL sentences, judging whether the HQL sentences subjected to the script analysis meet preset potential performance grammar rules, and if so, outputting the corresponding HQL sentences, HQL script program codes and potential performance grammar to a setting summary file.
2. The method as claimed in claim 1, wherein the step of determining whether the HQL statement parsed by the script meets a preset syntax rule of hidden performance danger includes:
extracting a set source table in the HQL sentences analyzed by the script, and determining set condition sentences in the source table;
and judging whether the conditional statement contains a partition limited field, if not, judging that the HQL statement conforms to a preset full-table scanning hidden danger grammar rule, and otherwise, judging that the HQL statement is normal.
3. The method as claimed in claim 1, wherein the step of determining whether the HQL statement parsed by the script meets a preset syntax rule of hidden performance danger includes:
extracting a set source table in the HQL statement after the script is analyzed, and determining a set insert statement in the source table;
and judging whether the inserted statement contains a partition limited field, if not, judging that the HQL statement conforms to a preset full-table insertion hidden danger grammar rule, and otherwise, judging that the HQL statement is normal.
4. The method as claimed in claim 1, wherein the step of determining whether the HQL statement parsed by the script meets a preset syntax rule of hidden performance danger includes:
and judging whether the HQL sentences analyzed by the script contain any one of set query sentences, set Cartesian product query sentences, set sequencing sentences, set statistical sentences and set record insertion functions, if so, judging that the HQL sentences accord with preset potential performance hazard grammar rules, and otherwise, judging that the HQL sentences are normal.
5. The method as claimed in claim 1, wherein the step of determining whether the HQL statement parsed by the script meets a preset syntax rule of hidden performance danger includes:
and judging whether the number of the set recorded merged sentences in the HQL sentences analyzed by the script exceeds a threshold value, if so, judging that the HQL sentences conform to a preset potential performance hazard grammar rule, and otherwise, judging that the HQL sentences are normal.
6. The method as claimed in claim 4, wherein the determining whether the HQL statement parsed by the script includes a query statement for setting cartesian product, if yes, determining that the HQL statement matches a preset syntax rule for hidden performance danger, otherwise, determining that the HQL statement is normal comprises:
and when judging that the two tables are connected and inquired in the HQL sentences analyzed by the script, judging whether specific condition sentences do not exist, if so, judging that the HQL sentences accord with preset potential performance hazard grammar rules, and otherwise, judging that the HQL sentences are normal.
7. The method as claimed in claim 4, wherein the determining whether the HQL statement parsed by the script includes the set sorting statement, if so, determining that the HQL statement matches a preset potential performance hazard grammar rule, otherwise, determining that the HQL statement is normal, includes:
judging whether the HQL sentences analyzed by the script contain set sorting statement order by, if so, judging that the HQL sentences accord with preset potential performance hazard grammar rules, replacing the set sorting statement order by with any one of sorting statements sort by or distributed by, and otherwise, judging that the HQL sentences are normal.
8. A big data offline batch processing performance capacity scanning device based on rules is characterized by comprising the following components:
the HQL script splitting module is used for splitting the HQL script program codes according to a preset spacer to obtain at least one HQL statement;
and the performance hidden danger rule judging module is used for sequentially carrying out script analysis on the HQL sentences, judging whether the HQL sentences subjected to the script analysis meet preset performance hidden danger grammar rules or not, and if so, outputting the corresponding HQL sentences, HQL script program codes and performance hidden danger grammars to a setting summary file.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the rule-based big data offline batch performance capacity scanning method of any one of claims 1 to 5.
10. A computer-readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the rule-based big data offline batch performance volume scanning method of any one of claims 1 to 5.
CN202110741372.XA 2021-06-30 2021-06-30 Rule-based big data offline batch processing performance capacity scanning method and device Pending CN113419957A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110741372.XA CN113419957A (en) 2021-06-30 2021-06-30 Rule-based big data offline batch processing performance capacity scanning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110741372.XA CN113419957A (en) 2021-06-30 2021-06-30 Rule-based big data offline batch processing performance capacity scanning method and device

Publications (1)

Publication Number Publication Date
CN113419957A true CN113419957A (en) 2021-09-21

Family

ID=77717390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110741372.XA Pending CN113419957A (en) 2021-06-30 2021-06-30 Rule-based big data offline batch processing performance capacity scanning method and device

Country Status (1)

Country Link
CN (1) CN113419957A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563167A (en) * 2022-12-02 2023-01-03 浙江大华技术股份有限公司 Data query method, electronic device and computer-readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563167A (en) * 2022-12-02 2023-01-03 浙江大华技术股份有限公司 Data query method, electronic device and computer-readable storage medium

Similar Documents

Publication Publication Date Title
CN112783793B (en) Automatic interface test system and method
CN111126019B (en) Report generation method and device based on mode customization and electronic equipment
CN113419789A (en) Method and device for generating data model script
CN106557307B (en) Service data processing method and system
CN111143390A (en) Method and device for updating metadata
CN112860264B (en) Method and device for reconstructing abstract syntax tree
CN113448869B (en) Method and device for generating test case, electronic equipment and computer readable medium
CN113419957A (en) Rule-based big data offline batch processing performance capacity scanning method and device
CN110297820B (en) Data processing method, device, equipment and storage medium
CN112988600A (en) Service scene testing method and device, electronic equipment and storage medium
CN115687050A (en) Performance analysis method and device of SQL (structured query language) statement
CN113515447B (en) Automatic testing method and device for system
CN114968917A (en) Method and device for rapidly importing file data
CN116150029A (en) Automatic overdue batch testing method and device for loan system
US8615744B2 (en) Methods and system for managing assets in programming code translation
CN114676113A (en) Heterogeneous database migration method and system based on task decomposition
CN114840421A (en) Log data processing method and device
CN113434423A (en) Interface test method and device
CN110334098A (en) A kind of database combining method and system based on script
CN113190236B (en) HQL script verification method and device
CN112905491B (en) Software test effectiveness analysis method and device
CN113722237B (en) Device testing method and electronic device
CN112988603B (en) Big data test case generation method and device
CN111339748B (en) Evaluation method, device, equipment and medium of analytical model
CN113688044A (en) Automatic testing method and device based on business scene library

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination