CN111309757B - SQL interpreter and optimization method of HBase - Google Patents

SQL interpreter and optimization method of HBase Download PDF

Info

Publication number
CN111309757B
CN111309757B CN202010405641.0A CN202010405641A CN111309757B CN 111309757 B CN111309757 B CN 111309757B CN 202010405641 A CN202010405641 A CN 202010405641A CN 111309757 B CN111309757 B CN 111309757B
Authority
CN
China
Prior art keywords
operator
hbase
sql
lexical
analyzer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010405641.0A
Other languages
Chinese (zh)
Other versions
CN111309757A (en
Inventor
赵欣
其他发明人请求不公开姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yingshisheng Information Technology Co Ltd
Original Assignee
Shenzhen Yingshisheng Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yingshisheng Information Technology Co Ltd filed Critical Shenzhen Yingshisheng Information Technology Co Ltd
Priority to CN202010405641.0A priority Critical patent/CN111309757B/en
Publication of CN111309757A publication Critical patent/CN111309757A/en
Application granted granted Critical
Publication of CN111309757B publication Critical patent/CN111309757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24547Optimisations to support specific applications; Extensibility of optimisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A SQL interpreter and an optimization method of HBase can analyze a logical operator in an SQL statement by setting a combination of a lexical analyzer, a syntax analyzer, a semantic analyzer and an actuator, and are beneficial to converting the logical operator into a physical operator directly executed in an HBase database, so that the use difficulty of the HBase database is reduced for a wide range of SQL language users.

Description

SQL interpreter and optimization method of HBase
Technical Field
The invention relates to the SQL data operation technology of an HBase database, in particular to an SQL interpreter and an optimization method of an HBase, which can analyze a logical operator in an SQL statement by setting the combination of a lexical analyzer, a syntax analyzer, a semantic analyzer and an actuator, and is beneficial to converting the logical operator into a physical operator directly executed in the HBase database, thereby reducing the use difficulty of the HBase database for a wide range of SQL language users.
Background
With the increase of the business scale, especially the data scale, the relational database cannot meet the practical requirements of mass data storage, high concurrent access and the like. Also because of this, the big data domain has emerged with numerous alternatives and systems, HBase being a distinct database from relational databases. HBase is a distributed and nematic database, the storage model of the HBase is multi-level key value pair + column storage, and the HBase well solves the problems of structured and semi-structured mass data storage and efficient random reading and writing. However, the HBase does not provide support for SQL, and is difficult to support because the storage model of the HBase is multi-level key-value pair + column storage, and the basic operation of this storage model is not a relational operation, but a key-value operation (i.e., key-value operation), and the key-value basic operation mainly includes: put operation, add new key-value, if the key already exists, update (replace) its value; get operation, obtaining value according to key; delete operation: deleting the key-value according to the key; scan operation: all key-values that satisfy the condition are scanned and returned. These basic operations are also data access interfaces (Client APIs) provided by the HBase externally, and the operations can only be completed by the above operations if the data stored in the HBase is to be accessed and operated. Compared with SQL, HBase Client API has poor universality and flexibility, the semantics of HBase Client API is not as rich as SQL, a series of program codes are required to be written for complex data operation to combine the basic operation, the workload and the use difficulty are increased undoubtedly, and the universality and the expansibility of an application program depending on HBase are sacrificed. SQL is semantically a relational operation, so SQL cannot be mapped into the key-value operation according to the method of a relational database.
The inventor, as a technical developer of database systems, has recognized that, as database systems have developed to date, SQL (Structured Query Language) is a de facto standard for database access. Various types of databases, including: relational databases, non-relational databases, and various data processing frameworks have provided or desired to provide support for SQL, so that client programs and developers can operate on data in a unified manner, thereby reducing the difficulty of using databases and improving the development efficiency. However, the SQL language was born and developed with the development of relational databases. SQL was originally a declarative programming language specifically designed to operate relational databases, the theoretical basis of which was relational algebra, but it has many features not possessed by relational algebra, such as aggregation, database updates, and the like. The theoretical model of SQL and its origin determine that the SQL language and relational model, and relational databases, are naturally adapted, for example the following SQL statements: SELECT A, B, C FROM T WHERE A > 0. The semantic representation of this SQL statement retrieves data satisfying the condition a >0 from the data table T, returns A, B, C three columns of data, and describes its semantics with a relational algebra, as shown in fig. 1 (assuming that the T table has A, B, C, D four columns), in which the projection: A. b, C, respectively; selecting: a is more than 0; the relationship is as follows: t [ A, B, C, D ].
The inventor realizes that the execution of the SQL statement is completed through a relational database, the relational database analyzes and converts the SQL statement into a series of basic operations of relational algebra, and the generated series of operations form an "execution plan". The bottom layer of the relational database implements several "operators" of basic operations, each operator representing a corresponding relational operation. The database will traverse each step in the execution plan, find out the operator corresponding to it, finish the execution of SQL statement by calling the operator, the specific flow is as shown in fig. 2: step 1, generating an execution plan; step 2, traversing each step in the execution plan; and step 3, returning a result after traversing is completed. For each step in step 2, including step 2.1, the operators are matched; and 2.2, calling an operator. Not all databases employ a relational model. Particularly in the field of big data, a database using a relational model does not exist basically because the relational database is not suitable for storage and calculation of a large amount of data due to the innate properties of the relational model (data centralized control, reduction of data redundancy, data structuring, and the like, namely, the advantages and limitations of the relational database). With the increase of the business scale, especially the data scale, the relational database cannot meet the actual requirements of mass data storage, high concurrent access and the like, and therefore, numerous alternatives and systems, for example, the HBase database related to the invention, emerge in the field of big data. The inventor thinks that if an SQL interpreter suitable for an HBase database is arranged based on relational algebra of SQL sentences, the SQL interpreter comprises a combination of a lexical analyzer, a syntax analyzer, a semantic analyzer and an actuator, so that a logical operator in the SQL sentences can be analyzed, the logical operator can be favorably converted into a physical operator directly executed in the HBase database, and the use difficulty of the HBase database is reduced for a wide range of SQL language users.
Regarding the interpretation and interpreter in the software field, the inventor has recognized that the source codes of various programming languages (such as C language, Java language, Python language, etc.) can be recognized and executed by the computer after being processed by a special compiler or interpreter, and as shown in fig. 3, the processing modes are mainly divided into three types: the method a: the source code of a compiled language (such as C language) is processed by a compiler to generate an executable program, and the program can be directly run without the assistance of other additional software facilities; mode b: an interpreted language (such as Java language) forms an intermediate code through the processing of an interpreter (the intermediate code is an intermediate representation form more suitable for interpreted execution, removes semantically irrelevant parts in a source code, such as code annotation, and the intermediate code is byte code for the Java language), and then the intermediate code is interpreted and executed through a special virtual machine program; mode c: another way to implement an interpreted language (e.g., Python language) is for the source code to be interpreted directly by the interpreter, without intermediate results and a dedicated virtual machine. Among the three ways, the way a is the best performance, the compiler compiles each statement of the source program into a machine language and stores it as a binary file (i.e., an executable program), so that the computer can directly run the program in the machine language at run time, and the speed is high. The method b and the method c are not as good as the method a in performance because the intermediate code or the source code is interpreted into the machine language to be executed by the computer when the program is executed. SQL is also a programming language, which also needs to be compiled or interpreted to execute, most databases process SQL in the manner c in fig. 4, that is, SQL statements are executed directly by an interpreter, and SQL is a declarative programming language (only expresses computational logic and does not describe its control flow), which cannot be compiled without control logic, and can only be interpreted to execute.
Disclosure of Invention
Aiming at the defects or shortcomings in the prior art, the invention provides the SQL interpreter and the optimization method of the HBase, and the combination of the lexical analyzer, the syntax analyzer, the semantic analyzer and the actuator is arranged, so that the logical operator in the SQL statement can be analyzed, the logical operator can be converted into a physical operator which is directly executed in the HBase database, and the use difficulty of the HBase database is reduced for a wide range of SQL language users.
The technical scheme of the invention is as follows:
the SQL interpreter of the HBase is characterized by comprising an SQL statement input interface for receiving an SQL statement from an application program, an operator calling interface connected with the HBase and a transmission interface for transmitting an execution result of the SQL statement in the HBase between the HBase and the application program, wherein the SQL statement input interface is connected with a lexical analyzer, the lexical analyzer is connected with a syntax analyzer, the syntax analyzer is connected with a semantic analyzer, the semantic analyzer is connected with an actuator, the actuator is connected with the operator calling interface of the HBase, and the semantic analyzer is provided with an information interface connected with the HBase.
The lexical analyzer, the syntactic analyzer, the semantic analyzer and the executor are respectively connected with a symbol table management module, and the symbol table management module stores lexical units with word types as identifiers.
The lexical analyzer, the syntax analyzer, the semantic analyzer and the executor are respectively connected with an error processing module, the error processing module processes errors according to error types, and the processing mode comprises the following steps: attempts are made to automatically correct or terminate SQL parsing or to ignore errors.
The semantic analyzer is connected with the executor through an optimizer.
The optimizer is respectively connected with the symbol table management module and the error processing module.
The lexical analyzer analyzes an input SQL statement into a lexical unit sequence by adopting a Deterministic Finite Automata (DFA) algorithm, the lexical unit sequence is output to the syntax analyzer, the lexical analyzer outputs error information to an error processing module, and an identifier is output to a symbol table management module.
The Deterministic Finite Automata (DFA) algorithm comprises a five-tuple data M, M = (K, Σ, f, S, Z), wherein K is a finite set, and each element in the finite set is called a state; Σ is a finite alphabet, each element of which is called an input symbol; f is a conversion function, which is a mapping on K × Σ → K; s belongs to K and is the only initial state; z ⊂ K is a final state set.
And f is a whole conversion function or a partial conversion function, when f is a partial conversion function, f (ki, a) = kj, and (ki is equal to K, kj is equal to K), the current state of f is ki, and when the input symbol is a, the f is converted into the next state kj, wherein kj is a subsequent state of ki.
The lexical units in the sequence of lexical units include words themselves and word types, and the word types include a classification of a word as one of: keywords, identifiers, delimiters, operators, word sizes; the sequence of lexical units does not include spaces, linefeeds, and annotations.
The keyword belongs to one of the following words used for matching: SELECT, FROM, WHERE, CREATE, DELETE, INSERT, UPDATE; the identifier belongs to a character sequence which is used for matching and is started by letters, numbers and underlines, and is not a keyword; the font size includes an integer font size, a date font size and/or a string font size.
The syntax analyzer receives the lexical unit sequences from the lexical units, assembles the lexical unit sequences into syntax analysis trees according to set grammar rules, outputs the syntax analysis trees to the semantic analyzer, outputs error information to the error processing module, and updates identifiers in the symbol table management module according to syntax analysis results.
The grammar rule adopts a statement composition rule or adopts grammar quadruple data G, G = (N, E, P, B) in the field of computer programming language grammar analysis, wherein N is a non-terminal character set; e is a terminal symbol set, and E and N have no intersection; p is a set of production rules of the form (E ≧ N) → (E ═ N) →, and the string on the left of the production must include at least one non-terminal symbol; b is the starting symbol, B belongs to N.
In the structure of the syntax analysis tree, SQL sentences are root nodes positioned at the top end, leaf nodes are lexical units positioned at the last layer, branch nodes are positioned between the root nodes and the leaf nodes, the root nodes belong to a starting symbol B, the leaf nodes belong to a terminal character set E, and the branch nodes belong to a non-terminal character set N.
The syntax analyzer adopts an adaptive LL (k) algorithm in the process of analyzing the lexical unit sequence according to a set grammar rule, wherein the first L in the LL represents that the lexical unit sequence is analyzed from left to right, the second L represents that the leftmost derivation is used in the analysis process, k is more than or equal to 1, k represents that k lexical units are matched forwards in the matching process according to the grammar rule, and the adaptive LL (k) algorithm can analyze the grammar in a dynamic mode in the grammar analysis process and automatically rewrite the left recursion in the grammar into an equivalent non-left recursion form.
The semantic analyzer receives the syntax analysis tree from the syntax analyzer, looks up table and column information from an HBase database through the information interface, generates an execution plan on the basis of 3 tasks of constructing an AST abstract syntax tree, performing static type check and updating an identifier stored in a symbol table management module, outputs the execution plan to the executor or outputs the execution plan to the executor after being optimized through an optimizer, outputs error information to an error processing module, and updates the identifier in the symbol table management module according to a semantic analysis result.
The AST abstract syntax tree construction process transforms the structure of the syntax analysis tree under the principle of compactness, easy use and ideographical principle, and removes lexical units which are not directly associated with semantics to form an association structure of semantic tree nodes, table names, column names, operation logic and word surface quantity.
The static type check comprises checking whether an expression in the SQL statement is logical, and if the expression which is not logical exists, the semantic analyzer transmits specific error information to the error processing module, and the error processing module further processes the error.
The execution plan is that semantic tree nodes under SQL statement root nodes in the AST abstract syntax tree are converted into corresponding execution plan nodes according to the represented relational algebra meaning, a tree structure formed by operators is formed, and the operators belong to basic operation of relational algebra.
When the SQL statement is an SQL query statement, the semantic tree nodes are respectively a result list, a data source and a filtering condition, the result list is converted into an upper projection operator, the filtering condition is converted into a middle selection operator, the data source is converted into a lower relation operator, and the query is executed by executing the relation operation, then executing the selection operation and finally executing the projection operation.
The optimization algorithm of the optimizer performs at least one traversal on a tree structure formed by operators and formed by the execution plan, and performs attribute adjustment on tree nodes in the traversal process, wherein the attribute adjustment includes modifying the expression of the nodes or performing shape adjustment on the tree, the shape adjustment includes adjusting the upper-lower level relation of the nodes, so that the optimization of the execution plan is realized, the traversal is a recursive operation and includes a first-order traversal or a last-order traversal, and the first-order traversal refers to: firstly, accessing a certain node, and then accessing the child nodes of the node until all tree nodes are accessed; the subsequent traversal means: the child node of a certain node is visited first, and then the node itself is visited until all the tree nodes are visited.
The optimization algorithm of the optimizer comprises a logic optimization algorithm, a physical optimization algorithm and/or a directional optimization algorithm.
The logic optimization algorithm is a combination of one or more of the following algorithms: connection rearrangement, predicate push-down, column clipping, projection merging, selection merging, constraint deduction, constant transfer, In expression optimization, constant folding, Like expression optimization, Between expression optimization and logic expression optimization.
The physical optimization algorithm is used for dynamically compiling the computational logic in the execution plan into a machine language through a JIT just-in-time compiler, so that the execution efficiency of the SQL statement in the HBase database is improved.
The directional optimization algorithm is to convert a logical operator in an execution plan into a physical operator in the HBase according to a mapping relation.
The physical operator is arranged in the HBase operator implementation module, and the physical operator comprises one or more of the following combinations: projection-selection-relational operator, aggregation-selection-relational operator, projection operator, selection operator, relational operator, join operator, aggregation operator, sort operator, pagination operator, deletion operator, update operator, insertion operator, instruction operator.
The optimizer receives the execution plan from the semantic analyzer, searches a physical operator mapping a logical operator in the execution plan from an HBase database, and then forms a physical execution plan, the physical execution plan is output to the executor, the optimizer outputs error information to an error processing module, and the optimizer acquires an identifier from a symbol table management module.
The executor receives an execution plan, the execution plan is driven to run through a root node in the execution plan, the root node calls the sub-nodes on the lower layer of the root node again, the sub-nodes are called layer by layer, and the running of the whole execution plan is completed.
The executor adopts a resource pool technology, a part of extra idle resources are applied to a computer in advance as a resource pool of the executor when the executor is started, and when the executor processes an execution plan, the resource is not distributed by the computer any more, but the executor is responsible for distributing and recycling the resource.
The executor receives the physical execution plan from the optimizer, an operator implementation module in the HBase database is called according to the physical operator mapping in the physical execution plan, the operator implementation module runs the mapped physical operator, the operator implementation module returns an execution result to the executor after completing data reading and writing in the HBase data table through the basic operation module, the extension operation module and/or the transformation operation module, and the executor outputs the final execution result of the SQL statement.
An optimization method used in the SQL interpreter of the HBase is characterized by comprising a logic optimization algorithm, a physical optimization algorithm and/or a directional optimization algorithm, wherein the logic optimization algorithm is a combination of one or more of the following algorithms: connection rearrangement, predicate push-down, column clipping, projection merging, selection merging, constraint deduction, constant transfer, In expression optimization, constant folding, Like expression optimization, Between expression optimization and logic expression optimization; the physical optimization algorithm is to dynamically compile computational logic in an execution plan into machine language through a JIT just-in-time compiler, so that the execution efficiency of SQL statements in an HBase database is improved; the directional optimization algorithm is to convert a logical operator in an execution plan into a physical operator mapping in the HBase according to a mapping relation, the physical operator is arranged in an HBase operator implementation module, and the physical operator comprises one or more of the following combinations: projection-selection-relational operator, aggregation-selection-relational operator, projection operator, selection operator, relational operator, join operator, aggregation operator, sort operator, pagination operator, deletion operator, update operator, insertion operator, instruction operator.
The invention has the following technical effects: according to the SQL interpreter and the optimization method of the HBase, disclosed by the invention, the combination of the lexical analyzer, the syntax analyzer, the semantic analyzer and the actuator is arranged, so that a logical operator in an SQL statement can be analyzed, the SQL access support is provided for the HBase, the HBase can be operated through the SQL, the SQL execution efficiency is greatly improved compared with that of a native HBase Client API, the use difficulty of an HBase database is reduced for a wide SQL language user, and the HBase can be used and accessed in a universal operation mode (SQL) in the field of the database very conveniently. In addition, according to the SQL interpreter and the optimization method of the HBase, the improvement and the expansion of the existing HBase database can be realized, complete SQL semantics (namely various operators for realizing relational operation in the HBase) are realized on the bottom layer of the HBase, and the execution performance of the SQL on the HBase can be further greatly improved by the SQL optimization method provided by the invention.
Drawings
FIG. 1 is a diagram illustrating an equivalent description of SQL statements and relational algebra.
FIG. 2 is a schematic diagram of the structure and flow of an SQL statement and a relational database.
FIG. 3 is a flow and structure diagram of source code processing.
FIG. 4 is a diagram of the application structure of the SQL interpreter for implementing the HBase of the present invention.
FIG. 5 is a schematic diagram of the overall construction and flow of an SQL interpreter for implementing an HBase according to the invention.
Fig. 6 is a schematic diagram of an input/output structure of the syntax analyzer in fig. 5.
Fig. 7 is a diagram illustrating DFA state transitions of integers identified by the lexical analyzer of fig. 5 or 6.
Fig. 8 is a diagram illustrating a syntax analysis tree structure formed by the parser of fig. 5 according to the lexical units obtained by the lexical parser.
Fig. 9 is a schematic diagram of an input-output structure of the parser of fig. 5 or 6.
Fig. 10 is a schematic diagram of an input/output structure of the semantic analyzer in fig. 5 or fig. 9.
Fig. 11 is a diagram of an AST abstract syntax tree structure and representation. Ast (abstract Syntax tree) refers to an abstract Syntax tree.
Fig. 12 is a diagram illustrating an AST abstract syntax tree structure formed after the syntax analysis tree of fig. 8 passes through a semantic analyzer.
Fig. 13 is a schematic diagram of identifier change in the symbol table management module of each stage.
Fig. 14 is a schematic diagram of an execution plan node structure formed from the AST abstract syntax tree in fig. 12.
FIG. 15 is a process flow diagram of the semantic analysis phase.
FIG. 16 is a schematic diagram of an execution plan after logic optimization.
FIG. 17 is a schematic diagram of an optimizer performing physical optimization on an execution plan.
FIG. 18 is a schematic diagram of directional optimization of operator mapping between logical operators in an execution plan and physical operators in an HBase operator implementation module.
FIG. 19 is a schematic diagram of an execution plan optimized by an optimizer.
Fig. 20 is a schematic diagram of an input/output structure of the optimizer in fig. 5.
FIG. 21 is a schematic flow chart of the operation of the execution plan in the executor.
FIG. 22 is a diagram of an implementation architecture with resource pools.
Fig. 23 is a schematic diagram of an input/output structure of the actuator.
FIG. 24 is an example of a raw data table in HBase, Orders (product order Table, stored in HBase as key-value pair + columnar).
FIG. 25 is a schematic diagram of a data retrieval process for the raw data table of FIG. 24.
Fig. 26 is a diagram showing the result of data retrieval on the original data table in fig. 24.
Detailed Description
The invention is described below with reference to the accompanying drawings (fig. 1-26).
FIG. 1 is a diagram illustrating an equivalent description of SQL statements and relational algebra. SQL statement in fig. 1: SELECT A, B, CFROM T WHERE A > 0. The semantic representation of this SQL statement retrieves data satisfying the condition a >0 from the data table T, returns A, B, C three columns of data, and describes the semantics by relational algebra, assuming that the T table has a column a, a column B, a column C, and a column D, for a total of 4 columns. In relational algebra, projection: A. b, C, respectively; selecting: a is more than 0; the relationship is as follows: t [ A, B, C, D ]. T is the name of the data table. FIG. 2 is a schematic diagram of the structure and flow of an SQL statement and a relational database. The process refers to an internal execution process in a relational database. The internal execution flow in fig. 2 includes: step 1, generating an execution plan; step 2, traversing each step in the execution plan; and step 3, returning a result after traversing is completed. For each step in step 2, a matching operator is included, and then the operator is invoked. FIG. 3 is a flow and structure diagram of source code processing. The flow includes a compiled language processing flow involving a compiler on the left and an interpreted language processing flow involving an interpreter on the right. a is a compiled language process flow (e.g., C language) that is executed directly in a database from a source code-compiler-executable. b is a first interpreted language process (e.g., Java language) that interprets execution in a database from source code-interpreter-intermediate code-virtual machine. c is a second interpreted language process (e.g., Python language) that is interpreted from the source code-interpreter-executing in the database. FIG. 4 is a diagram of the application structure of the SQL interpreter for implementing the HBase of the present invention. In fig. 4, the SQL interpreter is connected to the HBase, the application program uses SQL statements to operate the HBase through the SQL interpreter, and the HBase returns an execution result to the application program through the SQL interpreter. The SQL interpreter comprises an SQL optimization method. FIG. 5 is a schematic diagram of the overall construction and flow of an SQL interpreter for implementing an HBase according to the invention. The SQL interpreter structure comprises a lexical analyzer, a syntax analyzer, a semantic analyzer, an optimizer and an executor which are used for SQL sentences and connected in sequence, wherein the lexical analyzer, the syntax analyzer, the semantic analyzer, the optimizer and the executor are respectively connected with a symbol table management module, the lexical analyzer, the syntax analyzer, the semantic analyzer, the optimizer and the executor are respectively connected with an error processing module, and the executor calls an operator from HBase to execute to obtain a result. And the SQL statement execution result is returned from the HBase through the executor, and the semantic analyzer searches the table and column information from the HBase database. Fig. 6 is a schematic diagram of an input/output structure of the syntax analyzer in fig. 5. After the SQL statement in fig. 6 is input into the lexical analyzer, the lexical unit sequence therein is output to the syntax analyzer, the error information is output to the error processing module, and the identifier is output to the symbol table management module. For example, a lexical unit sequence obtained from an SQL statement (SELECT ID, BuyerFROM Orders WHERE (Date BETWEEN 2019-04-01AND 2019-06-30) AND (Amount > 100)) includes: { key: SELECT }, { identifier: ID }, { delimiter:, }, { identifier: Buyer }, { key: FROM }, { identifier: Orders }, { key: WHERE }, { operator:, { identifier: Date }, { key: BETWEEN }, { word face Amount: Date: 2019-04-01}, { key: AND }, { word face Amount: Date: 2019-06-30}, { operator: }, { key: AND }, { operator:, (}, { identifier: Amount }, { word face Amount: integer: 100}, AND { operator: }. Fig. 7 is a diagram illustrating DFA state transitions of integers identified by the lexical analyzer of fig. 5 or 6. Dfa (deterministic Finite automaton) is a deterministic Finite automaton. The states 1 (initial state), 2 and 3 (final state) are included in fig. 7, and the asterisks at the upper right of the states 3 indicate that no non-numbers are included. The digits accepted by state 1 reach state 2, forming a sequence of digits beginning with a number, ending until a non-numeric character is encountered. Fig. 8 is a diagram illustrating a syntax analysis tree structure formed by the parser of fig. 5 according to the lexical units obtained by the lexical parser. In FIG. 8, the SELECT clause, FROM clause, AND WHERE clauses are parsed down FROM the query statement (query 2019 for Orders in the second quarter FROM the Orders table for products with amounts greater than 100 dollars, returning order number AND Buyer number, SQL statements: SELECT ID, Buyer FROM Orders WHERE (Date BETWEEN 2019-04-01AND 2019-06-30) AND (AMount > 100)); analyzing the { key word: SELECT } and a result list from the SELECT clause; analyzing { key word: FROM } and data source FROM FROM clause; analyzing the { keyword: WHERE } and expression from the WHERE clause; analyzing down from the result list { identifier: ID }, { delimiter: }, and { identifier: layer }; analyzing the identifier Orders from the data source; analyzing an AND expression from the expression downwards; analyzing a first bracket expression, { key word: AND }, AND a second bracket expression from the AND expression; analyzing BETWEEN expressions (omitted below) from the first parenthesis expressions; analyzing { operator: (}, greater than the expression, and { operator:) } from the second bracket expression downwards; from the greater than expression, the { identifier: Amount }, { operator: }, and { literal: integer: 100} are analyzed down. Fig. 9 is a schematic diagram of an input-output structure of the parser of fig. 5 or 6. After the lexical unit sequence in fig. 9 is input to the syntax analyzer, the syntax analysis tree formed therein is output to the semantic analyzer, the error information is output to the error processing module, and the updated identifier is output to the symbol table management module. Fig. 10 is a schematic diagram of an input/output structure of the semantic analyzer in fig. 5 or fig. 9. After the syntax analysis tree in fig. 10 is input to the semantic analyzer, the semantic analyzer searches the table and the column from the HBase, the semantic analyzer obtains the execution plan and outputs the execution plan to the optimizer, the error information is output to the error processing module, and the updated identifier is output to the symbol table management module. Fig. 11 is a diagram of an AST abstract syntax tree structure and representation. Ast (abstract Syntax tree) refers to an abstract Syntax tree. In fig. 11, a language analysis tree and a modern chinese analysis tree are formed in parallel by taking the language "ringing both the chuzhou mountains" and the modern chinese "chuzhou mountain ringing" as examples. The "analysis tree of chuzhou city mountain too" and the "analysis tree of chuzhou cluster mountain around" form the same tree node "# statement fact" in the AST abstract syntax tree through respective semantic analysis: the subject "Chuzhou", the predicate "ring", and the object "mountain". The Chinese analysis tree is sequentially provided with Chinese and judgment sentences in a sentence pattern of 'all-also' from top to bottom; the sentence pattern of 'all-also' is respectively the subject, the judgment word, the object and the auxiliary word; under the subject are verbs (under which "rings") and nouns (under which "Chu"), respectively; "all" under the judgment word; under object is a noun (under which is "mountain"); the assistant words are also. The modern Chinese analytic tree sequentially comprises a modern Chinese and statement sentences from top to bottom; displaying subject and predicate under the sentence; nouns, "Chuzhou", are in turn under the subject; adjectives are in turn under the predicate (omitted: cliff). Fig. 12 is a diagram illustrating an AST abstract syntax tree structure formed after the syntax analysis tree of fig. 8 passes through a semantic analyzer. The top of FIG. 12 is the tree node "# query" (e.g., SQL statement SELECT ID, Buyer FROM Orders WHERE (Date BETWEEN 2019-04-01AND 2019-06-30) AND (AMount > 100)); under "# query" are "# result list" "," # data source "", "# filter condition", respectively; under the "# results list" are "{ column: ID, belonged table: Orders }", "{ column: layer, belonged table: Orders }", respectively. Under "# data Source" is "{ Table: Orders }". Under the "# filter condition", there are "# expression" and "{ logical and }" in this order. "{ logical AND }" is followed by "(omit)", and "{ greater than compare }", respectively. "{ larger than comparison }" is followed by "{ column: Amount, belonged table: Orders }", and "{ literal: integer: 100 }", respectively. Fig. 13 is a schematic diagram of identifier change in the symbol table management module of each stage. "initial state" corresponds to "(empty symbol table)" in fig. 13; the "lexical analysis phase" corresponds to "{ identifier: ID }, { identifier: Orders },"; "parsing phase" corresponds to "{ columns: identifier: ID }, { source: identifiers: Orders },"; the "semantic analysis phase" corresponds to the columns ID, table of belongings, Orders, data table, Orders. Fig. 14 is a schematic diagram of an execution plan node structure formed from the AST abstract syntax tree in fig. 12. In FIG. 14, from top to bottom, there are projection nodes (projection: ID, Buyer), selection nodes (selection (Date BeTWEEN 2019-04-01AND 2019-06-30) AND (Amount > CAST (100, DOUBLE))), AND relationship nodes (relationship: Orders [ ID, Amount, Date, Status, Buyer ]). The projection node is from the "# result list" under "# query", the selection node is from the "# filter condition" under "# query", and the relationship node is from the "# data source" under "# query". FIG. 15 is a process flow diagram of the semantic analysis phase. Fig. 15 includes: 1, constructing an AST abstract syntax tree; 2, checking static type; 3, confirming the meaning of the identifier; and 4, generating an execution plan. FIG. 16 is a schematic diagram of an execution plan after logic optimization. In fig. 16, compared with fig. 14, the selected node is changed, and neither the projected node nor the relationship node is changed. The selected nodes are subjected to the following algorithms such as constraint derivation, constant folding and BETWEEN expression optimization, and are selected: (Date betweeen 2019-04-01AND 2019-06-30) AND (Amount > CAST (100, DOUBLE)) "change to" select: the Amount IS NOT NULL AND the Amount >100.00 AND Date IS NOT NULL AND Date > = 2019-04-01AND Date < = 2019-06-30 ". FIG. 17 is a schematic diagram of an optimizer performing physical optimization on an execution plan. The physical optimization module in fig. 17 optimizes the execution plan to be optimized into a new expression and then inputs it into the JIT just-in-time compiler to compile it into machine language. FIG. 18 is a schematic diagram of directional optimization of operator mapping between logical operators in an execution plan and physical operators in an HBase operator implementation module. FIG. 19 is a schematic diagram of an execution plan optimized by an optimizer. In FIG. 19 the projection, selection and relational operators are mapped into a PFR blend operator (PFR: target table- > Orders; result column- > ID, Buyer; filter- > (expression abbreviation)). The expression IS amino IS NOT NULL AND amino >100.00 AND Date IS NOT NULL AND Date > = 2019-04-01AND Date < = 2019-06-30. PFR is the english acronym for projection (Project), selection (Filter), Relation (relationship). The PFR mixing operator acquires data by calling the basic operation Scan of HBase when executing. Fig. 20 is a schematic diagram of an input/output structure of the optimizer in fig. 5. In fig. 20, after the logical operator in the execution plan is input to the optimizer, the optimizer acquires the identifier from the symbol table management module, optimizes the input logical execution plan into a physical execution plan by combining with the operator mapping of HBase, outputs the physical execution plan to the executor, and outputs the error information to the error processing module. FIG. 21 is a schematic flow chart of the operation of the execution plan in the executor. In fig. 21, a is a root node, b and c are child nodes of a, d is a child node of b, a solid line with an arrow indicates a call flow, and a dotted line with an arrow indicates a result return flow. FIG. 22 is a diagram of an implementation architecture with resource pools. The executor in fig. 22 applies for resources in advance from the computer to establish a resource pool for the execution plan, where the resource pool includes resources allocated for the execution plan a and resources allocated for the execution plan B, idle resources cannot satisfy the execution plan C, and the execution plan C can only wait for other execution plans to release resources. Resources refer to CPU, memory, and/or disk, etc. Fig. 23 is a schematic diagram of an input/output structure of the actuator. In fig. 23, the physical execution plan is input to the executor, the executor invokes an operator in the operator implementation module of the HBase, the operator implementation module returns the execution result to the executor after completing data reading and writing in the HBase data table through the basic operation module, the extended operation module, and/or the modified operation module, and the executor outputs the final execution result of the SQL statement. FIG. 24 is an example of a raw data table in HBase, Orders (product order Table, stored in HBase as key-value pair + columnar). The table name Orders includes an ID order number column, an Amount order Amount column, a Date creation Date column, a Status order Status column, and a Buyer Buyer number column. FIG. 25 is a schematic diagram of a data retrieval process for the raw data table of FIG. 24. The data retrieval requirement in fig. 25 is "query the order with the sum of more than 100 yuan in the 2 nd quarter of 2019 FROM the Orders table, return the order number AND the Buyer number", AND the SQL statement is SELECT ID, AND the buy FROM Orders WHERE (DateBeTWEEN 2019-04-01AND 2019-06-30) AND (Amount > 100), AND the SQL statement is executed in the HBase after passing through the interpreter of the HBase of the present invention, AND the data with the diagonal lines in the table is the data filtered out in the execution process. Fig. 26 is a diagram showing the result of data retrieval on the original data table in fig. 24. SQL sentences (SELECT ID, layer FROMOORDERS WHERE (Date BETWEEN 2019-04-01AND 2019-06-30) AND (Amount > 100)) are executed in the HBase after passing through the interpreter of the HBase, AND the executor outputs the query result of the SQL sentences: the ID order number 0001 and its layer Buyer number B0001 are returned, and the ID order number 0003 and its layer Buyer number B0002 are returned.
Referring to fig. 1 to 26, an SQL interpreter of an HBase includes an SQL statement input interface (e.g., an arrow line under the SQL statement in fig. 4) for receiving SQL statements from an application, an operator call interface (e.g., an interface line between the SQL interpreter and the HBase in fig. 4) for connecting the HBase, and a transmission interface (e.g., an arrow line under the execution result in fig. 4) for transmitting the execution results of the SQL statements in the HBase between the HBase and the application, the SQL statement input interface is connected to a lexical analyzer, the lexical analyzer is connected to a syntax analyzer, the syntax analyzer is connected to a semantic analyzer, the semantic analyzer is connected to an actuator, the actuator is connected to the operator call interface of the HBase, and the semantic analyzer has an information interface (see fig. 5) for connecting the HBase. The lexical analyzer, the syntactic analyzer, the semantic analyzer and the executor are respectively connected with a symbol table management module, and the symbol table management module stores lexical units with word types as identifiers (shown in figure 5). The lexical analyzer, the syntax analyzer, the semantic analyzer and the executor are respectively connected with an error processing module, the error processing module processes errors according to error types, and the processing mode comprises the following steps: attempts are made to automatically correct or terminate SQL parsing or to ignore errors (see fig. 5). The semantic analyzer is connected with the executor through an optimizer (see fig. 5). The optimizer is respectively connected with a symbol table management module and an error processing module (shown in figure 5).
The lexical analyzer analyzes an input SQL statement into a lexical unit sequence by adopting a Deterministic Finite Automata (DFA) algorithm, the lexical unit sequence is output to the syntax analyzer, the lexical analyzer outputs error information to an error processing module, and an identifier is output to a symbol table management module (shown in figure 6, and can help understand lexical rules by referring to figure 7). The Deterministic Finite Automata (DFA) algorithm comprises a five-tuple data M, M = (K, Σ, f, S, Z), wherein K is a finite set, and each element in the finite set is called a state; Σ is a finite alphabet, each element of which is called an input symbol; f is a conversion function, which is a mapping on K × Σ → K; s belongs to K and is the only initial state; z ⊂ K is a final state set. And f is a whole conversion function or a partial conversion function, when f is a partial conversion function, f (ki, a) = kj, and (ki is equal to K, kj is equal to K), the current state of f is ki, and when the input symbol is a, the f is converted into the next state kj, wherein kj is a subsequent state of ki. As shown in fig. 8, the lexical units in the sequence of lexical units include the words themselves and word types, the word types including categorizing a word as one of: keywords, identifiers, delimiters, operators, word sizes; the sequence of lexical units does not include spaces, linefeeds, and annotations. The keyword belongs to one of the following words used for matching: SELECT, FROM, WHERE, CREATE, DELETE, INSERT, UPDATE; the identifier belongs to a character sequence which is used for matching and is started by letters, numbers and underlines, and is not a keyword; the font size includes an integer font size, a date font size and/or a string font size.
As shown in fig. 9, the parser receives the sequence of lexical units from the lexical units, assembles the sequence of lexical units into a parse tree according to a set grammar rule, outputs the parse tree to the parser, outputs error information to the error processing module, and updates the identifier in the symbol table management module according to a parsing result. The grammar rule adopts a statement composition rule or adopts grammar quadruple data G, G = (N, E, P, B) in the field of computer programming language grammar analysis, wherein N is a non-terminal character set; e is a terminal symbol set, and E and N have no intersection; p is a set of production rules of the form (E ≧ N) → (E ═ N) →, and the string on the left of the production must include at least one non-terminal symbol; b is the starting symbol, B belongs to N. In the structure of the syntax analysis tree, SQL sentences are root nodes positioned at the top end, leaf nodes are lexical units positioned at the last layer, branch nodes are positioned between the root nodes and the leaf nodes, the root nodes belong to a starting symbol B, the leaf nodes belong to a terminal character set E, and the branch nodes belong to a non-terminal character set N. The syntax analyzer adopts an adaptive LL (k) algorithm in the process of analyzing the lexical unit sequence according to a set grammar rule, wherein the first L in the LL represents that the lexical unit sequence is analyzed from left to right, the second L represents that the leftmost derivation is used in the analysis process, k is more than or equal to 1, k represents that k lexical units are matched forwards in the matching process according to the grammar rule, and the adaptive LL (k) algorithm can analyze the grammar in a dynamic mode in the grammar analysis process and automatically rewrite the left recursion in the grammar into an equivalent non-left recursion form.
As shown in fig. 10, the semantic analyzer receives the syntax analysis tree from the syntax analyzer and looks up information of tables and columns from the HBase database through the information interface, and then generates an execution plan (as shown in fig. 15) based on 3 jobs of constructing an AST abstract syntax tree, performing static type check, and updating an identifier (a symbol change in a symbol table management module can be understood with reference to fig. 13) stored in a symbol table management module, which is output to the executor or optimized through an optimizer and then outputs error information to an error processing module, and updates the identifier in the symbol table management module according to the semantic analysis result. As shown in fig. 12, as can be help to understand with reference to fig. 11, the construction process of the AST abstract syntax tree modifies the structure of the syntax analysis tree according to the principles of compactness, ease of use and ideographical principle, and removes lexical units that are not directly associated with semantics to form an association structure of semantic tree nodes with table names, column names, operation logic and word sizes. The static type check comprises checking whether an expression in the SQL statement is logical, and if the expression which is not logical exists, the semantic analyzer transmits specific error information to the error processing module, and the error processing module further processes the error. Referring to fig. 14, the execution plan is to convert semantic tree nodes under the root node of the SQL statement in the AST abstract syntax tree into corresponding execution plan nodes according to the meaning of the relational algebra represented by the semantic tree nodes, and form a tree structure composed of operators, where the operators belong to the basic operation of the relational algebra. As shown in fig. 14, when the SQL statement is an SQL query statement, the semantic tree nodes are respectively a result list, a data source, and a filter condition, the result list is converted into an upper projection operator, the filter condition is converted into a middle selection operator, the data source is converted into a lower relation operator, and the query is executed by first executing the relation operation, then executing the selection operation, and finally executing the projection operation.
The optimization algorithm of the optimizer performs at least one traversal on a tree structure formed by operators and formed by the execution plan, and performs attribute adjustment on tree nodes in the traversal process, wherein the attribute adjustment includes modifying the expression of the nodes or performing shape adjustment on the tree, the shape adjustment includes adjusting the upper-lower level relation of the nodes, so that the optimization of the execution plan is realized, the traversal is a recursive operation and includes a first-order traversal or a last-order traversal, and the first-order traversal refers to: firstly, accessing a certain node, and then accessing the child nodes of the node until all tree nodes are accessed; the subsequent traversal means: the child node of a certain node is visited first, and then the node itself is visited until all the tree nodes are visited. The optimization algorithm of the optimizer comprises a logic optimization algorithm, a physical optimization algorithm and/or a directional optimization algorithm. Referring to fig. 16, the logic optimization algorithm is a combination of one or more of the following algorithms: connection rearrangement, predicate push-down, column clipping, projection merging, selection merging, constraint deduction, constant transfer, In expression optimization, constant folding, Like expression optimization, Between expression optimization and logic expression optimization. As shown in fig. 17, the physical optimization algorithm is to dynamically compile the computational logic in the execution plan into a machine language by a JIT just-in-time compiler, so as to improve the execution efficiency of the SQL statement in the HBase database.
Referring to fig. 18 and 19, the directional optimization algorithm converts the logical operator in the execution plan into the physical operator in the HBase according to the mapping relationship. The physical operator is arranged in the HBase operator implementation module, and the physical operator comprises one or more of the following combinations: projection-selection-relational operator, aggregation-selection-relational operator, projection operator, selection operator, relational operator, join operator, aggregation operator, sort operator, pagination operator, deletion operator, update operator, insertion operator, instruction operator. As shown in fig. 20, the optimizer receives the execution plan from the semantic analyzer, looks up the physical operators mapping the logical operators in the execution plan from the HBase database, and then forms a physical execution plan, the physical execution plan is output to the executor, the optimizer outputs error information to the error processing module, and the optimizer acquires the identifier from the symbol table management module.
Referring to fig. 21, 22 and 23, the executor receives the execution plan, and drives the execution plan to run through a root node in the execution plan, and the root node recalls a child node at a lower layer thereof, and invokes layer by layer to complete the running of the entire execution plan. The executor adopts a resource pool technology, a part of extra idle resources are applied to a computer in advance as a resource pool of the executor when the executor is started, and when the executor processes an execution plan, the resource is not distributed by the computer any more, but the executor is responsible for distributing and recycling the resource. The executor receives the physical execution plan from the optimizer, an operator implementation module in the HBase database is called according to the physical operator mapping in the physical execution plan, the operator implementation module runs the mapped physical operator, the operator implementation module returns an execution result to the executor after completing data reading and writing in the HBase data table through the basic operation module, the extension operation module and/or the transformation operation module, and the executor outputs the final execution result of the SQL statement.
An optimization method for use in the SQL interpreter of the HBase, comprising a logical optimization algorithm, a physical optimization algorithm and/or a directional optimization algorithm, the logical optimization algorithm being a combination of one or more of the following algorithms: connection rearrangement, predicate push-down, column clipping, projection merging, selection merging, constraint deduction, constant transfer, In expression optimization, constant folding, Like expression optimization, Between expression optimization and logic expression optimization; the physical optimization algorithm is to dynamically compile computational logic in an execution plan into machine language through a JIT just-in-time compiler, so that the execution efficiency of SQL statements in an HBase database is improved; the directional optimization algorithm is to convert a logical operator in an execution plan into a physical operator mapping in the HBase according to a mapping relation, the physical operator is arranged in an HBase operator implementation module, and the physical operator comprises one or more of the following combinations: projection-selection-relational operator, aggregation-selection-relational operator, projection operator, selection operator, relational operator, join operator, aggregation operator, sort operator, pagination operator, deletion operator, update operator, insertion operator, instruction operator.
The SQL interpreter and the SQL optimization method of the HBase and the modification and extension of the HBase can provide SQL access support for the HBase, namely the HBase can be operated through the SQL, and the SQL execution efficiency can be greatly improved compared with the native HBase Client API. By modifying and expanding the HBase, the invention realizes complete SQL semantics (namely various operators for realizing relational operation in the HBase) on the bottom layer of the HBase, and the SQL optimization method provided by the invention can greatly improve the execution performance of SQL on the HBase, as shown in FIG. 4.
The overall structure of the SQL interpreter of HBase is shown in FIG. 5, and the SQL interpreter is composed of the following modules: a lexical analyzer; a parser; a semantic analyzer; an optimizer; an actuator; a symbol table management module; and an error processing module.
The following describes the internal processing flow of the SQL interpreter of the HBase and the relationship between the modules in conjunction with specific examples (see fig. 24 to fig. 26 for data tables), assuming the following data tables:
product order form (Orders)
Figure 232891DEST_PATH_IMAGE001
If an order with a total amount greater than 100 dollars in the second quarter of 2019 is to be queried from the product order table (with the table name Orders) (returning the order number and the buyer number), the corresponding SQL statement is as follows:
SELECT ID, layer- -Note: returning only ID and layer column data
FROM Orders
WHERE (Date BETWEEN 2019-04-01 AND 2019-06-30) AND (Amount>100)
First, lexical analysis stage
The essence of the source code is a section of text, the SQL statement is firstly lexical analyzed by a lexical analyzer, the lexical analysis process is to analyze the source code into a series of lexical units (Token) according to a given lexical rule, and ignore the lexical units without actual meaning (for example, a comment has no actual effect on the SQL execution level, and only plays a supplementary description role on the source code level, that is, the comment is human-oriented and has no meaning to the computer), and the input and output of the lexical analyzer are shown in fig. 6.
The algorithm adopted by the lexical analyzer is a DFA algorithm, the DFA is an English abbreviation of a Deterministic finish Automaton, and the Chinese is called as follows: deterministic finite automata. The deterministic finite automata M is a five-tuple: m = (K, Σ, f, S, Z), in which: 1) k is a finite set, each element of which is called a state; 2) Σ is a finite alphabet, each element of which is called an input symbol, and therefore also called Σ as an input symbol alphabet; 3) f is a conversion function, which is a mapping (and may be a partial function) on K × Σ → K, i.e.: if f (ki, a) = kj, and (ki is belonged to K, kj is belonged to K), then the current state is ki, and when the input character is a, the next state kj is converted, and the kj is called as a subsequent state of ki; 4) s belongs to K and is the only initial state; 5) z ⊂ K is a set of final states, also referred to as acceptable states or end states.
The predetermined lexical rule is equivalent to the DFA, for example, the lexical rule of the integer can be represented by a state transition diagram of the DFA, as shown in fig. 7, indicating that state 1 is an initial state, state 3 is a final state, state 1 accepts the number to state 2, and state 3 does not accept the non-number to state (the asterisk of state 3 indicates that it is not included), that is, the lexical rule of the integer is: a sequence of numbers beginning with a number until a non-numeric character is encountered.
The lexical analyzer sequentially reads characters in the SQL sentence from left to right, and identifies various lexical units according to the lexical rules of the SQL (namely, words are identified through the DFA algorithm), wherein the types of the lexical units comprise: keywords (e.g., SELECT, FROM, WHERE, AND), identifiers (e.g., ID, Buyer, Orders), operators (e.g., +, -, >, <), word size (e.g., integer word size 100, date word size 2019-04-01, string word size 'abc'), AND the like. Each type of lexical unit has corresponding lexical rules, for example, the lexical rules of the keywords are as follows: the character sequence is one of the following sequences as a keyword: SELECT, FROM, WHERE, CREATE, DELETE, INSERT, UPDATE, etc.; the lexical rules for the identifiers are: a sequence of characters beginning with letters, consisting of letters, numbers, underlines, and not keywords.
Reading the characters of the SQL sentence in sequence by the lexical analyzer, firstly reading the characters S and S, wherein the characters S and S are letters, the possibility of beginning of the letters is a keyword or an identifier according to the lexical rule, which lexical unit cannot be determined, continuously reading the next character until the corresponding lexical rule can be matched, when a space behind SELECT is met, SELECT can be matched with the lexical rule of the keyword, returning the first lexical unit { keyword: SELECT } by the lexical analyzer, continuously reading the subsequent character, reading the space, wherein the space is a nonsense lexical unit, ignoring, reading the character I and I are letters, continuously reading until a comma behind the ID is met, the ID can be matched with the lexical rule of the identifier, returning the lexical analyzer to the second lexical unit { identifier: ID }, and completing the lexical analysis stage after the lexical analyzer reads the whole SQL sentence, the output of the lexical analysis stage is a sequence of lexical units, and the SQL statement finally generates the following lexical units: { key: SELECT }, { identifier: ID }, { delimiter:, }, { identifier: Buyer }, { key: FROM }, { identifier: Orders }, { key: WHERE }, { operator:, { identifier: Date }, { key: BETWEEN }, { word face Amount: Date: 2019-04-01}, { key: AND }, { word face Amount: Date: 2019-06-30}, { operator: }, { key: AND }, { operator:, (}, { identifier: Amount }, { word face Amount: integer: 100}, AND { operator: }.
As above, the lexical units contain all the elements in the SQL statement, ignore irrelevant parts (spaces, lines, comments) in the SQL statement, and each lexical unit contains the type of the word in addition to the word itself. And storing the lexical units of which the types are identifiers into a symbol table management module for further processing in a subsequent stage. If an error occurs in the lexical analysis stage, for example, an illegal character exists in the SQL statement or a lexical rule with which a word is not matched exists, specific error information (for example, { illegal character: line 2: column 3: "@") is fed back to the error processing module, the error processing module processes the error according to the type of the error, and the processing method includes: attempt to automatically correct, terminate SQL parsing, ignore errors, etc.
Second, grammar analysis stage
After the SQL sentences are processed by the lexical analyzer, a series of lexical units are formed, in the syntax analysis stage, the lexical units are used as input of the syntax analyzer, the syntax analyzer is used for forming a series of legal sentences (Statement) by the input lexical units according to syntax rules, and the sentences are tree structures formed by the lexical units, so the sentences are also called parse trees. The analysis tree formed by the lexical units of the SQL statement in the above example is shown in fig. 8, the analysis tree is split and refined layer by layer when viewed from top to bottom, the root node (i.e., the query statement in the graph) represents the complete SQL statement, the analysis tree is summarized layer by layer when viewed from bottom to top, the leaf node (i.e., the tree node at the last layer) is each lexical unit, the leaf node is summarized upward to finally form the complete SQL statement, and the analysis tree includes all details in the SQL statement, that is, the analysis tree includes all the lexical units and their relationships.
The structure (i.e., shape) of the parse tree represents the grammar rules, i.e., the grammar parser assembles a series of lexical units into a tree structure according to the established grammar rules. The strategy of assembly is called grammar, the so-called grammar refers to the composition rule of a sentence, for example, if a statement sentence should be composed of a subject, a predicate and an object in sequence, the sentence can be analyzed according to the composition rule of the statement sentence when judging whether the sentence is the statement sentence, and in the field of grammar analysis of computer programming languages, grammar G is a quadruplet (N, E, P, B) composed of the following elements: a set of non-terminators N; the set E of the terminal symbols has no intersection with the N; a set of production rules P in the form of strings → (E ═ N) →, and the strings to the left of the production rules must include at least one non-terminal symbol; the starting symbols B, B belong to N.
Taking the grammar of the query statement as an example, as shown in fig. 8, the query statement is composed of a SELECT clause, a FROM clause, and a WHERE clause, the SELECT clause is composed of SELECT keywords and a result list, the derived rules form a production rule P, leaf nodes (i.e., lexical units of the last layer nodes) in the tree structure belong to a set E of terminal symbols, other nodes (e.g., SELECT clauses and FROM clauses) except the leaf clauses belong to a set N of terminal symbols, and a root node (i.e., the query statement) belongs to a starting symbol B.
During the process of analyzing the lexical unit sequence by the syntax analyzer according to the grammar rules, a new algorithm named as self-adaptive LL (k) is created, and the algorithm is an improvement and enhancement of the LL (1) algorithm and the LL (k) algorithm. The first L in LL indicates that the lexical unit sequence is analyzed from Left to right, the second L indicates that the leftmost derivation will be used during the analysis, and both L are the first letters of the english word Left. With respect to the LL (1) algorithm, the adaptive LL (k) algorithm can perform parsing of the grammar in a dynamic manner during parsing, rather than the static manner of LL (1), and the adaptive LL (k) algorithm can automatically rewrite the left recursion in the grammar to an equivalent non-left recursive form. LL (1) and LL (k) 1and k in parentheses indicate that 1 or k lexical units are matched forwards in the grammar rule matching process, and LL (k) can match k (k is more than or equal to 1) lexical units forwards, and the parsing capability of LL (1) is stronger.
For example, the process of parsing is similar to a word-guessing poetry answering game: LL (1) indicates that which poem is guessed when a word is seen, and the wrong answer is likely to happen, and the wrong answer needs to be answered again according to the word appearing subsequently; LL (k) shows that the answer cannot be given until a plurality of characters can form a complete poetry sentence, the accuracy is greatly improved, but the answer speed is reduced; the self-adaptive LL (k) shows that in some cases, people can accurately judge which poem is without seeing a complete poem sentence, and then answer can be completed in advance.
In the execution process of the grammar analyzer, the accuracy and the performance of grammar analysis can be greatly improved by means of the powerful analysis capability of the self-adaptive LL (k) algorithm. The input and output of the parser, and the relationship to other modules, are shown in FIG. 9. The input of the syntax analyzer is a lexical unit sequence generated by the lexical analyzer, the syntax analysis process is to generate an analysis tree according to the grammar rule, and according to the result of the syntax analysis, the identifier in the symbol table management module is updated, namely the meaning of the identifier is further clarified, such as the Orders in the SQL statement, only one identifier can be determined through the lexical analysis stage, and further the Orders can be further determined to be a data table through the syntax analysis stage. If an error occurs in the syntax analysis stage (for example, the sequence of the lexical units cannot form a legal statement, and the formed statement does not have ambiguity definitely), the lexical analyzer feeds error information back to the error processing module, and the error processing module performs corresponding processing according to the error type (for example, trying to automatically correct, terminating the SQL analysis, ignoring the error, and the like).
Third, semantic analysis stage
The SQL sentence is a text, which is not suitable for the computer to analyze the sentence, after lexical analysis and syntactic analysis, the SQL sentence is converted into an equivalent analysis tree, the analysis tree is structured representation of SQL, that is, the SQL sentence is converted into a form suitable for the computer to analyze, the semantic analyzer is used for understanding the meaning (meaning) represented by the analysis tree, and forming a corresponding execution plan after understanding the meaning of the analysis tree, the execution plan represents the real intention and specific execution steps of the SQL, and the subsequent stage executes corresponding operation according to the execution plan. The input and output of the semantic analyzer, and the relationship with other modules are shown in fig. 10.
The 1 st task of the semantic analysis stage is to construct an AST (Abstract Syntax Tree), which is also a Tree-shaped structure called Abstract Syntax Tree as the name implies, because the AST does not represent every detail appearing in the real Syntax, for example, the nesting brackets are hidden in the structure of the Tree and not represented in the form of Tree nodes, and the parse Tree contains all the Syntax details. The core logic of the AST construction process is to modify the structure of an analysis tree, and the following construction scheme is followed: 1) the method is compact: no useless nodes are contained; 2) the use is easy: easy traversal; 3) and (3) ideographical representation: highlighting operands, operators, and their relationships is not tied to grammars. The first two points mean that, in order to identify the patterns in the tree, the subsequent steps generally require multiple traversal of the AST to analyze the true semantics and construct other data structures based on the AST, so the AST must be the simplest. For AST, the most important is its shape, others can be reduced. And the 3 rd point is used for avoiding that the AST tree structure (and the subsequent stage thereof) is influenced by grammar change, the grammar of the SQL language is almost changed certainly along with the development of the database field, and other parts cannot normally run due to the grammar change, so that the semantic analyzer fully considers the development rule of the database field, and has good compatibility and expandability, thereby improving the application range of the semantic analyzer and prolonging the life cycle of the semantic analyzer.
To illustrate the important significance of the three-point AST construction scheme, taking natural language as an example, the ancient and present expression modes of objects with the same meaning are different, for example, "all mountains in Chuzhou" in the literature, and the modern chinese will be described as "surrounding of mountains in Chuzhou", although the expression modes of the same objects at different times are different, the core words are: one of targets of constructing the AST is to extract core words and relations thereof, and the specific expression mode is realized on the premise of understanding and maintaining accurate meanings of the core words and relations. After semantic analysis, the two expression modes form the same AST (i.e. the meaning of the two words is the same), as shown in fig. 11, the final AST no longer depends on the specific expression mode (i.e. decoupling), only the tree nodes of the core and the relationship between the nodes are retained, and the accurate meaning (semantic) can still be represented.
For the analysis tree shown in fig. 8, after being processed by the semantic analyzer, a corresponding AST is constructed as shown in fig. 12, lexical units (e.g., { keyword: SELECT }, { separator:, } and the like) that are not directly associated with semantics have been removed FROM the AST, and nodes such as SELECT clauses and FROM clauses in the analysis tree are rewritten into nodes such as an equivalent result list and a data source, and the rewritten tree nodes (i.e., nodes with # in fig. 12) do not depend on specific syntax rules any more, so that the number of nodes of the entire AST is significantly reduced, and the hierarchical relationship is greatly simplified. Obviously, the AST is more beneficial to subsequent analysis, and grammar change (as long as the semantics are unchanged) does not influence the structure of the AST.
In addition to the need to generate AST, item 2 of the semantic analysis phase is a static type check, i.e., check whether the expression in the SQL statement is logical, as in FIG. 12, { greater than compare } node, whose left node is the column Amount and whose right node is the integer 100, these three nodes represent the meaning: the value of Amount is compared with the integer 100, and if Amount is greater than 100, true is returned (i.e. the condition is true), otherwise false is returned (i.e. the condition is not true).
For this expression, the work of the static type check is: whether the data types on the two sides larger than the number are compatible or not is judged, namely whether the two values for comparison are comparable or not can be judged, in the example, the Amount is a floating point number, 100 is an integer, the floating point number and the integer are compatible and can be compared in size, but the integer conversion bit floating point number needs to be firstly compared, the Amount >100 is rewritten into the Amount > CAST (100, DOUBLE) in the static type checking process, and the algorithm for converting the data types into consistency is called type promotion (namely, the integer type is promoted into the floating point type). If the expression is: amount >2019-05-01, then the data types on both sides of the number are incompatible (the expression is illegal), and the floating point number and the date are of an incomparable size. The static type check judges all expressions, if illegal expressions exist, the semantic analyzer transmits specific error information to the error processing module, and the error processing module further processes the errors (including stopping analysis, trying to correct, ignoring errors and the like).
Item 3 of the semantic analysis phase is to update the identifiers stored in the symbol table management module so that the meaning is finally confirmed, taking the Orders and the ID in the example SQL statement as an example:
1) in the lexical analysis stage, two words of Orders and ID in the SQL statement are identified as { identifier: Orders } and { identifier: ID }, and are stored in a symbol table management module;
2) in the parsing phase, the two identifiers are further identified as { data source: identifier: Orders } and { column: identifier: ID } and the corresponding identifiers in the symbol table management module are updated;
3) in the semantic analysis stage, further analyzing and identifying the data, the analysis of the stage needs to interact with HBase, the semantic analyzer searches whether a table named Orders exists in the HBase, further searches whether a column named ID exists in the Orders table if the table exists, finally, the two identifiers are { table: Orders } and { column: ID, the affiliated table: Orders }, and the corresponding identifier in the symbol table management module is updated, so far, all the identifiers in the symbol table management module completely clarify the semantics (namely, what meaning the Orders and the ID specifically represent is confirmed to be completed).
The identification process of the identifier in the above 3 stages is a step-by-step explicit process, and the change of the symbol table management module in the whole process is shown in fig. 13, and the stored identifier is updated in each stage.
The 4 th task of the semantic analysis stage is to generate an execution plan on the basis of the first 3 tasks, the execution plan is a tree structure composed of operators (i.e. basic operations of relational algebra), the execution plan represents an execution mode and an execution step of SQL statements, wherein nodes in the tree structure represent the execution mode (i.e. specific operations), relationships between the nodes (shapes of trees) represent the execution step, and the AST generated execution plan is as shown in fig. 14.
As shown in fig. 14, the three AST nodes (# result list, # data source, # filter condition) are converted into corresponding execution plan nodes (projection, relationship, selection) according to their representative relationship algebraic meanings, and the shape of the tree is changed, the originally juxtaposed three AST nodes (sibling relationships) are changed into nodes having a top-bottom hierarchical relationship (parent-child relationship), and the execution step of the- # query whose shape is determined by the AST root node (# query) is: the 'relational' operation is executed first, then the 'selection' operation is executed, and finally the 'projection' operation is executed.
The semantic analysis is performed through the above 4 jobs (i.e., the processing flow), and finally the execution plan is formed, and the relationship between the 4 jobs is shown in fig. 15.
Fourthly, an optimization stage
The output of the semantic analysis stage is an execution plan which represents the execution mode and the execution steps of the SQL statement and can be understood and executed by a computer, the invention can further process the execution plan by introducing an optimizer to optimize the execution performance of the execution plan, and the optimization algorithms are divided into 3 types: 1) logic optimization; 2) physical optimization; 3) and (5) orientation optimization. As described above, the execution plan is a tree structure composed of operators, the optimization algorithm traverses the tree multiple times according to the traversal strategy (first order and last order) and adjusts the attributes of the tree nodes (e.g., modifies the node expressions) or adjusts the tree shapes (e.g., adjusts the upper-lower level relationship of the nodes) during the traversal process, thereby implementing the optimization of the execution plan. The traversal of the tree is a recursive operation, and the preorder traversal refers to the following steps: firstly, accessing a certain node, and then accessing the child nodes of the node until all tree nodes are accessed; the subsequent traversal means: the child node of a certain node is visited first, and then the node itself is visited until all the tree nodes are visited.
1. Logic optimization: the logic optimization is equivalent transformation (i.e. logical adjustment) of the execution plan through a group of algorithms, and the specific algorithms comprise: connection rearrangement, predicate push-down, column clipping, projection merging, selection merging, constraint derivation, constant transfer, In expression optimization, constant folding, Like expression optimization, Between expression optimization, logic expression optimization and the Like.
The new execution plan generated by the execution plan (fig. 14) generated by the SQL example after logic optimization is shown in fig. 16.
The execution plan shown in FIG. 16, as compared to before optimization (shown in FIG. 14), is changed to "select" the node, whose expression is expressed from the original:
(Date BETWEEN 2019-04-01 AND 2019-06-30) AND (Amount>CAST(100,DOUBLE))
the optimization is as follows:
Amount IS NOT NULL AND Amount>100.00 AND Date IS NOT NULL AND Date>= 2019-04-01 AND Date<= 2019-06-30
the three algorithms on which the actual optimization takes place are constraint derivation, constant folding and betweeen optimization.
Constraint derivation: the Amount IS NOT NULL IS deduced through the Amount >100, because if the Amount IS larger than 100, the Amount IS NOT empty definitely, and the Date IS NOT NULL IS the same.
Constant folding: CAST (100, DOUBLE) is directly calculable, the calculation result is 100.00, and the expression is optimized to be AMount > 100.00.
BETWEEN optimization: betweeen.. AND.. expression is optimized to be equivalent > = AND < =.
Through optimization, the expression is more accurate (filtering conditions are increased), more concise (calculated in advance), and more beneficial to computer processing (BETWEENAND is removed).
2. Physical optimization: the invention innovates the SQL optimization mode, combines the mode a and the mode b shown in figure 3 in the optimization process, adopts just-in-time (JIT for short) technology to dynamically compile the computational logic (namely expression) in the execution plan into machine language, and further accelerates the SQL execution efficiency, as shown in figure 17.
Through physical optimization, the above example execution plan changes mainly by "selecting" a node, and the expression of the "selecting" node: round IS NOT NULL AND round >100.00 AND Date IS NOT NULL AND Date > = 2019-04-01AND Date < = 2019-06-30
By compiling the JIT compiler into the machine language, during the execution of the 'selection' node, the expression is executed as the machine language, and the overall execution efficiency of the SQL is effectively improved.
3. Directional optimization: as described above, the present invention implements complete SQL semantics (i.e. operators for all relational operations) by modifying and expanding the HBase, and as shown in fig. 18, the execution plan of SQL can be run in the HBase, i.e. operators (tree nodes) in the execution plan can be mapped to the operators implemented in the HBase in the present invention. The operator realized on the HBase fully combines the characteristics of the HBase on the premise of following the relational operation theory, and provides a special operator called a hybrid operator, and the execution performance of the operator on the HBase is specially optimized and is better than the performance of a basic operator provided in a relational database.
The directional optimization is to convert the logical operator in the execution plan into the physical operator in the HBase according to the mapping relationship, which is shown by the dotted line in fig. 18. Through directional optimization, all operators in the execution plan are converted into truly executable physical operators, and the execution plan is optimized.
After the execution plan (before optimization) in the above example is optimized by the optimizer, the final execution plan (after optimization) is as shown in fig. 19, and three operators of projection-selection-relation are finally mapped into one PFR operator. The PFR operator is a hybrid operator, and when executed, calls the basic operation Scan of the HBase to acquire data. PFR is the english acronym for projection (Project), selection (Filter), Relation (relationship).
The physical operator provided by the operator implementation module and the mapping relation between the physical operator and the logical operator are as follows: 1) projection-selection-relation operator (i.e. PFR operator): projection-selection-relationship, projection-relationship, selection-relationship, relationship may be mapped. 2) Aggregation-selection-relation operator: aggregation-selection-relationships, aggregation-relationships may be mapped. 3) Projection operator: projections other than case 1 may be mapped. 4) Selecting an operator: selections other than cases 1, 2 may be mapped. 5) A relational operator: relationships other than cases 1, 2 may be mapped, including: table relationships, virtual relationships, null relationships, sub-queries. 6) A connection operator: all connections may be mapped, including: internal connection, external connection, natural connection, cartesian product. 7) And (3) an aggregation operator: aggregations other than case 2 may be mapped. 8) A sequencing operator: all the orderings may be mapped. 9) Paging operator: all pages may be mapped. 10) And (3) deleting an operator: all deletions may be mapped. 11) Updating an operator: all updates may be mapped. 12) An insertion operator: all insertions may be mapped. 13) Instruction operators (including all instructions except for adding, deleting, modifying and checking, such as table creation, user authorization and the like): all instructions except for add, delete, modify, and check may be mapped.
In the optimization phase, the input of the optimizer is an execution plan, which can be called a logic plan before the optimization phase and can be called a physical plan after the optimization phase. The change of the execution plan in name also embodies the role of the optimizer: and converting the logic plan more suitable for the definition into a physical plan more suitable for the execution of the computer. The relationship between the optimizer and the other modules is shown in FIG. 20.
Fifthly, an execution stage
After the SQL statement is subjected to lexical analysis, syntactic analysis, semantic analysis and optimization, a physical execution plan is generated, and finally an actual execution stage is started. As described above, the execution plan is a tree structure formed by operators, and the executor is driven by the root node of the execution plan when running the execution plan, that is, the executor invokes the root node of the execution plan, and the root node invokes the sub-nodes of the lower layer thereof, and the layer-by-layer invocation is performed to complete the running of the whole execution plan, and the flow is shown in fig. 21. In fig. 21, the solid line indicates a call flow, the dotted line indicates a result return flow, the executor first calls a root node a of the execution plan (step 1), the node a calls child nodes b and c concurrently (step 2), the node b calls a child node d thereof (step 3), each node collects the return results upward layer by layer (steps 4, 5, and 6), and finally returns the results to the executor. The calling step is downward layer by layer and the return result is upward layer by layer.
The method includes that resources (including a CPU, a memory, a disk and the like) of a computer are occupied during program running, the resources of the computer are limited, an executor adopts a technology named as a resource pool, a part of extra free resources are applied to the computer in advance when the executor is started, and when the executor processes an execution plan, the resource is not allocated by the computer any more, but is allocated and recycled by the executor, as shown in FIG. 22.
Compared with the allocation of resources by computers, the resource pool technology has the following advantages: 1. the computer resources occupied by the executor are constant, the total amount of the resource pool is applied in advance and kept unchanged, when a plurality of execution plans need to be executed, a large amount of computer resources cannot be occupied, and stable operation of the executor and the computer is effectively guaranteed. 2. The executor prevents the computer resources from being occupied by other programs to cause the insufficient resources of the executor per se and cause the execution plan not to be operated normally by applying for the resources in advance. 3. The executor automatically controls the allocation of resources, is more flexible than computer control, can specify different resource allocation strategies (such as first come first get, small resource first, large resource first, concurrency first and the like), and is suitable for different use scenes.
The executor inputs, outputs, and relationships to other modules are shown in FIG. 23, where the executor inputs are execution plans and the outputs are the final execution results of SQL. The double dotted lines in the figure indicate that the actuator interacts with the operator implementation module for multiple times (that is, each operator in the execution plan calls the operator implementation module), and the operator implementation module completes actual data reading and writing through three types of operation modules (a basic operation module, an extended operation module and a modified operation module).
When the execution plan of the SQL example described above is executed by the flow shown in fig. 23, the data retrieval process for the original data table (Orders table) is shown in the following table, and the data with the diagonal lines in the table is filtered out during the execution process.
Data retrieval for Orders tables
Figure 602562DEST_PATH_IMAGE002
The final data returned (i.e., the SQL query results) is as shown in the table below, with only data that meets the conditions being retained (all orders with an amount greater than 100 dollars in the second quarter of 2019, returning the order number and the buyer number).
Query results of SQL
Figure 585561DEST_PATH_IMAGE003
In summary, through the SQL interpreter, the SQL optimization method, and the modification and extension of the HBase provided by the present invention, the support of SQL access can be provided for the HBase, that is, the HBase can be operated through SQL, and the execution efficiency of SQL can be greatly improved compared with the native HBase Client API, the difficulty in using the HBase is reduced, and the HBase can be used and accessed in a general operation manner (SQL) in the database field.
It is pointed out here that the above description is helpful for the person skilled in the art to understand the invention, but does not limit the scope of protection of the invention. Any such equivalents, modifications and/or omissions as may be made without departing from the spirit and scope of the invention may be resorted to.

Claims (10)

1. An SQL interpreter of an HBase, comprising an SQL statement input interface for receiving SQL statements from an application, an operator call interface for connecting the HBase, and a transmission interface for transmitting results of execution of the SQL statements in the HBase between the HBase and the application, wherein the SQL statement input interface is connected to a lexical analyzer, the lexical analyzer is connected to a parser, the parser is connected to a semantic analyzer, the semantic analyzer is connected to an actuator, the actuator is connected to the operator call interface of the HBase to call physical operators provided in an HBase operator implementation module, and the physical operators comprise one or more of the following combinations: projection-selection-relational operator, aggregation-selection-relational operator, projection operator, selection operator, relational operator, connection operator, aggregation operator, sorting operator, paging operator, deletion operator, update operator, insertion operator, and instruction operator, and the semantic analyzer has an information interface connected with the HBase.
2. The HBase SQL interpreter according to claim 1, wherein the lexical analyzer, the syntax analyzer, the semantic analyzer, and the executor are respectively connected to a notation table management module that stores lexical units with word types as identifiers; the lexical analyzer, the syntax analyzer, the semantic analyzer and the executor are respectively connected with an error processing module, the error processing module processes errors according to error types, and the processing mode comprises the following steps: attempts are made to automatically correct or terminate SQL parsing or to ignore errors.
3. The HBase SQL interpreter according to claim 1, wherein the semantic analyzer is connected to the executor through an optimizer; the optimizer is respectively connected with the symbol table management module and the error processing module.
4. The HBase SQL interpreter according to claim 1, wherein the lexical analyzer employs a Deterministic Finite Automata (DFA) algorithm to parse an input SQL statement into a sequence of lexical units, the sequence of lexical units being output to the parser, the lexical analyzer outputting error information to an error handling module, and identifiers to a symbol table management module; the lexical units in the sequence of lexical units include words themselves and word types, and the word types include a classification of a word as one of: keywords, identifiers, delimiters, operators, word sizes; the lexical unit sequence does not include spaces, line feeds, and annotations; the keyword belongs to one of the following words used for matching: SELECT, FROM, WHERE, CREATE, DELETE, INSERT, UPDATE; the identifier belongs to a character sequence which is used for matching and is started by letters, numbers and underlines, and is not a keyword; the font size includes an integer font size, a date font size and/or a string font size.
5. The HBase SQL interpreter according to claim 4, characterized in that the deterministic finite automata DFA algorithm includes a quintuple of data M, M = (K, Σ, f, S, Z), where K is a finite set, each element in the finite set being called a state; Σ is a finite alphabet, each element of which is called an input symbol; f is a conversion function, which is a mapping on K × Σ → K; s belongs to K and is the only initial state; z ⊂ K is a final state set; and f is a whole conversion function or a partial conversion function, when f is a partial conversion function, f (ki, a) = kj, and (ki is equal to K, kj is equal to K), the current state of f is ki, and when the input symbol is a, the f is converted into the next state kj, wherein kj is a subsequent state of ki.
6. The HBase SQL interpreter according to claim 2, wherein the parser receives the sequence of lexical units from the lexical units, assembles the sequence of lexical units into a parse tree according to a set grammar rule, the parse tree is output to the semantic parser, the parser outputs error information to an error processing module, and updates an identifier in a symbol table management module according to a parsing result; the grammar rule adopts a statement composition rule or adopts grammar quadruple data G, G = (N, E, P, B) in the field of computer programming language grammar analysis, wherein N is a non-terminal character set; e is a terminal symbol set, and E and N have no intersection; p is a set of production rules of the form (E ≧ N) → (E ═ N) →, and the string on the left of the production must include at least one non-terminal symbol; b is a starting symbol, B belongs to N; in the structure of the syntax analysis tree, SQL sentences are root nodes positioned at the top end, leaf nodes are lexical units positioned at the last layer, branch nodes are positioned between the root nodes and the leaf nodes, the root nodes belong to a starting symbol B, the leaf nodes belong to a terminal character set E, and the branch nodes belong to a non-terminal character set N.
7. The HBase SQL interpreter according to claim 1, wherein the parser parses the sequence of lexical units according to the set grammar rules using an adaptive LL (k) algorithm, where the first L in the LL indicates that the sequence of lexical units is parsed from left to right, the second L indicates that the left-most derivation will be used during parsing, k ≧ 1, k indicates that k lexical units are matched forward in the matching process according to the grammar rules, and the adaptive LL (k) algorithm can perform parsing on the grammar in a dynamic manner during parsing and can automatically rewrite the left recursion in the grammar to an equivalent non-left recursive form.
8. The HBase SQL interpreter according to claim 1, wherein the parser receives the parse tree from the parser and looks up the table and column information from the HBase database through the information interface, and then generates an execution plan based on 3 jobs of constructing an AST abstract syntax tree, performing static type check, and updating an identifier stored in a symbol table management module, the execution plan being output to the executor or being output to the executor after being optimized by an optimizer, the parser outputs error information to an error processing module and updates the identifier in the symbol table management module according to the result of the parsing; the AST abstract syntax tree construction process transforms the structure of the syntax analysis tree under the principle of compactness, easy use and ideographical principle, and removes lexical units which are not directly associated with semantics to form an association structure of semantic tree nodes, table names, column names, operation logic and word surface quantity; the static type check comprises checking whether an expression in the SQL statement is logical, if the expression is not logical, the semantic analyzer transmits specific error information to the error processing module, and the error processing module further processes the error; the execution plan is that semantic tree nodes under SQL statement root nodes in the AST abstract syntax tree are converted into corresponding execution plan nodes according to the represented relational algebra meaning, a tree structure formed by operators is formed, and the operators belong to basic operation of relational algebra.
9. The HBase SQL interpreter according to claim 1, wherein the executor receives an execution plan, and drives the execution plan to run through a root node in the execution plan, and the root node recalls sub-nodes at a lower layer and invokes layer by layer to complete the running of the whole execution plan; the executor adopts a resource pool technology, a part of extra idle resources are applied to a computer in advance as a resource pool of the executor when the executor is started, and when the executor processes an execution plan, the resource is not distributed by the computer any more, but the executor is responsible for distributing and recycling the resource.
10. Optimization method for use in the SQL interpreter of the HBase according to one of the previous claims 1 to 9, characterized by comprising a logical optimization algorithm, a physical optimization algorithm and/or a directed optimization algorithm, said logical optimization algorithm being a combination of one or more of the following algorithms: connection rearrangement, predicate push-down, column clipping, projection merging, selection merging, constraint deduction, constant transfer, In expression optimization, constant folding, Like expression optimization, Between expression optimization and logic expression optimization; the physical optimization algorithm is to dynamically compile computational logic in an execution plan into machine language through a JIT just-in-time compiler, so that the execution efficiency of SQL statements in an HBase database is improved; the directional optimization algorithm is to convert the logical operator in the execution plan into the physical operator mapping in the HBase according to the mapping relation.
CN202010405641.0A 2020-05-14 2020-05-14 SQL interpreter and optimization method of HBase Active CN111309757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010405641.0A CN111309757B (en) 2020-05-14 2020-05-14 SQL interpreter and optimization method of HBase

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010405641.0A CN111309757B (en) 2020-05-14 2020-05-14 SQL interpreter and optimization method of HBase

Publications (2)

Publication Number Publication Date
CN111309757A CN111309757A (en) 2020-06-19
CN111309757B true CN111309757B (en) 2020-09-01

Family

ID=71161131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010405641.0A Active CN111309757B (en) 2020-05-14 2020-05-14 SQL interpreter and optimization method of HBase

Country Status (1)

Country Link
CN (1) CN111309757B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069198B (en) * 2020-07-16 2021-09-10 中科驭数(北京)科技有限公司 SQL analysis optimization method and device
CN112052255B (en) * 2020-09-02 2022-05-03 福建天晴在线互动科技有限公司 SQL (structured query language) interpretation method and device for splitting multi-table slow query from top to bottom
CN112346730B (en) * 2020-11-04 2021-08-27 星环信息科技(上海)股份有限公司 Intermediate representation generation method, computer equipment and storage medium
CN112835925B (en) * 2021-02-02 2024-03-29 北京握奇数据股份有限公司 SQL statement analysis method for embedded chip
CN112949172B (en) * 2021-02-24 2023-07-04 重庆中科云从科技有限公司 Data processing method, device, machine-readable medium and equipment
CN113448982A (en) * 2021-06-30 2021-09-28 未鲲(上海)科技服务有限公司 DDL statement analysis method and device, computer equipment and storage medium
CN114090017B (en) * 2022-01-20 2022-06-24 北京大学 Method and device for analyzing programming language and nonvolatile storage medium
CN114461351B (en) 2022-04-13 2022-06-17 之江实验室 Dynamic graph execution method and device for neural network computation
CN115905236B (en) * 2022-11-30 2023-08-22 深圳计算科学研究院 Data processing method, device, equipment and storage medium
CN117971236B (en) * 2024-03-31 2024-06-18 浪潮电子信息产业股份有限公司 Operator analysis method, device, equipment and medium based on lexical and grammatical analysis

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10599654B2 (en) * 2017-06-12 2020-03-24 Salesforce.Com, Inc. Method and system for determining unique events from a stream of events

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1251067C (en) * 2002-06-26 2006-04-12 联想(北京)有限公司 Method for realizing modular query language interpreter in the flush type data base system
US10496640B2 (en) * 2012-12-19 2019-12-03 Salesforce.Com, Inc. Querying a not only structured query language (NoSQL) database using structured query language (SQL) commands
CN107818100B (en) * 2016-09-12 2019-12-20 杭州海康威视数字技术股份有限公司 SQL statement execution method and device
CN106682147A (en) * 2016-12-22 2017-05-17 北京锐安科技有限公司 Mass data based query method and device
CN110019291A (en) * 2017-09-04 2019-07-16 ***通信集团浙江有限公司 A kind of SQL analytic method and SQL resolver
CN107665404B (en) * 2017-09-25 2020-08-18 北京航空航天大学 Domain specific language description system and method for taxi supervision
CN110968579B (en) * 2018-09-30 2023-04-11 阿里巴巴集团控股有限公司 Execution plan generation and execution method, database engine and storage medium
CN110083625A (en) * 2019-03-18 2019-08-02 北京奇艺世纪科技有限公司 Realtime stream processing method, equipment, data processing equipment and medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10599654B2 (en) * 2017-06-12 2020-03-24 Salesforce.Com, Inc. Method and system for determining unique events from a stream of events

Also Published As

Publication number Publication date
CN111309757A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
CN111309757B (en) SQL interpreter and optimization method of HBase
CN110187885B (en) Intermediate code generation method and device for quantum program compiling
Van den Brand et al. Disambiguation filters for scannerless generalized LR parsers
US7680782B2 (en) Method to generate semantically valid queries in the XQuery language
Rheinländer et al. Optimization of complex dataflows with user-defined functions
US20040267760A1 (en) Query intermediate language method and system
Rompf et al. Functional pearl: a SQL to C compiler in 500 lines of code
Abiteboul et al. Object identity as a query language primitive
Ackermann et al. Jet: An embedded DSL for high performance big data processing
Johnstone et al. Modelling GLL parser implementations
US8495055B2 (en) Method and computer program for evaluating database queries involving relational and hierarchical data
Taylor Generalized data base management system data structures and their mappingto physical storage
Abiteboul et al. A logical view of structured files
Rompf et al. A SQL to C compiler in 500 lines of code
EP1504362A1 (en) Cooperation of concurrent, distributed networks of resources
Matsuda et al. A functional reformulation of UnCAL graph-transformations: or, graph transformation as graph reduction
Borkar et al. A Common Compiler Framework for Big Data Languages: Motivation, Opportunities, and Benefits.
Wyss et al. A relational algebra for data/metadata integration in a federated database system
Munnecke et al. MUMPS: Characteristics and comparisons with other programming systems
Bandat On the formal definition of PL/I
CN116560667B (en) Splitting scheduling system and method based on precompiled delay execution
US11656868B1 (en) System and method for translating existential Datalog into differential dataflow
CN111158691B (en) Method for realizing rule engine dynamic
Hao Implementation of the nested relational algebra in Java
Botoeva et al. Expressivity and complexity of MongoDB (Extended version)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant