Semi-structured data query method and distributed NewSQL database system
Technical Field
The invention relates to the technical field of big data, in particular to a semi-structured data query method and a distributed NewSQL database system.
Background
The Hbase unit is currently one of the most well-known distributed NoSQL databases in the Hadoop ecosystem. The Hbase unit main components comprise an HMmaster and an HRegionserver, a table type data model is provided for a user, a plurality of regions are divided according to a main key range, the HMmaster is responsible for managing and distributing the regions, and the HRegionserver is responsible for reading and writing region data. The data stored by the existing Hbase unit has no data type, and is byte arrays, so that problems in query can exist if semi-structured data such as JSON is stored. To store JSON format data in the Hbase unit, the entire JSON object is conventionally stored as a string. This approach has the following drawbacks:
when the records are to be filtered, all the records need to be read out and then filtered at the client, and the performance cannot be accepted in the case of large data volume.
When a record needs to be updated, the record needs to be read out, updated according to a specific field, and then rewritten into the Hbase unit for overwriting.
Disclosure of Invention
The embodiment of the invention aims to provide a semi-structured data query method and a distributed NewSQL database system, which can realize data query in a JSON format and solve the problems of poor effect and poor performance when processing semi-structured data.
In order to achieve the above object, an embodiment of the present invention provides a method for querying semi-structured data, which is applicable to a distributed NewSQL database system, and includes:
accessing a user request in an interface mode of JDCB/ODBC, wherein the user request comprises a query condition of JSON data to be queried, and the query result is the JSON data obtained according to the query condition;
analyzing the user request, compiling and generating a corresponding execution plan;
acquiring index data corresponding to the query condition requested by the user according to an execution plan; wherein, the index table stores the index data in the form of inverted index generated by the JSON data as a nested type;
inquiring a data table according to the acquired index data so as to acquire the corresponding inquiry result; wherein the JSON data is stored as a whole;
and returning the query result to the user.
Further, the analyzing the user request, compiling, and generating the corresponding execution plan includes:
judging whether a pre-stored SQL statement corresponding to the SQL request exists in the shared cache pool, if so, outputting an execution plan corresponding to the pre-stored SQL statement, otherwise,
and carrying out syntax check on the SQL request, if the syntax error returns error information to a user, otherwise,
semantic check is carried out on the SQL request, if the semantic error returns error information to the user, otherwise,
carrying out view and expression conversion on the SQL request to obtain a corresponding conversion result;
selecting an optimizer according to the conversion result to obtain a corresponding optimizer selection result;
selecting a corresponding data connection mode and a corresponding connection sequence according to the selection result of the optimizer;
selecting a searched path according to the connection mode and the connection sequence;
and generating an execution plan according to the search path, and outputting the execution plan.
Correspondingly, an embodiment of the present invention further provides a distributed NewSQL database system, including:
the JDCB/ODBC interface unit is used for carrying out interactive operation with a user, and comprises the steps of receiving a user request and returning a query result to the user; the user request comprises a query condition of JSON data to be queried, and the query result is the JSON data obtained according to the query condition;
the master unit is used for accessing a user request accessed by the JDCB/ODBC interface unit, coordinating data communication among a plurality of processors and managing the whole flow, and preferentially sending the user request to the SQLPLaner unit; the master unit is also used for returning the query result to the JDCB/ODBC interface unit;
the SQLPLaner unit is used for analyzing the user request, compiling and customizing an execution plan according to the user request;
a worker unit to execute the plan in parallel, comprising: according to an execution plan, starting a coprocessor module to obtain index data corresponding to the query conditions requested by the user, and querying a data table according to the obtained index data so as to obtain the corresponding query result; the Hbase unit is also used for returning the query result of the Hbase unit to the master unit;
the Hbase unit is used for storing the data table and the index table; the Hbase unit further comprises the coprocessor module, JSON type data are added to the bottom layer of the Hbase unit, and the JSON data are stored in the bottom layer HFile in a whole mode;
and the distributed transaction manager is used for coordinating multiple parties to finish distributed transaction management when the worker unit execution plan relates to a transaction.
Further, the JDCB/ODBC interface unit is further configured to convert the user request into an SQL request in the form of an SQL statement.
Further, the SQLPlaner unit is configured to:
judging whether a pre-stored SQL statement corresponding to the SQL request exists in the shared cache pool, if so, outputting an execution plan corresponding to the pre-stored SQL statement, otherwise,
and carrying out syntax check on the SQL request, if the syntax error returns error information to a user, otherwise,
semantic check is carried out on the SQL request, if the semantic error returns error information to the user, otherwise,
carrying out view and expression conversion on the SQL request to obtain a corresponding conversion result;
selecting an optimizer according to the conversion result to obtain a corresponding optimizer selection result;
selecting a corresponding data connection mode and a corresponding connection sequence according to the selection result of the optimizer;
selecting a searched path according to the connection mode and the connection sequence;
and generating an execution plan according to the search path, and outputting the execution plan.
Further, the method also comprises the following steps:
a monitor for taking charge of metadata management, monitoring a load of a Region of the Hbase unit, and reallocating the Region through a coprocessor module of the Hbase unit; the monitor is connected with the master unit.
Further, the monitoring the load of the Region of the Hbase unit and the reallocating the Region by the coprocessor module of the Hbase unit includes:
receiving data distribution information of the Hbase unit, and receiving load information of the worker unit in the master unit, wherein the load information comprises a load deviation value of the worker unit;
comparing the load deviation value of the worker unit with a preset load deviation threshold, and if the load deviation value is judged to exceed the threshold, triggering the Hbase unit to perform secondary distribution on the Region on the server with higher hit rate and the Region on the server with lower hit rate;
acquiring the data volume of each Region, judging the data volume of each Region and a preset data volume threshold, and triggering the Hbase unit to divide the regions exceeding the preset data volume threshold into two regions if the data volume of the Region is judged to exceed the threshold.
Further, the JDCB/ODBC interface unit includes:
the JDBC application program module is used for receiving the user request, calling the JDBC object method to give an SQL statement and extracting a result to return to the user;
the JDBC driver manager module is used for loading and calling the JDBC driver module for the JDBC application program module;
the JDBC driver module is used for executing the calling of the JDBC object method, sending the SQL statement corresponding to the user request to the bottom database and returning the result obtained from the bottom database to the JDBC application module.
Compared with the prior art, the semi-structured data query method and the distributed NewSQL database system provided by the invention access a user request in an interface mode of JDCB/ODBC, wherein the user request comprises a query condition of JSON data to be queried, and the query result is the JSON data obtained according to the query condition; analyzing the user request, compiling and generating a corresponding execution plan; acquiring index data corresponding to the query condition requested by the user according to an execution plan; wherein, the index table stores the index data in the form of inverted index generated by the JSON data as a nested type; inquiring a data table according to the acquired index data so as to acquire the corresponding inquiry result; wherein the JSON data is stored as a whole; the technical scheme of returning the query result to the user can realize data query in a JSON format and solve the problems of poor effect and poor performance when processing semi-structured data.
Drawings
Fig. 1 is a schematic flowchart of a method for semi-structured data query according to embodiment 1 of the present invention;
fig. 2 is a schematic structural diagram of a distributed NewSQL database provided in embodiment 2 of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flowchart of a method for semi-structured data query according to embodiment 1 of the present invention; the method is suitable for a distributed NewSQL database system, and the embodiment 1 comprises the following steps:
s1, accessing a user request in a JDCB/ODBC interface mode, wherein the user request comprises a query condition of JSON data to be queried, and the query result is the JSON data obtained according to the query condition;
s2, analyzing the user request, compiling and generating a corresponding execution plan;
s3, acquiring index data corresponding to the query condition requested by the user according to an execution plan; wherein, the index table stores the index data in the form of inverted index generated by the JSON data as a nested type;
s4, inquiring a data table according to the acquired index data so as to acquire the corresponding inquiry result; wherein the JSON data is stored as a whole;
and S5, returning the query result to the user.
In the prior art, data stored by Hbase has no data type difference and is a byte array, so that problems exist in the aspect of query if json semi-structured data is stored. To store json format data in hbase, the entire json object would conventionally be stored as a string. This approach has the following drawbacks: when the records are to be filtered, all the records need to be read out and then filtered at the client, and the performance cannot be accepted in the case of large data volume. When a record needs to be updated, the record needs to be read out, updated according to a specific field, and then rewritten to the hbase for overwriting. Particularly, for semi-structured data, the embodiment can support the semi-structured data, and a user can directly store data in a JSON format, query any field of the JSON, create an index and delete the data. The problem of effect and performance are not good when the hbase processes the semi-structured data in the prior art is solved.
Further, step S1 further includes: and converting the user request into an SQL request in an SQL statement form.
Further, the parsing, compiling and generating the corresponding execution plan in step S2 includes:
s21, judging whether the shared cache pool has the pre-stored SQL sentence corresponding to the SQL request, if yes, outputting the execution plan corresponding to the pre-stored SQL sentence, if not,
s22, syntax checking the SQL request, if the syntax error returns error information to the user, otherwise,
s23, semantic checking the SQL request, if the semantic error returns error information to the user, otherwise,
s24, carrying out view and expression conversion on the SQL request to obtain a corresponding conversion result;
s25, selecting an optimizer according to the conversion result to obtain a corresponding optimizer selection result;
s26, selecting a corresponding data connection mode and a connection sequence according to the result of the optimizer selection;
s27, selecting the searched path according to the connection mode and the connection sequence;
and S28, generating an execution plan according to the search path and outputting the execution plan.
When the method is specifically implemented, a user request is accessed in an interface mode of JDCB/ODBC, and then the user request is analyzed, compiled and a corresponding execution plan is generated; then, according to an execution plan, acquiring index data corresponding to the query condition requested by the user; wherein, the index table stores the index data in the form of inverted index generated by the JSON data as a nested type; inquiring a data table according to the acquired index data so as to acquire a corresponding inquiry result; wherein the JSON data is stored as a whole; and finally, returning the query result to the user.
The embodiment can realize data query in JSON format, and solve the problems of poor effect and performance when processing semi-structured data.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a distributed NewSQL database system according to embodiment 2 of the present invention, where the embodiment includes:
the JDCB/ODBC interface unit 1 is used for carrying out interactive operation with a user, and comprises the steps of receiving a user request and returning a query result to the user; the user request comprises a query condition of JSON data to be queried, and the query result is the JSON data obtained according to the query condition;
the master unit 2 is used for accessing a user request accessed by the JDCB/ODBC interface unit 1, coordinating data communication among a plurality of processors and managing the whole process, and preferentially sending the user request to the SQLPLanner unit 3; the master unit 2 is also used for returning the query result to the JDCB/ODBC interface unit;
the SQLPLaner unit 3 is used for analyzing the user request, compiling and customizing an execution plan according to the user request;
a worker unit 4 for executing the plan in parallel, comprising: according to an execution plan, starting a coprocessor module to obtain index data corresponding to the query conditions requested by the user, and querying a data table according to the obtained index data so as to obtain the corresponding query result; the Hbase unit is also used for returning the query result of the Hbase unit to the master unit 2;
an Hbase unit 6, configured to store the data table and the index table; the Hbase unit 6 further comprises the coprocessor module 61, wherein the bottom layer of the Hbase unit 6 is augmented with JSON type data, which is stored in its entirety in the bottom layer HFile;
generally, the distributed NewSQL database system of the embodiment allows a user to flexibly establish a secondary index according to specific business logic, in practical application, the user often establishes a plurality of secondary indexes, and dynamically calculates the cost of using the indexes according to query conditions during use, and automatically selects the most appropriate index. The query for rowkey is extremely efficient, so the implementation of the secondary index is to generate an index table for data by using the coprocessors module 61 and the Filter module 62 of the hbase unit 6.
And the distributed transaction manager 5 is used for coordinating multiple parties to complete distributed transaction management when the worker unit 4 execution plan relates to a transaction.
Further, the JDCB/ODBC interface unit 1 is further configured to convert the user request into an SQL request in the form of an SQL statement.
Further, the SQLPlaner unit 3 is configured to:
judging whether a pre-stored SQL statement corresponding to the SQL request exists in the shared cache pool, if so, outputting an execution plan corresponding to the pre-stored SQL statement, otherwise,
and carrying out syntax check on the SQL request, if the syntax error returns error information to a user, otherwise,
semantic check is carried out on the SQL request, if the semantic error returns error information to the user, otherwise,
carrying out view and expression conversion on the SQL request to obtain a corresponding conversion result;
selecting an optimizer according to the conversion result to obtain a corresponding optimizer selection result;
selecting a corresponding data connection mode and a corresponding connection sequence according to the selection result of the optimizer;
selecting a searched path according to the connection mode and the connection sequence;
and generating an execution plan according to the search path, and outputting the execution plan.
Further, this embodiment further includes:
a monitor 8 for taking charge of metadata management, monitoring the load of Region of the Hbase unit, and reallocating the Region by the coprocessors module 61 of the Hbase unit 6; the monitor is connected with the master unit.
Further, the monitoring the load of the Region of the Hbase unit 6, and the reallocating the Region by the coprocessor module of the Hbase unit 6 includes:
receiving data distribution information of the Hbase unit 6, and receiving load information of the worker unit 4 in the master unit 2, wherein the load information comprises a load deviation value of the worker unit 4;
comparing the load deviation value of the worker unit 4 with a preset load deviation threshold, and if the load deviation value is judged to exceed the threshold, triggering the Hbase unit 6 to distribute the Region on the server with higher hit rate and the Region on the server with lower hit rate;
acquiring the data volume of each Region, judging the data volume of each Region and a preset data volume threshold, and triggering the Hbase unit 6 to divide the regions exceeding the preset data volume threshold into two regions if the data volume of the Region is judged to exceed the threshold.
Further, the JDCB/ODBC interface unit 1 includes:
the JDBC application program module 11 is used for receiving a user request, calling a JDBC object method to give an SQL statement, and extracting a result to return to a user;
a JDBC driver manager module 12, configured to load and call a JDBC driver module 13 for the JDBC application module 11;
the JDBC driver module 13 is configured to execute the invocation of the JDBC object method, send an SQL statement corresponding to the user request to the underlying database, and return a result obtained from the underlying database to the JDBC application module 11.
When the method is implemented specifically, firstly, a user request is received through the JDCB/ODBC interface unit 1; then, the master unit 2 accesses the user request accessed by the JDCB/ODBC interface unit 1, coordinates data communication among a plurality of processors and manages the whole process, and preferentially sends the user request to the SQLPLaner unit; then, the SQLPLaner unit 3 analyzes the user request, and compiles and customizes an execution plan according to the user request; then, the worker unit 4 executes the plan in parallel, the coprocessors module 61 of the Hbase unit 6 is started to obtain index data corresponding to the query condition requested by the user, and a data table is queried according to the obtained index data, so as to obtain the corresponding query result; JSON type data are added to the bottom layer of the Hbase unit 6, and the JSON data are integrally stored in the bottom layer HFile; and finally, returning the query result of the Hbase unit 6 to the master unit, and returning the query result written with the JSON data to the JDCB/ODBC interface unit through the master unit 2 so as to return to the user.
The distributed NewSQL database system of the embodiment can realize data query in a JSON format, and solves the problems of poor effect and performance when processing semi-structured data.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.