CN113704290A

CN113704290A - Data query system and method

Info

Publication number: CN113704290A
Application number: CN202111026975.8A
Authority: CN
Inventors: 马鹏飞; 邓靖
Original assignee: Hongqiao Hi Tech Group Co ltd
Current assignee: Hongqiao Hi Tech Group Co ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-11-26

Abstract

The invention discloses a data query system and a data query method, which can transmit an SQL query request with a fixed format to a kylin database for query, and can query the SQL query request with a user-defined format step by step through a personal database and a public database, thereby improving the query efficiency and solving the problem that the data query from Hive data is time-consuming in the prior art. Aiming at SQL query requests which cannot be queried in the personal database and the public database, the data corresponding to the SQL query requests are read from the Hive database and are put into the personal database, so that the data updating in the personal data is ensured, and the subsequent query efficiency of a user is improved.

Description

Data query system and method

Technical Field

The invention belongs to the field of data query, and particularly relates to a data query system and a data query method.

Background

At present, for a data warehouse constructed based on Hive (data warehouse management system), the industry provides numerous query engine tools, such as Presto (distributed query engine), kylin (kylin), etc., but more or less emphasizes on a certain aspect, for example, kylin has a powerful pre-computing framework, and well solves the problem of slow query of fixed query, especially analysis and statistics. Presto can increase Hive queries from the minute level to the second level, but fails to compare with kylin for a particular analytical query, which is highly efficient for relational database-small data volume queries but can be very time consuming when the data volume is particularly large.

Disclosure of Invention

Aiming at the defects in the prior art, the data query system provided by the invention solves the problem that the query in the prior art takes a long time.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a data query system comprises a query interface module, an SQL analysis module, an SQL forwarding module, a first query module, a second query module, a kylin database, a Hive database, a personal database and a public database;

the query interface module is connected to the SQL analysis module and used for receiving and outputting an SQL query request; the SQL analysis module is connected to the SQL forwarding module and used for receiving and analyzing the SQL query request sent by the query interface module and checking the safety of the SQL query request so as to intercept the illegal SQL query request; the SQL forwarding module is respectively connected to the first query module and the second query module and is used for forwarding the SQL query request to the first query module or the second query module; the first query module is connected to the kylin database and used for executing an SQL query request, reading corresponding data from the kylin database and visualizing the read data; the kylin database is used for storing queryable data; the second query module is respectively connected to the personal database, the public database and the Hive database, and is used for executing SQL query requests, reading corresponding data from the personal database, the public database or the Hive database, and visualizing the read data; the personal database is used for storing historical query data of a single user; the public database is used for storing historical query data of a plurality of users; the Hive database is used for storing queryable data.

Further, the SQL query request executed by the first query module is an SQL query request in a fixed format.

Further, the SQL query request executed by the second query module is an SQL query request in a custom format.

Further, the second query module may be further configured to add, delete, and/or modify data in the personal database, the public database, and the Hive database.

The invention has the beneficial effects that:

(1) the invention provides a data query system, which can realize automatic distribution and query of SQL query requests, and construct a plurality of databases, thereby improving query efficiency.

(2) The invention is provided with the SQL forwarding module, and can automatically distribute the SQL query request to realize the query of the SQL query request with a fixed format and the SQL query request with a user-defined format.

A method of data query, comprising:

acquiring an SQL query request, and analyzing the SQL query request to obtain an SQL analysis result, wherein the SQL analysis result comprises an illegal SQL query request and a safe SQL query request;

judging whether the SQL query request is a safe SQL query request or not according to the SQL analysis result, if so, executing the SQL query request, otherwise, intercepting the SQL query request, and ending the data query process;

when the SQL query request is executed, judging whether the SQL query request is a query request corresponding to the kylin database, if so, reading corresponding data from the kylin database, visualizing the read data to complete data query, and otherwise, retrieving data corresponding to the SQL query request from the personal database;

judging whether the data corresponding to the SQL query request is retrieved from the personal database, if so, visualizing the retrieved data to complete the data query, otherwise, retrieving the data corresponding to the SQL query request from the public database;

and judging whether the data corresponding to the SQL query request is retrieved from the public database, if so, visualizing the retrieved data to finish data query, otherwise, reading the data corresponding to the SQL query request from the Hive database, storing the read data into the personal database and visualizing the read data to finish data query.

Further, the personal database and the public database are both PostgreSQL databases.

Further, the reading corresponding data from the kylin database and visualizing the read data comprises:

constructing a query task in a kylin database according to the SQL query request;

according to the query task, querying data corresponding to the SQL query request in a kylin database through an Hbase query instruction;

and reading and visualizing the data corresponding to the SQL query request.

Further, the retrieving the data corresponding to the SQL query request from the public database includes:

eliminating the data part which is obtained by querying in the personal database in the SQL query request to obtain the eliminated SQL query request;

and searching corresponding data in the public database according to the eliminated SQL query request.

The invention has the beneficial effects that:

(1) the invention can transmit the SQL query request with a fixed format to the kylin database for query, and can query the SQL query request with a user-defined format step by step through the personal database and the public database, thereby improving the query efficiency and solving the problem that the data query from Hive data is time-consuming in the prior art.

(2) Aiming at SQL query requests which cannot be queried in the personal database and the public database, the data corresponding to the SQL query requests are read from the Hive database and are put into the personal database, so that the data updating in the personal data is ensured, and the subsequent query efficiency of a user is improved.

(3) The invention can quickly inquire both the user-defined inquiry and the classified retrieval inquiry, realizes the second-level response of the inquiry request and improves the working efficiency of the user.

Drawings

Fig. 1 is a schematic diagram of a data query system according to an embodiment of the present invention.

Fig. 2 is a flowchart of a data query method according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

First, terms referred to in the present application will be explained.

Hive database: hive is a data warehouse tool based on Hadoop, and can map structured data files into a database table and provide SQL-like query functions. Hive can freely expand the size of the cluster, and generally does not need to restart the service. Hive supports user-defined functions, users can realize the functions according to the requirements of the users, good fault tolerance is achieved, and SQL can still be executed when a node goes wrong.

kylin database: kylin is an open-source distributed analysis engine, provides SQL query interface and multi-dimensional analysis (OLAP) capability over Hadoop/Spark to support very large scale data, and can query huge Hive tables in sub-second.

SQL: SQL is an abbreviation for Structured Query Language (Structured Query Language). The SQL language is a database query and programming language for accessing data and querying, updating, and managing relational database systems; and is also an extension of the database script file.

PostgreSQL database: the postgreSQL is a free object-relational database server (database management system), can be written in multiple languages, can be completely realized in a database localization mode for complex business logic calculation and large data access, greatly reduces network interaction cost, and accordingly improves application performance integrally.

Example 1

As shown in fig. 1, a data query system includes a query interface module, an SQL parsing module, an SQL forwarding module, a first query module, a second query module, a kylin database, a Hive database, a personal database, and a public database.

The query interface module is connected to the SQL analysis module and used for receiving and outputting an SQL query request; the SQL analysis module is connected to the SQL forwarding module and used for receiving and analyzing the SQL query request sent by the query interface module and checking the safety of the SQL query request so as to intercept the illegal SQL query request; the SQL forwarding module is respectively connected to the first query module and the second query module and is used for forwarding the SQL query request to the first query module or the second query module; the first query module is connected to the kylin database and used for executing the SQL query request, reading corresponding data from the kylin database and visualizing the read data; the kylin database is used for storing queryable data; the second query module is respectively connected to the personal database, the public database and the Hive database and used for executing SQL query requests, reading corresponding data from the personal database, the public database or the Hive database and visualizing the read data; the personal database is used for storing historical query data of a single user; the public database is used for storing historical query data of a plurality of users; the Hive database is used for storing queryable data.

In this embodiment, the query interface module, the SQL parsing module, the SQL forwarding module, the first query module, the second query module, the kylin database, the Hive database, the personal database, and the public database may be implemented by software, or implemented by a combination of software and hardware.

In this embodiment, the personal database and the public database may be constructed by using a PostgreSQL database, and when the SQL parsing module performs SQL parsing, the SQL92 standard may be used to perform format verification on the SQL query request.

When the SQL query request reaches the SQL forwarding module, the SQL forwarding module is used for judging whether the SQL request is a request corresponding to kylin, if so, the SQL query request is forwarded to the first query module, otherwise, the SQL query request is forwarded to the second query module.

The SQL query request executed by the first query module is an SQL query request with a fixed format, and the SQL query request executed by the second query module is an SQL query request with a self-defined format. The fixed format SQL query request is used for querying data in kylin data, and the custom format SQL query request is used for querying data in a personal database, a public database and/or a Hive database.

The second query module can also be used for adding, deleting and/or modifying data in the personal database, the public database and the Hive database.

The invention provides a data query system, which can realize automatic distribution and query of SQL query requests, and construct a plurality of databases, thereby improving query efficiency. The invention is provided with the SQL forwarding module, and can automatically distribute the SQL query request to realize the query of the SQL query request with a fixed format and the SQL query request with a user-defined format.

Example 2

As shown in fig. 2, a data query method, performed based on the system of embodiment 1, includes:

and acquiring the SQL query request, and analyzing the SQL query request to obtain an SQL analysis result, wherein the SQL analysis result comprises an illegal SQL query request and a safe SQL query request.

And judging whether the SQL query request is a safe SQL query request or not according to the SQL analysis result, if so, executing the SQL query request, otherwise, intercepting the SQL query request, and ending the data query process.

And when the SQL query request is executed, judging whether the SQL query request is a query request corresponding to the kylin database, if so, reading corresponding data from the kylin database, visualizing the read data to complete data query, and otherwise, retrieving data corresponding to the SQL query request from the personal database.

And judging whether the data corresponding to the SQL query request is retrieved from the personal database, if so, visualizing the retrieved data to complete the data query, and otherwise, retrieving the data corresponding to the SQL query request from the public database.

The execution subject of the embodiment of the application may be a data query system, and the data query system may be implemented by software, or by a combination of software and hardware.

In this embodiment, the data corresponding to the SQL query request refers to data matched with the SQL query request in a certain database.

The personal database and the public database are both PostgreSQL databases.

Reading corresponding data from the kylin database, and visualizing the read data comprises: constructing a query task in a kylin database according to the SQL query request; according to the query task, querying data corresponding to the SQL query request in a kylin database through an Hbase query instruction; and reading and visualizing the data corresponding to the SQL query request. The kylin Database is constructed based on Hbase (Hadoop Database, distributed storage system), and when data in the kylin Database is queried, a Hbase query instruction is required to be used for querying.

Retrieving data corresponding to the SQL query request from the public database comprises: eliminating the data part which is obtained by querying in the personal database in the SQL query request to obtain the eliminated SQL query request; and searching corresponding data in the public database according to the eliminated SQL query request.

The method can transmit the SQL query request with the fixed format to the kylin database for query, and query the SQL query request with the user-defined format through the personal database and the public database, thereby improving the query efficiency and solving the problem of time consumption in the prior art. According to the invention, the data corresponding to the SQL query request is read in the Hive database and is put in the personal database, so that the data updating in the personal data is ensured, and the query efficiency of the user is improved.

Example 3

The present embodiment provides another data query method, including:

A. a personal database and a public database are constructed.

The data query method provided by the embodiment is suitable for a user at a specific post, and the kylin database and the Hive database are used for storing all data which may be used by the user. Respectively constructing a personal database and a public database by adopting a PostgreSQL database; caching the post related data of the user and the query historical data of the user into a personal database by adopting a user behavior analysis algorithm in a data pre-reading mode, and establishing an index; and caching the query history data of other users on the same post with the user in the last N days into a public database, and establishing an index.

In one possible implementation, the user behavior can be analyzed through a logistic regression algorithm and a decision tree algorithm, data which are frequently used by the user and data which are possibly used by the user (user position related data) are recommended, and then the recommended data are preloaded into a personal database by adopting a Surrise algorithm and a scimit-lean algorithm based on python.

B. And acquiring the SQL query request of the user.

The user's SQL query request may include a custom query and a category search query; the user-defined query is an SQL query request with a user-defined format, and SQL statements with irregular rules are defined by a user; the classified retrieval query refers to regular query performed by a fixed format SQL query request, which is commonly used for query of a statistical chart of an analysis system and query of a statistical chart of a specific service of a user, wherein a kylin identifier is arranged in the fixed format SQL query request. And for the SQL query request carrying the kylin identifier, when the query request is executed, corresponding data is retrieved from the kylin database.

C. And analyzing the SQL query request to intercept the illegal SQL query request.

Format verification is carried out on the obtained SQL query request by adopting an SQL92 standard, the SQL query request with unqualified format verification is judged as an illegal SQL query request, and the illegal SQL query request is intercepted; and judging the SQL query request qualified by verification as a safe SQL query request, and executing the safe SQL query request.

D. And judging whether the SQL query request is a query request corresponding to the kylin database, if so, retrieving data corresponding to the SQL query request in the kylin database, and visualizing the retrieved data, otherwise, retrieving data corresponding to the SQL query request in the personal database.

Whether the SQL query request carries the kylin identifier can be judged by checking whether the SQL query request carries the kylin identifier or not.

In one possible implementation, retrieving data corresponding to the SQL query request in the kylin database may include: checking whether the query list of the kylin database comprises an SQL query request, if so, querying data corresponding to the SQL query request in the kylin database through an Hbase query instruction, reading the corresponding data and visualizing the data; otherwise, constructing an inquiry task, inquiring the data corresponding to the SQL inquiry request in the kylin database through the Hbase inquiry instruction, reading the data corresponding to the SQL inquiry request and visualizing the data.

The SQL grammar corresponding to the query engine of the postgreSQL database is different from the standard SQL grammar standard, so that before retrieving the data corresponding to the SQL query request in the personal database, whether the SQL query request is the SQL query request defined by the query engine of the postgreSQL database can be judged, if so, the SQL query request is judged to be in accordance with the SQL grammar defined by the postgreSQL database, and the SQL query request is directly executed; otherwise, judging that the SQL query request does not conform to the SQL grammar defined by the PostgreSQL database, performing grammar translation on the SQL query request, translating the SQL query request into the SQL query request defined by the query engine of the PostgreSQL database, and executing the translated SQL query request.

Optionally, executing the SQL query request may include: and matching the corresponding directory index in the retrieval directory of the database according to the SQL query request, and executing the SQL query request and visualizing the query result if the corresponding directory index exists.

Syntactically translating the SQL query request may include: and defining an SQL mapping table in a query engine of the PostgreSQL database, wherein the SQL mapping table comprises the relation between the standard SQL grammar and the SQL grammar corresponding to the PostgreSQL database, and converting the SQL query request into the SQL corresponding to the PostgreSQL database through the SQL mapping table to finish the grammar translation process.

In a possible implementation manner, when a user queries data in the personal database each time, the query record of the user is stored, and the content corresponding to the SQL query request for querying more than M times can be cached in advance, so as to realize quick query of the user common content.

E. And judging whether the data corresponding to the SQL query request is retrieved from the personal database, if so, reading the corresponding data and visualizing the read data, otherwise, retrieving the data corresponding to the SQL query request from the public database.

In one possible implementation, retrieving data corresponding to the SQL query request in the public database may include: performing secondary analysis on the SQL query request, and removing the part of the SQL query request which has obtained the query result in the personal database to obtain the removed SQL query result; and searching whether corresponding data exists in a search directory of the public database according to the removed SQL query result, if so, performing syntax translation on the SQL query request, translating the SQL query request into SQL corresponding to the public database and executing the SQL query request, and otherwise, searching the data corresponding to the SQL query request in the Hive database.

Optionally, when searching in the public database, the corresponding table or data is matched in the search directory of the public database according to the SQL query request, and the SQL translation and execution steps may be performed for the SQL query request successfully matched. Because different databases may have different syntax standards, when cross-database query is performed from a personal database to a public database, translation may be performed corresponding to SQL query, and the translation may be translated into SQL corresponding to the public database, so as to read corresponding data in the public database. For the SQL syntax translation part, a mapping table method can be adopted for translation.

Optionally, when the number of times that the same post user queries the same table exceeds L times, the table is cached from the Hive database to the public database.

F. Judging whether the data corresponding to the SQL query request is retrieved from the public database or not, if so, reading the corresponding data, and visualizing the read data; otherwise, constructing a missing table according to the SQL query request, querying corresponding data in the Hive database, and caching the data queried in the Hive database into the personal database.

In one possible implementation, constructing a missing table query according to the SQL query request, and querying the Hive database for corresponding data includes: constructing a missing table Query according to the SQL Query request, constructing an HQL (Query Language) statement in the Hive database by adopting a Query engine, generating scheduling at the L2 level, and querying data corresponding to the missing table in the Hive database through the HQL statement.

The query engine divides the scheduling order into L1, L2, L3 by default; and dynamically starting or suspending scheduling tasks of different levels according to the real-time occupation condition of the CPU, the memory and the IO resources. For example: when the CPU is less than 30% and the memory is less than 50%, starting calling at the level of L2; when CPU > 70%, memory > 70%, L2 level scheduling is automatically suspended to guarantee L1 level scheduling.

The method is provided with a personal database and a public database, and a database table which is possibly used by a user is cached into the personal database (default synchronous data of nearly 3 years) from a bottom-layer Hive database through a user behavior analysis algorithm (the user behavior analysis algorithm can be combined with factors such as user posts, co-workers with the user posts, recent hot events related to the user, daily habits of the user and the like and is used for deduction by adopting a machine learning algorithm), and an index is constructed; and synchronizing the historical data tables of the users on the same post within the last N days into a public database. The method comprises the steps of inquiring an SQL inquiry request of a user step by step to intercept the SQL inquiry request in a personal database and a public database to realize inquiry at the earliest time, or intercepting an inquiry engine of a kylin database to carry out inquiry at a precalculation level, reducing the chance of directly inquiring the Hive database by the user to the maximum extent, and inquiring and extracting data which cannot be inquired in the personal database and the public database in the Hive database. Therefore, more than 90% of user-defined queries realize second-level response, and even if the first time is slow, the subsequent query speed is gradually increased; query statements in a fixed format can be distributed to a kylin data warehouse computing framework by a system for pre-computing, so that the query response speed is improved, and more than 98% of analysis type retrieval queries realize second-level response.

The embodiment of the application also provides data query equipment, which comprises a processor and a memory, wherein the memory stores computer execution instructions; the processor executes the computer-executable instructions stored by the memory to cause the processor to perform any of the data querying methods described above.

The embodiment of the present application further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is configured to implement any one of the data query methods shown above.

Claims

1. A data query system is characterized by comprising a query interface module, an SQL analysis module, an SQL forwarding module, a first query module, a second query module, a kylin database, a Hive database, a personal database and a public database;

2. The data query system of claim 1, wherein the SQL query request executed by the first query module is a fixed-format SQL query request.

3. The data query system of claim 1, wherein the SQL query request executed by the second query module is a custom formatted SQL query request.

4. The data query system of claim 1, wherein the second query module is further configured to add, delete and/or modify data in the personal database, the public database and the Hive database.

5. A method for querying data, comprising:

6. The method of claim 5, wherein the personal database and the public database are PostgreSQL databases.

7. The data query method of claim 5, wherein reading the corresponding data from the kylin database and visualizing the read data comprises:

and reading and visualizing the data corresponding to the SQL query request.

8. The data query method of claim 7, wherein the retrieving the data corresponding to the SQL query request from the public database comprises: