CN114238370A - Method and system for applying NER entity recognition algorithm in report query - Google Patents

Method and system for applying NER entity recognition algorithm in report query Download PDF

Info

Publication number
CN114238370A
CN114238370A CN202111493638.XA CN202111493638A CN114238370A CN 114238370 A CN114238370 A CN 114238370A CN 202111493638 A CN202111493638 A CN 202111493638A CN 114238370 A CN114238370 A CN 114238370A
Authority
CN
China
Prior art keywords
result
query
data query
model
obtaining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111493638.XA
Other languages
Chinese (zh)
Inventor
王恩典
陈浩
侯乐
钟蔚伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Citic Bank Corp Ltd
Original Assignee
China Citic Bank Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Citic Bank Corp Ltd filed Critical China Citic Bank Corp Ltd
Priority to CN202111493638.XA priority Critical patent/CN114238370A/en
Publication of CN114238370A publication Critical patent/CN114238370A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an application method and a system of an NER entity recognition algorithm in report query, wherein the method comprises the following steps: obtaining input language information; inputting the input language information into an NER entity recognition model to obtain a key field; obtaining a data query statement according to the key field; and executing the data query statement to query indexes in a database system, obtaining a data query result, and sending the data query result to the front end for displaying. The technical problems that in the prior art, the writing threshold of SQL query sentences of query reports is high, a plurality of business personnel do not have proper professional skills, and the application of big data is limited by the development of the technical bottlenecks of the business personnel are solved.

Description

Method and system for applying NER entity recognition algorithm in report query
Technical Field
The invention relates to the technical field related to database technology, in particular to an application method and an application system of an NER entity identification algorithm in report query.
Background
The database stores a large amount of production operation data of individuals or enterprises, and the individuals or the enterprises interact with the database every day. Generally, querying the data in the database requires interaction through a procedural Query language such as SQL (structured Query language), which requires a professional technician who understands the SQL language to perform the operation.
In the prior art, in order to facilitate data query in a database, technicians can query data fields in the database by writing SQL statements and perform operations such as report index comparison, sorting, index processing and the like. As businesses evolve, the self-contained report query and exploration BI (Business Intelligence) tool has been increasingly adopted and used by enterprises.
However, in the process of implementing the technical solution of the invention in the embodiments of the present application, the inventors of the present application find that the above-mentioned technology has at least the following technical problems:
in the prior art, the writing threshold of SQL query sentences of query reports is high, many business personnel do not have proper professional skills, and the application of big data is limited by the development of technical bottlenecks of the business personnel. The existing intelligent data modeling technology has higher threshold for most of queries, is not beneficial to non-professional users, and has the technical problem that the intelligent processing and query can not be carried out on natural language.
Disclosure of Invention
The embodiment of the application provides an application method and an application system of an NER (name Entity recognition) Entity recognition algorithm in report query, and aims to solve the technical problems that in the prior art, the writing threshold of SQL query statements is high, the threshold of most queries is high due to the existing intelligent data modeling technology, the use by non-professional users is not facilitated, and the intelligent processing and query of natural language cannot be performed.
In view of the foregoing problems, embodiments of the present application provide a method and a system for applying an NER entity identification algorithm to report query.
In a first aspect of the embodiments of the present application, a method for applying an NER entity recognition algorithm to report query is provided, where the method includes: obtaining input language information; inputting the input language information into an NER entity recognition model to obtain a key field; obtaining a data query statement according to the key field; and executing the data query statement to query indexes in a database system, obtaining a data query result, and sending the data query result to the front end for displaying.
In a second aspect of the embodiments of the present application, an application system of the NER entity recognition algorithm in report query is provided, where the system includes: a first obtaining unit configured to obtain input language information; the first processing unit is used for inputting the input language information into an NER entity recognition model to obtain a key field; the second processing unit is used for obtaining a data query statement according to the key field; and the third processing unit is used for executing the index query of the data query statement in a database system, obtaining a data query result and sending the data query result to the front end for display.
In a third aspect of the embodiments of the present application, an application system of an NER entity recognition algorithm in report query is provided, including: a processor coupled to a memory for storing a program that, when executed by the processor, causes a system to perform the steps of the method according to the first aspect.
One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:
the method is based on an NER entity identification algorithm, index search is carried out on a database table through a search algorithm through identified keywords, and then the process of finding indexes is completed; compared with the prior art that only an NL2SQL (Natural Language to SQL) model is used, similarity calculation can be carried out on a table field by using a 'Natural Language problem', and then the 'table finding' process is completed, the NER entity recognition is used as a priority searching scheme to optimize the algorithm execution speed, the condition of a certain probability inaccurate result generated based on the 'similarity calculation' is supplemented, technical personnel are not required to compile SQL query statements, the data query threshold is reduced, and the technical effect of carrying out intelligent processing and query on Natural Language more accurately can be achieved.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
Fig. 1 is a schematic flowchart of an application method of an NER entity identification algorithm in report query according to an embodiment of the present application;
fig. 2 is a flowchart of an application method of an NER entity identification algorithm in report query according to an embodiment of the present disclosure;
fig. 3 is a block diagram of a key field obtained by an NER entity recognition algorithm in an application method in report query according to an embodiment of the present application;
fig. 4 is a block diagram illustrating standardization of key fields in an application method of an NER entity recognition algorithm in report query according to an embodiment of the present disclosure;
FIG. 5 is a schematic structural diagram of an application system of the NER entity recognition algorithm in report query according to the embodiment of the present application;
fig. 6 is a schematic structural diagram of an exemplary electronic device according to an embodiment of the present application.
Description of reference numerals: a first obtaining unit 11, a first processing unit 12, a second processing unit 13, a third processing unit 14, an electronic device 300, a memory 301, a processor 302, a communication interface 303, and a bus architecture 304.
Detailed Description
The embodiment of the application provides an application method and system of an NER entity recognition algorithm in report query, and aims to solve the problems that in the prior art, the writing threshold of SQL query statements of query reports is high, a plurality of business personnel do not have proper professional skills, and the application of big data is limited by the development of the technical bottleneck of the business personnel. The existing intelligent data modeling technology has higher threshold for most of queries, is not beneficial to non-professional users, and has the technical problem that the intelligent processing and query can not be carried out on natural language.
The method comprises the steps of correcting and standardizing input information of natural sentences, then carrying out tuning and identification on the processed input information based on an NER entity identification algorithm, defining entity texts in the input information, obtaining key fields, generating data query sentences according to the key fields, generating an SQL algorithm for unidentified entity texts through a rule model to obtain an SQL rule model, entering the SQL rule model for calculation, generating SQL query sentences for query, entering an NL2SQL model for prediction if calculation results cannot be obtained in the SQL model, generating the SQL query sentences for query, and returning no query result if the result probability predicted by the NL2SQL model is lower than a certain threshold. According to the method, the traditional question-answering system carries out classification recognition of language intentions by using intention recognition, and various natural language processing technologies including a text error correction model, an entity recognition model, a SQL generation rule model and a natural language-to-SQL conversion model are fused. The fusion of the technologies improves the robustness of the system and brings more accurate understanding of the language description, thereby achieving the purpose of accurately generating the database query language and acquiring accurate data query and analysis results. Compared with the prior art that only an NL2SQL model is used, similarity calculation can be carried out on a 'natural language problem' and a table field, and then the 'table finding' process is completed, NER entity identification is used as a priority searching scheme to optimize algorithm execution speed, the condition of a certain probability inaccurate result generated based on 'similarity calculation' is supplemented, technical personnel do not need to compile SQL query statements, the data query threshold is reduced, and the technical effect of carrying out intelligent processing and query on natural language more accurately can be achieved.
Summary of the application
The database stores a large amount of production operation data of individuals or enterprises, and the individuals or the enterprises interact with the database every day. Typically, querying the data in the database requires interaction through a procedural query language such as SQL, which requires a professional technician who understands the SQL language to perform this operation. In order to enable a non-professional user to query the database as required, a special interface based on condition screening is designed in the current popular technical scheme, and the user can query the database by clicking different conditions. In addition, other conventional data query modes are that business personnel writes data fields in an SQL statement query database, and operations such as report index comparison, sorting, index processing and the like are performed. In the prior art, in order to facilitate the development of data query and related business in a database, autonomous report query and exploration BI tools are increasingly adopted and used by enterprises.
However, in the prior art, the writing threshold of the SQL query statement of the query report is high, many business personnel do not have proper professional skills, and the application of big data is limited by the development of the technical bottleneck of the business personnel. The existing intelligent data modeling technology has higher threshold for most of queries, is not beneficial to non-professional users, and has the technical problem that the intelligent processing and query can not be carried out on natural language.
In view of the above technical problems, the technical solution provided by the present application has the following general idea:
the embodiment of the application provides an application method of an NER entity identification algorithm in report query, which comprises the following steps: obtaining input language information; inputting the input language information into an NER entity recognition model to obtain a key field; obtaining a data query statement according to the key field; and executing the data query statement to query indexes in a database system, obtaining a data query result, and sending the data query result to the front end for displaying.
Having described the basic principles of the present application, the following embodiments will be described in detail and fully with reference to the accompanying drawings, it being understood that the embodiments described are only some embodiments of the present application, and not all embodiments of the present application, and that the present application is not limited to the exemplary embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. It should be further noted that, for the convenience of description, only some but not all of the elements relevant to the present application are shown in the drawings.
Example one
As shown in fig. 1, an embodiment of the present application provides an application method of an NER entity identification algorithm in report query, where the method includes:
s100: obtaining input language information;
specifically, the input language information is language information input when a query is made in any database, and the input language information may be language information with query indicators or keywords input through any input device, including but not limited to: an Automatic Speech Recognition interface (ASR), a keyboard, etc. Illustratively, in the process of making a query, input language information with query indicators or keywords can be obtained by speaking speech input and converting the speech input into text input through the ASR interface.
By way of example and not limitation, when a service person needs to query, the service person can input "query me for a newly added card issuance amount at the beginning of the previous month and the Shenzhen center in comparison with the increase of the same period in the last year" through the microphone, and convert the newly added card issuance amount into corresponding text input through the ASR interface, so that input language information can be obtained.
S200: inputting the input language information into an NER entity recognition model to obtain a key field;
specifically, as shown in fig. 2, the key field is a field with a query indicator or a keyword in the input language information, and data related to the key field can be obtained by querying in the database according to the key field.
Named Entity Recognition (NER) is an important basic tool in application fields such as information extraction, question-answering system, syntactic analysis, machine translation and the like, and plays an important role in the process of putting natural language processing technology into practical use. The application of NER is very wide, which can identify key fields from a text, and in general, the task of NER is to identify three major classes and seven minor classes in the text to be processed according to different classification methods, wherein the three major classes include: entity class, time class and digit class, seven subclasses include: name of person, organization name, place name, time, date, currency, percentage, etc.
Exemplarily, in the embodiment of the present application, the categories of the key fields include: the name (index _ name) of the index, the name (destination _ name) of the place mechanism, the query time (index _ date), the comparison time (compare _ date), and the index comparison method (compare _ statistics). If the input language information is: "inquiring me for the newly increased issue quantity of the Shenzhen records center in the previous month compared with the increase of the same period in the previous year", the key fields obtained by the NER entity identification model include:
"index _ name": "newly increased hairpin amount";
"destination _ name": "Shenzhen fen center";
"index _ date": "last month" or "first month";
"match _ date": "last year synchronization";
"compare _ statistics": "amplification".
The five key fields can indicate the query indexes in the input language information, and the query result can be obtained in the corresponding database based on the query indexes. In practical applications, if a corresponding key field is missing, the key field is empty.
S300: obtaining a data query statement according to the key field;
specifically, the data query statement is a query statement composed of key fields obtained based on the NER entity identification model, and the data query statement is a data query statement that can be identified by a computer and can query and invoke relevant data in a corresponding database.
In practical application, the key fields need to be processed to be converted into entity fields which can be recognized by a computer, for example, the key fields at the beginning of the last month cannot be recognized, and need to be converted into accurate year, month and date or time intervals to be reassembled into data query sentences which can be recognized by the computer, so that the query can be performed in a corresponding database.
S400: and executing the data query statement to query indexes in a database system, obtaining a data query result, and sending the data query result to the front end for displaying.
Specifically, the data query result is a query result obtained by querying in a corresponding database based on the data query statement, and the front end is a terminal for business personnel to input the input language information and perform query. If the data query sentence obtained by inputting the language information is clearly identifiable and corresponding data exists in the corresponding database, the corresponding data obtained by querying can be returned to the front end to be provided for the querying business personnel. If the data query statement obtained by inputting the language information is not identifiable or corresponding data does not exist in the corresponding database, a no-query result is returned to the front end.
According to the embodiment of the application, semantic understanding of report query language is realized through an NER entity recognition algorithm, index search is carried out on a database table through a search algorithm through recognized key fields so as to complete the process of finding indexes, similarity calculation is carried out on a natural language problem and the table fields so as to complete the process of finding tables, NER entity recognition is used as a priority search scheme to optimize algorithm execution speed, compiling of SQL query statements is not needed by technicians, the threshold of data query is lowered, and the technical effect of more accurately carrying out intelligent processing and query on natural language can be achieved.
Step S200 in the method provided in the embodiment of the present application includes:
s210: acquiring text information according to the input language information;
s220: carrying out error recognition on the text information to obtain an error vocabulary;
s230: and performing supplementary error correction on the error vocabulary to obtain preprocessed language information, and inputting the preprocessed language information serving as input information into the NER entity recognition model.
Specifically, in this embodiment of the application, the input language information is a voice input through a microphone, and then is converted into a text input through an ASR interface, so as to obtain the text information.
In the process of inputting through voice and converting into text through an ASR interface, the spoken query sentence may not be standard, and the embodiment of the application can take the input data, including text and voice data, as the input information for the interaction between the user and the system. Converting the data input by the voice into a text by an ASR (asynchronous receiver-transmitter) voice recognition engine, and transmitting the text to a text error correction model for data preprocessing, wherein the data preprocessing comprises the following steps: wrong character rewriting, homophone character replacing and invalid character replacing; the method provided by the embodiment of the application comprises an online text error correction service, and can perform model hot update and hot deployment according to the problem of wrong words generated online, so that online addition and deletion of wrong words can be completed without restarting the system service.
According to the embodiment of the application, non-professional technicians can freely inquire data as required on the premise of not learning and mastering the programming language of the database through voice input, correct preprocessed language information corresponding to the corresponding database can be obtained through the error identification and supplementary error correction processes, and then inquiry is carried out, so that the threshold of data inquiry is lowered, and the technical effects of intelligent processing and inquiry on natural language can be accurately achieved.
Step S200 in the method provided in the embodiment of the present application further includes:
s240: inputting the preprocessed language information into a BERT pre-training model for coding to obtain a BERT training tuning output layer;
s250: inputting the BERT training tuning output layer into a Bi-LSTM layer for coding to obtain Bi-LSTM layer output;
s260: and outputting the Bi-LSTM layer to an CRF classification labeling layer for entity word slot extraction, and obtaining the key field.
Specifically, as shown in fig. 2 and fig. 3, in the embodiment of the present application, through fusion of multiple algorithms, the preprocessed language information is optimized (fine-tuning) through a BERT (bidirectional Encoder retrieval from transforms) pre-trained model (pre-trained model), an entity to be identified is defined, then, embedding encoding is performed, an output layer after BERT training and optimization is input into a BiLSTM layer, so as to obtain a Bi-LSTM layer output, and finally, the obtained embedding is input into a crf (conditional field algorithm) to perform word classification, so as to extract a real word slot.
In the tuning process, the tuned pre-training models are respectively stored in a Distributed File System (HDFS) for subsequent calling.
According to the embodiment of the application, the entity recognition is carried out through the BERT algorithm, the BilSTM and the conditional random field algorithm CRF are added to the processing of the output vector at the uppermost layer of the BERT, and the limitation of the probability distribution of the entity word slot is carried out, so that the accuracy of the BERT training and the speed of loss convergence of the training are improved, and the technical effect of accurately and efficiently obtaining the entity in the preprocessed language information is achieved.
Step S240 in the method provided in the embodiment of the present application includes:
s241: marking the data entity type of the preprocessing language information, wherein the data entity type comprises a table field, a time field and a mechanism field;
s242: and adjusting the BERT pre-training model based on the table field, the time field and the mechanism field to obtain the BERT training adjusted output layer.
Specifically, in the embodiment of the present application, the specific data entity types in the pre-processing language information include a table field, a time field, and an organization field, and the BERT pre-training model is optimized according to the table field, the time field, and the organization field. The preprocessing language information comprises the table field, the time field and the mechanism field, the three entities can represent query indexes in the preprocessing language information, and accurate entities can be obtained after entity identification and tuning are carried out through a BERT algorithm.
In the embodiment of the application, interaction with a database table is carried out according to a natural language problem, a key entity required by table lookup is extracted by utilizing the natural language problem, the entity type can assist in establishing a method for accurately acquiring the key entity from the preprocessed language information, the accuracy of BERT training and the speed of loss convergence are further improved, and the technical effect of accurately and efficiently acquiring the entity in the preprocessed language information is achieved.
Step S300 in the method provided in the embodiment of the present application includes:
s310: standardizing the key fields to obtain standard indexes;
s320: inquiring indexes in a database system according to the standard indexes to obtain index inquiry results;
s330: and when the index query result is a first result, taking the index query result as the data query result, and sending the data query result to the front end for displaying.
Specifically, in the embodiment of the present application, the key fields include: five entities of a name (index _ name) of the index, a name (destination _ name) of the place mechanism, a query time (index _ date), a comparison time (compare _ date), and a comparison method (compare _ statistics) of the index.
As shown in fig. 2 and 4, the embodiment of the present application requires standardization of the above five entities, specifically including rewriting of time and rewriting of standardization of index name, organization name, and comparison method. This is because spoken query statements may not be standardized during conversion to text by speech input and by ASR interface, and require standardization of the key fields described above, including but not limited to: the spoken index name is rewritten into a unified recognizable index name through an index name rewriting module; the non-standardized spoken language time needs to be converted into a standard time format such as YYYY-MM-DD.
For example, if the text input includes "number of new cards" or "number of cards," the corresponding entity "index _ name" needs to be expanded or normalized to the standard name "number of new cards" existing in the database, and the extracted entity retrieves the result from the database through the search engine. If the key fields include non-standardized spoken time such as "last month of the month", "yesterday", "last year of the year", "last weekend", it needs to be converted into a standard time format in a corresponding database such as "2021-10-DD". The standard index name and the standard format are preset query standards in a corresponding database, and can be set by a person skilled in the art.
The standardized standard indexes are inquired in a database system, if corresponding standard indexes exist, the obtained index inquiry result is a first result, the first result represents that the standard indexes are indexes which can be directly inquired in a corresponding database, the indexes can be directly inquired in the database, an entity corresponding to the indexes is directly input into a table of the database to be inquired, and then the data inquiry result can be obtained and sent to the front end to be displayed.
According to the method and the device, the key fields are standardized, spoken query input can be standardized, query indexes corresponding to standard index names in the database can be obtained, more accurate understanding of spoken language input is brought, and accurate data query and analysis results are obtained by accurately generating the database query language.
In the method provided in the embodiment of the present application, if the index query result is the second result, step S300 further includes:
s340: generating an SQL algorithm according to the rule model to obtain an SQL rule model;
s350: inputting the standard index into the SQL rule model to obtain a first output result;
s360: and when the first output result is the first result, obtaining the data query statement according to the first output result.
Specifically, if the entity corresponding to the standard index cannot be directly queried in the database, the SQL query statement needs to be generated. Furthermore, the situation that the entity corresponding to the standard index cannot be directly queried in the database is generally a complex natural language query problem, the corresponding input language information is complex, and the search algorithm of the key field in the database cannot obtain the query result, namely the second result.
And under the condition that the index query result is a second result, generating an SQL algorithm by the rule model to obtain an SQL rule model, inputting the key field result extracted by the entity into the model to generate an SQL query statement, and querying the SQL query statement in the database.
Illustratively, when a ' city with a sub-center, in which the increment of a newly added card exceeds 10% in the same period in the last year ' is queried ', obviously, a result cannot be directly obtained by searching an index of ' the newly added card amount ', SQL statements need to be generated for calculation and sequencing, spoken input language information is generated into SQL statements conforming to an SQL rule model, and then the SQL statements are queried in a corresponding database to obtain a query result.
And when the standard index is input into the SQL rule model, a first output result is obtained, and whether the first output result meets the query requirement or not needs to be judged. If the SQL statement corresponding to the first output result meets the query requirement and the query result can be obtained, the first output result is the first result, and the SQL statement can be used as a data query statement to be queried in the database.
According to the method and the device, the SQL rule model is built by adopting the SQL algorithm, when complex natural language query problems are processed, the query sentences can be calculated and sequenced through the SQL rule model, the text error correction model and the entity recognition model are combined, more accurate understanding of language description is brought, query results can be obtained for more complex language query input, and therefore accurate data query and analysis results can be obtained by accurately generating database query languages.
In the method provided in the embodiment of the present application, if the first output result is the second result, step S360 further includes:
s361: inputting the standard index into an NL2SQL model, wherein the NL2SQL model is a prediction model constructed by supplementing an NL2SQL algorithm with an SQL algorithm generated based on the rule model, and the NL2SQL model comprises a first subtask and a second subtask;
s362: obtaining a first prediction result based on the first subtask;
s363: obtaining a second prediction result based on the second subtask;
s364: and splicing the first prediction result and the second prediction result to obtain a second output result, wherein the second output result comprises the data query statement.
Specifically, if the SQL statements obtained by the calculation and sorting of the SQL rule model cannot obtain the query result in the corresponding database, the first output result is the second result, and at this time, the algorithm of the SQL statements is generated based on the SQL rule model, and then the NL2SQL algorithm is supplemented to return the result, and then the query is performed.
The NL2SQL model is a prediction model constructed by supplementing the NL2SQL algorithm with the SQL algorithm generated based on the rule model. Inputting the standard indexes into an NL2SQL model, recalling the most similar result table by the NL2SQL model through calculation in a library table based on table name similarity, and combining and splicing the result table into a complete SQL statement based on the NL2SQL model.
Furthermore, in the embodiment of the present application, the NL2SQL model prediction SQL is divided into 2 subtasks, where the first subtask and the second subtask total 2 subtasks, a first prediction result corresponding to the first subtask is responsible for predicting a field, a where connection condition, and an aggregation function, and a second prediction result corresponding to the second subtask is responsible for predicting a after-where condition value, and the first prediction result and the second prediction result of the two models are spliced to form a complete SQL statement, so as to obtain a second output result.
According to the embodiment of the application, the NL2SQL model is adopted, the SQL sentence algorithm is generated based on the SQL rule model for re-prediction calculation, the NL2SQL model prediction task is divided into two subtasks, the accuracy of the model is improved, the NL2SQL and the big data technology are fused, a training data set and the model are stored to the HDFS, and unified management of cluster resource files and calling of the model are facilitated. The embodiment of the application optimizes the algorithm execution speed based on NER entity identification and in combination with the NL2SQL model, and achieves the technical effect of generating the data query statement more efficiently.
In the method provided in the embodiment of the present application, when the data query statement is the second output result, step S400 includes:
s410: obtaining a prediction result probability according to the data query result;
s420: judging whether the prediction result probability meets a first preset threshold value or not;
s430: when the data is satisfied, sending the data query result to a front end for displaying;
s440: and when the preset result is not met, returning the preset result.
Specifically, as shown in fig. 2, when the data query statement is the second output result, outputting the spliced SQL statement according to the NL2SQL model corresponding to the second output result to perform a query in the database, and obtaining a data query result.
And the SQL statement corresponding to the second output result comprises a most similar result table recalled by the NL2SQL model from a database table of the database based on table name similarity calculation, if the similarity is high, the accuracy of the probability of the predicted result of the NL2SQL model is high, and if the similarity is low, the accuracy of the probability of the predicted result of the NL2SQL model is low. Judging whether the predicted result probability meets a first preset threshold value or not according to whether the SQL sentences corresponding to the data query result and the second output result correspond to each other or not and the corresponding similarity, if so, indicating that the data query result meets the query standard, and sending the data query result to the front end for displaying; if not, the data query result is not in accordance with the query standard, and a preset result is returned.
The preset result may be no query result or query failure, and the user needs to re-input more accurate input language information for query.
The first predetermined threshold is a probability preset in the NL2SQL model, and can be set by those skilled in the art as needed, and exemplarily, the first predetermined threshold can be 60%. And predicting the NL2SQL model, generating an SQL query statement, inquiring the SQL query statement in the hive system, if the probability of the predicted result is greater than a first preset threshold value, sending the data query result to the front end for displaying, and if the probability of the predicted result is less than the first preset threshold value, returning a preset result. According to the embodiment of the application, the preset result is returned under the condition that the NL2SQL model prediction result probability is lower than a certain threshold value, the supplement of model hot deployment hot updating is provided, training data does not need to be added, and the model can predict a more accurate result through an optimized model.
In summary, the embodiment of the present application performs index search in the database table through the search algorithm based on the NER entity identification algorithm and the identified keyword, thereby completing the process of "finding the index"; compared with the prior art that only an NL2SQL model is used, similarity calculation can be carried out on the 'natural language problem' and the table field, the 'table finding' process is further completed, NER entity identification is used as a priority searching scheme to optimize the algorithm execution speed, the condition of certain probability inaccurate results generated based on the 'similarity calculation' is supplemented, and the following technical effects are achieved:
1. the embodiment of the application abandons the classification and recognition of language intentions by using intention recognition in the traditional question-answering system, and integrates various natural language processing technologies, including a text error correction model, an entity recognition model, a rule model for generating SQL (structured query language) and a natural language-to-SQL (structured query language) model. The fusion of the technologies improves the robustness of the system and brings more accurate understanding of the language description, thereby achieving the purpose of accurately generating the database query language and obtaining accurate data query and analysis results.
2. The invention uses BERT algorithm to identify entity, adds BilSTM and conditional random field algorithm CRF to process output vector at the uppermost layer of BERT, and limits probability distribution of entity word slot, thereby improving accuracy of Bert training and speed of training loss convergence.
3. The invention realizes semantic understanding of report query language through NER entity recognition algorithm, and index search is carried out in a database table through the recognized key words through search algorithm so as to complete the process of finding indexes; compared with the method, only an NL2SQL model is used, similarity calculation is carried out on the natural language problem and the table field, the process of finding the table is further completed, NER entity identification is used as a priority searching scheme to optimize the algorithm execution speed, and the condition of certain probability inaccurate results generated based on similarity calculation is supplemented.
Example two
Based on the same inventive concept as the method for applying the NER entity recognition algorithm to report query in the foregoing embodiment, as shown in fig. 5, an embodiment of the present application provides an application system of the NER entity recognition algorithm to report query, wherein the system includes:
a first obtaining unit 11, the first obtaining unit 11 being configured to obtain input language information;
a first processing unit 12, where the first processing unit 12 is configured to input the input language information into an NER entity recognition model, and obtain a key field;
the second processing unit 13, where the second processing unit 13 is configured to obtain a data query statement according to the key field;
and the third processing unit 14 is configured to execute the data query statement to query the index in the database system, obtain a data query result, and send the data query result to the front end for display.
Further, the system further comprises:
a second obtaining unit configured to obtain text information according to the input language information;
the fourth processing unit is used for carrying out error recognition on the text information to obtain error vocabularies;
and the fifth processing unit is used for performing supplementary error correction on the error vocabulary to obtain preprocessed language information, and inputting the preprocessed language information serving as input information into the NER entity recognition model.
Further, the system further comprises:
a sixth processing unit, configured to input the preprocessed language information into a BERT pre-training model for encoding, so as to obtain a BERT training tuning output layer;
a seventh processing unit, configured to input the BERT training tuning output layer into a Bi-LSTM layer for encoding, so as to obtain a Bi-LSTM layer output;
and the eighth processing unit is used for outputting and inputting the Bi-LSTM layer to a CRF classification and labeling layer to perform entity word slot extraction, so as to obtain the key field.
Further, the system further comprises:
a ninth processing unit, configured to label a data entity type of the pre-processing language information, where the data entity type includes a table field, a time field, and a mechanism field;
a tenth processing unit, configured to tune the BERT pre-training model based on the table field, the time field, and the mechanism field, and obtain the BERT training tuned output layer.
Further, the system further comprises:
the eleventh processing unit is used for carrying out standardization processing on the key fields to obtain standard indexes;
the twelfth processing unit is used for inquiring indexes in the database system according to the standard indexes to obtain index inquiry results;
and the thirteenth processing unit is used for taking the index query result as the data query result and sending the data query result to the front end for displaying when the index query result is the first result.
Further, the system further comprises:
a fourteenth processing unit, configured to generate an SQL algorithm according to the rule model, and obtain an SQL rule model;
a fifteenth processing unit, configured to input the standard indicator into the SQL rule model, and obtain a first output result;
a first judging unit, configured to, when the first output result is a first result, obtain the data query statement according to the first output result.
Further, the system further comprises:
a sixteenth processing unit, configured to input the standard indicator into an NL2SQL model, where the NL2SQL model is a prediction model constructed by generating an SQL algorithm and supplementing the NL2SQL algorithm based on the rule model, where the NL2SQL model includes a first subtask and a second subtask;
a seventeenth processing unit, configured to obtain a first prediction result based on the first sub-task;
an eighteenth processing unit, configured to obtain a second prediction result based on the second subtask;
a nineteenth processing unit, configured to splice the first prediction result and the second prediction result to obtain a second output result, where the second output result includes the data query statement.
Further, the system further comprises:
a twentieth processing unit, configured to obtain a predicted result probability according to the data query result;
a second determination unit configured to determine whether the prediction result probability satisfies a first predetermined threshold;
the twenty-first processing unit is used for sending the data query result to a front end for displaying when the data query result is met;
a twenty-second processing unit, configured to return a preset result when not satisfied.
Exemplary electronic device
The electronic device of the embodiment of the present application is described below with reference to figure 6,
based on the same inventive concept as the method for applying the NER entity identification algorithm to report query in the foregoing embodiments, the present application embodiment further provides an application system of the NER entity identification algorithm to report query, including: a processor coupled to a memory, the memory for storing a program that, when executed by the processor, causes the system to perform the steps of the method of embodiment one.
The electronic device 300 includes: processor 302, communication interface 303, memory 301. Optionally, the electronic device 300 may also include a bus architecture 304. Wherein, the communication interface 303, the processor 302 and the memory 301 may be connected to each other through a bus architecture 304; the bus architecture 304 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus architecture 304 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.
Processor 302 may be a CPU, microprocessor, ASIC, or one or more integrated circuits for controlling the execution of programs in accordance with the teachings of the present application.
The communication interface 303 is a system using any transceiver or the like, and is used for communicating with other devices or communication networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), wired access network, and the like.
The memory 301 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an electrically erasable Programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and coupled to the processor through a bus architecture 304. The memory may also be integral to the processor.
The memory 301 is used for storing computer-executable instructions for executing the present application, and is controlled by the processor 302 to execute. The processor 302 is configured to execute the computer-executable instructions stored in the memory 301, so as to implement a method for applying the NER entity recognition algorithm to report queries, which is provided by the above embodiments of the present application.
Optionally, the computer-executable instructions in the embodiments of the present application may also be referred to as application program codes, which are not specifically limited in the embodiments of the present application.
The method is based on an NER entity identification algorithm, index search is carried out on a database table through identified keywords through a search algorithm, and then the process of finding indexes is completed; compared with the prior art that only an NL2SQL model is used, similarity calculation can be carried out on the 'natural language problem' and the table field, the 'table finding' process is further completed, NER entity identification is used as a priority searching scheme to optimize the algorithm execution speed, the condition of certain probability inaccurate results generated based on the 'similarity calculation' is supplemented, and the following technical effects are achieved:
1. the embodiment of the application abandons the classification and recognition of language intentions by using intention recognition in the traditional question-answering system, and integrates various natural language processing technologies, including a text error correction model, an entity recognition model, a rule model for generating SQL (structured query language) and a natural language-to-SQL (structured query language) model. The fusion of the technologies improves the robustness of the system and brings more accurate understanding of the language description, thereby achieving the purpose of accurately generating the database query language and obtaining accurate data query and analysis results.
2. The invention uses BERT algorithm to identify entity, adds BilSTM and conditional random field algorithm CRF to process output vector at the uppermost layer of BERT, and limits probability distribution of entity word slot, thereby improving accuracy of Bert training and speed of training loss convergence.
3. The invention realizes semantic understanding of report query language through NER entity recognition algorithm, and index search is carried out in a database table through the recognized key words through search algorithm so as to complete the process of finding indexes; compared with the method, only an NL2SQL model is used, similarity calculation is carried out on the natural language problem and the table field, the process of finding the table is further completed, NER entity identification is used as a priority searching scheme to optimize the algorithm execution speed, and the condition of certain probability inaccurate results generated based on similarity calculation is supplemented.
Those of ordinary skill in the art will understand that: the various numbers of the first, second, etc. mentioned in this application are only used for the convenience of description and are not used to limit the scope of the embodiments of this application, nor to indicate the order of precedence. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one" means one or more. At least two means two or more. "at least one," "any," or similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one (one ) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable system. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The various illustrative logical units and circuits described in this application may be implemented or operated upon by general purpose processors, digital signal processors, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic systems, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing systems, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in the embodiments herein may be embodied directly in hardware, in a software element executed by a processor, or in a combination of the two. The software cells may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be disposed in a terminal. In the alternative, the processor and the storage medium may reside in different components within the terminal. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Although the present application has been described in conjunction with specific features and embodiments thereof, it will be evident that various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and figures are merely exemplary of the present application as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations.

Claims (10)

1. A method for applying NER entity recognition algorithm in report query, wherein the method comprises the following steps:
obtaining input language information;
inputting the input language information into an NER entity recognition model to obtain a key field;
obtaining a data query statement according to the key field;
and executing the data query statement to query indexes in a database system, obtaining a data query result, and sending the data query result to the front end for displaying.
2. The method of claim 1, wherein said entering said input language information into a NER entity recognition model comprises:
acquiring text information according to the input language information;
carrying out error recognition on the text information to obtain an error vocabulary;
and performing supplementary error correction on the error vocabulary to obtain preprocessed language information, and inputting the preprocessed language information serving as input information into the NER entity recognition model.
3. The method of claim 2, wherein said entering said input language information into a NER entity recognition model, obtaining key fields, comprises:
inputting the preprocessed language information into a BERT pre-training model for coding to obtain a BERT training tuning output layer;
inputting the BERT training tuning output layer into a Bi-LSTM layer for coding to obtain Bi-LSTM layer output;
and outputting the Bi-LSTM layer to an CRF classification labeling layer for entity word slot extraction, and obtaining the key field.
4. The method of claim 3, wherein said inputting said preprocessed linguistic information into a BERT pre-training model for encoding to obtain a BERT training tuned output layer comprises:
marking the data entity type of the preprocessing language information, wherein the data entity type comprises a table field, a time field and a mechanism field;
and adjusting the BERT pre-training model based on the table field, the time field and the mechanism field to obtain the BERT training adjusted output layer.
5. The method of claim 1, wherein obtaining the data query statement according to the key field comprises:
standardizing the key fields to obtain standard indexes;
inquiring indexes in a database system according to the standard indexes to obtain index inquiry results;
and when the index query result is a first result, taking the index query result as the data query result, and sending the data query result to the front end for displaying.
6. The method of claim 5, wherein when the target query result is a second result, the obtaining a data query statement according to the key field comprises:
generating an SQL algorithm according to the rule model to obtain an SQL rule model;
inputting the standard index into the SQL rule model to obtain a first output result;
and when the first output result is the first result, obtaining the data query statement according to the first output result.
7. The method of claim 6, wherein when the first output result is a second result, the method comprises:
inputting the standard index into an NL2SQL model, wherein the NL2SQL model is a prediction model constructed by supplementing an NL2SQL algorithm with an SQL algorithm generated based on the rule model, and the NL2SQL model comprises a first subtask and a second subtask;
obtaining a first prediction result based on the first subtask;
obtaining a second prediction result based on the second subtask;
and splicing the first prediction result and the second prediction result to obtain a second output result, wherein the second output result comprises the data query statement.
8. The method of claim 7, wherein when the data query statement is the second output result, the executing the data query statement queries an index in a database system to obtain a data query result, comprising:
obtaining a prediction result probability according to the data query result;
judging whether the prediction result probability meets a first preset threshold value or not;
when the data is satisfied, sending the data query result to a front end for displaying;
and when the preset result is not met, returning the preset result.
9. An application system of NER entity recognition algorithm in report query, wherein the system comprises:
a first obtaining unit configured to obtain input language information;
the first processing unit is used for inputting the input language information into an NER entity recognition model to obtain a key field;
the second processing unit is used for obtaining a data query statement according to the key field;
and the third processing unit is used for executing the index query of the data query statement in a database system, obtaining a data query result and sending the data query result to the front end for display.
10. An application system of NER entity recognition algorithm in report query includes: a processor coupled to a memory for storing a program that, when executed by the processor, causes a system to perform the steps of the method of any of claims 1 to 8.
CN202111493638.XA 2021-12-08 2021-12-08 Method and system for applying NER entity recognition algorithm in report query Pending CN114238370A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111493638.XA CN114238370A (en) 2021-12-08 2021-12-08 Method and system for applying NER entity recognition algorithm in report query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111493638.XA CN114238370A (en) 2021-12-08 2021-12-08 Method and system for applying NER entity recognition algorithm in report query

Publications (1)

Publication Number Publication Date
CN114238370A true CN114238370A (en) 2022-03-25

Family

ID=80754009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111493638.XA Pending CN114238370A (en) 2021-12-08 2021-12-08 Method and system for applying NER entity recognition algorithm in report query

Country Status (1)

Country Link
CN (1) CN114238370A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306598A (en) * 2023-05-22 2023-06-23 上海蜜度信息技术有限公司 Customized error correction method, system, equipment and medium for words in different fields

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306598A (en) * 2023-05-22 2023-06-23 上海蜜度信息技术有限公司 Customized error correction method, system, equipment and medium for words in different fields
CN116306598B (en) * 2023-05-22 2023-09-08 上海蜜度信息技术有限公司 Customized error correction method, system, equipment and medium for words in different fields

Similar Documents

Publication Publication Date Title
US11948058B2 (en) Utilizing recurrent neural networks to recognize and extract open intent from text inputs
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
US10262062B2 (en) Natural language system question classifier, semantic representations, and logical form templates
CN104361127B (en) The multilingual quick constructive method of question and answer interface based on domain body and template logic
CN104657440B (en) Structured query statement generation system and method
CN114547329A (en) Method for establishing pre-training language model, semantic analysis method and device
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN112817561B (en) Transaction type functional point structured extraction method and system for software demand document
CN110555205B (en) Negative semantic recognition method and device, electronic equipment and storage medium
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN113254581B (en) Financial text formula extraction method and device based on neural semantic analysis
CN113268560A (en) Method and device for text matching
CN115827819A (en) Intelligent question and answer processing method and device, electronic equipment and storage medium
CN114625748A (en) SQL query statement generation method and device, electronic equipment and readable storage medium
CN115359799A (en) Speech recognition method, training method, device, electronic equipment and storage medium
CN113220900B (en) Modeling Method of Entity Disambiguation Model and Entity Disambiguation Prediction Method
CN113239694B (en) Argument role identification method based on argument phrase
CN112036186A (en) Corpus labeling method and device, computer storage medium and electronic equipment
CN114238370A (en) Method and system for applying NER entity recognition algorithm in report query
CN111738008B (en) Entity identification method, device and equipment based on multilayer model and storage medium
CN113095082A (en) Method, device, computer device and computer readable storage medium for text processing based on multitask model
CN110750967B (en) Pronunciation labeling method and device, computer equipment and storage medium
CN111209746A (en) Natural language processing method, device, storage medium and electronic equipment
CN115952770A (en) Data standardization processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination