CN111680043B

CN111680043B - Method for quickly retrieving mass data

Info

Publication number: CN111680043B
Application number: CN202010505012.5A
Authority: CN
Inventors: 徐晓贝; 陈胡; 陈宽; 陶伟洋; 叶兆裕; 王远友
Original assignee: Nanjing LES Information Technology Co. Ltd
Current assignee: Nanjing LES Information Technology Co. Ltd
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2023-11-28
Anticipated expiration: 2040-06-05
Also published as: CN111680043A

Abstract

The invention discloses a method for quickly searching mass data, which comprises the following steps: constructing a mass data storage system; establishing a secondary index for the data in the mass data storage system; starting a data retrieval service and monitoring an Http request; analyzing the Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an elastic search index service, and obtaining a response result; and reading structural data corresponding to the ROWKEY from Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the searched structural data and returning. The method can quickly search mass data according to multiple conditions and return the query result within a very short time range, and solves the defects of the prior art scheme with minimum cost.

Description

Method for quickly retrieving mass data

Technical Field

The invention belongs to the technical field of quick retrieval of big data, and particularly relates to a quick retrieval method for massive data.

Background

With the development of society and technology, massive data are generated in different fields every day, and the storage and use of the data become a very challenging technical problem. For example, in the transportation industry, a county city with 300 tens of thousands of people, the video detector generates 1000 tens of thousands of vehicles passing through the data. The common transaction type information management system stores the data through the relational data, and in the first year, the data can be retrieved normally, and the query method and the storage design are found to be optimized even if the data amount is accumulated for more than two years or even longer, but the data to be searched still cannot be queried in a short time. How to store mass data more effectively and realize quick retrieval by a certain technology is a problem to be solved.

At present, most solutions under construction are to increase storage nodes of a relational database and build a great number of indexes to realize quick retrieval, but the maintenance cost of the indexes is very high, and once data are changed, the indexes are reconstructed in batches, because the indexes and the data are operated in the same database instance, the reconstruction of the indexes directly affects the performance of the database, and the query operation being executed is affected.

At present, two fast query schemes are realized based on big data technology, as follows:

1. in order to realize rapid query according to conditions, the RowKey of the HBase needs to be designed according to the query conditions, the RowKey contains all query conditions, and the function of global unique index is realized through the RowKey. However, there is a significant drawback that once the query condition is changed, the RowKey needs to be redesigned, the original main data cannot be used, and the data needs to be generated again according to the new RowKey, so that the same service data needs to be stored for multiple copies according to different rowkeys, and huge waste is caused in the storage space. This is almost a fatal problem.

2. The data is also required to be stored in the distributed column database HBase, and the secondary index is required to be designed according to the query condition, but the primary data is stored in the table, and simultaneously the secondary index is stored in a storage area together with the primary data, so that the secondary index is positioned first when the data is queried, and then the primary data is positioned directly in the same area according to the secondary index. The advantage of this is that the index and the main data are in the same storage area, saving the time for retrieving the main data again across nodes. The two-level index design solves the problem of storage space waste after the query conditions are changed in the step 1, but creates a new problem, the RowKey matching principle in the HBase is that the ASCII codes of the RowKey are matched from front to back, so that if a plurality of query conditions exist, the number of the two-level indexes is very large in order to adapt to various combined queries, when the conditions reach 7 to 8, the number of the indexes is too large, and the storage space occupation amount of the indexes is as large as possible to exceed that of main data.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a rapid retrieval method for massive data; the method can quickly search mass data according to multiple conditions and return the query result within a very short time range, and solves the defects of the prior art scheme with minimum cost.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the invention discloses a method for quickly searching mass data, which comprises the following steps:

1) Constructing a mass data storage system;

2) Establishing a secondary index for the data in the mass data storage system;

3) Starting a data retrieval service and monitoring an Http request;

4) Analyzing the Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an elastic search index service, and obtaining a response result (ROWKEY set);

5) And reading structural data corresponding to the ROWKEY from Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the searched structural data and returning.

Further, the mass data storage system in the step 1) is an Apache HBase, which is a distributed and scalable mass data storage system constructed based on Hadoop, and the business data to be searched are stored into the HBase according to respective business designs, and a proper RowKey is designed and used as a unique identifier of the record; the data are uniformly distributed on a plurality of region servers of the HBase, the concurrent processing performance is improved, and the situation of local overheating is avoided.

Further, the step 2) specifically includes:

21 Using an elastic search as a carrier of a secondary index of the mass data storage system, and designing index fields according to respective query conditions by index design; the index field comprises all query fields, and a RowKey field is added to represent the value of the corresponding unique primary key after all conditions are matched.

22 Extracting data in all HBases, and inserting a value corresponding to an index corresponding to an elastic search in the data into the index, wherein the process is index data;

23 When creating the elastic search index, different index types are set according to whether each field is queried, the condition of fuzzy query uses an IK word segmentation device, and the field of full word matching is set as a keyword.

Further, the step 3) specifically includes: and writing a back-end service interface, wherein the service interface is provided for other programs to monitor requests and return data results required by the other programs.

Further, the step 4) specifically includes:

the backend service interface parses the request content, creates ElasticSearch Client API instances and specifies the index to use, requests the elastesearch index service, and returns the query RowKey result set.

Further, the step 5) specifically includes:

creating an HBase Client API instance, bringing the RowKey set into the HBase Client API, accessing the HBase service, and packaging the returned result set into a result list;

the results list is returned to the client along the call stack.

The invention has the beneficial effects that:

1. any change of the secondary index is irrelevant to the main data and does not affect any storage of the main data.

2. Only the query condition and the RowKey of the main data are stored in the secondary index, so that the secondary index occupies very small storage space.

3. When the query conditions change, only a new elastic search index needs to be built.

4. Because of some reasons of the distributed storage design of the HBase, the data query of the HBase cannot realize accurate paging; but the elastic search realizes the accurate paging function, and overcomes the defect that the HBase data can not be paged by directly inquiring.

5. The data retrieval realized by the technology can not cause great fluctuation of query efficiency due to the change of the data volume; the query time for 10 hundred million data volumes and 100 hundred million data volumes is nearly the same.

Drawings

Fig. 1 shows a schematic diagram of the method of the present invention.

Detailed Description

The invention will be further described with reference to examples and drawings, to which reference is made, but which are not intended to limit the scope of the invention.

Referring to fig. 1, the method for quickly searching mass data of the present invention comprises the following steps:

1) Constructing a mass data storage system;

the mass data storage system is Apache HBase, is a distributed and telescopic mass data storage system constructed based on Hadoop, stores service data to be searched into the HBase according to respective service designs, and designs a proper RowKey to be used as a unique record identifier; the data are uniformly distributed on a plurality of region servers of the HBase, the concurrent processing performance is improved, and the situation of local overheating is avoided.

2) Establishing a secondary index for the data in the mass data storage system;

21 Configuration settings:

211 A value of a partition number, i.e., number_of_boards parameter, is set as the number of cluster nodes;

212 A value of a number_of_redundant parameter, which represents the number of redundant copies of data, is set to 0, i.e., no copies;

213 Setting a data compression mode, namely setting a value of a codec parameter as best_compression; the data can be compressed more effectively, so that occupation of the magnetic disk is obviously reduced;

22 Configuration custom analyzer (one analyzer with at least one token filter that may be zero or more):

221 Setting a custom token, wherein the type is edge_ngram, the parameter min_gram is set to 1, the parameter max_gram is set to 10, and the token_characters are set to a left and a digit, so that the token can split when encountering characters and numbers, and the token can be better suitable for index fields in demands;

222 The custom index analyzer of the field PLATE_NO is set, the custom token of the last step is selected as a token setting item, and the token filters are set as lowercase token filters of the system, so that the method can be better suitable for the field PLATE_NO;

223 Setting a custom query analyzer of the field PLATE_NO, setting a token thereof as keyword tokenizer of the system, setting a token filters thereof as lowercase token filters of the system, and collocating the custom query analyzer of the field PLATE_NO with the custom index analyzer for use; the method can be better suitable for the requirement, and simultaneously, the required data can be more efficiently and accurately searched when the data is inquired according to the PLATE_NO;

23 Configuration field maps):

231 Setting the type of the ROWKEY field as an object and setting the value of the enabled parameter as false, wherein the value of the field is only provided for HBase to search data, the data is not required to be searched according to the field in the elastic search, and the elastic search completely skips the analysis of the field content if the enabled parameter is set as false, but the specific value can still be obtained from the source field, and the specific value can not be searched and is not used for indexing the data or is stored in any other way, so that the occupation of a disk can be reduced;

232 The TYPE of cross-INDEX field, PLATE_TYPE field, PLATE_COLOR field) is set as a keyword field, which can only be retrieved according to its exact value, since these fields are structured content and are typically used for filtering, ordering and aggregation.

233 Setting the type of the PLATE_NO field as text, the text field is applicable to the field which needs full text retrieval. the text field stores a normalization factor in the index to enable scoring of the document, and if only one text field needs to be matched, but the score generated is not a concern, the norm parameter value may be set to false. By default, the text field also stores the frequency and position in the index, the frequency is used to calculate the score, the position is used to run the phrase query, if no phrase query is required to be run, the index_options parameter value may be set to freqs so that the elastomer search does not index the position. The above arrangement can speed up the inquiry and reduce disk occupation. Setting an index analyzer parameter value of the PLATE_NO field as a custom index analyzer, and inquiring the search_analyzer parameter value as a custom inquiry analyzer; and mapping the text field into a keyword field for ordering or aggregation by means of multi-fields;

234 A type of pass_time field is set to date, and its format is set by a format parameter.

3) Starting a data retrieval service and monitoring an Http request;

and writing a back-end service interface, wherein the service interface is provided for other programs to monitor requests and return data results required by the other programs.

the results list is returned to the client along the call stack.

The invention stores mass data in a distributed column database Hbase, establishes a secondary index for the data by utilizing a distributed full text index engine elastic search, and does not directly inquire the HBase when searching the data, but firstly inquires a RowKey of the data through the secondary index, then inquires the data in the HBase through the RowKey and returns a result meeting the condition.

The present invention has been described in terms of the preferred embodiments thereof, and it should be understood by those skilled in the art that various modifications can be made without departing from the principles of the invention, and such modifications should also be considered as being within the scope of the invention.

Claims

1. A method for quickly searching mass data is characterized by comprising the following steps:

1) Constructing a mass data storage system;

2) Establishing a secondary index for the data in the mass data storage system;

3) Starting a data retrieval service and monitoring an Http request;

4) Analyzing the Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an elastic search index service, and obtaining a response result;

5) Reading structured data corresponding to the ROWKEY from Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data;

the step 2) specifically comprises the following steps:

21 Configuration settings:

212 A value of a number_of_redundant parameter, which indicates the number of redundant copies of data, is set to 0, which indicates no copies;

213 Setting a data compression mode, and setting a value of a codec parameter as best_compression;

22 Configuration custom analyzer:

221 Setting a custom token, wherein the type is edge_ngram, the parameter min_gram is set to 1, the parameter max_gram is set to 10, and token_characters are set to letters and digits, which means that the token can split when encountering characters and numbers;

222 A custom index analyzer of a field PLATE_NO is set, the custom token of the last step is selected as a token setting item, and token filters are set as lowercase token filters of the system;

223 Setting a custom query analyzer of the field PLATE_NO, setting a token thereof as keyword tokenizer of the system, setting a token filters thereof as lowercase token filters of the system, and collocating the custom query analyzer of the field PLATE_NO with the custom index analyzer for use;

23 Configuration field maps):

231 Setting the type of the ROWKEY field as an object, setting the value of an enabled parameter of the ROWKEY field as false, wherein the value of the field is only provided for HBase retrieval data, the data is not required to be retrieved according to the field in the elastic search, and if the enabled parameter is set as false, the elastic search completely skips the analysis of the field content, and a specific value is acquired from the_source field;

232 Setting the TYPE of the cross-word field, the PLATE_TYPE field and the PLATE_COLOR field as a keyword, and using the keyword as the TYPE, and searching the keyword field according to the exact value;

233 Setting the type of the PLATE_NO field as text; the text field stores the standardized factors in the index to score the document, and if only one text field needs to be matched, the norm parameter value is set as false; if the phrase query does not need to be run, setting index_options parameter values to freqs so that the elastomer search does not index positions; setting an index analyzer parameter value of the PLATE_NO field as a custom index analyzer, and inquiring the search_analyzer parameter value as a custom inquiry analyzer; and mapping the text field into a keyword field for ordering or aggregation by means of multi-fields;

2. The method for fast searching for massive data according to claim 1, wherein the massive data storage system in step 1) is an Apache HBase, which is a distributed and scalable massive data storage system constructed based on Hadoop, and the business data to be searched are stored in the HBase according to respective business designs, and a proper RowKey is designed to be used as a unique identifier of the record; and data are uniformly distributed on a plurality of region servers of the HBase, so that the performance of concurrent processing is improved.

3. The method for fast searching for massive data according to claim 1, wherein the step 3) specifically includes: and writing a back-end service interface, wherein the service interface is provided for other programs to monitor requests and return data results required by the other programs.

4. The method for fast searching for massive data according to claim 1, wherein the step 4) specifically includes:

5. The method for fast searching for massive data according to claim 1, wherein the step 5) specifically includes:

the results list is returned to the client along the call stack.