CN111680043B - Method for quickly retrieving mass data - Google Patents

Method for quickly retrieving mass data Download PDF

Info

Publication number
CN111680043B
CN111680043B CN202010505012.5A CN202010505012A CN111680043B CN 111680043 B CN111680043 B CN 111680043B CN 202010505012 A CN202010505012 A CN 202010505012A CN 111680043 B CN111680043 B CN 111680043B
Authority
CN
China
Prior art keywords
data
field
index
setting
rowkey
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010505012.5A
Other languages
Chinese (zh)
Other versions
CN111680043A (en
Inventor
徐晓贝
陈胡
陈宽
陶伟洋
叶兆裕
王远友
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing LES Information Technology Co. Ltd
Original Assignee
Nanjing LES Information Technology Co. Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing LES Information Technology Co. Ltd filed Critical Nanjing LES Information Technology Co. Ltd
Priority to CN202010505012.5A priority Critical patent/CN111680043B/en
Publication of CN111680043A publication Critical patent/CN111680043A/en
Application granted granted Critical
Publication of CN111680043B publication Critical patent/CN111680043B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation

Abstract

The invention discloses a method for quickly searching mass data, which comprises the following steps: constructing a mass data storage system; establishing a secondary index for the data in the mass data storage system; starting a data retrieval service and monitoring an Http request; analyzing the Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an elastic search index service, and obtaining a response result; and reading structural data corresponding to the ROWKEY from Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the searched structural data and returning. The method can quickly search mass data according to multiple conditions and return the query result within a very short time range, and solves the defects of the prior art scheme with minimum cost.

Description

Method for quickly retrieving mass data
Technical Field
The invention belongs to the technical field of quick retrieval of big data, and particularly relates to a quick retrieval method for massive data.
Background
With the development of society and technology, massive data are generated in different fields every day, and the storage and use of the data become a very challenging technical problem. For example, in the transportation industry, a county city with 300 tens of thousands of people, the video detector generates 1000 tens of thousands of vehicles passing through the data. The common transaction type information management system stores the data through the relational data, and in the first year, the data can be retrieved normally, and the query method and the storage design are found to be optimized even if the data amount is accumulated for more than two years or even longer, but the data to be searched still cannot be queried in a short time. How to store mass data more effectively and realize quick retrieval by a certain technology is a problem to be solved.
At present, most solutions under construction are to increase storage nodes of a relational database and build a great number of indexes to realize quick retrieval, but the maintenance cost of the indexes is very high, and once data are changed, the indexes are reconstructed in batches, because the indexes and the data are operated in the same database instance, the reconstruction of the indexes directly affects the performance of the database, and the query operation being executed is affected.
At present, two fast query schemes are realized based on big data technology, as follows:
1. in order to realize rapid query according to conditions, the RowKey of the HBase needs to be designed according to the query conditions, the RowKey contains all query conditions, and the function of global unique index is realized through the RowKey. However, there is a significant drawback that once the query condition is changed, the RowKey needs to be redesigned, the original main data cannot be used, and the data needs to be generated again according to the new RowKey, so that the same service data needs to be stored for multiple copies according to different rowkeys, and huge waste is caused in the storage space. This is almost a fatal problem.
2. The data is also required to be stored in the distributed column database HBase, and the secondary index is required to be designed according to the query condition, but the primary data is stored in the table, and simultaneously the secondary index is stored in a storage area together with the primary data, so that the secondary index is positioned first when the data is queried, and then the primary data is positioned directly in the same area according to the secondary index. The advantage of this is that the index and the main data are in the same storage area, saving the time for retrieving the main data again across nodes. The two-level index design solves the problem of storage space waste after the query conditions are changed in the step 1, but creates a new problem, the RowKey matching principle in the HBase is that the ASCII codes of the RowKey are matched from front to back, so that if a plurality of query conditions exist, the number of the two-level indexes is very large in order to adapt to various combined queries, when the conditions reach 7 to 8, the number of the indexes is too large, and the storage space occupation amount of the indexes is as large as possible to exceed that of main data.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a rapid retrieval method for massive data; the method can quickly search mass data according to multiple conditions and return the query result within a very short time range, and solves the defects of the prior art scheme with minimum cost.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the invention discloses a method for quickly searching mass data, which comprises the following steps:
1) Constructing a mass data storage system;
2) Establishing a secondary index for the data in the mass data storage system;
3) Starting a data retrieval service and monitoring an Http request;
4) Analyzing the Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an elastic search index service, and obtaining a response result (ROWKEY set);
5) And reading structural data corresponding to the ROWKEY from Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the searched structural data and returning.
Further, the mass data storage system in the step 1) is an Apache HBase, which is a distributed and scalable mass data storage system constructed based on Hadoop, and the business data to be searched are stored into the HBase according to respective business designs, and a proper RowKey is designed and used as a unique identifier of the record; the data are uniformly distributed on a plurality of region servers of the HBase, the concurrent processing performance is improved, and the situation of local overheating is avoided.
Further, the step 2) specifically includes:
21 Using an elastic search as a carrier of a secondary index of the mass data storage system, and designing index fields according to respective query conditions by index design; the index field comprises all query fields, and a RowKey field is added to represent the value of the corresponding unique primary key after all conditions are matched.
22 Extracting data in all HBases, and inserting a value corresponding to an index corresponding to an elastic search in the data into the index, wherein the process is index data;
23 When creating the elastic search index, different index types are set according to whether each field is queried, the condition of fuzzy query uses an IK word segmentation device, and the field of full word matching is set as a keyword.
Further, the step 3) specifically includes: and writing a back-end service interface, wherein the service interface is provided for other programs to monitor requests and return data results required by the other programs.
Further, the step 4) specifically includes:
the backend service interface parses the request content, creates ElasticSearch Client API instances and specifies the index to use, requests the elastesearch index service, and returns the query RowKey result set.
Further, the step 5) specifically includes:
creating an HBase Client API instance, bringing the RowKey set into the HBase Client API, accessing the HBase service, and packaging the returned result set into a result list;
the results list is returned to the client along the call stack.
The invention has the beneficial effects that:
1. any change of the secondary index is irrelevant to the main data and does not affect any storage of the main data.
2. Only the query condition and the RowKey of the main data are stored in the secondary index, so that the secondary index occupies very small storage space.
3. When the query conditions change, only a new elastic search index needs to be built.
4. Because of some reasons of the distributed storage design of the HBase, the data query of the HBase cannot realize accurate paging; but the elastic search realizes the accurate paging function, and overcomes the defect that the HBase data can not be paged by directly inquiring.
5. The data retrieval realized by the technology can not cause great fluctuation of query efficiency due to the change of the data volume; the query time for 10 hundred million data volumes and 100 hundred million data volumes is nearly the same.
Drawings
Fig. 1 shows a schematic diagram of the method of the present invention.
Detailed Description
The invention will be further described with reference to examples and drawings, to which reference is made, but which are not intended to limit the scope of the invention.
Referring to fig. 1, the method for quickly searching mass data of the present invention comprises the following steps:
1) Constructing a mass data storage system;
the mass data storage system is Apache HBase, is a distributed and telescopic mass data storage system constructed based on Hadoop, stores service data to be searched into the HBase according to respective service designs, and designs a proper RowKey to be used as a unique record identifier; the data are uniformly distributed on a plurality of region servers of the HBase, the concurrent processing performance is improved, and the situation of local overheating is avoided.
2) Establishing a secondary index for the data in the mass data storage system;
21 Configuration settings:
211 A value of a partition number, i.e., number_of_boards parameter, is set as the number of cluster nodes;
212 A value of a number_of_redundant parameter, which represents the number of redundant copies of data, is set to 0, i.e., no copies;
213 Setting a data compression mode, namely setting a value of a codec parameter as best_compression; the data can be compressed more effectively, so that occupation of the magnetic disk is obviously reduced;
22 Configuration custom analyzer (one analyzer with at least one token filter that may be zero or more):
221 Setting a custom token, wherein the type is edge_ngram, the parameter min_gram is set to 1, the parameter max_gram is set to 10, and the token_characters are set to a left and a digit, so that the token can split when encountering characters and numbers, and the token can be better suitable for index fields in demands;
222 The custom index analyzer of the field PLATE_NO is set, the custom token of the last step is selected as a token setting item, and the token filters are set as lowercase token filters of the system, so that the method can be better suitable for the field PLATE_NO;
223 Setting a custom query analyzer of the field PLATE_NO, setting a token thereof as keyword tokenizer of the system, setting a token filters thereof as lowercase token filters of the system, and collocating the custom query analyzer of the field PLATE_NO with the custom index analyzer for use; the method can be better suitable for the requirement, and simultaneously, the required data can be more efficiently and accurately searched when the data is inquired according to the PLATE_NO;
23 Configuration field maps):
231 Setting the type of the ROWKEY field as an object and setting the value of the enabled parameter as false, wherein the value of the field is only provided for HBase to search data, the data is not required to be searched according to the field in the elastic search, and the elastic search completely skips the analysis of the field content if the enabled parameter is set as false, but the specific value can still be obtained from the source field, and the specific value can not be searched and is not used for indexing the data or is stored in any other way, so that the occupation of a disk can be reduced;
232 The TYPE of cross-INDEX field, PLATE_TYPE field, PLATE_COLOR field) is set as a keyword field, which can only be retrieved according to its exact value, since these fields are structured content and are typically used for filtering, ordering and aggregation.
233 Setting the type of the PLATE_NO field as text, the text field is applicable to the field which needs full text retrieval. the text field stores a normalization factor in the index to enable scoring of the document, and if only one text field needs to be matched, but the score generated is not a concern, the norm parameter value may be set to false. By default, the text field also stores the frequency and position in the index, the frequency is used to calculate the score, the position is used to run the phrase query, if no phrase query is required to be run, the index_options parameter value may be set to freqs so that the elastomer search does not index the position. The above arrangement can speed up the inquiry and reduce disk occupation. Setting an index analyzer parameter value of the PLATE_NO field as a custom index analyzer, and inquiring the search_analyzer parameter value as a custom inquiry analyzer; and mapping the text field into a keyword field for ordering or aggregation by means of multi-fields;
234 A type of pass_time field is set to date, and its format is set by a format parameter.
3) Starting a data retrieval service and monitoring an Http request;
and writing a back-end service interface, wherein the service interface is provided for other programs to monitor requests and return data results required by the other programs.
4) Analyzing the Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an elastic search index service, and obtaining a response result (ROWKEY set);
the backend service interface parses the request content, creates ElasticSearch Client API instances and specifies the index to use, requests the elastesearch index service, and returns the query RowKey result set.
5) And reading structural data corresponding to the ROWKEY from Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the searched structural data and returning.
Creating an HBase Client API instance, bringing the RowKey set into the HBase Client API, accessing the HBase service, and packaging the returned result set into a result list;
the results list is returned to the client along the call stack.
The invention stores mass data in a distributed column database Hbase, establishes a secondary index for the data by utilizing a distributed full text index engine elastic search, and does not directly inquire the HBase when searching the data, but firstly inquires a RowKey of the data through the secondary index, then inquires the data in the HBase through the RowKey and returns a result meeting the condition.
The present invention has been described in terms of the preferred embodiments thereof, and it should be understood by those skilled in the art that various modifications can be made without departing from the principles of the invention, and such modifications should also be considered as being within the scope of the invention.

Claims (5)

1. A method for quickly searching mass data is characterized by comprising the following steps:
1) Constructing a mass data storage system;
2) Establishing a secondary index for the data in the mass data storage system;
3) Starting a data retrieval service and monitoring an Http request;
4) Analyzing the Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an elastic search index service, and obtaining a response result;
5) Reading structured data corresponding to the ROWKEY from Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data;
the step 2) specifically comprises the following steps:
21 Configuration settings:
211 A value of a partition number, i.e., number_of_boards parameter, is set as the number of cluster nodes;
212 A value of a number_of_redundant parameter, which indicates the number of redundant copies of data, is set to 0, which indicates no copies;
213 Setting a data compression mode, and setting a value of a codec parameter as best_compression;
22 Configuration custom analyzer:
221 Setting a custom token, wherein the type is edge_ngram, the parameter min_gram is set to 1, the parameter max_gram is set to 10, and token_characters are set to letters and digits, which means that the token can split when encountering characters and numbers;
222 A custom index analyzer of a field PLATE_NO is set, the custom token of the last step is selected as a token setting item, and token filters are set as lowercase token filters of the system;
223 Setting a custom query analyzer of the field PLATE_NO, setting a token thereof as keyword tokenizer of the system, setting a token filters thereof as lowercase token filters of the system, and collocating the custom query analyzer of the field PLATE_NO with the custom index analyzer for use;
23 Configuration field maps):
231 Setting the type of the ROWKEY field as an object, setting the value of an enabled parameter of the ROWKEY field as false, wherein the value of the field is only provided for HBase retrieval data, the data is not required to be retrieved according to the field in the elastic search, and if the enabled parameter is set as false, the elastic search completely skips the analysis of the field content, and a specific value is acquired from the_source field;
232 Setting the TYPE of the cross-word field, the PLATE_TYPE field and the PLATE_COLOR field as a keyword, and using the keyword as the TYPE, and searching the keyword field according to the exact value;
233 Setting the type of the PLATE_NO field as text; the text field stores the standardized factors in the index to score the document, and if only one text field needs to be matched, the norm parameter value is set as false; if the phrase query does not need to be run, setting index_options parameter values to freqs so that the elastomer search does not index positions; setting an index analyzer parameter value of the PLATE_NO field as a custom index analyzer, and inquiring the search_analyzer parameter value as a custom inquiry analyzer; and mapping the text field into a keyword field for ordering or aggregation by means of multi-fields;
234 A type of pass_time field is set to date, and its format is set by a format parameter.
2. The method for fast searching for massive data according to claim 1, wherein the massive data storage system in step 1) is an Apache HBase, which is a distributed and scalable massive data storage system constructed based on Hadoop, and the business data to be searched are stored in the HBase according to respective business designs, and a proper RowKey is designed to be used as a unique identifier of the record; and data are uniformly distributed on a plurality of region servers of the HBase, so that the performance of concurrent processing is improved.
3. The method for fast searching for massive data according to claim 1, wherein the step 3) specifically includes: and writing a back-end service interface, wherein the service interface is provided for other programs to monitor requests and return data results required by the other programs.
4. The method for fast searching for massive data according to claim 1, wherein the step 4) specifically includes:
the backend service interface parses the request content, creates ElasticSearch Client API instances and specifies the index to use, requests the elastesearch index service, and returns the query RowKey result set.
5. The method for fast searching for massive data according to claim 1, wherein the step 5) specifically includes:
creating an HBase Client API instance, bringing the RowKey set into the HBase Client API, accessing the HBase service, and packaging the returned result set into a result list;
the results list is returned to the client along the call stack.
CN202010505012.5A 2020-06-05 2020-06-05 Method for quickly retrieving mass data Active CN111680043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010505012.5A CN111680043B (en) 2020-06-05 2020-06-05 Method for quickly retrieving mass data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010505012.5A CN111680043B (en) 2020-06-05 2020-06-05 Method for quickly retrieving mass data

Publications (2)

Publication Number Publication Date
CN111680043A CN111680043A (en) 2020-09-18
CN111680043B true CN111680043B (en) 2023-11-28

Family

ID=72435070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010505012.5A Active CN111680043B (en) 2020-06-05 2020-06-05 Method for quickly retrieving mass data

Country Status (1)

Country Link
CN (1) CN111680043B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100510A (en) * 2020-11-18 2020-12-18 树根互联技术有限公司 Mass data query method and device based on Internet of vehicles platform
CN112632157B (en) * 2021-03-11 2021-07-27 全时云商务服务股份有限公司 Multi-condition paging query method under distributed system
CN114491199A (en) * 2022-01-25 2022-05-13 浙江大华技术股份有限公司 Data retrieval method, device and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682073A (en) * 2016-11-14 2017-05-17 上海轻维软件有限公司 HBase fuzzy retrieval system based on Elastic Search
CN109165222A (en) * 2018-08-20 2019-01-08 福州大学 A kind of HBase secondary index creation method and system based on coprocessor
CN109299102A (en) * 2018-10-23 2019-02-01 中国电子科技集团公司第二十八研究所 A kind of HBase secondary index system and method based on Elastcisearch
CN109800222A (en) * 2018-12-11 2019-05-24 中国科学院信息工程研究所 A kind of HBase secondary index adaptive optimization method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682073A (en) * 2016-11-14 2017-05-17 上海轻维软件有限公司 HBase fuzzy retrieval system based on Elastic Search
CN109165222A (en) * 2018-08-20 2019-01-08 福州大学 A kind of HBase secondary index creation method and system based on coprocessor
CN109299102A (en) * 2018-10-23 2019-02-01 中国电子科技集团公司第二十八研究所 A kind of HBase secondary index system and method based on Elastcisearch
CN109800222A (en) * 2018-12-11 2019-05-24 中国科学院信息工程研究所 A kind of HBase secondary index adaptive optimization method and system

Also Published As

Publication number Publication date
CN111680043A (en) 2020-09-18

Similar Documents

Publication Publication Date Title
CN111680043B (en) Method for quickly retrieving mass data
Wei et al. Analyticdb-v: A hybrid analytical engine towards query fusion for structured and unstructured data
US6931408B2 (en) Method of storing, maintaining and distributing computer intelligible electronic data
CN106326429A (en) Hbase second-level query scheme based on solr
US20030097354A1 (en) Method and system for index sampled tablescan
CN107291964B (en) A method of fuzzy query is realized based on HBase
US10372718B2 (en) Systems and methods for enterprise data search and analysis
CN110659282B (en) Data route construction method, device, computer equipment and storage medium
CN105912609A (en) Data file processing method and device
KR20130049111A (en) Forensic index method and apparatus by distributed processing
Yu et al. Two birds, one stone: a fast, yet lightweight, indexing scheme for modern database systems
CN104731945A (en) Full-text searching method and device based on HBase
Cheng et al. Supporting entity search: a large-scale prototype search engine
CN110134717A (en) Research funding system data query system
Wang et al. Efficient query processing framework for big data warehouse: an almost join-free approach
CN113553491A (en) Industrial big data search optimization method based on inverted index
CN110109870A (en) A kind of mass data quick retrieval system based on Solr
CN107291938A (en) Order Query System and method
US20160004749A1 (en) Search system and search method
US10877998B2 (en) Highly atomized segmented and interrogatable data systems (HASIDS)
CN107633094B (en) Method and device for data retrieval in cluster environment
CN112131215B (en) Bottom-up database information acquisition method and device
CN111680072B (en) System and method for dividing social information data
CN114218347A (en) Method for quickly searching index of multiple file contents
CN105868406A (en) Multi-database based patent retrieval system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant