CN111680043B - Method for quickly retrieving mass data - Google Patents
Method for quickly retrieving mass data Download PDFInfo
- Publication number
- CN111680043B CN111680043B CN202010505012.5A CN202010505012A CN111680043B CN 111680043 B CN111680043 B CN 111680043B CN 202010505012 A CN202010505012 A CN 202010505012A CN 111680043 B CN111680043 B CN 111680043B
- Authority
- CN
- China
- Prior art keywords
- data
- field
- index
- setting
- rowkey
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000013500 data storage Methods 0.000 claims abstract description 15
- 230000004044 response Effects 0.000 claims abstract description 8
- 230000000977 initiatory effect Effects 0.000 claims abstract description 4
- 238000012544 monitoring process Methods 0.000 claims abstract description 4
- 238000013461 design Methods 0.000 claims description 8
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 238000004806 packaging method and process Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 2
- 238000007906 compression Methods 0.000 claims description 2
- 238000013144 data compression Methods 0.000 claims description 2
- 229920001971 elastomer Polymers 0.000 claims description 2
- 239000000806 elastomer Substances 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims description 2
- 238000005192 partition Methods 0.000 claims description 2
- 230000007547 defect Effects 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013021 overheating Methods 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/221—Column-oriented storage; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
Abstract
The invention discloses a method for quickly searching mass data, which comprises the following steps: constructing a mass data storage system; establishing a secondary index for the data in the mass data storage system; starting a data retrieval service and monitoring an Http request; analyzing the Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an elastic search index service, and obtaining a response result; and reading structural data corresponding to the ROWKEY from Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the searched structural data and returning. The method can quickly search mass data according to multiple conditions and return the query result within a very short time range, and solves the defects of the prior art scheme with minimum cost.
Description
Technical Field
The invention belongs to the technical field of quick retrieval of big data, and particularly relates to a quick retrieval method for massive data.
Background
With the development of society and technology, massive data are generated in different fields every day, and the storage and use of the data become a very challenging technical problem. For example, in the transportation industry, a county city with 300 tens of thousands of people, the video detector generates 1000 tens of thousands of vehicles passing through the data. The common transaction type information management system stores the data through the relational data, and in the first year, the data can be retrieved normally, and the query method and the storage design are found to be optimized even if the data amount is accumulated for more than two years or even longer, but the data to be searched still cannot be queried in a short time. How to store mass data more effectively and realize quick retrieval by a certain technology is a problem to be solved.
At present, most solutions under construction are to increase storage nodes of a relational database and build a great number of indexes to realize quick retrieval, but the maintenance cost of the indexes is very high, and once data are changed, the indexes are reconstructed in batches, because the indexes and the data are operated in the same database instance, the reconstruction of the indexes directly affects the performance of the database, and the query operation being executed is affected.
At present, two fast query schemes are realized based on big data technology, as follows:
1. in order to realize rapid query according to conditions, the RowKey of the HBase needs to be designed according to the query conditions, the RowKey contains all query conditions, and the function of global unique index is realized through the RowKey. However, there is a significant drawback that once the query condition is changed, the RowKey needs to be redesigned, the original main data cannot be used, and the data needs to be generated again according to the new RowKey, so that the same service data needs to be stored for multiple copies according to different rowkeys, and huge waste is caused in the storage space. This is almost a fatal problem.
2. The data is also required to be stored in the distributed column database HBase, and the secondary index is required to be designed according to the query condition, but the primary data is stored in the table, and simultaneously the secondary index is stored in a storage area together with the primary data, so that the secondary index is positioned first when the data is queried, and then the primary data is positioned directly in the same area according to the secondary index. The advantage of this is that the index and the main data are in the same storage area, saving the time for retrieving the main data again across nodes. The two-level index design solves the problem of storage space waste after the query conditions are changed in the step 1, but creates a new problem, the RowKey matching principle in the HBase is that the ASCII codes of the RowKey are matched from front to back, so that if a plurality of query conditions exist, the number of the two-level indexes is very large in order to adapt to various combined queries, when the conditions reach 7 to 8, the number of the indexes is too large, and the storage space occupation amount of the indexes is as large as possible to exceed that of main data.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a rapid retrieval method for massive data; the method can quickly search mass data according to multiple conditions and return the query result within a very short time range, and solves the defects of the prior art scheme with minimum cost.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the invention discloses a method for quickly searching mass data, which comprises the following steps:
1) Constructing a mass data storage system;
2) Establishing a secondary index for the data in the mass data storage system;
3) Starting a data retrieval service and monitoring an Http request;
4) Analyzing the Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an elastic search index service, and obtaining a response result (ROWKEY set);
5) And reading structural data corresponding to the ROWKEY from Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the searched structural data and returning.
Further, the mass data storage system in the step 1) is an Apache HBase, which is a distributed and scalable mass data storage system constructed based on Hadoop, and the business data to be searched are stored into the HBase according to respective business designs, and a proper RowKey is designed and used as a unique identifier of the record; the data are uniformly distributed on a plurality of region servers of the HBase, the concurrent processing performance is improved, and the situation of local overheating is avoided.
Further, the step 2) specifically includes:
21 Using an elastic search as a carrier of a secondary index of the mass data storage system, and designing index fields according to respective query conditions by index design; the index field comprises all query fields, and a RowKey field is added to represent the value of the corresponding unique primary key after all conditions are matched.
22 Extracting data in all HBases, and inserting a value corresponding to an index corresponding to an elastic search in the data into the index, wherein the process is index data;
23 When creating the elastic search index, different index types are set according to whether each field is queried, the condition of fuzzy query uses an IK word segmentation device, and the field of full word matching is set as a keyword.
Further, the step 3) specifically includes: and writing a back-end service interface, wherein the service interface is provided for other programs to monitor requests and return data results required by the other programs.
Further, the step 4) specifically includes:
the backend service interface parses the request content, creates ElasticSearch Client API instances and specifies the index to use, requests the elastesearch index service, and returns the query RowKey result set.
Further, the step 5) specifically includes:
creating an HBase Client API instance, bringing the RowKey set into the HBase Client API, accessing the HBase service, and packaging the returned result set into a result list;
the results list is returned to the client along the call stack.
The invention has the beneficial effects that:
1. any change of the secondary index is irrelevant to the main data and does not affect any storage of the main data.
2. Only the query condition and the RowKey of the main data are stored in the secondary index, so that the secondary index occupies very small storage space.
3. When the query conditions change, only a new elastic search index needs to be built.
4. Because of some reasons of the distributed storage design of the HBase, the data query of the HBase cannot realize accurate paging; but the elastic search realizes the accurate paging function, and overcomes the defect that the HBase data can not be paged by directly inquiring.
5. The data retrieval realized by the technology can not cause great fluctuation of query efficiency due to the change of the data volume; the query time for 10 hundred million data volumes and 100 hundred million data volumes is nearly the same.
Drawings
Fig. 1 shows a schematic diagram of the method of the present invention.
Detailed Description
The invention will be further described with reference to examples and drawings, to which reference is made, but which are not intended to limit the scope of the invention.
Referring to fig. 1, the method for quickly searching mass data of the present invention comprises the following steps:
1) Constructing a mass data storage system;
the mass data storage system is Apache HBase, is a distributed and telescopic mass data storage system constructed based on Hadoop, stores service data to be searched into the HBase according to respective service designs, and designs a proper RowKey to be used as a unique record identifier; the data are uniformly distributed on a plurality of region servers of the HBase, the concurrent processing performance is improved, and the situation of local overheating is avoided.
2) Establishing a secondary index for the data in the mass data storage system;
21 Configuration settings:
211 A value of a partition number, i.e., number_of_boards parameter, is set as the number of cluster nodes;
212 A value of a number_of_redundant parameter, which represents the number of redundant copies of data, is set to 0, i.e., no copies;
213 Setting a data compression mode, namely setting a value of a codec parameter as best_compression; the data can be compressed more effectively, so that occupation of the magnetic disk is obviously reduced;
22 Configuration custom analyzer (one analyzer with at least one token filter that may be zero or more):
221 Setting a custom token, wherein the type is edge_ngram, the parameter min_gram is set to 1, the parameter max_gram is set to 10, and the token_characters are set to a left and a digit, so that the token can split when encountering characters and numbers, and the token can be better suitable for index fields in demands;
222 The custom index analyzer of the field PLATE_NO is set, the custom token of the last step is selected as a token setting item, and the token filters are set as lowercase token filters of the system, so that the method can be better suitable for the field PLATE_NO;
223 Setting a custom query analyzer of the field PLATE_NO, setting a token thereof as keyword tokenizer of the system, setting a token filters thereof as lowercase token filters of the system, and collocating the custom query analyzer of the field PLATE_NO with the custom index analyzer for use; the method can be better suitable for the requirement, and simultaneously, the required data can be more efficiently and accurately searched when the data is inquired according to the PLATE_NO;
23 Configuration field maps):
231 Setting the type of the ROWKEY field as an object and setting the value of the enabled parameter as false, wherein the value of the field is only provided for HBase to search data, the data is not required to be searched according to the field in the elastic search, and the elastic search completely skips the analysis of the field content if the enabled parameter is set as false, but the specific value can still be obtained from the source field, and the specific value can not be searched and is not used for indexing the data or is stored in any other way, so that the occupation of a disk can be reduced;
232 The TYPE of cross-INDEX field, PLATE_TYPE field, PLATE_COLOR field) is set as a keyword field, which can only be retrieved according to its exact value, since these fields are structured content and are typically used for filtering, ordering and aggregation.
233 Setting the type of the PLATE_NO field as text, the text field is applicable to the field which needs full text retrieval. the text field stores a normalization factor in the index to enable scoring of the document, and if only one text field needs to be matched, but the score generated is not a concern, the norm parameter value may be set to false. By default, the text field also stores the frequency and position in the index, the frequency is used to calculate the score, the position is used to run the phrase query, if no phrase query is required to be run, the index_options parameter value may be set to freqs so that the elastomer search does not index the position. The above arrangement can speed up the inquiry and reduce disk occupation. Setting an index analyzer parameter value of the PLATE_NO field as a custom index analyzer, and inquiring the search_analyzer parameter value as a custom inquiry analyzer; and mapping the text field into a keyword field for ordering or aggregation by means of multi-fields;
234 A type of pass_time field is set to date, and its format is set by a format parameter.
3) Starting a data retrieval service and monitoring an Http request;
and writing a back-end service interface, wherein the service interface is provided for other programs to monitor requests and return data results required by the other programs.
4) Analyzing the Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an elastic search index service, and obtaining a response result (ROWKEY set);
the backend service interface parses the request content, creates ElasticSearch Client API instances and specifies the index to use, requests the elastesearch index service, and returns the query RowKey result set.
5) And reading structural data corresponding to the ROWKEY from Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the searched structural data and returning.
Creating an HBase Client API instance, bringing the RowKey set into the HBase Client API, accessing the HBase service, and packaging the returned result set into a result list;
the results list is returned to the client along the call stack.
The invention stores mass data in a distributed column database Hbase, establishes a secondary index for the data by utilizing a distributed full text index engine elastic search, and does not directly inquire the HBase when searching the data, but firstly inquires a RowKey of the data through the secondary index, then inquires the data in the HBase through the RowKey and returns a result meeting the condition.
The present invention has been described in terms of the preferred embodiments thereof, and it should be understood by those skilled in the art that various modifications can be made without departing from the principles of the invention, and such modifications should also be considered as being within the scope of the invention.
Claims (5)
1. A method for quickly searching mass data is characterized by comprising the following steps:
1) Constructing a mass data storage system;
2) Establishing a secondary index for the data in the mass data storage system;
3) Starting a data retrieval service and monitoring an Http request;
4) Analyzing the Http request sent by the Client to generate an index retrieval condition, initiating an index retrieval request to an elastic search index service, and obtaining a response result;
5) Reading structured data corresponding to the ROWKEY from Hbase service according to the ROWKEY of the data corresponding to the response result, analyzing the retrieved structured data and returning the analyzed structured data;
the step 2) specifically comprises the following steps:
21 Configuration settings:
211 A value of a partition number, i.e., number_of_boards parameter, is set as the number of cluster nodes;
212 A value of a number_of_redundant parameter, which indicates the number of redundant copies of data, is set to 0, which indicates no copies;
213 Setting a data compression mode, and setting a value of a codec parameter as best_compression;
22 Configuration custom analyzer:
221 Setting a custom token, wherein the type is edge_ngram, the parameter min_gram is set to 1, the parameter max_gram is set to 10, and token_characters are set to letters and digits, which means that the token can split when encountering characters and numbers;
222 A custom index analyzer of a field PLATE_NO is set, the custom token of the last step is selected as a token setting item, and token filters are set as lowercase token filters of the system;
223 Setting a custom query analyzer of the field PLATE_NO, setting a token thereof as keyword tokenizer of the system, setting a token filters thereof as lowercase token filters of the system, and collocating the custom query analyzer of the field PLATE_NO with the custom index analyzer for use;
23 Configuration field maps):
231 Setting the type of the ROWKEY field as an object, setting the value of an enabled parameter of the ROWKEY field as false, wherein the value of the field is only provided for HBase retrieval data, the data is not required to be retrieved according to the field in the elastic search, and if the enabled parameter is set as false, the elastic search completely skips the analysis of the field content, and a specific value is acquired from the_source field;
232 Setting the TYPE of the cross-word field, the PLATE_TYPE field and the PLATE_COLOR field as a keyword, and using the keyword as the TYPE, and searching the keyword field according to the exact value;
233 Setting the type of the PLATE_NO field as text; the text field stores the standardized factors in the index to score the document, and if only one text field needs to be matched, the norm parameter value is set as false; if the phrase query does not need to be run, setting index_options parameter values to freqs so that the elastomer search does not index positions; setting an index analyzer parameter value of the PLATE_NO field as a custom index analyzer, and inquiring the search_analyzer parameter value as a custom inquiry analyzer; and mapping the text field into a keyword field for ordering or aggregation by means of multi-fields;
234 A type of pass_time field is set to date, and its format is set by a format parameter.
2. The method for fast searching for massive data according to claim 1, wherein the massive data storage system in step 1) is an Apache HBase, which is a distributed and scalable massive data storage system constructed based on Hadoop, and the business data to be searched are stored in the HBase according to respective business designs, and a proper RowKey is designed to be used as a unique identifier of the record; and data are uniformly distributed on a plurality of region servers of the HBase, so that the performance of concurrent processing is improved.
3. The method for fast searching for massive data according to claim 1, wherein the step 3) specifically includes: and writing a back-end service interface, wherein the service interface is provided for other programs to monitor requests and return data results required by the other programs.
4. The method for fast searching for massive data according to claim 1, wherein the step 4) specifically includes:
the backend service interface parses the request content, creates ElasticSearch Client API instances and specifies the index to use, requests the elastesearch index service, and returns the query RowKey result set.
5. The method for fast searching for massive data according to claim 1, wherein the step 5) specifically includes:
creating an HBase Client API instance, bringing the RowKey set into the HBase Client API, accessing the HBase service, and packaging the returned result set into a result list;
the results list is returned to the client along the call stack.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010505012.5A CN111680043B (en) | 2020-06-05 | 2020-06-05 | Method for quickly retrieving mass data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010505012.5A CN111680043B (en) | 2020-06-05 | 2020-06-05 | Method for quickly retrieving mass data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111680043A CN111680043A (en) | 2020-09-18 |
CN111680043B true CN111680043B (en) | 2023-11-28 |
Family
ID=72435070
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010505012.5A Active CN111680043B (en) | 2020-06-05 | 2020-06-05 | Method for quickly retrieving mass data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111680043B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112100510A (en) * | 2020-11-18 | 2020-12-18 | 树根互联技术有限公司 | Mass data query method and device based on Internet of vehicles platform |
CN112632157B (en) * | 2021-03-11 | 2021-07-27 | 全时云商务服务股份有限公司 | Multi-condition paging query method under distributed system |
CN114491199A (en) * | 2022-01-25 | 2022-05-13 | 浙江大华技术股份有限公司 | Data retrieval method, device and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106682073A (en) * | 2016-11-14 | 2017-05-17 | 上海轻维软件有限公司 | HBase fuzzy retrieval system based on Elastic Search |
CN109165222A (en) * | 2018-08-20 | 2019-01-08 | 福州大学 | A kind of HBase secondary index creation method and system based on coprocessor |
CN109299102A (en) * | 2018-10-23 | 2019-02-01 | 中国电子科技集团公司第二十八研究所 | A kind of HBase secondary index system and method based on Elastcisearch |
CN109800222A (en) * | 2018-12-11 | 2019-05-24 | 中国科学院信息工程研究所 | A kind of HBase secondary index adaptive optimization method and system |
-
2020
- 2020-06-05 CN CN202010505012.5A patent/CN111680043B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106682073A (en) * | 2016-11-14 | 2017-05-17 | 上海轻维软件有限公司 | HBase fuzzy retrieval system based on Elastic Search |
CN109165222A (en) * | 2018-08-20 | 2019-01-08 | 福州大学 | A kind of HBase secondary index creation method and system based on coprocessor |
CN109299102A (en) * | 2018-10-23 | 2019-02-01 | 中国电子科技集团公司第二十八研究所 | A kind of HBase secondary index system and method based on Elastcisearch |
CN109800222A (en) * | 2018-12-11 | 2019-05-24 | 中国科学院信息工程研究所 | A kind of HBase secondary index adaptive optimization method and system |
Also Published As
Publication number | Publication date |
---|---|
CN111680043A (en) | 2020-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111680043B (en) | Method for quickly retrieving mass data | |
Wei et al. | Analyticdb-v: A hybrid analytical engine towards query fusion for structured and unstructured data | |
US6931408B2 (en) | Method of storing, maintaining and distributing computer intelligible electronic data | |
CN106326429A (en) | Hbase second-level query scheme based on solr | |
US20030097354A1 (en) | Method and system for index sampled tablescan | |
CN107291964B (en) | A method of fuzzy query is realized based on HBase | |
US10372718B2 (en) | Systems and methods for enterprise data search and analysis | |
CN110659282B (en) | Data route construction method, device, computer equipment and storage medium | |
CN105912609A (en) | Data file processing method and device | |
KR20130049111A (en) | Forensic index method and apparatus by distributed processing | |
Yu et al. | Two birds, one stone: a fast, yet lightweight, indexing scheme for modern database systems | |
CN104731945A (en) | Full-text searching method and device based on HBase | |
Cheng et al. | Supporting entity search: a large-scale prototype search engine | |
CN110134717A (en) | Research funding system data query system | |
Wang et al. | Efficient query processing framework for big data warehouse: an almost join-free approach | |
CN113553491A (en) | Industrial big data search optimization method based on inverted index | |
CN110109870A (en) | A kind of mass data quick retrieval system based on Solr | |
CN107291938A (en) | Order Query System and method | |
US20160004749A1 (en) | Search system and search method | |
US10877998B2 (en) | Highly atomized segmented and interrogatable data systems (HASIDS) | |
CN107633094B (en) | Method and device for data retrieval in cluster environment | |
CN112131215B (en) | Bottom-up database information acquisition method and device | |
CN111680072B (en) | System and method for dividing social information data | |
CN114218347A (en) | Method for quickly searching index of multiple file contents | |
CN105868406A (en) | Multi-database based patent retrieval system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |