CN107679248A - A kind of intelligent data search method - Google Patents
A kind of intelligent data search method Download PDFInfo
- Publication number
- CN107679248A CN107679248A CN201711052166.8A CN201711052166A CN107679248A CN 107679248 A CN107679248 A CN 107679248A CN 201711052166 A CN201711052166 A CN 201711052166A CN 107679248 A CN107679248 A CN 107679248A
- Authority
- CN
- China
- Prior art keywords
- knowledge
- data
- key
- search method
- value pair
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of intelligent data search method, the knowledge data of the different carriers in each Node station is transferred in initial data warehouse by ETL scripts, initial data warehouse is divided into multiple subregions, each subregion passes through knowledge Similarity Algorithm, creation of knowledge model, knowledge model duplicate removal is cleaned, then lucene is relied on to index model creation, the index created, write-in index warehouse in the form of streaming, index warehouse catalogue need to only be retrieved when user is retrieved, it is possible to realize the fast positioning of mass knowledge data.The present invention breaks the barrier of more computer room Islands of Knowledge, realizes the unified filingization of more carriers of knowledge, realizes knowledge problem intelligent retrieval.
Description
Technical field
The present invention relates to a kind of search method, particularly a kind of intelligent data search method.
Background technology
In existing knowledge retrieval storehouse, range of search is often defined in unit room data storage center, Er Qiejian
The carrier of knowledge also all single presence in the data storage center used of rope, therefore, the result retrieved in user search is past
The in the past matching on text, it is impossible to the processing for needs of being handled it to the content of text, although only needing single-point to safeguard, machine
Device framework is simple, but more preferable service can not be made to client, therefore, it is necessary to existing 12345 product knowledge database
Search method and framework are lifted, and solve 12345 product know-how library searching at this stage, are also limited to unit room, and single knowledge carries
Body (is stored in database), and single-point is safeguarded, knowledge retrieval has significant limitation, and the result retrieved is only on text
Matching, is not processed for focus livelihood issues.Therefore need to realize the processing to multichip carrier knowledge, across more computer room knowledge
Extract, the integration of engineering, provide the user mass data, the facility of intelligent retrieval.
The content of the invention
The technical problems to be solved by the invention are to provide a kind of intelligent data search method, and it is lonely that it breaks more computer room knowledge
The barrier on island, the unified filingization of more carriers of knowledge is realized, realizes knowledge problem intelligent retrieval.
In order to solve the above technical problems, the technical solution adopted in the present invention is:
A kind of intelligent data search method, it is characterised in that comprise the steps of:
Step 1:The knowledge data in several separate Node stations is uniformly drawn into by ETL technologies original
In database;
Step 2:Mass knowledge data in raw data base are made up of several Hbase, by the different carriers of knowledge and not
The knowledge content of same knowledge form collects together;
Step 3:For the knowledge data in original storehouse, data are cleaned, duplicate removal, then applicating cooperation filtering is calculated
Method generates knowledge model;
Step 4:Using Lucence technologies, generate and index for model data, be stored in index database;
Step 5:Index database is shared to the user of all node computer rooms.
Further, in the step 1 ETL technologies specifically, using Kettle instruments, by well-regulated knowledge data
It is transferred to irregular knowledge data in original warehouse, wherein well-regulated knowledge data refers to the knowledge record in database,
Including knowledge content, knowledge title, the uplink time of knowledge, the keyword of knowledge, knowledge type, knowledge score information, do not advise
Knowledge data then is to deposit accessory information in a hard disk, includes the knowledge of txt, word, excel file format, well-regulated
Knowledge data, directly it is introduced directly into by Kettle instruments in raw data base, irregular knowledge data passes through analysis software
After the text data parsing in annex, then by kettle tool transports into original warehouse.
Further, the raw data base is made up of a Hbase cluster, and Hbase clusters are by the difference of knowledge data
Attribute information is stored in HTable, and HTable presses Row key auto-sequencings, and each Row includes any amount Columns,
Column key auto-sequencings are pressed between Columns, each Column includes any amount Values.
Further, the knowledge data in the raw data base is right by the MapReduce and HDFS using Hadoop
Data are cleaned, duplicate removal.
Further, when the MapReduce operations, the data in HDFS are read by the task of Mapper operations
File, then call and oneself carry out processing data, finally export.
Further, the task of the Mapper operations is a Java process, is comprised the steps of:
1) input file according to certain standard burst, the size of each input chip is fixed;
2) to the record in input chip according to certain rule parsing into key-value pair;
3) the map methods in Mapper classes are called, often parse a key-value pair, call a map method;
4) subregion is carried out to the key-value pair of output based on key;
5) key-value pair in each subregion is ranked up, is first ranked up according to key, for key identical key-value pair, presses
It is ranked up according to value;
6) reduction process is carried out to data.
Further, the reduction process is reduce processing, receives the output of Mapper tasks, is written to after processing
In HDFS.
Further, the reduce processing is a java process, is comprised the following steps:
1) actively from the key-value pair of its output of Mapper Task Duplications;
2) local data for copying to Reducer is all merged, i.e., scattered data are merged into one big
Data, and to the data sorting after merging;
3) reduce methods are called to the key-value pair after sequence;
4) key-value pair of output is written in HDFS files.
Further, Lucene descriptions include indexing and retrieving two processes in the step 4.
The present invention compared with prior art, has advantages below and effect:
1st, the unification of multinode knowledge data:ETL technologies are introduced by the knowledge in several separate Node stations
Data are uniformly drawn into raw data base.
2nd, the unification of multichip carrier knowledge data:Different carriers are known by application Hadoop MapReduce and HDFS
Know data and carry out unification, sequence.
3rd, intelligently pushing:For the knowledge data in HBase databases, applicating cooperation filter algorithm, the phase according to knowledge
Like property, parallel computation, knowledge recommendation model is formed.
Brief description of the drawings
Fig. 1 is a kind of schematic diagram of intelligent data search method of the present invention.
Embodiment
Below in conjunction with the accompanying drawings and the present invention is described in further detail by embodiment, and following examples are to this hair
Bright explanation and the invention is not limited in following examples.
A kind of intelligent data search method of the present invention passes through Node station, mass knowledge intelligent searching engine and index warehouse
Data connection, solve deficiency of the prior art.Wherein intelligent searching engine includes initial data warehouse module, lucene moulds
The knowledge data of different carriers in each Node station is transferred to original by block and model, the technical program by ETL scripts
In data warehouse, initial data warehouse is divided into multiple subregions, and each subregion passes through knowledge Similarity Algorithm, creation of knowledge mould
Type, knowledge model duplicate removal is cleaned, then rely on lucene to index model creation, the index created, write in the form of streaming
Enter and index warehouse, index warehouse catalogue need to only be retrieved when user is retrieved, it is possible to realize mass knowledge data
Fast positioning.
System used in intelligent data search method includes some Node stations, search engine and index warehouse, described
Search engine includes initial data warehouse module, model and lucene modules.
As shown in figure 1, a kind of intelligent data search method of the present invention, is comprised the steps of:
Step 1:The knowledge data in several separate Node stations is uniformly drawn into by ETL technologies original
In database;
ETL technologies are by using Kettle instruments, and well-regulated knowledge data and irregular knowledge data are transmitted
Into original warehouse, well-regulated knowledge data refers generally to the knowledge record in database, including knowledge content, knowledge mark
Topic, the uplink time of knowledge, the keyword of knowledge, knowledge type, knowledge comment grading information, and irregular knowledge data is usually
Accessory information in a hard disk is deposited, wherein have txt, the knowledge of the file format such as word, excel, well-regulated knowledge data, directly
Connected Kettle instruments to be introduced directly into raw data base, irregular knowledge data is then needed by analysis software annex
In text data parsing after, by kettle instruments, be transferred in original warehouse.
Step 2:Mass knowledge data in raw data base are made up of several Hbase, by the different carriers of knowledge and not
The knowledge content of same knowledge form collects together;
Step 3:For the knowledge data in original storehouse, data are cleaned, duplicate removal, then applicating cooperation filtering is calculated
Method generates knowledge model;
Step 4:Using Lucence technologies, generate and index for model data, be stored in index database;Lucene, which is described, to be included
Two processes of index and retrieval.
Step 5:Index database is shared to the user of all node computer rooms.
Wherein, raw data base is made up of a Hbase cluster, and Hbase clusters are by the different attribute information of knowledge data
Be stored in HTable, HTable presses Row key auto-sequencings, each Row include any amount Columns, Columns it
Between press Column key auto-sequencings, each Column includes any amount Values.Knowledge data in raw data base
By application Hadoop MapReduce and HDFS, data are cleaned, duplicate removal.
When MapReduce is run, the data file in HDFS is read by the task of Mapper operations, is then called
Oneself carries out processing data, finally exports.
The task of Mapper operations is a Java process, is comprised the steps of:
1) input file according to certain standard burst, the size of each input chip is fixed;
2) to the record in input chip according to certain rule parsing into key-value pair;
3) the map methods in Mapper classes are called, often parse a key-value pair, call a map method;
4) subregion is carried out to the key-value pair of output based on key;
5) key-value pair in each subregion is ranked up, is first ranked up according to key, for key identical key-value pair, presses
It is ranked up according to value;
6) reduction process is carried out to data.
Reduction process is reduce processing, receives the output of Mapper tasks, is written to after processing in HDFS.
Reduce processing is a java process, is comprised the following steps:
1) actively from the key-value pair of its output of Mapper Task Duplications;
2) local data for copying to Reducer is all merged, i.e., scattered data are merged into one big
Data, and to the data sorting after merging;
3) reduce methods are called to the key-value pair after sequence;
4) key-value pair of output is written in HDFS files.
Above content described in this specification is only illustration made for the present invention.Technology belonging to the present invention
The technical staff in field can be made various modifications or supplement to described specific embodiment or be substituted using similar mode, only
Will without departing from description of the invention content or surmount scope defined in the claims, all should belong to the present invention guarantor
Protect scope.
Claims (9)
1. a kind of intelligent data search method, it is characterised in that comprise the steps of:
Step 1:Knowledge data in several separate Node stations is uniformly drawn into by initial data by ETL technologies
In storehouse;
Step 2:Mass knowledge data in raw data base are made up of several Hbase, by the different carriers of knowledge and different
The knowledge content of knowledge form collects together;
Step 3:For the knowledge data in original storehouse, data are cleaned, duplicate removal, then applicating cooperation filter algorithm is given birth to
Into knowledge model;
Step 4:Using Lucence technologies, generate and index for model data, be stored in index database;
Step 5:Index database is shared to the user of all node computer rooms.
2. according to a kind of intelligent data search method described in claim 1, it is characterised in that:ETL technologies in the step 1
Specifically, using Kettle instruments, well-regulated knowledge data and irregular knowledge data are transferred in original warehouse, its
In well-regulated knowledge data refer to knowledge record in database, including the uplink time of knowledge content, knowledge title, knowledge,
The keyword of knowledge, knowledge type, knowledge score information, irregular knowledge data are to deposit accessory information in a hard disk, bag
The knowledge of txt, word, excel file format is included, well-regulated knowledge data, original is directly introduced directly into by Kettle instruments
In beginning database, after irregular knowledge data parses the text data in annex by analysis software, then pass through kettle
Tool transport is into original warehouse.
3. according to a kind of intelligent data search method described in claim 1, it is characterised in that:The raw data base is by one
Hbase clusters are formed, and the different attribute information of knowledge data is stored in HTable by Hbase clusters, and HTable presses Row key
Auto-sequencing, each Row are included and Column key auto-sequencings are pressed between any amount Columns, Columns, each
Column includes any amount Values.
4. according to a kind of intelligent data search method described in claim 3, it is characterised in that:Knowing in the raw data base
Know MapReduce and HDFS of the data by application Hadoop, data are cleaned, duplicate removal.
5. according to a kind of intelligent data search method described in claim 4, it is characterised in that:The MapReduce operations
When, the data file in HDFS is read by the task of Mapper operations, then calls and oneself carries out processing data, it is last defeated
Go out.
6. according to a kind of intelligent data search method described in claim 5, it is characterised in that:The task of the Mapper operations
It is a Java process, comprises the steps of:
1) input file according to certain standard burst, the size of each input chip is fixed;
2) to the record in input chip according to certain rule parsing into key-value pair;
3) the map methods in Mapper classes are called, often parse a key-value pair, call a map method;
4) subregion is carried out to the key-value pair of output based on key;
5) key-value pair in each subregion is ranked up, be first ranked up according to key, for key identical key-value pair, according to value
It is ranked up;
6) reduction process is carried out to data.
7. according to a kind of intelligent data search method described in claim 6, it is characterised in that:The reduction process is reduce
Processing, the output of Mapper tasks is received, is written to after processing in HDFS.
8. according to a kind of intelligent data search method described in claim 7, it is characterised in that:The reduce processing is one
Java processes, comprise the following steps:
1) actively from the key-value pair of its output of Mapper Task Duplications;
2) local data for copying to Reducer is all merged, i.e., scattered data is merged into a big data,
And to the data sorting after merging;
3) reduce methods are called to the key-value pair after sequence;
4) key-value pair of output is written in HDFS files.
9. according to a kind of intelligent data search method described in claim 1, it is characterised in that:Lucene remembers in the step 4
State including indexing and retrieving two processes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711052166.8A CN107679248A (en) | 2017-10-30 | 2017-10-30 | A kind of intelligent data search method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711052166.8A CN107679248A (en) | 2017-10-30 | 2017-10-30 | A kind of intelligent data search method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107679248A true CN107679248A (en) | 2018-02-09 |
Family
ID=61143985
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711052166.8A Pending CN107679248A (en) | 2017-10-30 | 2017-10-30 | A kind of intelligent data search method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107679248A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120182891A1 (en) * | 2011-01-19 | 2012-07-19 | Youngseok Lee | Packet analysis system and method using hadoop based parallel computation |
CN102779134A (en) * | 2011-05-12 | 2012-11-14 | 苏州同程旅游网络科技有限公司 | Lucene-based distributed search method |
CN102915365A (en) * | 2012-10-24 | 2013-02-06 | 苏州两江科技有限公司 | Hadoop-based construction method for distributed search engine |
CN104298768A (en) * | 2014-10-29 | 2015-01-21 | 深圳市同洲电子股份有限公司 | Searching method, device and system |
CN104376053A (en) * | 2014-11-04 | 2015-02-25 | 南京信息工程大学 | Storage and retrieval method based on massive meteorological data |
CN105069101A (en) * | 2015-08-07 | 2015-11-18 | 桂林电子科技大学 | Distributed index construction and search method |
-
2017
- 2017-10-30 CN CN201711052166.8A patent/CN107679248A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120182891A1 (en) * | 2011-01-19 | 2012-07-19 | Youngseok Lee | Packet analysis system and method using hadoop based parallel computation |
CN102779134A (en) * | 2011-05-12 | 2012-11-14 | 苏州同程旅游网络科技有限公司 | Lucene-based distributed search method |
CN102915365A (en) * | 2012-10-24 | 2013-02-06 | 苏州两江科技有限公司 | Hadoop-based construction method for distributed search engine |
CN104298768A (en) * | 2014-10-29 | 2015-01-21 | 深圳市同洲电子股份有限公司 | Searching method, device and system |
CN104376053A (en) * | 2014-11-04 | 2015-02-25 | 南京信息工程大学 | Storage and retrieval method based on massive meteorological data |
CN105069101A (en) * | 2015-08-07 | 2015-11-18 | 桂林电子科技大学 | Distributed index construction and search method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11354314B2 (en) | Method for connecting a relational data store's meta data with hadoop | |
US9256665B2 (en) | Creation of inverted index system, and data processing method and apparatus | |
US20170242889A1 (en) | Cache Based Efficient Access Scheduling for Super Scaled Stream Processing Systems | |
US20230273898A1 (en) | Lineage data for data records | |
CN104408159B (en) | A kind of data correlation, loading, querying method and device | |
CN106126601A (en) | A kind of social security distributed preprocess method of big data and system | |
CN109241159B (en) | Partition query method and system for data cube and terminal equipment | |
CN107515878A (en) | The management method and device of a kind of data directory | |
CN104111936B (en) | Data query method and system | |
CN104317970A (en) | Data flow type processing method based on data processing center | |
CN103440288A (en) | Big data storage method and device | |
CN109101575A (en) | Calculation method and device | |
WO2014117295A1 (en) | Performing an index operation in a mapreduce environment | |
CN108287889B (en) | A kind of multi-source heterogeneous date storage method and system based on elastic table model | |
CN106407442A (en) | Massive text data processing method and apparatus | |
CN116166191A (en) | Integrated system of lake and storehouse | |
CN109117426A (en) | Distributed networks database query method, apparatus, equipment and storage medium | |
CN117056303B (en) | Data storage method and device suitable for military operation big data | |
CN113779349A (en) | Data retrieval system, apparatus, electronic device, and readable storage medium | |
CN107291938A (en) | Order Query System and method | |
CN107679248A (en) | A kind of intelligent data search method | |
CN105718485B (en) | A kind of method and device by data inputting database | |
CN112860680B (en) | Data processing method and system, and data query method and system | |
CN111159213A (en) | Data query method, device, system and storage medium | |
CN111143329B (en) | Data processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 210029 No. 268, Hanzhoung Road, Nanjing, Jiangsu Applicant after: CLP Hongxin Information Technology Co., Ltd Address before: 210029 No. 268, Hanzhoung Road, Nanjing, Jiangsu Applicant before: Jiangsu Hongxin System Integration Co., Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180209 |