CN107679248A

CN107679248A - A kind of intelligent data search method

Info

Publication number: CN107679248A
Application number: CN201711052166.8A
Authority: CN
Inventors: 汪璞璞; 杨君; 钟启明; 李丰光
Original assignee: JIANGSU HONGXIN SYSTEM INTEGRATION CO Ltd
Current assignee: JIANGSU HONGXIN SYSTEM INTEGRATION CO Ltd
Priority date: 2017-10-30
Filing date: 2017-10-30
Publication date: 2018-02-09

Abstract

The invention discloses a kind of intelligent data search method, the knowledge data of the different carriers in each Node station is transferred in initial data warehouse by ETL scripts, initial data warehouse is divided into multiple subregions, each subregion passes through knowledge Similarity Algorithm, creation of knowledge model, knowledge model duplicate removal is cleaned, then lucene is relied on to index model creation, the index created, write-in index warehouse in the form of streaming, index warehouse catalogue need to only be retrieved when user is retrieved, it is possible to realize the fast positioning of mass knowledge data.The present invention breaks the barrier of more computer room Islands of Knowledge, realizes the unified filingization of more carriers of knowledge, realizes knowledge problem intelligent retrieval.

Description

A kind of intelligent data search method

Technical field

The present invention relates to a kind of search method, particularly a kind of intelligent data search method.

Background technology

In existing knowledge retrieval storehouse, range of search is often defined in unit room data storage center, Er Qiejian The carrier of knowledge also all single presence in the data storage center used of rope, therefore, the result retrieved in user search is past The in the past matching on text, it is impossible to the processing for needs of being handled it to the content of text, although only needing single-point to safeguard, machine Device framework is simple, but more preferable service can not be made to client, therefore, it is necessary to existing 12345 product knowledge database Search method and framework are lifted, and solve 12345 product know-how library searching at this stage, are also limited to unit room, and single knowledge carries Body (is stored in database), and single-point is safeguarded, knowledge retrieval has significant limitation, and the result retrieved is only on text Matching, is not processed for focus livelihood issues.Therefore need to realize the processing to multichip carrier knowledge, across more computer room knowledge Extract, the integration of engineering, provide the user mass data, the facility of intelligent retrieval.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of intelligent data search method, and it is lonely that it breaks more computer room knowledge The barrier on island, the unified filingization of more carriers of knowledge is realized, realizes knowledge problem intelligent retrieval.

In order to solve the above technical problems, the technical solution adopted in the present invention is：

A kind of intelligent data search method, it is characterised in that comprise the steps of：

Step 1：The knowledge data in several separate Node stations is uniformly drawn into by ETL technologies original In database；

Step 2：Mass knowledge data in raw data base are made up of several Hbase, by the different carriers of knowledge and not The knowledge content of same knowledge form collects together；

Step 3：For the knowledge data in original storehouse, data are cleaned, duplicate removal, then applicating cooperation filtering is calculated Method generates knowledge model；

Step 4：Using Lucence technologies, generate and index for model data, be stored in index database；

Step 5：Index database is shared to the user of all node computer rooms.

Further, in the step 1 ETL technologies specifically, using Kettle instruments, by well-regulated knowledge data It is transferred to irregular knowledge data in original warehouse, wherein well-regulated knowledge data refers to the knowledge record in database, Including knowledge content, knowledge title, the uplink time of knowledge, the keyword of knowledge, knowledge type, knowledge score information, do not advise Knowledge data then is to deposit accessory information in a hard disk, includes the knowledge of txt, word, excel file format, well-regulated Knowledge data, directly it is introduced directly into by Kettle instruments in raw data base, irregular knowledge data passes through analysis software After the text data parsing in annex, then by kettle tool transports into original warehouse.

Further, the raw data base is made up of a Hbase cluster, and Hbase clusters are by the difference of knowledge data Attribute information is stored in HTable, and HTable presses Row key auto-sequencings, and each Row includes any amount Columns, Column key auto-sequencings are pressed between Columns, each Column includes any amount Values.

Further, the knowledge data in the raw data base is right by the MapReduce and HDFS using Hadoop Data are cleaned, duplicate removal.

Further, when the MapReduce operations, the data in HDFS are read by the task of Mapper operations File, then call and oneself carry out processing data, finally export.

Further, the task of the Mapper operations is a Java process, is comprised the steps of：

1) input file according to certain standard burst, the size of each input chip is fixed；

2) to the record in input chip according to certain rule parsing into key-value pair；

3) the map methods in Mapper classes are called, often parse a key-value pair, call a map method；

4) subregion is carried out to the key-value pair of output based on key；

5) key-value pair in each subregion is ranked up, is first ranked up according to key, for key identical key-value pair, presses It is ranked up according to value；

6) reduction process is carried out to data.

Further, the reduction process is reduce processing, receives the output of Mapper tasks, is written to after processing In HDFS.

Further, the reduce processing is a java process, is comprised the following steps：

1) actively from the key-value pair of its output of Mapper Task Duplications；

2) local data for copying to Reducer is all merged, i.e., scattered data are merged into one big Data, and to the data sorting after merging；

3) reduce methods are called to the key-value pair after sequence；

4) key-value pair of output is written in HDFS files.

Further, Lucene descriptions include indexing and retrieving two processes in the step 4.

The present invention compared with prior art, has advantages below and effect：

1st, the unification of multinode knowledge data：ETL technologies are introduced by the knowledge in several separate Node stations Data are uniformly drawn into raw data base.

2nd, the unification of multichip carrier knowledge data：Different carriers are known by application Hadoop MapReduce and HDFS Know data and carry out unification, sequence.

3rd, intelligently pushing：For the knowledge data in HBase databases, applicating cooperation filter algorithm, the phase according to knowledge Like property, parallel computation, knowledge recommendation model is formed.

Brief description of the drawings

Fig. 1 is a kind of schematic diagram of intelligent data search method of the present invention.

Embodiment

Below in conjunction with the accompanying drawings and the present invention is described in further detail by embodiment, and following examples are to this hair Bright explanation and the invention is not limited in following examples.

A kind of intelligent data search method of the present invention passes through Node station, mass knowledge intelligent searching engine and index warehouse Data connection, solve deficiency of the prior art.Wherein intelligent searching engine includes initial data warehouse module, lucene moulds The knowledge data of different carriers in each Node station is transferred to original by block and model, the technical program by ETL scripts In data warehouse, initial data warehouse is divided into multiple subregions, and each subregion passes through knowledge Similarity Algorithm, creation of knowledge mould Type, knowledge model duplicate removal is cleaned, then rely on lucene to index model creation, the index created, write in the form of streaming Enter and index warehouse, index warehouse catalogue need to only be retrieved when user is retrieved, it is possible to realize mass knowledge data Fast positioning.

System used in intelligent data search method includes some Node stations, search engine and index warehouse, described Search engine includes initial data warehouse module, model and lucene modules.

As shown in figure 1, a kind of intelligent data search method of the present invention, is comprised the steps of：

ETL technologies are by using Kettle instruments, and well-regulated knowledge data and irregular knowledge data are transmitted Into original warehouse, well-regulated knowledge data refers generally to the knowledge record in database, including knowledge content, knowledge mark Topic, the uplink time of knowledge, the keyword of knowledge, knowledge type, knowledge comment grading information, and irregular knowledge data is usually Accessory information in a hard disk is deposited, wherein have txt, the knowledge of the file format such as word, excel, well-regulated knowledge data, directly Connected Kettle instruments to be introduced directly into raw data base, irregular knowledge data is then needed by analysis software annex In text data parsing after, by kettle instruments, be transferred in original warehouse.

Step 4：Using Lucence technologies, generate and index for model data, be stored in index database；Lucene, which is described, to be included Two processes of index and retrieval.

Step 5：Index database is shared to the user of all node computer rooms.

Wherein, raw data base is made up of a Hbase cluster, and Hbase clusters are by the different attribute information of knowledge data Be stored in HTable, HTable presses Row key auto-sequencings, each Row include any amount Columns, Columns it Between press Column key auto-sequencings, each Column includes any amount Values.Knowledge data in raw data base By application Hadoop MapReduce and HDFS, data are cleaned, duplicate removal.

When MapReduce is run, the data file in HDFS is read by the task of Mapper operations, is then called Oneself carries out processing data, finally exports.

The task of Mapper operations is a Java process, is comprised the steps of：

4) subregion is carried out to the key-value pair of output based on key；

6) reduction process is carried out to data.

Reduction process is reduce processing, receives the output of Mapper tasks, is written to after processing in HDFS.

Reduce processing is a java process, is comprised the following steps：

3) reduce methods are called to the key-value pair after sequence；

4) key-value pair of output is written in HDFS files.

Above content described in this specification is only illustration made for the present invention.Technology belonging to the present invention The technical staff in field can be made various modifications or supplement to described specific embodiment or be substituted using similar mode, only Will without departing from description of the invention content or surmount scope defined in the claims, all should belong to the present invention guarantor Protect scope.

Claims

1. a kind of intelligent data search method, it is characterised in that comprise the steps of：

Step 1：Knowledge data in several separate Node stations is uniformly drawn into by initial data by ETL technologies In storehouse；

Step 2：Mass knowledge data in raw data base are made up of several Hbase, by the different carriers of knowledge and different The knowledge content of knowledge form collects together；

Step 3：For the knowledge data in original storehouse, data are cleaned, duplicate removal, then applicating cooperation filter algorithm is given birth to Into knowledge model；

Step 5：Index database is shared to the user of all node computer rooms.

2. according to a kind of intelligent data search method described in claim 1, it is characterised in that：ETL technologies in the step 1 Specifically, using Kettle instruments, well-regulated knowledge data and irregular knowledge data are transferred in original warehouse, its In well-regulated knowledge data refer to knowledge record in database, including the uplink time of knowledge content, knowledge title, knowledge, The keyword of knowledge, knowledge type, knowledge score information, irregular knowledge data are to deposit accessory information in a hard disk, bag The knowledge of txt, word, excel file format is included, well-regulated knowledge data, original is directly introduced directly into by Kettle instruments In beginning database, after irregular knowledge data parses the text data in annex by analysis software, then pass through kettle Tool transport is into original warehouse.

3. according to a kind of intelligent data search method described in claim 1, it is characterised in that：The raw data base is by one Hbase clusters are formed, and the different attribute information of knowledge data is stored in HTable by Hbase clusters, and HTable presses Row key Auto-sequencing, each Row are included and Column key auto-sequencings are pressed between any amount Columns, Columns, each Column includes any amount Values.

4. according to a kind of intelligent data search method described in claim 3, it is characterised in that：Knowing in the raw data base Know MapReduce and HDFS of the data by application Hadoop, data are cleaned, duplicate removal.

5. according to a kind of intelligent data search method described in claim 4, it is characterised in that：The MapReduce operations When, the data file in HDFS is read by the task of Mapper operations, then calls and oneself carries out processing data, it is last defeated Go out.

6. according to a kind of intelligent data search method described in claim 5, it is characterised in that：The task of the Mapper operations It is a Java process, comprises the steps of：

4) subregion is carried out to the key-value pair of output based on key；

5) key-value pair in each subregion is ranked up, be first ranked up according to key, for key identical key-value pair, according to value It is ranked up；

6) reduction process is carried out to data.

7. according to a kind of intelligent data search method described in claim 6, it is characterised in that：The reduction process is reduce Processing, the output of Mapper tasks is received, is written to after processing in HDFS.

8. according to a kind of intelligent data search method described in claim 7, it is characterised in that：The reduce processing is one Java processes, comprise the following steps：

2) local data for copying to Reducer is all merged, i.e., scattered data is merged into a big data, And to the data sorting after merging；

3) reduce methods are called to the key-value pair after sequence；

4) key-value pair of output is written in HDFS files.

9. according to a kind of intelligent data search method described in claim 1, it is characterised in that：Lucene remembers in the step 4 State including indexing and retrieving two processes.