CN107679248A - A kind of intelligent data search method - Google Patents

A kind of intelligent data search method Download PDF

Info

Publication number
CN107679248A
CN107679248A CN201711052166.8A CN201711052166A CN107679248A CN 107679248 A CN107679248 A CN 107679248A CN 201711052166 A CN201711052166 A CN 201711052166A CN 107679248 A CN107679248 A CN 107679248A
Authority
CN
China
Prior art keywords
knowledge
data
key
search method
value pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711052166.8A
Other languages
Chinese (zh)
Inventor
汪璞璞
杨君
钟启明
李丰光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU HONGXIN SYSTEM INTEGRATION CO Ltd
Original Assignee
JIANGSU HONGXIN SYSTEM INTEGRATION CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU HONGXIN SYSTEM INTEGRATION CO Ltd filed Critical JIANGSU HONGXIN SYSTEM INTEGRATION CO Ltd
Priority to CN201711052166.8A priority Critical patent/CN107679248A/en
Publication of CN107679248A publication Critical patent/CN107679248A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of intelligent data search method, the knowledge data of the different carriers in each Node station is transferred in initial data warehouse by ETL scripts, initial data warehouse is divided into multiple subregions, each subregion passes through knowledge Similarity Algorithm, creation of knowledge model, knowledge model duplicate removal is cleaned, then lucene is relied on to index model creation, the index created, write-in index warehouse in the form of streaming, index warehouse catalogue need to only be retrieved when user is retrieved, it is possible to realize the fast positioning of mass knowledge data.The present invention breaks the barrier of more computer room Islands of Knowledge, realizes the unified filingization of more carriers of knowledge, realizes knowledge problem intelligent retrieval.

Description

A kind of intelligent data search method
Technical field
The present invention relates to a kind of search method, particularly a kind of intelligent data search method.
Background technology
In existing knowledge retrieval storehouse, range of search is often defined in unit room data storage center, Er Qiejian The carrier of knowledge also all single presence in the data storage center used of rope, therefore, the result retrieved in user search is past The in the past matching on text, it is impossible to the processing for needs of being handled it to the content of text, although only needing single-point to safeguard, machine Device framework is simple, but more preferable service can not be made to client, therefore, it is necessary to existing 12345 product knowledge database Search method and framework are lifted, and solve 12345 product know-how library searching at this stage, are also limited to unit room, and single knowledge carries Body (is stored in database), and single-point is safeguarded, knowledge retrieval has significant limitation, and the result retrieved is only on text Matching, is not processed for focus livelihood issues.Therefore need to realize the processing to multichip carrier knowledge, across more computer room knowledge Extract, the integration of engineering, provide the user mass data, the facility of intelligent retrieval.
The content of the invention
The technical problems to be solved by the invention are to provide a kind of intelligent data search method, and it is lonely that it breaks more computer room knowledge The barrier on island, the unified filingization of more carriers of knowledge is realized, realizes knowledge problem intelligent retrieval.
In order to solve the above technical problems, the technical solution adopted in the present invention is:
A kind of intelligent data search method, it is characterised in that comprise the steps of:
Step 1:The knowledge data in several separate Node stations is uniformly drawn into by ETL technologies original In database;
Step 2:Mass knowledge data in raw data base are made up of several Hbase, by the different carriers of knowledge and not The knowledge content of same knowledge form collects together;
Step 3:For the knowledge data in original storehouse, data are cleaned, duplicate removal, then applicating cooperation filtering is calculated Method generates knowledge model;
Step 4:Using Lucence technologies, generate and index for model data, be stored in index database;
Step 5:Index database is shared to the user of all node computer rooms.
Further, in the step 1 ETL technologies specifically, using Kettle instruments, by well-regulated knowledge data It is transferred to irregular knowledge data in original warehouse, wherein well-regulated knowledge data refers to the knowledge record in database, Including knowledge content, knowledge title, the uplink time of knowledge, the keyword of knowledge, knowledge type, knowledge score information, do not advise Knowledge data then is to deposit accessory information in a hard disk, includes the knowledge of txt, word, excel file format, well-regulated Knowledge data, directly it is introduced directly into by Kettle instruments in raw data base, irregular knowledge data passes through analysis software After the text data parsing in annex, then by kettle tool transports into original warehouse.
Further, the raw data base is made up of a Hbase cluster, and Hbase clusters are by the difference of knowledge data Attribute information is stored in HTable, and HTable presses Row key auto-sequencings, and each Row includes any amount Columns, Column key auto-sequencings are pressed between Columns, each Column includes any amount Values.
Further, the knowledge data in the raw data base is right by the MapReduce and HDFS using Hadoop Data are cleaned, duplicate removal.
Further, when the MapReduce operations, the data in HDFS are read by the task of Mapper operations File, then call and oneself carry out processing data, finally export.
Further, the task of the Mapper operations is a Java process, is comprised the steps of:
1) input file according to certain standard burst, the size of each input chip is fixed;
2) to the record in input chip according to certain rule parsing into key-value pair;
3) the map methods in Mapper classes are called, often parse a key-value pair, call a map method;
4) subregion is carried out to the key-value pair of output based on key;
5) key-value pair in each subregion is ranked up, is first ranked up according to key, for key identical key-value pair, presses It is ranked up according to value;
6) reduction process is carried out to data.
Further, the reduction process is reduce processing, receives the output of Mapper tasks, is written to after processing In HDFS.
Further, the reduce processing is a java process, is comprised the following steps:
1) actively from the key-value pair of its output of Mapper Task Duplications;
2) local data for copying to Reducer is all merged, i.e., scattered data are merged into one big Data, and to the data sorting after merging;
3) reduce methods are called to the key-value pair after sequence;
4) key-value pair of output is written in HDFS files.
Further, Lucene descriptions include indexing and retrieving two processes in the step 4.
The present invention compared with prior art, has advantages below and effect:
1st, the unification of multinode knowledge data:ETL technologies are introduced by the knowledge in several separate Node stations Data are uniformly drawn into raw data base.
2nd, the unification of multichip carrier knowledge data:Different carriers are known by application Hadoop MapReduce and HDFS Know data and carry out unification, sequence.
3rd, intelligently pushing:For the knowledge data in HBase databases, applicating cooperation filter algorithm, the phase according to knowledge Like property, parallel computation, knowledge recommendation model is formed.
Brief description of the drawings
Fig. 1 is a kind of schematic diagram of intelligent data search method of the present invention.
Embodiment
Below in conjunction with the accompanying drawings and the present invention is described in further detail by embodiment, and following examples are to this hair Bright explanation and the invention is not limited in following examples.
A kind of intelligent data search method of the present invention passes through Node station, mass knowledge intelligent searching engine and index warehouse Data connection, solve deficiency of the prior art.Wherein intelligent searching engine includes initial data warehouse module, lucene moulds The knowledge data of different carriers in each Node station is transferred to original by block and model, the technical program by ETL scripts In data warehouse, initial data warehouse is divided into multiple subregions, and each subregion passes through knowledge Similarity Algorithm, creation of knowledge mould Type, knowledge model duplicate removal is cleaned, then rely on lucene to index model creation, the index created, write in the form of streaming Enter and index warehouse, index warehouse catalogue need to only be retrieved when user is retrieved, it is possible to realize mass knowledge data Fast positioning.
System used in intelligent data search method includes some Node stations, search engine and index warehouse, described Search engine includes initial data warehouse module, model and lucene modules.
As shown in figure 1, a kind of intelligent data search method of the present invention, is comprised the steps of:
Step 1:The knowledge data in several separate Node stations is uniformly drawn into by ETL technologies original In database;
ETL technologies are by using Kettle instruments, and well-regulated knowledge data and irregular knowledge data are transmitted Into original warehouse, well-regulated knowledge data refers generally to the knowledge record in database, including knowledge content, knowledge mark Topic, the uplink time of knowledge, the keyword of knowledge, knowledge type, knowledge comment grading information, and irregular knowledge data is usually Accessory information in a hard disk is deposited, wherein have txt, the knowledge of the file format such as word, excel, well-regulated knowledge data, directly Connected Kettle instruments to be introduced directly into raw data base, irregular knowledge data is then needed by analysis software annex In text data parsing after, by kettle instruments, be transferred in original warehouse.
Step 2:Mass knowledge data in raw data base are made up of several Hbase, by the different carriers of knowledge and not The knowledge content of same knowledge form collects together;
Step 3:For the knowledge data in original storehouse, data are cleaned, duplicate removal, then applicating cooperation filtering is calculated Method generates knowledge model;
Step 4:Using Lucence technologies, generate and index for model data, be stored in index database;Lucene, which is described, to be included Two processes of index and retrieval.
Step 5:Index database is shared to the user of all node computer rooms.
Wherein, raw data base is made up of a Hbase cluster, and Hbase clusters are by the different attribute information of knowledge data Be stored in HTable, HTable presses Row key auto-sequencings, each Row include any amount Columns, Columns it Between press Column key auto-sequencings, each Column includes any amount Values.Knowledge data in raw data base By application Hadoop MapReduce and HDFS, data are cleaned, duplicate removal.
When MapReduce is run, the data file in HDFS is read by the task of Mapper operations, is then called Oneself carries out processing data, finally exports.
The task of Mapper operations is a Java process, is comprised the steps of:
1) input file according to certain standard burst, the size of each input chip is fixed;
2) to the record in input chip according to certain rule parsing into key-value pair;
3) the map methods in Mapper classes are called, often parse a key-value pair, call a map method;
4) subregion is carried out to the key-value pair of output based on key;
5) key-value pair in each subregion is ranked up, is first ranked up according to key, for key identical key-value pair, presses It is ranked up according to value;
6) reduction process is carried out to data.
Reduction process is reduce processing, receives the output of Mapper tasks, is written to after processing in HDFS.
Reduce processing is a java process, is comprised the following steps:
1) actively from the key-value pair of its output of Mapper Task Duplications;
2) local data for copying to Reducer is all merged, i.e., scattered data are merged into one big Data, and to the data sorting after merging;
3) reduce methods are called to the key-value pair after sequence;
4) key-value pair of output is written in HDFS files.
Above content described in this specification is only illustration made for the present invention.Technology belonging to the present invention The technical staff in field can be made various modifications or supplement to described specific embodiment or be substituted using similar mode, only Will without departing from description of the invention content or surmount scope defined in the claims, all should belong to the present invention guarantor Protect scope.

Claims (9)

1. a kind of intelligent data search method, it is characterised in that comprise the steps of:
Step 1:Knowledge data in several separate Node stations is uniformly drawn into by initial data by ETL technologies In storehouse;
Step 2:Mass knowledge data in raw data base are made up of several Hbase, by the different carriers of knowledge and different The knowledge content of knowledge form collects together;
Step 3:For the knowledge data in original storehouse, data are cleaned, duplicate removal, then applicating cooperation filter algorithm is given birth to Into knowledge model;
Step 4:Using Lucence technologies, generate and index for model data, be stored in index database;
Step 5:Index database is shared to the user of all node computer rooms.
2. according to a kind of intelligent data search method described in claim 1, it is characterised in that:ETL technologies in the step 1 Specifically, using Kettle instruments, well-regulated knowledge data and irregular knowledge data are transferred in original warehouse, its In well-regulated knowledge data refer to knowledge record in database, including the uplink time of knowledge content, knowledge title, knowledge, The keyword of knowledge, knowledge type, knowledge score information, irregular knowledge data are to deposit accessory information in a hard disk, bag The knowledge of txt, word, excel file format is included, well-regulated knowledge data, original is directly introduced directly into by Kettle instruments In beginning database, after irregular knowledge data parses the text data in annex by analysis software, then pass through kettle Tool transport is into original warehouse.
3. according to a kind of intelligent data search method described in claim 1, it is characterised in that:The raw data base is by one Hbase clusters are formed, and the different attribute information of knowledge data is stored in HTable by Hbase clusters, and HTable presses Row key Auto-sequencing, each Row are included and Column key auto-sequencings are pressed between any amount Columns, Columns, each Column includes any amount Values.
4. according to a kind of intelligent data search method described in claim 3, it is characterised in that:Knowing in the raw data base Know MapReduce and HDFS of the data by application Hadoop, data are cleaned, duplicate removal.
5. according to a kind of intelligent data search method described in claim 4, it is characterised in that:The MapReduce operations When, the data file in HDFS is read by the task of Mapper operations, then calls and oneself carries out processing data, it is last defeated Go out.
6. according to a kind of intelligent data search method described in claim 5, it is characterised in that:The task of the Mapper operations It is a Java process, comprises the steps of:
1) input file according to certain standard burst, the size of each input chip is fixed;
2) to the record in input chip according to certain rule parsing into key-value pair;
3) the map methods in Mapper classes are called, often parse a key-value pair, call a map method;
4) subregion is carried out to the key-value pair of output based on key;
5) key-value pair in each subregion is ranked up, be first ranked up according to key, for key identical key-value pair, according to value It is ranked up;
6) reduction process is carried out to data.
7. according to a kind of intelligent data search method described in claim 6, it is characterised in that:The reduction process is reduce Processing, the output of Mapper tasks is received, is written to after processing in HDFS.
8. according to a kind of intelligent data search method described in claim 7, it is characterised in that:The reduce processing is one Java processes, comprise the following steps:
1) actively from the key-value pair of its output of Mapper Task Duplications;
2) local data for copying to Reducer is all merged, i.e., scattered data is merged into a big data, And to the data sorting after merging;
3) reduce methods are called to the key-value pair after sequence;
4) key-value pair of output is written in HDFS files.
9. according to a kind of intelligent data search method described in claim 1, it is characterised in that:Lucene remembers in the step 4 State including indexing and retrieving two processes.
CN201711052166.8A 2017-10-30 2017-10-30 A kind of intelligent data search method Pending CN107679248A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711052166.8A CN107679248A (en) 2017-10-30 2017-10-30 A kind of intelligent data search method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711052166.8A CN107679248A (en) 2017-10-30 2017-10-30 A kind of intelligent data search method

Publications (1)

Publication Number Publication Date
CN107679248A true CN107679248A (en) 2018-02-09

Family

ID=61143985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711052166.8A Pending CN107679248A (en) 2017-10-30 2017-10-30 A kind of intelligent data search method

Country Status (1)

Country Link
CN (1) CN107679248A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation
CN102779134A (en) * 2011-05-12 2012-11-14 苏州同程旅游网络科技有限公司 Lucene-based distributed search method
CN102915365A (en) * 2012-10-24 2013-02-06 苏州两江科技有限公司 Hadoop-based construction method for distributed search engine
CN104298768A (en) * 2014-10-29 2015-01-21 深圳市同洲电子股份有限公司 Searching method, device and system
CN104376053A (en) * 2014-11-04 2015-02-25 南京信息工程大学 Storage and retrieval method based on massive meteorological data
CN105069101A (en) * 2015-08-07 2015-11-18 桂林电子科技大学 Distributed index construction and search method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation
CN102779134A (en) * 2011-05-12 2012-11-14 苏州同程旅游网络科技有限公司 Lucene-based distributed search method
CN102915365A (en) * 2012-10-24 2013-02-06 苏州两江科技有限公司 Hadoop-based construction method for distributed search engine
CN104298768A (en) * 2014-10-29 2015-01-21 深圳市同洲电子股份有限公司 Searching method, device and system
CN104376053A (en) * 2014-11-04 2015-02-25 南京信息工程大学 Storage and retrieval method based on massive meteorological data
CN105069101A (en) * 2015-08-07 2015-11-18 桂林电子科技大学 Distributed index construction and search method

Similar Documents

Publication Publication Date Title
US11354314B2 (en) Method for connecting a relational data store's meta data with hadoop
US9256665B2 (en) Creation of inverted index system, and data processing method and apparatus
US20170242889A1 (en) Cache Based Efficient Access Scheduling for Super Scaled Stream Processing Systems
US20230273898A1 (en) Lineage data for data records
CN104408159B (en) A kind of data correlation, loading, querying method and device
CN106126601A (en) A kind of social security distributed preprocess method of big data and system
CN109241159B (en) Partition query method and system for data cube and terminal equipment
CN107515878A (en) The management method and device of a kind of data directory
CN104111936B (en) Data query method and system
CN104317970A (en) Data flow type processing method based on data processing center
CN103440288A (en) Big data storage method and device
CN109101575A (en) Calculation method and device
WO2014117295A1 (en) Performing an index operation in a mapreduce environment
CN108287889B (en) A kind of multi-source heterogeneous date storage method and system based on elastic table model
CN106407442A (en) Massive text data processing method and apparatus
CN116166191A (en) Integrated system of lake and storehouse
CN109117426A (en) Distributed networks database query method, apparatus, equipment and storage medium
CN117056303B (en) Data storage method and device suitable for military operation big data
CN113779349A (en) Data retrieval system, apparatus, electronic device, and readable storage medium
CN107291938A (en) Order Query System and method
CN107679248A (en) A kind of intelligent data search method
CN105718485B (en) A kind of method and device by data inputting database
CN112860680B (en) Data processing method and system, and data query method and system
CN111159213A (en) Data query method, device, system and storage medium
CN111143329B (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 210029 No. 268, Hanzhoung Road, Nanjing, Jiangsu

Applicant after: CLP Hongxin Information Technology Co., Ltd

Address before: 210029 No. 268, Hanzhoung Road, Nanjing, Jiangsu

Applicant before: Jiangsu Hongxin System Integration Co., Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180209