CN102081655A - Information retrieval method based on Bayesian classification algorithm - Google Patents

Information retrieval method based on Bayesian classification algorithm Download PDF

Info

Publication number
CN102081655A
CN102081655A CN 201110005077 CN201110005077A CN102081655A CN 102081655 A CN102081655 A CN 102081655A CN 201110005077 CN201110005077 CN 201110005077 CN 201110005077 A CN201110005077 A CN 201110005077A CN 102081655 A CN102081655 A CN 102081655A
Authority
CN
China
Prior art keywords
sample
information retrieval
classification
probability
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201110005077
Other languages
Chinese (zh)
Other versions
CN102081655B (en
Inventor
刘琳
李国栋
问梁军
李国粹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
North China Electric Power University
Information and Telecommunication Branch of State Grid Xinjiang Electric Power Co Ltd
Original Assignee
North China Electric Power University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Electric Power University filed Critical North China Electric Power University
Priority to CN 201110005077 priority Critical patent/CN102081655B/en
Publication of CN102081655A publication Critical patent/CN102081655A/en
Application granted granted Critical
Publication of CN102081655B publication Critical patent/CN102081655B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an information retrieval method based on a Bayesian classification algorithm in the technical field of information processing, which comprises the steps of: selecting a data sheet from a relationship database, establishing an information retrieval model; deriving a sample set from the information retrieval model, dividing the sample set into a training data set and a testing data set; selecting relevant fields from the information retrieval model to be used as a classification attribute of the sample set, determining the class of the classification attribute and calculating the prior probability of the class of the classification attribute; calculating the posterior probability of samples; calculating the class probability of the samples according to the Bayesian formula; classifying the samples according to the class probability of the samples, and generating a data classification set; and making the information retrieval operation in the data classification set by a user. The invention applies the Bayesian classification algorithm to the actual information retrieval, thereby effectively improving the accuracy of the information retrieval.

Description

Information retrieval method based on the Bayes algorithm
Technical field
The invention belongs to technical field of information processing, relate in particular to a kind of information retrieval method based on the Bayes algorithm.
Background technology
Information retrieval is the important component part of internet, applications, and along with the growth at full speed of internet information, complicated taxonomic hierarchies makes to be retrieved more and more difficult to valuable information.
Bayes is the basic algorithm in the text mining, and this method is classified to text message quickly and accurately by utilization theory of probability knowledge; And sorted information can be used as the basis of other application.
Information category variation, the complicated present situation of information relationship cause the user according to demand, during query-related information, may extend to other message subjects by any one message subject, thereby constitute some information rings in system.This category feature often makes when system carries out information recommendation, and some literal similar, information of differing greatly of meaning actually may be provided, and causes the user may run into unpredictable trouble when carrying out information retrieval.
At the problems referred to above, the present invention in information retrieval, by information resources are classified, dwindles range of search with the Bayes algorithm application, thereby is implemented in when retrieving in a certain classification, can improve the accuracy rate of information retrieval.
Summary of the invention
The objective of the invention is to, a kind of information retrieval method based on the Bayes algorithm is provided, raw information is classified, to dwindle the scope of information retrieval by the Bayes algorithm, under a certain particular category, carry out information retrieval then, thereby improve the accuracy rate of information retrieval.
Technical scheme is that a kind of information retrieval method based on the Bayes algorithm is characterized in that described method comprises the following steps:
Step 1: from relational database, choose tables of data, set up the information retrieval model; Describedly set up the information retrieval model specifically: the major key and the external key of the described tables of data of definition earlier according to major key that exists between the tables of data and external key relation, make up ring texture information retrieval model then;
Step 2: from described information retrieval model, derive sample set, and sample set is divided into training dataset and test data set;
Step 3: from the information retrieval model, select the categorical attribute of relevant field, determine the classification of described categorical attribute as sample set, and according to other prior probability of categorical attribute compute classes P (C i);
Step 4: the posterior probability P (X|C that calculates sample i);
Step 5: according to Bayesian formula
Figure BDA0000043465140000021
Calculate the class probability P (C of sample i| X);
Step 6: the class probability according to sample is classified to sample, and generates the data qualification collection;
Step 7: the user concentrates in data qualification and does the information retrieval operation.
Described prior probability P (C i) be meant that training data concentrates the shared ratio of sample of each classification, prior probability P (C i) utilize formula P (C i)=s i/ s calculates, wherein s iBe classification C iAt the sample number of training dataset, and s is the total sample number of training dataset.
Described posterior probability P (X|C i) be meant that test data concentrates the sample proportion of each classification, utilize formula
Figure BDA0000043465140000022
Calculate, wherein, probability P (X k| C i)=s Ik/ s i, 1≤k≤n, s IkBe classification C iAt the sample number of test data set, and s iBe the total sample number of training dataset, n is the classification number.
Described class probability according to sample is classified specifically to sample: the probability size of the sample of compare test data centralization under each classification, and choose the classification of probable value maximum sample is classified; Wherein, most probable value utilizes formula X ∈ C i| P (C i| X)=Max{P (C i| X) } calculate.
Effect of the present invention is, with the information retrieval of Bayes algorithm application in reality, has improved the degree of accuracy of information retrieval effectively.
Description of drawings
Fig. 1 is based on the information retrieval method process flow diagram of Bayes algorithm;
Fig. 2 is an information retrieval modelling synoptic diagram;
Fig. 3 is based on the graphical example procedure figure of information retrieval of Bayes algorithm.
Embodiment
Below in conjunction with accompanying drawing, preferred embodiment is elaborated.Should be emphasized that following explanation only is exemplary, rather than in order to limit the scope of the invention and to use.
Fig. 1 is based on the information retrieval method process flow diagram of Bayes algorithm.Among Fig. 1, comprise the following steps: based on the information retrieval method of Bayes algorithm
Step 1: from relational database, choose tables of data, set up the information retrieval model.Fig. 2 is an information retrieval modelling synoptic diagram.Among Fig. 2, with scientific payoffs table, personal information table and the department information table chosen in database is example, according to the incidence relation between three tables, set up the information retrieval model specifically: the major key and the external key that define three tables of data earlier, according to major key that exists between three tables of data and external key relation, make up ring texture information retrieval model then.
Step 2: from the information retrieval model, derive sample set, and sample set is divided into training dataset and test data set.
Deriving sample set from the information retrieval model specifically is to derive the data recording that is used to make up the data qualification collection from constructed information retrieval model at random, and it is divided into training dataset and test data set, the data recording of training dataset and test data set is generally with 2: 1 ratio random division.Wherein, training dataset is meant have been demarcated, and is used for the data acquisition of training classifier.Test data set is meant not demarcation, need carry out the recognition data set with sorter.
Fig. 3 is based on the graphical example procedure figure of information retrieval of Bayes algorithm.Among Fig. 3, from constructed information retrieval model, select 1000 data at random as sample set.Wherein, 666 as training dataset, and 334 as test data set.
Step 3: from the information retrieval model, select the categorical attribute of relevant field, determine the classification of described categorical attribute as sample set, and according to other prior probability of categorical attribute compute classes P (C i).
According to the actual requirements, this example is chosen four fields of sample as categorical attribute, is respectively functional localization attribute, subject attribute, avatar attribute and national economy industry attribute.Wherein, the functional localization attribute kit contains 12 classifications, and the subject attribute kit contains 58 classifications, and the avatar attribute kit contains 16 classifications, and national economy industry attribute kit contains 98 classifications.
In order to simplify calculating, this example is chosen the functional localization attribute and is calculated as categorical attribute, and subject attribute, avatar attribute are similar to the computing method of functional localization attribute with the computing method of national economy industry attribute, repeat no more here.
According to 12 classifications that the functional localization attribute comprises, calculate its prior probability P (C i).Prior probability P (C i) be meant that training data concentrates the shared ratio of sample of each classification, prior probability P (C i) utilize formula P (C i)=s i/ s calculates, wherein s iBe classification C iAt the sample number of training dataset, and s is the total sample number of training dataset.In this example, the prior probability of 12 classifications of functional localization attribute is respectively 11.4%, 9.0%, 0.6%, 11.7%, 28.5%, 12.7%, 6.6%, 7.8%, 3.5%, 0.4%, 18.8% and 0.
Step 4: the posterior probability P (X|C that calculates sample i).
Sample is meant a data recording in the sample set.Posterior probability P (X|C i) be meant that test data concentrates the sample proportion of each classification, utilize formula Calculate, wherein, probability P (X k| C i)=s Ik/ s i, 1≤k≤n, s IkBe classification C iAt the sample number of test data set, and s iBe the total sample number of training dataset, n is the classification number.In this example, this is one dimension for sampling, so its posterior probability is set to 1.
Step 5: according to Bayesian formula
Figure BDA0000043465140000052
Calculate the class probability P (C of sample i| X).
Bayesian formula
Figure BDA0000043465140000053
Be all to be separate between the category attribute of each sample of supposition, and each attribute is the same to the influence that given classification produces.Each data sample is with a n dimensional feature vector X={X in the formula 1, X 2..., X nExpression, the vectorial C={C of the category attribute of sample 1, C 2..., C mExpression, P (C i) be prior probability, P (X j| C i) be posterior probability, P (X) is the total probability of sample, for each sample class, P (X) is a constant.
According to 12 classifications that the functional localization attribute comprises, the class probability value of calculating sample is respectively 11.4%, 9.0%, 0.6%, 11.7%, 28.5%, 12.7%, 6.6%, 7.8%, 3.5%, 0.4%, 18.8% and 0.Wherein, maximal value is 28.5%, so the sample class probability is 28.5%.
Step 6: the class probability according to sample is classified to sample, and generates the data qualification collection.
According to the probability size of sample under each category attribute that the class probability of sample is classified and is meant the compare test data centralization sample, choose the classification of probable value maximum sample is classified.Its most probable value is by formula X ∈ C i| P (C i| X)=Max{P (C i| X) } calculate.Because of the class probability value maximal value of the sample that calculates in the step 5 is 28.5%, so in its corresponding class, sample is classified.
Step 7: the user concentrates in data qualification and does the information retrieval operation.
The user input query keyword is concentrated in completed data qualification and according to keyword to be retrieved.
For example, the user imports keyword for " large electric power plant unit ", and the user concentrates in completed data qualification and retrieves, and to obtain with " large electric power plant unit " be the expectation information of keyword, and the example partial content is as follows:
" research of Chinese large-sized fired power generating unit air cooling design and operation gordian technique and application ", " study and use ", " application of advanced control strategy in the large electric power plant unit and the exploitation of Control Software Package " based on the large electric power plant unit synthesis energy saving that Energy saving theory is analyzed.
If the user will retrieve next time, then re-enter keyword and retrieve; Otherwise finish retrieval.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (4)

1. the information retrieval method based on the Bayes algorithm is characterized in that described method comprises the following steps:
Step 1: from relational database, choose tables of data, set up the information retrieval model; Describedly set up the information retrieval model specifically: the major key and the external key of the described tables of data of definition earlier according to major key that exists between the tables of data and external key relation, make up ring texture information retrieval model then;
Step 2: from described information retrieval model, derive sample set, and sample set is divided into training dataset and test data set;
Step 3: from the information retrieval model, select the categorical attribute of relevant field, determine the classification of described categorical attribute as sample set, and according to other prior probability of categorical attribute compute classes P (C i);
Step 4: the posterior probability P (X|C that calculates sample i);
Step 5: according to Bayesian formula
Figure FDA0000043465130000011
Calculate the class probability P (C of sample i| X);
Step 6: the class probability according to sample is classified to sample, and generates the data qualification collection;
Step 7: the user concentrates in data qualification and does the information retrieval operation.
2. a kind of information retrieval method based on the Bayes algorithm according to claim 1 is characterized in that described prior probability P (C i) be meant that training data concentrates the shared ratio of sample of each classification, prior probability P (C i) utilize formula P (C i)=s i/ s calculates, wherein s iBe classification C iAt the sample number of training dataset, and s is the total sample number of training dataset.
3. a kind of information retrieval method based on the Bayes algorithm according to claim 1 is characterized in that described posterior probability P (X|C i) be meant that test data concentrates the sample proportion of each classification, utilize formula
Figure FDA0000043465130000021
Calculate, wherein, probability P (X k| C i)=s Ik/ s i, 1≤k≤n, s IkBe classification C iAt the sample number of test data set, and s iBe the total sample number of training dataset, n is the classification number.
4. a kind of information retrieval method according to claim 1 based on the Bayes algorithm, it is characterized in that described class probability according to sample classifies specifically to sample: the probability size of the sample of compare test data centralization under each classification, choose the classification of probable value maximum sample is classified; Wherein, most probable value utilizes formula X ∈ C i| P (C i| X)=Max{P (C i| X) } calculate.
CN 201110005077 2011-01-11 2011-01-11 Information retrieval method based on Bayesian classification algorithm Active CN102081655B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110005077 CN102081655B (en) 2011-01-11 2011-01-11 Information retrieval method based on Bayesian classification algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110005077 CN102081655B (en) 2011-01-11 2011-01-11 Information retrieval method based on Bayesian classification algorithm

Publications (2)

Publication Number Publication Date
CN102081655A true CN102081655A (en) 2011-06-01
CN102081655B CN102081655B (en) 2013-06-05

Family

ID=44087618

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110005077 Active CN102081655B (en) 2011-01-11 2011-01-11 Information retrieval method based on Bayesian classification algorithm

Country Status (1)

Country Link
CN (1) CN102081655B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722537A (en) * 2012-05-22 2012-10-10 苏州阔地网络科技有限公司 Database test data generation method and system thereof
CN102956023A (en) * 2012-08-30 2013-03-06 南京信息工程大学 Bayes classification-based method for fusing traditional meteorological data with perception data
CN103064939A (en) * 2012-12-25 2013-04-24 深圳先进技术研究院 Method and system for re-ordering data
CN103294828A (en) * 2013-06-25 2013-09-11 厦门市美亚柏科信息股份有限公司 Verification method and verification device of data mining model dimension
CN103345676A (en) * 2013-06-20 2013-10-09 南京邮电大学 Materials management system oriented missing information estimation method based on Bayesian classification
CN104268260A (en) * 2014-10-10 2015-01-07 中国科学院重庆绿色智能技术研究院 Method, device and system for classifying streaming data
CN106204083A (en) * 2015-04-30 2016-12-07 ***通信集团山东有限公司 A kind of targeted customer's sorting technique, Apparatus and system
CN106372670A (en) * 2016-09-06 2017-02-01 南京理工大学 Loyalty index prediction method based on improved nearest neighbor algorithm
CN109495558A (en) * 2018-11-06 2019-03-19 中国铁道科学研究院集团有限公司通信信号研究所 Vehicle applied to City Rail Transit System ground multi-internet integration wireless communications method
CN109784047A (en) * 2018-12-07 2019-05-21 中国人民解放军战略支援部队航天工程大学 Program detecting method based on multiple features
CN109804362A (en) * 2016-07-15 2019-05-24 伊欧-塔霍有限责任公司 Primary key-foreign key relationship is determined by machine learning
CN110580483A (en) * 2018-05-21 2019-12-17 上海大唐移动通信设备有限公司 indoor and outdoor user distinguishing method and device
CN110737700A (en) * 2019-10-16 2020-01-31 百卓网络科技有限公司 purchase, sales and inventory user classification method and system based on Bayesian algorithm

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699772B (en) * 2015-03-05 2018-03-23 内蒙古科技大学 A kind of big data file classification method based on cloud computing
CN108334590B (en) * 2018-01-30 2021-06-29 苏州龙御上宾信息科技有限公司 Information retrieval system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211769A (en) * 1997-06-26 1999-03-24 香港中文大学 Method and equipment for file retrieval based on Bayesian network
CN1535431A (en) * 2000-07-28 2004-10-06 �ʼҷ����ֵ������޹�˾ Context and content based information processing for multimedia segmentation and indexing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211769A (en) * 1997-06-26 1999-03-24 香港中文大学 Method and equipment for file retrieval based on Bayesian network
CN1535431A (en) * 2000-07-28 2004-10-06 �ʼҷ����ֵ������޹�˾ Context and content based information processing for multimedia segmentation and indexing

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722537A (en) * 2012-05-22 2012-10-10 苏州阔地网络科技有限公司 Database test data generation method and system thereof
CN102956023B (en) * 2012-08-30 2016-02-03 南京信息工程大学 A kind of method that traditional meteorological data based on Bayes's classification and perception data merge
CN102956023A (en) * 2012-08-30 2013-03-06 南京信息工程大学 Bayes classification-based method for fusing traditional meteorological data with perception data
CN103064939A (en) * 2012-12-25 2013-04-24 深圳先进技术研究院 Method and system for re-ordering data
CN103064939B (en) * 2012-12-25 2015-09-30 深圳先进技术研究院 data reordering method and system
CN103345676B (en) * 2013-06-20 2016-06-15 南京邮电大学 A kind of missing information method of estimation classified based on Bayes towards material Management System
CN103345676A (en) * 2013-06-20 2013-10-09 南京邮电大学 Materials management system oriented missing information estimation method based on Bayesian classification
CN103294828B (en) * 2013-06-25 2016-04-27 厦门市美亚柏科信息股份有限公司 The verification method of data mining model dimension and demo plant
CN103294828A (en) * 2013-06-25 2013-09-11 厦门市美亚柏科信息股份有限公司 Verification method and verification device of data mining model dimension
CN104268260A (en) * 2014-10-10 2015-01-07 中国科学院重庆绿色智能技术研究院 Method, device and system for classifying streaming data
CN106204083A (en) * 2015-04-30 2016-12-07 ***通信集团山东有限公司 A kind of targeted customer's sorting technique, Apparatus and system
CN109804362A (en) * 2016-07-15 2019-05-24 伊欧-塔霍有限责任公司 Primary key-foreign key relationship is determined by machine learning
CN106372670A (en) * 2016-09-06 2017-02-01 南京理工大学 Loyalty index prediction method based on improved nearest neighbor algorithm
CN110580483A (en) * 2018-05-21 2019-12-17 上海大唐移动通信设备有限公司 indoor and outdoor user distinguishing method and device
CN109495558A (en) * 2018-11-06 2019-03-19 中国铁道科学研究院集团有限公司通信信号研究所 Vehicle applied to City Rail Transit System ground multi-internet integration wireless communications method
CN109784047A (en) * 2018-12-07 2019-05-21 中国人民解放军战略支援部队航天工程大学 Program detecting method based on multiple features
CN109784047B (en) * 2018-12-07 2021-03-30 中国人民解放军战略支援部队航天工程大学 Program detection method based on multiple features
CN110737700A (en) * 2019-10-16 2020-01-31 百卓网络科技有限公司 purchase, sales and inventory user classification method and system based on Bayesian algorithm

Also Published As

Publication number Publication date
CN102081655B (en) 2013-06-05

Similar Documents

Publication Publication Date Title
CN102081655B (en) Information retrieval method based on Bayesian classification algorithm
CN106845717B (en) Energy efficiency evaluation method based on multi-model fusion strategy
CN105205096B (en) A kind of data retrieval method across text modality and image modalities
CN103617157A (en) Text similarity calculation method based on semantics
CN102289522B (en) Method of intelligently classifying texts
Xu et al. Activity auto-completion: Predicting human activities from partial videos
CN107066555A (en) Towards the online topic detection method of professional domain
CN102999615B (en) Based on variety of images mark and the search method of radial basis function neural network
CN105320764A (en) 3D model retrieval method and 3D model retrieval apparatus based on slow increment features
CN103412888A (en) Point of interest (POI) identification method and device
CN104216993A (en) Tag-co-occurred tag clustering method
CN103778206A (en) Method for providing network service resources
CN103761286B (en) A kind of Service Source search method based on user interest
Agrawal et al. A novel algorithm for automatic document clustering
CN105183792B (en) Distributed fast text classification method based on locality sensitive hashing
CN103984700B (en) A kind of isomeric data analysis method for scientific and technological information vertical search
CN104537392B (en) A kind of method for checking object based on the semantic part study of identification
Yang et al. Microblog sentiment analysis algorithm research and implementation based on classification
Qian et al. Weakly supervised part-based method for combined object detection in remote sensing imagery
Sutanto et al. Ranking Based Clustering for Social Event Detection.
CN103207893B (en) The sorting technique of two class texts based on Vector Groups mapping
CN103761433A (en) Network service resource classifying method
Saad et al. Efficient content based image retrieval using SVM and color histogram
Feng et al. Chinese short text classification based on domain knowledge
Zhu et al. Chinese texts classification system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: STATE GRID CORPORATION OF CHINA INFORMATION COMMUN

Effective date: 20140925

C41 Transfer of patent application or patent right or utility model
C53 Correction of patent for invention or patent application
CB03 Change of inventor or designer information

Inventor after: Liu Lin

Inventor after: Li Guodong

Inventor after: Wen Liangjun

Inventor after: Li Guocui

Inventor after: Yin Jun

Inventor after: Zhou Wenting

Inventor after: Nijiati.Najimi

Inventor after: Ma Tianfu

Inventor after: Li Kai

Inventor before: Liu Lin

Inventor before: Li Guodong

Inventor before: Wen Liangjun

Inventor before: Li Guocui

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: LIU LIN LI GUODONG WEN LIANGJUN LI GUOCUI TO: LIU LIN LI GUODONG WEN LIANGJUN LI GUOCUI YIN JUN ZHOU WENTING NIJIATI NAJIMI MA TIANFU LI KAI

TR01 Transfer of patent right

Effective date of registration: 20140925

Address after: 102206 Changping District North Road, No. 2, Beijing

Patentee after: North China Electric Power University

Patentee after: State Grid Corporation of China

Patentee after: INFORMATION & TELECOMMUNICATION COMPANY OF STATE GRID XINJIANG ELECTRIC POWER COMPANY

Address before: 102206, Beijing, Changping District, Beijing Desheng outside the door, Zhu Xin, North China Electric Power University

Patentee before: North China Electric Power University